Artificial Intelligence Machine Learning Python

DIA: A TTS Model for Ultra-Realistic Dialogue Generation

fdciabdulApril 25, 2025

0 637 2 minutes read

Nari Labs has released Dia, a groundbreaking 1.6B parameter text-to-speech model that’s changing the landscape of AI-generated dialogue. What makes Dia special is its ability to generate incredibly realistic, natural-sounding dialogue in a single pass – something that traditionally required multiple processing steps or models.

What Makes Dia Revolutionary?

Dia stands out from other TTS models with its unique capabilities:

Single-Pass Dialogue Generation: Creates realistic conversations between multiple speakers in one go
Non-Verbal Communication: Naturally incorporates laughs, coughs, throat clearing, and other human sounds
Audio Conditioning: Clone voices or control emotion/tone by providing audio samples
Open Weights: Fully accessible for research and development

Key Features

Multi-Speaker Support: Easily switch between speakers using [S1] and [S2] tags
Natural Non-Verbal Elements: Generate authentic human sounds like (laughs), (coughs), (sighs), and more
Voice Cloning: Match specific voice characteristics by providing sample audio
High Performance: Runs at 2.2x realtime on modern GPUs (RTX 4090) with float16 precision

Getting Started with Dia

Quick Installation

# Install directly from GitHub
pip install git+https://github.com/nari-labs/dia.git

# Install directly from GitHub

pip install git+https://github.com/nari-labs/dia.git

Running the Gradio UI

git clone https://github.com/nari-labs/dia.git
cd dia &amp;&amp; uv run app.py

# Or without uv
git clone https://github.com/nari-labs/dia.git
cd dia
python -m venv .venv
source .venv/bin/activate
pip install -e .
python app.py

git clone https://github.com/nari-labs/dia.git

cd dia && uv run app.py

# Or without uv

git clone https://github.com/nari-labs/dia.git

cd dia

python -m venv .venv

source .venv/bin/activate

pip install -e .

python app.py

Using Dia in Python

from dia.model import Dia

model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")

text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on GitHub or Hugging Face."

output = model.generate(text, use_torch_compile=True, verbose=True)

model.save_audio("dialogue.mp3", output)

from dia.model import Dia

model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")

text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on GitHub or Hugging Face."

output = model.generate(text, use_torch_compile=True, verbose=True)

model.save_audio("dialogue.mp3", output)

or i’ve created sample code for you :

# Example of voice cloning with DIA
# This demonstrates how to use a reference audio file to clone a voice

from dia.model import Dia
import numpy as np
import soundfile as sf
import torch

# Load the DIA model
print("Loading DIA model...")
model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")

# Path to your reference audio file for voice cloning
reference_audio_path = "reference_voice.wav"

# The transcript of what's said in the reference audio
# This helps the model understand the voice characteristics
reference_transcript = "[S1] This is my natural speaking voice that I want the model to clone."

# Load the reference audio file
audio_data, sample_rate = sf.read(reference_audio_path)

# Convert to mono if stereo
if len(audio_data.shape) > 1:
    audio_data = np.mean(audio_data, axis=1)

# Convert to float32 tensor
reference_audio = torch.tensor(audio_data).float()

# Create your script using the same speaker tag from the reference
script = """
[S1] Hello everyone! I'm excited to demonstrate voice cloning with the DIA model.
[S1] This dialogue is being generated using the voice characteristics from my reference audio.
[S1] The model can maintain my speaking style, accent, and tone across multiple sentences. (laughs)
[S1] It can even handle expressions like laughter or sighs while keeping my voice consistent.
"""

# Generate audio using the reference voice
print("Generating audio with cloned voice...")
output = model.generate(
    reference_transcript + script,  # Combine reference transcript with new script
    audio_prompt=reference_audio,   # Provide the reference audio
    sample_rate=sample_rate,
    use_torch_compile=True,
    verbose=True
)

# Save the generated audio
model.save_audio("cloned_voice_output.mp3", output)
print("Voice cloning complete! Audio saved as 'cloned_voice_output.mp3'")

# Additional voice cloning tips:
# 1. Use clear, high-quality reference audio (minimal background noise)
# 2. Make sure the reference transcript accurately matches what's said in the audio
# 3. Keep the reference audio between 3-10 seconds for best results
# 4. Use the same speaker tag ([S1] or [S2]) consistently for the cloned voice

# Example of voice cloning with DIA

# This demonstrates how to use a reference audio file to clone a voice

from dia.model import Dia

import numpy as np

import soundfile as sf

import torch

# Load the DIA model

print("Loading DIA model...")

model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")

# Path to your reference audio file for voice cloning

reference_audio_path = "reference_voice.wav"

# The transcript of what's said in the reference audio

# This helps the model understand the voice characteristics

reference_transcript = "[S1] This is my natural speaking voice that I want the model to clone."

# Load the reference audio file

audio_data, sample_rate = sf.read(reference_audio_path)

# Convert to mono if stereo

if len(audio_data.shape) > 1:

audio_data = np.mean(audio_data, axis=1)

# Convert to float32 tensor

reference_audio = torch.tensor(audio_data).float()

# Create your script using the same speaker tag from the reference

script = """

[S1] Hello everyone! I'm excited to demonstrate voice cloning with the DIA model.

[S1] This dialogue is being generated using the voice characteristics from my reference audio.

[S1] The model can maintain my speaking style, accent, and tone across multiple sentences. (laughs)

[S1] It can even handle expressions like laughter or sighs while keeping my voice consistent.

"""

# Generate audio using the reference voice

print("Generating audio with cloned voice...")

output = model.generate(

reference_transcript + script, # Combine reference transcript with new script

audio_prompt=reference_audio, # Provide the reference audio

sample_rate=sample_rate,

use_torch_compile=True,

verbose=True

)

# Save the generated audio

model.save_audio("cloned_voice_output.mp3", output)

print("Voice cloning complete! Audio saved as 'cloned_voice_output.mp3'")

# Additional voice cloning tips:

# 1. Use clear, high-quality reference audio (minimal background noise)

# 2. Make sure the reference transcript accurately matches what's said in the audio

# 3. Keep the reference audio between 3-10 seconds for best results

# 4. Use the same speaker tag ([S1] or [S2]) consistently for the cloned voice

Hardware Requirements

Dia currently requires a GPU with CUDA support (tested on CUDA 12.6 with PyTorch 2.0+). CPU support is planned for future releases.

Precision	Realtime Factor (w/ compile)	Realtime Factor (w/o compile)	VRAM Usage
bfloat16	2.1x	1.5x	~10GB
float16	2.2x	1.3x	~10GB
float32	1.0x	0.9x	~13GB

Try It Now

HuggingFace Space: Try the live demo
ZeroGPU Space: Available for those without GPU access
Community Support: Join their Discord server for help and updates
Extended Access: Join the waitlist for access to larger models and additional features

Ethical Considerations

Dia is intended for research and educational purposes. Nari Labs explicitly prohibits:

Creating audio that impersonates real individuals without permission
Generating deceptive or misleading content
Any illegal or harmful applications

The Future of Conversational AI

Dia represents a significant leap forward in generating natural-sounding dialogue. By condensing what was previously a multi-step process into a single model pass, Dia opens new possibilities for creative content, accessibility tools, and conversational AI systems.

With voice cloning capabilities and support for non-verbal communication, Dia can produce audio content that captures the nuance and natural flow of human conversation in ways that weren’t previously possible with open models.