DIA: A TTS Model for Ultra-Realistic Dialogue Generation

Nari Labs has released Dia, a groundbreaking 1.6B parameter text-to-speech model that’s changing the landscape of AI-generated dialogue. What makes Dia special is its ability to generate incredibly realistic, natural-sounding dialogue in a single pass – something that traditionally required multiple processing steps or models.
What Makes Dia Revolutionary?
Dia stands out from other TTS models with its unique capabilities:
- Single-Pass Dialogue Generation: Creates realistic conversations between multiple speakers in one go
- Non-Verbal Communication: Naturally incorporates laughs, coughs, throat clearing, and other human sounds
- Audio Conditioning: Clone voices or control emotion/tone by providing audio samples
- Open Weights: Fully accessible for research and development
Key Features
- Multi-Speaker Support: Easily switch between speakers using
[S1]
and[S2]
tags - Natural Non-Verbal Elements: Generate authentic human sounds like
(laughs)
,(coughs)
,(sighs)
, and more - Voice Cloning: Match specific voice characteristics by providing sample audio
- High Performance: Runs at 2.2x realtime on modern GPUs (RTX 4090) with float16 precision
Getting Started with Dia
Quick Installation
1 2 3 |
# Install directly from GitHub pip install git+https://github.com/nari-labs/dia.git |
Running the Gradio UI
1 2 3 4 5 6 7 8 9 10 11 |
git clone https://github.com/nari-labs/dia.git cd dia && uv run app.py # Or without uv git clone https://github.com/nari-labs/dia.git cd dia python -m venv .venv source .venv/bin/activate pip install -e . python app.py |
Using Dia in Python
1 2 3 4 5 6 7 8 9 10 |
from dia.model import Dia model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16") text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on GitHub or Hugging Face." output = model.generate(text, use_torch_compile=True, verbose=True) model.save_audio("dialogue.mp3", output) |
or i’ve created sample code for you :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# Example of voice cloning with DIA # This demonstrates how to use a reference audio file to clone a voice from dia.model import Dia import numpy as np import soundfile as sf import torch # Load the DIA model print("Loading DIA model...") model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16") # Path to your reference audio file for voice cloning reference_audio_path = "reference_voice.wav" # The transcript of what's said in the reference audio # This helps the model understand the voice characteristics reference_transcript = "[S1] This is my natural speaking voice that I want the model to clone." # Load the reference audio file audio_data, sample_rate = sf.read(reference_audio_path) # Convert to mono if stereo if len(audio_data.shape) > 1: audio_data = np.mean(audio_data, axis=1) # Convert to float32 tensor reference_audio = torch.tensor(audio_data).float() # Create your script using the same speaker tag from the reference script = """ [S1] Hello everyone! I'm excited to demonstrate voice cloning with the DIA model. [S1] This dialogue is being generated using the voice characteristics from my reference audio. [S1] The model can maintain my speaking style, accent, and tone across multiple sentences. (laughs) [S1] It can even handle expressions like laughter or sighs while keeping my voice consistent. """ # Generate audio using the reference voice print("Generating audio with cloned voice...") output = model.generate( reference_transcript + script, # Combine reference transcript with new script audio_prompt=reference_audio, # Provide the reference audio sample_rate=sample_rate, use_torch_compile=True, verbose=True ) # Save the generated audio model.save_audio("cloned_voice_output.mp3", output) print("Voice cloning complete! Audio saved as 'cloned_voice_output.mp3'") # Additional voice cloning tips: # 1. Use clear, high-quality reference audio (minimal background noise) # 2. Make sure the reference transcript accurately matches what's said in the audio # 3. Keep the reference audio between 3-10 seconds for best results # 4. Use the same speaker tag ([S1] or [S2]) consistently for the cloned voice |
Hardware Requirements
Dia currently requires a GPU with CUDA support (tested on CUDA 12.6 with PyTorch 2.0+). CPU support is planned for future releases.
Precision | Realtime Factor (w/ compile) | Realtime Factor (w/o compile) | VRAM Usage |
---|---|---|---|
bfloat16 | 2.1x | 1.5x | ~10GB |
float16 | 2.2x | 1.3x | ~10GB |
float32 | 1.0x | 0.9x | ~13GB |
Try It Now
- HuggingFace Space: Try the live demo
- ZeroGPU Space: Available for those without GPU access
- Community Support: Join their Discord server for help and updates
- Extended Access: Join the waitlist for access to larger models and additional features
Ethical Considerations
Dia is intended for research and educational purposes. Nari Labs explicitly prohibits:
- Creating audio that impersonates real individuals without permission
- Generating deceptive or misleading content
- Any illegal or harmful applications
The Future of Conversational AI
Dia represents a significant leap forward in generating natural-sounding dialogue. By condensing what was previously a multi-step process into a single model pass, Dia opens new possibilities for creative content, accessibility tools, and conversational AI systems.
With voice cloning capabilities and support for non-verbal communication, Dia can produce audio content that captures the nuance and natural flow of human conversation in ways that weren’t previously possible with open models.
Resources
Dia is licensed under the Apache License 2.0 and is currently available as an open-weight model for research and development purposes.