A new, open source text-to-speech model called Dia has arrived to challenge ElevenLabs, OpenAI and more

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


A two-person startup by the name of Nari Labs has introduced Dia, a 1.6 billion parameter text-to-speech (TTS) model designed to produce naturalistic dialogue directly from text prompts — and one of its creators claims it surpasses the performance of competing proprietary offerings from the likes of ElevenLabs, Google’s hit NotebookLM AI podcast generation product.

It could also threaten uptake of OpenAI’s recent gpt-4o-mini-tts.

“Dia rivals NotebookLM’s podcast feature while surpassing ElevenLabs Studio and Sesame’s open model in quality,” said Toby Kim, one of the co-creators of Nari and Dia, on a post from his account on the social network X.

In a separate post, Kim noted that the model was built with “zero funding,” and added across a thread: “…we were not AI experts from the beginning. It all started when we fell in love with NotebookLM’s podcast feature when it was released last year. We wanted more—more control over the voices, more freedom in the script. We tried every TTS API on the market. None of them sounded like real human conversation.”

Kim further credited Google for giving him and his collaborator access to the company’s Tensor Processing Unit chips (TPUs) for training Dia through Google’s Research Cloud.

Dia’s code and weights — the internal model connection set — is now available for download and local deployment by anyone from Hugging Face or Github. Individual users can try generating speech from it on a Hugging Face Space.

Advanced controls and more customizable features

Dia supports nuanced features like emotional tone, speaker tagging, and nonverbal audio cues—all from plain text.

Users can mark speaker turns with tags like [S1] and [S2], and include cues like (laughs), (coughs), or (clears throat) to enrich the resulting dialogue with nonverbal behaviors.

These tags are correctly interpreted by Dia during generation—something not reliably supported by other available models, according to the company’s examples page.

The model is currently English-only and not tied to any single speaker’s voice, producing different voices per run unless users fix the generation seed or provide an audio prompt. Audio conditioning, or voice cloning, lets users guide speech tone and voice likeness by uploading a sample clip.

Nari Labs offers example code to facilitate this process and a Gradio-based demo so users can try it without setup.

Comparison with ElevenLabs and Sesame

Nari offers a host of example audio files generated by Dia on its Notion website, comparing it to other leading speech-to-text rivals, specifically ElevenLabs Studio and Sesame CSM-1B, the latter a new text-to-speech model from Oculus VR headset co-creator Brendan Iribe that went somewhat viral on X earlier this year.

Side-by-side examples shared by Nari Labs show how Dia outperforms the competition in several areas:

In standard dialogue scenarios, Dia handles both natural timing and nonverbal expressions better. For example, in a script ending with (laughs), Dia interprets and delivers actual laughter, whereas ElevenLabs and Sesame output textual substitutions like “haha”.

For example, here’s Dia…

…and the same sentence spoken by ElevenLabs Studio

In multi-turn conversations with emotional range, Dia demonstrates smoother transitions and tone shifts. One test included a dramatic, emotionally-charged emergency scene. Dia rendered the urgency and speaker stress effectively, while competing models often flattened delivery or lost pacing.

Dia uniquely handles nonverbal-only scripts, such as a humorous exchange involving coughs, sniffs, and laughs. Competing models failed to recognize these tags or skipped them entirely.

Even with rhythmically complex content like rap lyrics, Dia generates fluid, performance-style speech that maintains tempo. This contrasts with more monotone or disjointed outputs from ElevenLabs and Sesame’s 1B model.

Using audio prompts, Dia can extend or continue a speaker’s voice style into new lines. An example using a conversational clip as a seed showed how Dia carried vocal traits from the sample through the rest of the scripted dialogue. This feature isn’t robustly supported in other models.

In one set of tests, Nari Labs noted that Sesame’s best website demo likely used an internal 8B version of the model rather than the public 1B checkpoint, resulting in a gap between advertised and actual performance.

Model access and tech specs

Developers can access Dia from Nari Labs’ GitHub repository and its Hugging Face model page.

The model runs on PyTorch 2.0+ and CUDA 12.6 and requires about 10GB of VRAM.

Inference on enterprise-grade GPUs like the NVIDIA A4000 delivers roughly 40 tokens per second.

While the current version only runs on GPU, Nari plans to offer CPU support and a quantized release to improve accessibility.

The startup offers both a Python library and CLI tool to further streamline deployment.

Dia’s flexibility opens use cases from content creation to assistive technologies and synthetic voiceovers.

Nari Labs is also developing a consumer version of Dia aimed at casual users looking to remix or share generated conversations. Interested users can sing up via email to a waitlist for early access.

Fully open source

The model is distributed under a fully open source Apache 2.0 license, which means it can be used for commercial purposes — something that will obviously appeal to enterprises or indie app developers.

Nari Labs explicitly prohibits usage that includes impersonating individuals, spreading misinformation, or engaging in illegal activities. The team encourages responsible experimentation and has taken a stance against unethical deployment.

Dia’s development credits support from the Google TPU Research Cloud, Hugging Face’s ZeroGPU grant program, and prior work on SoundStorm, Parakeet, and Descript Audio Codec.

Nari Labs itself comprises just two engineers—one full-time and one part-time—but they actively invite community contributions through its Discord server and GitHub.

With a clear focus on expressive quality, reproducibility, and open access, Dia adds a distinctive new voice to the landscape of generative speech models.


source

Leave a Comment

Your email address will not be published. Required fields are marked *