VentureBeat

A new, open source text-to-speech model called Dia has arrived to challenge ElevenLabs, OpenAI and more

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A two-person startup by the name of Nari Labs has introduced Dia, a 1.6 billion parameter text-to-speech (TTS) model designed to produce naturalistic dialogue directly from text prompts — and one of its creators claims it surpasses the performance of competing proprietary offerings from the likes of ElevenLabs, Google’s hit NotebookLM AI podcast generation product. It could also threaten uptake of OpenAI’s recent gpt-4o-mini-tts. “Dia rivals NotebookLM’s podcast feature while surpassing ElevenLabs Studio and Sesame’s open model in quality,” said Toby Kim, one of the co-creators of Nari and Dia, on a post from his account on the social network X. In a separate post, Kim noted that the model was built with “zero funding,” and added across a thread: “…we were not AI experts from the beginning. It all started when we fell in love with NotebookLM’s podcast feature when it was released last year. We wanted more—more control over the voices, more freedom in the script. We tried every TTS API on the market. None of them sounded like real human conversation.” Kim further credited Google for giving him and his collaborator access to the company’s Tensor Processing Unit chips (TPUs) for training Dia through Google’s Research Cloud. Dia’s code and weights — the internal model connection set — is now available for download and local deployment by anyone from Hugging Face or Github. Individual users can try generating speech from it on a Hugging Face Space. Advanced controls and more customizable features Dia supports nuanced features like emotional tone, speaker tagging, and nonverbal audio cues—all from plain text. Users can mark speaker turns with tags like [S1] and [S2], and include cues like (laughs), (coughs), or (clears throat) to enrich the resulting dialogue with nonverbal behaviors. These tags are correctly interpreted by Dia during generation—something not reliably supported by other available models, according to the company’s examples page. The model is currently English-only and not tied to any single speaker’s voice, producing different voices per run unless users fix the generation seed or provide an audio prompt. Audio conditioning, or voice cloning, lets users guide speech tone and voice likeness by uploading a sample clip. Nari Labs offers example code to facilitate this process and a Gradio-based demo so users can try it without setup. Comparison with ElevenLabs and Sesame Nari offers a host of example audio files generated by Dia on its Notion website, comparing it to other leading speech-to-text rivals, specifically ElevenLabs Studio and Sesame CSM-1B, the latter a new text-to-speech model from Oculus VR headset co-creator Brendan Iribe that went somewhat viral on X earlier this year. Side-by-side examples shared by Nari Labs show how Dia outperforms the competition in several areas: In standard dialogue scenarios, Dia handles both natural timing and nonverbal expressions better. For example, in a script ending with (laughs), Dia interprets and delivers actual laughter, whereas ElevenLabs and Sesame output textual substitutions like “haha”. For example, here’s Dia… …and the same sentence spoken by ElevenLabs Studio In multi-turn conversations with emotional range, Dia demonstrates smoother transitions and tone shifts. One test included a dramatic, emotionally-charged emergency scene. Dia rendered the urgency and speaker stress effectively, while competing models often flattened delivery or lost pacing. Dia uniquely handles nonverbal-only scripts, such as a humorous exchange involving coughs, sniffs, and laughs. Competing models failed to recognize these tags or skipped them entirely. Even with rhythmically complex content like rap lyrics, Dia generates fluid, performance-style speech that maintains tempo. This contrasts with more monotone or disjointed outputs from ElevenLabs and Sesame’s 1B model. Using audio prompts, Dia can extend or continue a speaker’s voice style into new lines. An example using a conversational clip as a seed showed how Dia carried vocal traits from the sample through the rest of the scripted dialogue. This feature isn’t robustly supported in other models. In one set of tests, Nari Labs noted that Sesame’s best website demo likely used an internal 8B version of the model rather than the public 1B checkpoint, resulting in a gap between advertised and actual performance. Model access and tech specs Developers can access Dia from Nari Labs’ GitHub repository and its Hugging Face model page. The model runs on PyTorch 2.0+ and CUDA 12.6 and requires about 10GB of VRAM. Inference on enterprise-grade GPUs like the NVIDIA A4000 delivers roughly 40 tokens per second. While the current version only runs on GPU, Nari plans to offer CPU support and a quantized release to improve accessibility. The startup offers both a Python library and CLI tool to further streamline deployment. Dia’s flexibility opens use cases from content creation to assistive technologies and synthetic voiceovers. Nari Labs is also developing a consumer version of Dia aimed at casual users looking to remix or share generated conversations. Interested users can sing up via email to a waitlist for early access. Fully open source The model is distributed under a fully open source Apache 2.0 license, which means it can be used for commercial purposes — something that will obviously appeal to enterprises or indie app developers. Nari Labs explicitly prohibits usage that includes impersonating individuals, spreading misinformation, or engaging in illegal activities. The team encourages responsible experimentation and has taken a stance against unethical deployment. Dia’s development credits support from the Google TPU Research Cloud, Hugging Face’s ZeroGPU grant program, and prior work on SoundStorm, Parakeet, and Descript Audio Codec. Nari Labs itself comprises just two engineers—one full-time and one part-time—but they actively invite community contributions through its Discord server and GitHub. With a clear focus on expressive quality, reproducibility, and open access, Dia adds a distinctive new voice to the landscape of generative speech models. source

A new, open source text-to-speech model called Dia has arrived to challenge ElevenLabs, OpenAI and more Read More »

Alibaba launches open source Qwen3 model that surpasses OpenAI o1 and DeepSeek R1

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Chinese e-commerce and web giant Alibaba’s Qwen team has officially launched a new series of open source AI large language multimodal models known as Qwen3 that appear to be among the state-of-the-art for open models, and approach performance of proprietary models from the likes of OpenAI and Google. The Qwen3 series features two “mixture-of-experts” models and six dense models for a total of eight (!) new models. The “mixture-of-experts” approach involves having several different specialty model types combined into one, with only those relevant models to the task at hand being activated when needed in the internal settings of the model (known as parameters). It was popularized by open source French AI startup Mistral. According to the team, the 235-billion parameter version of Qwen3 codenamed A22B outperforms DeepSeek’s open source R1 and OpenAI’s proprietary o1 on key third-party benchmarks including ArenaHard (with 500 user questions in software engineering and math) and nears the performance of the new, proprietary Google Gemini 2.5-Pro. Overall, the benchmark data positions Qwen3-235B-A22B as one of the most powerful publicly available models, achieving parity or superiority relative to major industry offerings. Hybrid (reasoning) theory The Qwen3 models are trained to provide so-called “hybrid reasoning” or “dynamic reasoning” capabilities, allowing users to toggle between fast, accurate responses and more time-consuming and compute-intensive reasoning steps (similar to OpenAI’s “o” series) for more difficult queries in science, math, engineering and other specialized fields. This is an approach pioneered by Nous Research and other AI startups and research collectives. With Qwen3, users can engage the more intensive “Thinking Mode” using the button marked as such on the Qwen Chat website or by embedding specific prompts like /think or /no_think when deploying the model locally or through the API, allowing for flexible use depending on the task complexity. Users can now access and deploy these models across platforms like Hugging Face, ModelScope, Kaggle, and GitHub, as well as interact with them directly via the Qwen Chat web interface and mobile applications. The release includes both Mixture of Experts (MoE) and dense models, all available under the Apache 2.0 open-source license. In my brief usage of the Qwen Chat website so far, it was able to generate imagery relatively rapidly and with decent prompt adherence — especially when incorporating text into the image natively while matching the style. However, it often prompted me to log in and was subject to the usual Chinese content restrictions (such as prohibiting prompts or responses related to the Tiananmen Square protests). In addition to the MoE offerings, Qwen3 includes dense models at different scales: Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B. These models vary in size and architecture, offering users options to fit diverse needs and computational budgets. The Qwen3 models also significantly expand multilingual support, now covering 119 languages and dialects across major language families. This broadens the models’ potential applications globally, facilitating research and deployment in a wide range of linguistic contexts. Model training and architecture In terms of model training, Qwen3 represents a substantial step up from its predecessor, Qwen2.5. The pretraining dataset doubled in size to approximately 36 trillion tokens. The data sources include web crawls, PDF-like document extractions, and synthetic content generated using previous Qwen models focused on math and coding. The training pipeline consisted of a three-stage pretraining process followed by a four-stage post-training refinement to enable the hybrid thinking and non-thinking capabilities. The training improvements allow the dense base models of Qwen3 to match or exceed the performance of much larger Qwen2.5 models. Deployment options are versatile. Users can integrate Qwen3 models using frameworks such as SGLang and vLLM, both of which offer OpenAI-compatible endpoints. For local usage, options like Ollama, LMStudio, MLX, llama.cpp, and KTransformers are recommended. Additionally, users interested in the models’ agentic capabilities are encouraged to explore the Qwen-Agent toolkit, which simplifies tool-calling operations. Junyang Lin, a member of the Qwen team, commented on X that building Qwen3 involved addressing critical but less glamorous technical challenges such as scaling reinforcement learning stably, balancing multi-domain data, and expanding multilingual performance without quality sacrifice. Lin also indicated that the team is transitioning focus toward training agents capable of long-horizon reasoning for real-world tasks. What it means for enterprise decision-makers Engineering teams can point existing OpenAI-compatible endpoints to the new model in hours instead of weeks. The MoE checkpoints (235 B parameters with 22 B active, and 30 B with 3 B active) deliver GPT-4-class reasoning at roughly the GPU memory cost of a 20–30 B dense model. Official LoRA and QLoRA hooks allow private fine-tuning without sending proprietary data to a third-party vendor. Dense variants from 0.6 B to 32 B make it easy to prototype on laptops and scale to multi-GPU clusters without rewriting prompts. Running the weights on-premises means all prompts and outputs can be logged and inspected. MoE sparsity reduces the number of active parameters per call, cutting the inference attack surface. The Apache-2.0 license removes usage-based legal hurdles, though organizations should still review export-control and governance implications of using a model trained by a China-based vendor. Yet at the same time, it also offers a viable alternative to other Chinese players including DeepSeek, Tencent, and ByteDance — as well as the myriad and growing number of North American models such as the aforementioned OpenAI, Google, Microsoft, Anthropic, Amazon, Meta and others. The permissive Apache 2.0 license — which allows for unlimited commercial usage — is also a big advantage over other open source players like Meta, whose licenses are more restrictive. It indicates furthermore that the race between AI providers to offer ever-more powerful and accessible models continues to remain highly competitive, and savvy organizations looking to cut costs should attempt to remain flexible and open to evaluating said new models for their AI agents and workflows. Looking ahead The Qwen team positions Qwen3 not just as an incremental improvement but as a significant

Alibaba launches open source Qwen3 model that surpasses OpenAI o1 and DeepSeek R1 Read More »

30 seconds vs. 3: The d1 reasoning framework that’s slashing AI response times

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Researchers from UCLA and Meta AI have introduced d1, a novel framework using reinforcement learning (RL) to significantly enhance the reasoning capabilities of diffusion-based large language models (dLLMs). While most attention has focused on autoregressive models like GPT, dLLMs offer unique advantages. Giving them strong reasoning skills could unlock new efficiencies and applications for enterprises. dLLMs represent a distinct approach to generating text compared to standard autoregressive models, potentially offering benefits in terms of efficiency and information processing, which could be valuable for various real-world applications. Understanding diffusion language models Most large language models (LLMs) like GPT-4o and Llama are autoregressive (AR). They generate text sequentially, predicting the next token based only on the tokens that came before it.  Diffusion language models (dLLMs) work differently. Diffusion models were initially used in image generation models like DALL-E 2, Midjourney and Stable Diffusion. The core idea involves gradually adding noise to an image until it’s pure static, and then training a model to meticulously reverse this process, starting from noise and progressively refining it into a coherent picture. Adapting this concept directly to language was tricky because text is made of discrete units (tokens), unlike the continuous pixel values in images. Researchers overcame this by developing masked diffusion language models. Instead of adding continuous noise, these models work by randomly masking out tokens in a sequence and training the model to predict the original tokens. This leads to a different generation process compared to autoregressive models. dLLMs start with a heavily masked version of the input text and gradually “unmask” or refine it over several steps until the final, coherent output emerges. This “coarse-to-fine” generation enables dLLMs to consider the entire context simultaneously at each step, as opposed to focusing solely on the next token. This difference gives dLLMs potential advantages, such as improved parallel processing during generation, which could lead to faster inference, especially for longer sequences. Examples of this model type include the open-source LLaDA and the closed-source Mercury model from Inception Labs.  “While autoregressive LLMs can use reasoning to enhance quality, this improvement comes at a severe compute cost with frontier reasoning LLMs incurring 30+ seconds in latency to generate a single response,” Aditya Grover, assistant professor of computer science at UCLA and co-author of the d1 paper, told VentureBeat. “In contrast, one of the key benefits of dLLMs is their computational efficiency. For example, frontier dLLMs like Mercury can outperform the best speed-optimized autoregressive LLMs from frontier labs by 10x in user throughputs.” Reinforcement learning for dLLMs Despite their advantages, dLLMs still lag behind autoregressive models in reasoning abilities. Reinforcement learning has become crucial for teaching LLMs complex reasoning skills. By training models based on reward signals (essentially rewarding them for correct reasoning steps or final answers) RL has pushed LLMs toward better instruction-following and reasoning.  Algorithms such as Proximal Policy Optimization (PPO) and the more recent Group Relative Policy Optimization (GRPO) have been central to applying RL effectively to autoregressive models. These methods typically rely on calculating the probability (or log probability) of the generated text sequence under the model’s current policy to guide the learning process. This calculation is straightforward for autoregressive models due to their sequential, token-by-token generation. However, for dLLMs, with their iterative, non-sequential generation process, directly computing this sequence probability is difficult and computationally expensive. This has been a major roadblock to applying established RL techniques to improve dLLM reasoning. The d1 framework tackles this challenge with a two-stage post-training process designed specifically for masked dLLMs: Supervised fine-tuning (SFT): First, the pre-trained dLLM is fine-tuned on a dataset of high-quality reasoning examples. The paper uses the “s1k” dataset, which contains detailed step-by-step solutions to problems, including examples of self-correction and backtracking when errors occur. This stage aims to instill foundational reasoning patterns and behaviors into the model. Reinforcement learning with diffu-GRPO: After SFT, the model undergoes RL training using a novel algorithm called diffu-GRPO. This algorithm adapts the principles of GRPO to dLLMs. It introduces an efficient method for estimating log probabilities while avoiding the costly computations previously required. It also incorporates a clever technique called “random prompt masking.” During RL training, parts of the input prompt are randomly masked in each update step. This acts as a form of regularization and data augmentation, allowing the model to learn more effectively from each batch of data. d1 in real-world applications The researchers applied the d1 framework to LLaDA-8B-Instruct, an open-source dLLM. They fine-tuned it using the s1k reasoning dataset for the SFT stage. They then compared several versions: the base LLaDA model, LLaDA with only SFT, LLaDA with only diffu-GRPO and the full d1-LLaDA (SFT followed by diffu-GRPO). These models were tested on mathematical reasoning benchmarks (GSM8K, MATH500) and logical reasoning tasks (4×4 Sudoku, Countdown number game). The results showed that the full d1-LLaDA consistently achieved the best performance across all tasks. Impressively, diffu-GRPO applied alone also significantly outperformed SFT alone and the base model.  “Reasoning-enhanced dLLMs like d1 can fuel many different kinds of agents for enterprise workloads,” Grover said. “These include coding agents for instantaneous software engineering, as well as ultra-fast deep research for real-time strategy and consulting… With d1 agents, everyday digital workflows can become automated and accelerated at the same time.” Interestingly, the researchers observed qualitative improvements, especially when generating longer responses. The models began to exhibit “aha moments,” demonstrating self-correction and backtracking behaviors learned from the examples in the s1k dataset. This suggests the model isn’t just memorizing answers but learning more robust problem-solving strategies. Autoregressive models have a first-mover advantage in terms of adoption. However, Grover believes that advances in dLLMs can change the dynamics of the playing field. For an enterprise, one way to decide between the two is if their application is currently bottlenecked by latency or cost constraints. According to Grover, reasoning-enhanced diffusion dLLMs such as d1 can help in one of two complementary

30 seconds vs. 3: The d1 reasoning framework that’s slashing AI response times Read More »

Swapping LLMs isn’t plug-and-play: Inside the hidden cost of model migration

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Swapping large language models (LLMs) is supposed to be easy, isn’t it? After all, if they all speak “natural language,” switching from GPT-4o to Claude or Gemini should be as simple as changing an API key… right? In reality, each model interprets and responds to prompts differently, making the transition anything but seamless. Enterprise teams who treat model switching as a “plug-and-play” operation often grapple with unexpected regressions: broken outputs, ballooning token costs or shifts in reasoning quality. This story explores the hidden complexities of cross-model migration, from tokenizer quirks and formatting preferences to response structures and context window performance. Based on hands-on comparisons and real-world tests, this guide unpacks what happens when you switch from OpenAI to Anthropic or Google’s Gemini and what your team needs to watch for. Understanding Model Differences Each AI model family has its own strengths and limitations. Some key aspects to consider include: Tokenization variations—Different models use different tokenization strategies, which impact the input prompt length and its total associated cost. Context window differences—Most flagship models allow a context window of 128K tokens; however, Gemini extends this to 1M and 2M tokens. Instruction following – Reasoning models prefer simpler instructions, while chat-style models require clean and explicit instructions.  Formatting preferences – Some models prefer markdown while others prefer XML tags for formatting. Model response structure—Each model has its own style of generating responses, which affects verbosity and factual accuracy. Some models perform better when allowed to “speak freely,” i.e., without adhering to an output structure, while others prefer JSON-like output structures. Interesting research shows the interplay between structured response generation and overall model performance. Migrating from OpenAI to Anthropic Imagine a real-world scenario where you’ve just benchmarked GPT-4o, and now your CTO wants to try Claude 3.5. Make sure to refer to the pointers below before making any decision: Tokenization variations All model providers pitch extremely competitive per-token costs. For example, this post shows how the tokenization costs for GPT-4 plummeted in just one year between 2023 and 2024. However, from a machine learning (ML) practitioner’s viewpoint, making model choices and decisions based on purported per-token costs can often be misleading.  A practical case study comparing GPT-4o and Sonnet 3.5 exposes the verbosity of Anthropic models’ tokenizers. In other words, the Anthropic tokenizer tends to break down the same text input into more tokens than OpenAI’s tokenizer.  Context window differences Each model provider is pushing the boundaries to allow longer and longer input text prompts. However, different models may handle different prompt lengths differently. For example, Sonnet-3.5 offers a larger context window up to 200K tokens as compared to the 128K context window of GPT-4. Despite this, it is noticed that OpenAI’s GPT-4 is the most performant in handling contexts up to 32K, whereas Sonnet-3.5’s performance declines with increased prompts longer than 8K-16K tokens. Moreover, there is evidence that different context lengths are treated differently within intra-family models by the LLM, i.e., better performance at short contexts and worse performance at longer contexts for the same given task. This means that replacing one model with another (either from the same or a different family) might result in unexpected performance deviations. Formatting preferences Unfortunately, even the current state-of-the-art LLMs are highly sensitive to minor prompt formatting. This means the presence or absence of formatting in the form of markdown and XML tags can highly vary the model performance on a given task. Empirical results across multiple studies suggest that OpenAI models prefer markdownified prompts including sectional delimiters, emphasis, lists, etc. In contrast, Anthropic models prefer XML tags for delineating different parts of the input prompt. This nuance is commonly known to data scientists and there is ample discussion on the same in public forums (Has anyone found that using markdown in the prompt makes a difference?, Formatting plain text to markdown, Use XML tags to structure your prompts). For more insights, check out the official best prompt engineering practices released by OpenAI and Anthropic, respectively.   Model response structure OpenAI GPT-4o models are generally biased toward generating JSON-structured outputs. However, Anthropic models tend to adhere equally to the requested JSON or XML schema, as specified in the user prompt. However, imposing or relaxing the structures on models’ outputs is a model-dependent and empirically driven decision based on the underlying task. During a model migration phase, modifying the expected output structure would also entail slight adjustments in the post-processing of the generated responses. Cross-model platforms and ecosystems LLM switching is more complicated than it looks. Recognizing the challenge, major enterprises are increasingly focusing on providing solutions to tackle it. Companies like Google (Vertex AI), Microsoft (Azure AI Studio) and AWS (Bedrock) are actively investing in tools to support flexible model orchestration and robust prompt management. For example, Google Cloud Next 2025 recently announced that Vertex AI allows users to work with more than 130 models by facilitating an expanded model garden, unified API access, and the new feature AutoSxS, which enables head-to-head comparisons of different model outputs by providing detailed insights into why one model’s output is better than the other. Standardizing model and prompt methodologies Migrating prompts across AI model families requires careful planning, testing and iteration. By understanding the nuances of each model and refining prompts accordingly, developers can ensure a smooth transition while maintaining output quality and efficiency. ML practitioners must invest in robust evaluation frameworks, maintain documentation of model behaviors and collaborate closely with product teams to ensure the model outputs align with end-user expectations. Ultimately, standardizing and formalizing the model and prompt migration methodologies will equip teams to future-proof their applications, leverage best-in-class models as they emerge, and deliver users more reliable, context-aware, and cost-efficient AI experiences. source

Swapping LLMs isn’t plug-and-play: Inside the hidden cost of model migration Read More »

Beyond A2A and MCP: How LOKA’s Universal Agent Identity Layer changes the game

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Agentic interoperability is gaining steam, but organizations continue to propose new interoperability protocols as the industry continues to figure out which standards to adopt.  A group of researchers from Carnegie Mellon University proposed a new interoperability protocol governing autonomous AI agents’ identity, accountability and ethics. Layered Orchestration for Knowledgeful Agents, or LOKA, could join other proposed standards like Google’s Agent2Agent (A2A) and Model Context Protocol (MCP) from Anthropic.  In a paper, the researchers noted that the rise of AI agents underscores the importance of governing them.  “As their presence expands, the need for a standardized framework to govern their interactions becomes paramount,” the researchers wrote. “Despite their growing ubiquity, AI agents often operate within siloed systems, lacking a common protocol for communication, ethical reasoning, and compliance with jurisdictional regulations. This fragmentation poses significant risks, such as interoperability issues, ethical misalignment, and accountability gaps.” To address this, they propose the open-source LOKA, which would enable agents to prove their identity, “exchange semantically rich, ethically annotated messages,” add accountability, and establish ethical governance throughout the agent’s decision-making process.  LOKA builds on what the researchers refer to as a Universal Agent Identity Layer, a framework that assigns agents a unique and verifiable identity.  “We envision LOKA as a foundational architecture and a call to reexamine the core elements—identity, intent, trust and ethical consensus—that should underpin agent interactions. As the scope of AI agents expands, it is crucial to assess whether our existing infrastructure can responsibly facilitate this transition,” Rajesh Ranjan, one of the researchers, told VentureBeat.  LOKA layers LOKA works as a layered stack. The first stack revolves around identity, which lays out what the agent is. This includes a decentralized identifier, or a “unique, cryptographically verifiable ID.” This would let users and other agents verify the agent’s identity.  The next layer is the communication layer, where the agent informs another agent of its intention and the task it needs to accomplish. This is followed by the ethics later and the security layer.  LOKA’s ethics layer lays out how the agent behaves. It incorporates “a flexible yet robust ethical decision-making framework that allows agents to adapt to varying ethical standards depending on the context in which they operate.” The LOKA protocol employs collective decision-making models, allowing agents within the framework to determine their next steps and assess whether these steps align with the ethical and responsible AI standards.  Meanwhile, the security layer utilizes what the researchers describe as “quantum-resilient cryptography.” What differentiates LOKA The researchers said LOKA stands out because it establishes crucial information for agents to communicate with other agents and operate autonomously across different systems.  LOKA could be helpful for enterprises to ensure the safety of agents they deploy in the world and provide a traceable way to understand how the agent made decisions. A fear many enterprises have is that an agent will tap into another system or access private data and make a mistake.  Ranjan said the system “highlights the need to define who agents are and how they make decisions and how they’re held accountable.”  “Our vision is to illuminate the critical questions that are often overshadowed in the rush to scale AI agents: How do we create ecosystems where these agents can be trusted, held accountable, and ethically interoperable across diverse systems?” Ranjan said.  LOKA will have to compete with other agentic protocols and standards that are now emerging. Protocols like MCP and A2A have found a large audience, not just because of the technical solutions they provide, but because these projects are backed by organizations people know. Anthropic started MCP, while Google backs A2A, and both protocols have gathered many companies open to use — and improve — these standards.  LOKA operates independently, but Ranjan said they’ve received “very encouraging and exciting feedback” from other researchers and other institutions to expand the LOKA research project.  source

Beyond A2A and MCP: How LOKA’s Universal Agent Identity Layer changes the game Read More »

Is your AI product actually working? How to develop the right metric system

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More In my first stint as a machine learning (ML) product manager, a simple question inspired passionate debates across functions and leaders: How do we know if this product is actually working? The product in question that I managed catered to both internal and external customers. The model enabled internal teams to identify the top issues faced by our customers so that they could prioritize the right set of experiences to fix customer issues. With such a complex web of interdependencies among internal and external customers, choosing the right metrics to capture the impact of the product was critical to steer it towards success. Not tracking whether your product is working well is like landing a plane without any instructions from air traffic control. There is absolutely no way that you can make informed decisions for your customer without knowing what is going right or wrong. Additionally, if you do not actively define the metrics, your team will identify their own back-up metrics. The risk of having multiple flavors of an ‘accuracy’ or ‘quality’ metric is that everyone will develop their own version, leading to a scenario where you might not all be working toward the same outcome. For example, when I reviewed my annual goal and the underlying metric with our engineering team, the immediate feedback was: “But this is a business metric, we already track precision and recall.”  First, identify what you want to know about your AI product Once you do get down to the task of defining the metrics for your product — where to begin? In my experience, the complexity of operating an ML product with multiple customers translates to defining metrics for the model, too. What do I use to measure whether a model is working well? Measuring the outcome of internal teams to prioritize launches based on our models would not be quick enough; measuring whether the customer adopted solutions recommended by our model could risk us drawing conclusions from a very broad adoption metric (what if the customer didn’t adopt the solution because they just wanted to reach a support agent?). Fast-forward to the era of large language models (LLMs) — where we don’t just have a single output from an ML model, we have text answers, images and music as outputs, too. The dimensions of the product that require metrics now rapidly increases — formats, customers, type … the list goes on. Across all my products, when I try to come up with metrics, my first step is to distill what I want to know about its impact on customers into a few key questions. Identifying the right set of questions makes it easier to identify the right set of metrics. Here are a few examples: Did the customer get an output? → metric for coverage How long did it take for the product to provide an output? → metric for latency Did the user like the output? → metrics for customer feedback, customer adoption and retention Once you identify your key questions, the next step is to identify a set of sub-questions for ‘input’ and ‘output’ signals. Output metrics are lagging indicators where you can measure an event that has already happened. Input metrics and leading indicators can be used to identify trends or predict outcomes. See below for ways to add the right sub-questions for lagging and leading indicators to the questions above. Not all questions need to have leading/lagging indicators. Did the customer get an output? → coverage How long did it take for the product to provide an output? → latency Did the user like the output? → customer feedback, customer adoption and retention Did the user indicate that the output is right/wrong? (output) Was the output good/fair? (input) The third and final step is to identify the method to gather metrics. Most metrics are gathered at-scale by new instrumentation via data engineering. However, in some instances (like question 3 above) especially for ML based products, you have the option of manual or automated evaluations that assess the model outputs. While it’s always best to develop automated evaluations, starting with manual evaluations for “was the output good/fair” and creating a rubric for the definitions of good, fair and not good will help you lay the groundwork for a rigorous and tested automated evaluation process, too. Example use cases: AI search, listing descriptions The above framework can be applied to any ML-based product to identify the list of primary metrics for your product. Let’s take search as an example. Question  Metrics Nature of Metric Did the customer get an output? → Coverage % search sessions with search results shown to customer Output How long did it take for the product to provide an output? → Latency Time taken to display search results for the user Output Did the user like the output? → Customer feedback, customer adoption and retention Did the user indicate that the output is right/wrong? (Output) Was the output good/fair? (Input) % of search sessions with ‘thumbs up’ feedback on search results from the customer or % of search sessions with clicks from the customer % of search results marked as ‘good/fair’ for each search term, per quality rubric Output Input How about a product to generate descriptions for a listing (whether it’s a menu item in Doordash or a product listing on Amazon)? Question  Metrics Nature of Metric Did the customer get an output? → Coverage % listings with generated description Output How long did it take for the product to provide an output? → Latency Time taken to generate descriptions to the user Output Did the user like the output? → Customer feedback, customer adoption and retention Did the user indicate that the output is right/wrong? (Output) Was the output good/fair? (Input) % of listings with generated descriptions that required edits from the technical content team/seller/customer % of listing descriptions marked as ‘good/fair’, per

Is your AI product actually working? How to develop the right metric system Read More »

Does RAG make LLMs less safe?  Bloomberg research reveals hidden dangers

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Retrieval Augmented Generation (RAG) is supposed to help improve the accuracy of enterprise AI by providing grounded content. While that is often the case, there is also an unintended side effect. According to surprising new research published today by Bloomberg, RAG can potentially make large language models (LLMs) unsafe.  Bloomberg’s paper, ‘RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models,’ evaluated 11 popular LLMs including Claude-3.5-Sonnet, Llama-3-8B and GPT-4o. The findings contradict conventional wisdom that RAG inherently makes AI systems safer. The Bloomberg research team discovered that when using RAG, models that typically refuse harmful queries in standard settings often produce unsafe responses. Alongside the RAG research, Bloomberg released a second paper, ‘Understanding and Mitigating Risks of Generative AI in Financial Services,’ that introduces a specialized AI content risk taxonomy for financial services that addresses domain-specific concerns not covered by general-purpose safety approaches. The research challenges widespread assumptions that retrieval-augmented generation (RAG) enhances AI safety, while demonstrating how existing guardrail systems fail to address domain-specific risks in financial services applications. “Systems need to be evaluated in the context they’re deployed in, and you might not be able to just take the word of others that say, Hey, my model is safe, use it, you’re good,” Sebastian Gehrmann, Bloomberg’s Head of Responsible AI, told VentureBeat.  RAG systems can make LLMs less safe, not more RAG is widely used by enterprise AI teams to provide grounded content. The goal is to provide accurate, updated information.  There has been a lot of research and advancement in RAG in recent months to further improve accuracy as well. Earlier this month a new open-source framework called Open RAG Eval debuted to help validate RAG efficiency. It’s important to note that Bloomberg’s research is not questioning the efficacy of RAG or its ability to reduce hallucination. That’s not what the research is about. Rather it’s about how RAG usage impacts LLM guardrails in an unexpected way. The research team discovered that when using RAG, models that typically refuse harmful queries in standard settings often produce unsafe responses. For example, Llama-3-8B’s unsafe responses jumped from 0.3% to 9.2% when RAG was implemented. Gehrmann explained that without RAG being in place, if a user typed in a malicious query, the built-in safety system or guardrails will typically block the query. Yet for some reason, when the same query is issued in an LLM that is using RAG, the system will answer the malicious query, even when the retrieved documents themselves are safe. “What we found is that if you use a large language model out of the box, often they have safeguards built in where, if you ask, ‘How do I do this illegal thing,’ it will say, ‘Sorry, I cannot help you do this,’” Gehrmann explained. “We found that if you actually apply this in a RAG setting, one thing that could happen is that the additional retrieved context, even if it does not contain any information that addresses the original malicious query, might still answer that original query.” How does RAG bypass enterprise AI guardrails? So why and how does RAG serve to bypass guardrails? The Bloomberg researchers were not entirely certain though they did have a few ideas. Gehrmann hypothesized that the way the LLMs were developed and trained did not fully consider safety alignments for really long inputs. The research demonstrated that context length directly impacts safety degradation. “Provided with more documents, LLMs tend to be more vulnerable,” the paper states, showing that even introducing a single safe document can significantly alter safety behavior. “I think the bigger point of this RAG paper is you really cannot escape this risk,” Amanda Stent, Bloomberg’s Head of AI Strategy and Research, told VentureBeat. “It’s inherent to the way RAG systems are. The way you escape it is by putting business logic or fact checks or guardrails around the core RAG system.” Why generic AI safety taxonomies fail in financial services Bloomberg’s second paper introduces a specialized AI content risk taxonomy for financial services, addressing domain-specific concerns like financial misconduct, confidential disclosure and counterfactual narratives. The researchers empirically demonstrated that existing guardrail systems miss these specialized risks. They tested open-source guardrail models including Llama Guard, Llama Guard 3, AEGIS and ShieldGemma against data collected during red-teaming exercises. “We developed this taxonomy, and then ran an experiment where we took openly available guardrail systems that are published by other firms and we ran this against data that we collected as part of our ongoing red teaming events,” Gehrmann explained. “We found that these open source guardrails… do not find any of the issues specific to our industry.” The researchers developed a framework that goes beyond generic safety models, focusing on risks unique to professional financial environments. Gehrmann argued that general purpose guardrail models are usually developed for consumer facing specific risks. So they are very much focused on toxicity and bias. He noted that while important those concerns are not necessarily specific to any one industry or domain. The key takeaway from the research is that organizations need to have the domain specific taxonomy in place for their own specific industry and application use cases. Responsible AI at Bloomberg Bloomberg has made a name for itself over the years as a trusted provider of financial data systems. In some respects, gen AI and RAG systems could potentially be seen as competitive against Bloomberg’s traditional business and therefore there could be some hidden bias in the research.  “We are in the business of giving our clients the best data and analytics and the broadest ability to discover, analyze and synthesize information,” Stent said. “Generative AI is a tool that can really help with discovery, analysis and synthesis across data and analytics, so for us, it’s a benefit.” She added that the kinds of bias that Bloomberg is concerned about with its AI solutions are

Does RAG make LLMs less safe?  Bloomberg research reveals hidden dangers Read More »

Relyance AI builds ‘x-ray vision’ for company data: Cuts AI compliance time by 80% while solving trust crisis

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Relyance AI, a data governance platform provider that secured $32.1 million in Series B funding last October, is launching a new solution aimed at solving one of the most pressing challenges in enterprise AI adoption: understanding exactly how data moves through complex systems. The company’s new Data Journeys platform, announced today, addresses a critical blind spot for organizations implementing AI — tracking not just where data resides, but how and why it’s being used across applications, cloud services, and third-party systems. “The fundamental premise is making sure that our customers have this AI native, context-aware view, very visual view of the entire journey of data across their applications, services, infrastructures, third parties,” said Abhi Sharma, CEO and co-founder of Relyance AI, in an exclusive interview with VentureBeat. “You can really get at the heart of the why of data processing, which is the most foundational layer needed for general AI governance.” The launch comes at a pivotal moment for enterprise AI governance. As companies accelerate AI implementation, they face mounting pressure from regulators worldwide. More than a quarter of Fortune 500 companies have identified AI regulation as a risk in SEC filings, and GDPR-related fines reached €1.2 billion in 2024 alone (approximately $1.26 billion at current exchange rates). How Data Journeys tracks information flow where others fall short The platform represents a significant evolution from conventional data lineage approaches, which typically track data movement on a table-to-table or column-to-column basis within specific systems. “The status quo for data lineage is basically table to table and column level lineage. I can see how data moved within my Snowflake instance or within my S3 buckets,” Sharma explained. “But nobody can answer: Where did it come from originally? What nuanced transformation happened between data pipelines, third-party vendors, API calls, RAG architectures, to finally land up here?” Data Journeys aims to provide this comprehensive view, showing the complete data lifecycle from original collection through every transformation and use case. The system starts with code analysis rather than simply connecting to data repositories, giving it context about why data is being processed in specific ways. The promise of AI comes with significant accountability for how data is used. After seeing Relyance AI Data Journeys, we immediately recognized its potential to revolutionize our approach to responsible AI development,” said Heather Allen, privacy officer and director of privacy management at CHG Healthcare. “The automated, context-aware data lineage capabilities would address our most pressing challenges. It represents exactly what we’ve been looking for to support our global AI governance framework. Four business problems that data visibility promises to solve According to Sharma, Data Journeys delivers value in four critical areas: First, compliance and risk management: “Today, you kind of are required to vouch for integrity of data processing, but you can’t see inside. It’s basically blind governance,” Sharma said. The platform enables organizations to prove the integrity of their data practices when facing regulatory scrutiny. Second, precise bias detection: Rather than just examining the immediate dataset used to train a model, companies can trace potential bias to its source. “Bias often happens at inference time, not because you had bias in the dataset,” Sharma noted. “The point is, it’s actually not that dataset. It’s the journey it took.” Third, explainability and accountability: For high-stakes AI decisions like loan approvals or medical diagnoses, understanding the complete data provenance becomes essential. “The why behind that is super important, and many times, the incorrect behavior of the model is completely dependent on the multiple steps it took before the inference time,” Sharma explained. Finally, regulatory compliance: The platform provides what Sharma calls a “mathematical proof point” that companies are using data appropriately, helping them navigate increasingly complex global regulations. From hours to minutes: Measurable returns on better data oversight Relyance claims the platform delivers measurable returns on investment. According to Sharma, customers have seen 70-80% time savings in compliance documentation and evidence gathering. What he calls “time to certainty”—the ability to quickly answer questions about how specific data is being used—has been reduced from hours to minutes. In one example Sharma shared, a direct-to-consumer company was switching payment processors from Braintree to Stripe. An engineer working on the project inadvertently created code that stored credit card information in plain text under the wrong column name in Snowflake. “We caught that at the time the code was checked in,” Sharma said. Without Data Journeys’ visual representation of data flows, this potential security incident might have gone undetected until much later. Keeping sensitive data inside your walls: The self-hosted option Alongside Data Journeys, Relyance is introducing InHost, a self-hosted deployment model designed for organizations with strict data sovereignty requirements or those in highly regulated industries. “The industries that are most interested in the in-host option are more regulated industries — FinTech and healthcare,” said Sharma. This includes banking, fraud detection, credit worthiness applications, genetics, and personal healthcare services. The flexibility to deploy either in the cloud or within a company’s own infrastructure addresses growing concerns about sensitive data leaving organizational boundaries, particularly for AI applications that might process regulated information. Relyance AI’s expansion plans point to growing AI governance market Relyance is positioning Data Journeys as part of a broader strategy to become what Sharma calls “a unified AI-native platform” for global privacy compliance, data security posture management, and AI governance. “In the second half of this year, I’m launching an AI governance solution which will be a 360-degree management of all AI footprint in your environment,” Sharma revealed, encompassing compliance, real-time ethics monitoring, bias detection, and accountability for both third-party and in-house AI systems. The company’s long-term vision is ambitious. “AI agents are going to run the world, and we want to be that company that provides the infrastructure for organizations to trust and govern it,” Sharma said. “We want to help improve the data utility index of the world.” Investors bet big on

Relyance AI builds ‘x-ray vision’ for company data: Cuts AI compliance time by 80% while solving trust crisis Read More »

Google adds more AI tools to its Workspace productivity apps

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Google continues to bring its flagship AI models to its productivity apps, expanding its Gemini features.  The company today announced several updates to its Workspace products, including the addition of Audio Overviews and new streamlined methods for tracking meetings.  Audio Overviews, which was first introduced in Google’s popular NotebookLM, allows people to create podcasts on their chosen research topic.  Now, through Gemini, users can create audio files based on uploaded documents and slides. They can also generate audio overviews within deep research reports. These podcast-style audio files are downloadable. Audio Overview generates voices and grounds its discussions solely on the provided documents.  Google previously told VentureBeat that its tests showed some people prefer learning through listening, where information is presented in a conversational format.  The company also launched a new feature called Canvas in Gemini, which lets people create drafts and refine text or code using the Gemini model. Google said Canvas helps “generate, optimize and preview code.” Canvas documents can be shared with Google Docs.  Updated calendars Google also streamlined how users can add events and meetings to their calendars. Gemini will detect if an email contains details of events and can prompt people to add it to their calendar. The model will surface emails with potential appointments if the user misses them.  Some plug-ins for Google, such as Boomerang, offer similar features that display appointments above the subject line. The Gemini-powered calendar feature will open a Gemini chat window alerting the user of the event.  Pointing AI models to surface data or events from emails has become a cornerstone of enterprise AI assistants and agents. Microsoft’s new agents parse through emails for input. Startup Martin AI has an AI assistant that manages calendars, emails and to-do lists.   Melding generative AI with productivity Google added Gemini chat to Workspace last year to integrate the standalone chat platform with Gmail, Google Docs and Calendars. This brought Google closer to Microsoft’s Copilot, which added AI models to its productivity platforms, including Outlook.   Enterprises continue to add AI features to the workplace, and it’s possible that if their employees regularly access things like Gemini on their Gmails and use AI models for research, AI adoption rates may be even higher.  source

Google adds more AI tools to its Workspace productivity apps Read More »

Ethically trained AI startup Pleias releases new small reasoning models optimized for RAG with built-in citations

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More French AI startup Pleias made waves late last year with the launch of its ethically trained Pleias 1.0 family of small language models — among the first and only to date to be built entirely on scraping “open” data, that is, data explicitly labeled as public domain, open source, or unlicensed and not copyrighted. Now the company has announced the release of two open source small-scale reasoning models designed specifically for retrieval-augmented generation (RAG), citation synthesis, and structured multilingual output. The launch includes two core models — Pleias-RAG-350M and Pleias-RAG-1B — each also available in CPU-optimized GGUF format, making a total of four deployment-ready variants. They are all based on Pleias 1.0, and can be used independently or in conjunction with other LLMs that the organization may already or plan to deploy. All appear to be available under a permissive Apache 2.0 open source license, meaning they are eligible for organizations to take, modify and deploy for commercial use cases. RAG, as you’ll recall, is the widely-used technique that enterprises and organizations can deploy to hook an AI large language model (LLM) such as OpenAI’s GPT-4o, Google’s Gemini 2.5 Flash, Anthropic’s Claude Sonnet 3.7 or Cohere’s Command-A, or open source alternatives like Llama 4 and DeepSeek V3 to external knowledge bases, such as enterprise documents and cloud storages. This is often necessary for enterprises that want to build chatbots and other AI applications that reference their internal policies or product catalogs (an alternative, prompting a long context LLM with all the information necessary, may not be suitable for enterprise use cases where security and per-token transmission costs are concerns). The Pleias-RAG model family is the latest effort to bridge the gap between accuracy and efficiency in small language models. These models are aimed at enterprises, developers, and researchers looking for cost-effective alternatives to large-scale language models without compromising traceability, multilingual capabilities, or structured reasoning workflows. The target userbase is actually Pleias’s home continent of Europe, as co-founder Alexander Doria told VentureBeat via direct message on the social network X: “A primary motivation has been the difficulty of scaling RAG applications in Europe. Most private organization have little GPUs (it may have changed but not long ago less than 2% of all [Nvidia] H100 [GPUs] were in Europe). And yet simultaneously there are strong incentive to self-host for regulated reasons, including GDPR. “SLMs have progressed significantly over the past year, yet they are too often conceived as ‘mini-chatbots’ and we have observed a significant drop of performance in non-English languages, both in terms of source understanding and quality of text generation. So we have been satisfied to hit most of our objectives: An actual alternative to 7-8b models for RAG even on CPU and other constrained infras. Fully verifiable models coming with citation support. Preservation of European language performance.” However, of course the models being open source under the Apache 2.0 license means anyone could take and use them freely anywhere in the world. Focused on grounding, citations, and facts A key feature of the new Pleias-RAG models is their native support for source citation with literal quotes, fully integrated into the model’s inference process. Unlike post-hoc citation methods or external chunking pipelines, the Pleias-RAG models generate citations directly, using a syntax inspired by Wikipedia’s reference format. This approach allows for shorter, more readable citation snippets while maintaining verifiability. Citation grounding plays a functional role in regulated settings. For sectors like healthcare, legal, and finance — where decision-making must be documented and traceable — these built-in references offer a direct path to auditability. Pleias positions this design choice as an ethical imperative, aligning with increasing regulatory demands for explainable AI. Proto agentic? Pleias-RAG models are described as “proto-agentic” — they can autonomously assess whether a query is understandable, determine if it is trivial or complex, and decide whether to answer, reformulate, or refuse based on source adequacy. Their structured output includes language detection, query and source analysis reports, and a reasoned answer. Despite their relatively small size (Pleias-RAG-350M has just 350 million parameters) the models exhibit behavior traditionally associated with larger, agentic systems. According to Pleias, these capabilities stem from a specialized mid-training pipeline that blends synthetic data generation with iterative reasoning prompts. Pleias-RAG-350M is explicitly designed for constrained environments. It performs well on standard CPUs, including mobile-class infrastructure. According to internal benchmarks, the unquantized GGUF version produces complete reasoning outputs in roughly 20 seconds on 8GB RAM setups. Its small footprint places it in a niche with very few competitors, such as Qwen-0.5 and SmolLM, but with a much stronger emphasis on structured source synthesis. Competitive performance across tasks and languages In benchmark evaluations, Pleias-RAG-350M and Pleias-RAG-1B outperform most open-weight models under 4 billion parameters, including Llama-3.1-8B and Qwen-2.5-7B, on tasks such as HotPotQA, 2WikiMultiHopQA, and MuSiQue. These multi-hop RAG benchmarks test the model’s ability to reason across multiple documents and identify distractors — common requirements in enterprise-grade knowledge systems. The models’ strength extends to multilingual scenarios. On translated benchmark sets across French, German, Spanish, and Italian, the Pleias models show negligible degradation in performance. This sets them apart from other SLMs, which typically experience a 10–35% performance loss when handling non-English queries. The multilingual support stems from careful tokenizer design and synthetic adversarial training that includes language-switching exercises. The models not only detect the language of a user query but aim to respond in the same language—an important feature for global deployments. In addition, Doria highlighted how the models could be used to augment the performance of other existing models an enterprise may already be using: “We envision the models to be used in orchestration setting, especially since their compute cost is low. A very interesting results on the evaluation side: even the 350m model turned out to be good on entirely different answers than the answers [Meta] Llama and [Alibaba] Qwen were performing at. So there’s a real

Ethically trained AI startup Pleias releases new small reasoning models optimized for RAG with built-in citations Read More »