VentureBeat

OpenAI launches o3-pro AI model, offering increased reliability and tool use for enterprises

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Just hours after announcing a big price cut for its o3 reasoning model, OpenAI made o3-pro, an even more powerful version, available to developers.  o3-pro is “designed to think longer and provide the most reliable responses,” and has access to many more software tool integrations than its predecessor, making it potentially appealing to enterprises and developers searching for high levels of detail and accuracy. However, this model will also be slower than what many developers are accustomed to, having access to computer tools that OpenAI claims make the model more accurate.  “Because 03-pro has access to tools, responses typically take longer than o1-pro to complete. We recommend using it for challenging questions where reliability matters more than speed, and waiting a few minutes is worth the tradeoff,” the company said in an email to reporters.  COME SEE OpenAI Head of Product, API Olivier Godement in-person at VB TRANSFORM 2025 in San Francisco June 24-25. REGISTER NOW while tickets still remain. But how much longer? We asked OpenAI about how much slower o3-pro is than o3 on average to output responses and will update when we receive one of our own from the company. On X, Hyerbolic Labs co-founder and CTO Yuchen Jin posted several screenshots of his o3-pro usage showing it took 3 minutes and $80 worth of token to respond to the phrase, “Hi, I’m Sam Altman.” Bindu Reddy, CEO of Abacus AI, also said o3-pro took 2 minutes to respond to “hey there.” https://twitter.com/bindureddy/status/1932562799772971295 How to use OpenAI o3-pro Developers can access o3-pro through the OpenAI API as well as for Pro and Team users of ChatGPT. The new model version replaces o1-pro in the model picker for paying ChatGPT users.  OpenAI said o3-pro “has access to tools that make ChatGPT useful,” such as web search, analyzing files, reason about visual inputs, using Python, and personalize responses. The mode, though, is pricey, which may give some enterprise developers pause. Based on OpenAI’s pricing page, o3-pro costs $20 per input and $80 for outputs, compared to o3 itself, which is now down to $2 and $8, a tenth of the price. A more comprehensive model  OpenAI launched o3 and o4-mini in April, expanding its “o-series” of models that rely on reasoning and can “think with images.” The new model, o3-pro, uses the same underlying model as o3.  Evaluations conducted by OpenAI showed the o3-pro can often outperform the base model. Expert reviewers ranked o3-pro higher in domains such as science, education, programming, business, and writing help. The company said o3-pro is more effective, more comprehensive, and follows instructions better. Reasoning models have become a new battleground for model providers, with competitors like Google, Anthropic, and xAI, as well as rivals from China, such as DeepSeek, coming out with their own models designed to think through responses.  Currently, o3-pro is not able to generate images, and OpenAI has disabled temporary chats to resolve a technical issue. ChatGPT’s expanded workspace feature Canvas is also not yet accessible using o3-pro.  Some early users claim that o3-pro has been working remarkably, but it is still early days, and the high cost of running it may deter some developers from experimenting with it.  See some initial reactions below: As Ben Hyak, a former Apple Vision Pro interface designer and co-founder of startup Raindrop AI observability solutions wrote in a blog post about his early access usage of o3, “It’s noticeably better at discerning what it’s environment is; accurately communicating what tools it has access to, when to ask questions about the outside world (rather than pretending it has the information/access), and choosing the right tool for the job.” OpenAI founder and CEO Sam Altman also highlighted Hyak’s blog post in an X post. First impressions of o3-pro, very powerful, answers take very long, so make sure you know what you’re asking before you submit. @sama, if it is reasoning for 13 minutes, why don’t we see more on the right-hand side? pic.twitter.com/u19a0tJrSO — Avi Hacker, J.D. (@Avi_hacker1) June 10, 2025 Should we bring o3-pro into ChatLLM? input – $20 / 1M inputoutput – $80/ 1M input Expensive… could be fun though Coming to LiveBench AI in a couple of hours…. I am expecting it to top every single leaderboard ? — Bindu Reddy (@bindureddy) June 10, 2025 The launch also comes at a time when OpenAI said it has reached three million business users, with enterprise users surging 50% since February.  source

OpenAI launches o3-pro AI model, offering increased reliability and tool use for enterprises Read More »

Groq just made Hugging Face way faster — and it’s coming for AWS and Google

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Groq, the artificial intelligence inference startup, is making an aggressive play to challenge established cloud providers like Amazon Web Services and Google with two major announcements that could reshape how developers access high-performance AI models. The company announced Monday that it now supports Alibaba’s Qwen3 32B language model with its full 131,000-token context window — a technical capability it claims no other fast inference provider can match. Simultaneously, Groq became an official inference provider on Hugging Face’s platform, potentially exposing its technology to millions of developers worldwide. The move is Groq’s boldest attempt yet to carve out market share in the rapidly expanding AI inference market, where companies like AWS Bedrock, Google Vertex AI, and Microsoft Azure have dominated by offering convenient access to leading language models. “The Hugging Face integration extends the Groq ecosystem providing developers choice and further reduces barriers to entry in adopting Groq’s fast and efficient AI inference,” a Groq spokesperson told VentureBeat. “Groq is the only inference provider to enable the full 131K context window, allowing developers to build applications at scale.” How Groq’s 131k context window claims stack up against AI inference competitors Groq’s assertion about context windows — the amount of text an AI model can process at once — strikes at a core limitation that has plagued practical AI applications. Most inference providers struggle to maintain speed and cost-effectiveness when handling large context windows, which are essential for tasks like analyzing entire documents or maintaining long conversations. Independent benchmarking firm Artificial Analysis measured Groq’s Qwen3 32B deployment running at approximately 535 tokens per second, a speed that would allow real-time processing of lengthy documents or complex reasoning tasks. The company is pricing the service at $0.29 per million input tokens and $0.59 per million output tokens — rates that undercut many established providers. Groq and Alibaba Cloud are the only providers supporting Qwen3 32B’s full 131,000-token context window, according to independent benchmarks from Artificial Analysis. Most competitors offer significantly smaller limits. (Credit: Groq) “Groq offers a fully integrated stack, delivering inference compute that is built for scale, which means we are able to continue to improve inference costs while also ensuring performance that developers need to build real AI solutions,” the spokesperson explained when asked about the economic viability of supporting massive context windows. The technical advantage stems from Groq’s custom Language Processing Unit (LPU) architecture, designed specifically for AI inference rather than the general-purpose graphics processing units (GPUs) that most competitors rely on. This specialized hardware approach allows Groq to handle memory-intensive operations like large context windows more efficiently. Why Groq’s Hugging Face integration could unlock millions of new AI developers The integration with Hugging Face represents perhaps the more significant long-term strategic move. Hugging Face has become the de facto platform for open-source AI development, hosting hundreds of thousands of models and serving millions of developers monthly. By becoming an official inference provider, Groq gains access to this vast developer ecosystem with streamlined billing and unified access. Developers can now select Groq as a provider directly within the Hugging Face Playground or API, with usage billed to their Hugging Face accounts. The integration supports a range of popular models including Meta’s Llama series, Google’s Gemma models, and the newly added Qwen3 32B. “This collaboration between Hugging Face and Groq is a significant step forward in making high-performance AI inference more accessible and efficient,” according to a joint statement. The partnership could dramatically increase Groq’s user base and transaction volume, but it also raises questions about the company’s ability to maintain performance at scale. Can Groq’s infrastructure compete with AWS Bedrock and Google Vertex AI at scale When pressed about infrastructure expansion plans to handle potentially significant new traffic from Hugging Face, the Groq spokesperson revealed the company’s current global footprint: “At present, Groq’s global infrastructure includes data center locations throughout the US, Canada and the Middle East, which are serving over 20M tokens per second.” The company plans continued international expansion, though specific details were not provided. This global scaling effort will be crucial as Groq faces increasing pressure from well-funded competitors with deeper infrastructure resources. Amazon’s Bedrock service, for instance, leverages AWS’s massive global cloud infrastructure, while Google’s Vertex AI benefits from the search giant’s worldwide data center network. Microsoft’s Azure OpenAI service has similarly deep infrastructure backing. However, Groq’s spokesperson expressed confidence in the company’s differentiated approach: “As an industry, we’re just starting to see the beginning of the real demand for inference compute. Even if Groq were to deploy double the planned amount of infrastructure this year, there still wouldn’t be enough capacity to meet the demand today.” How aggressive AI inference pricing could impact Groq’s business model The AI inference market has been characterized by aggressive pricing and razor-thin margins as providers compete for market share. Groq’s competitive pricing raises questions about long-term profitability, particularly given the capital-intensive nature of specialized hardware development and deployment. “As we see more and new AI solutions come to market and be adopted, inference demand will continue to grow at an exponential rate,” the spokesperson said when asked about the path to profitability. “Our ultimate goal is to scale to meet that demand, leveraging our infrastructure to drive the cost of inference compute as low as possible and enabling the future AI economy.” This strategy — betting on massive volume growth to achieve profitability despite low margins — mirrors approaches taken by other infrastructure providers, though success is far from guaranteed. What enterprise AI adoption means for the $154 billion inference market The announcements come as the AI inference market experiences explosive growth. Research firm Grand View Research estimates the global AI inference chip market will reach $154.9 billion by 2030, driven by increasing deployment of AI applications across industries. For enterprise decision-makers, Groq’s moves represent both opportunity and risk. The company’s performance claims, if validated

Groq just made Hugging Face way faster — and it’s coming for AWS and Google Read More »

Beyond GPT architecture: Why Google’s Diffusion approach could reshape LLM deployment

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Last month, along with a comprehensive suite of new AI tools and innovations, Google DeepMind unveiled Gemini Diffusion. This experimental research model uses a diffusion-based approach to generate text. Traditionally, large language models (LLMs) like GPT and Gemini itself have relied on autoregression, a step-by-step approach where each word is generated based on the previous one. Diffusion language models (DLMs), also known as diffusion-based large language models (dLLMs), leverage a method more commonly seen in image generation, starting with random noise and gradually refining it into a coherent output. This approach dramatically increases generation speed and can improve coherency and consistency.  Gemini Diffusion is currently available as an experimental demo; sign up for the waitlist here to get access.  (Editor’s note: We’ll be unpacking paradigm shifts like diffusion-based language models—and what it takes to run them in production—at VB Transform, June 24–25 in San Francisco, alongside Google DeepMind, LinkedIn and other enterprise AI leaders.) Understanding diffusion vs. autoregression Diffusion and autoregression are fundamentally different approaches. The autoregressive approach generates text sequentially, with tokens predicted one at a time. While this method ensures strong coherence and context tracking, it can be computationally intensive and slow, especially for long-form content. Diffusion models, by contrast, begin with random noise, which is gradually denoised into a coherent output. When applied to language, the technique has several advantages. Blocks of text can be processed in parallel, potentially producing entire segments or sentences at a much higher rate.  Gemini Diffusion can reportedly generate 1,000-2,000 tokens per second. In contrast, Gemini 2.5 Flash has an average output speed of 272.4 tokens per second. Additionally, mistakes in generation can be corrected during the refining process, improving accuracy and reducing the number of hallucinations. There may be trade-offs in terms of fine-grained accuracy and token-level control; however, the increase in speed will be a game-changer for numerous applications.  How does diffusion-based text generation work? During training, DLMs work by gradually corrupting a sentence with noise over many steps, until the original sentence is rendered entirely unrecognizable. The model is then trained to reverse this process, step by step, reconstructing the original sentence from increasingly noisy versions. Through the iterative refinement, it learns to model the entire distribution of plausible sentences in the training data. While the specifics of Gemini Diffusion have not yet been disclosed, the typical training methodology for a diffusion model involves these key stages: Forward diffusion: With each sample in the training dataset, noise is added progressively over multiple cycles (often 500 to 1,000) until it becomes indistinguishable from random noise.  Reverse diffusion: The model learns to reverse each step of the noising process, essentially learning how to “denoise” a corrupted sentence one stage at a time, eventually restoring the original structure. This process is repeated millions of times with diverse samples and noise levels, enabling the model to learn a reliable denoising function.  Once trained, the model is capable of generating entirely new sentences. DLMs generally require a condition or input, such as a prompt, class label, or embedding, to guide the generation towards desired outcomes. The condition is injected into each step of the denoising process, which shapes an initial blob of noise into structured and coherent text.  Advantages and disadvantages of diffusion-based models In an interview with VentureBeat, Brendan O’Donoghue, research scientist at Google DeepMind and one of the leads on the Gemini Diffusion project, elaborated on some of the advantages of diffusion-based techniques when compared to autoregression. According to O’Donoghue, the major advantages of diffusion techniques are the following: Lower latencies: Diffusion models can produce a sequence of tokens in much less time than autoregressive models. Adaptive computation: Diffusion models will converge to a sequence of tokens at different rates depending on the task’s difficulty. This allows the model to consume fewer resources (and have lower latencies) on easy tasks and more on harder ones. Non-causal reasoning: Due to the bidirectional attention in the denoiser, tokens can attend to future tokens within the same generation block. This allows non-causal reasoning to take place and allows the model to make global edits within a block to produce more coherent text. Iterative refinement / self-correction: The denoising process involves sampling, which can introduce errors just like in autoregressive models. However, unlike autoregressive models, the tokens are passed back into the denoiser, which then has an opportunity to correct the error. O’Donoghue also noted the main disadvantages: “higher cost of serving and slightly higher time-to-first-token (TTFT), since autoregressive models will produce the first token right away. For diffusion, the first token can only appear when the entire sequence of tokens is ready.” Performance benchmarks Google says Gemini Diffusion’s performance is comparable to Gemini 2.0 Flash-Lite. Benchmark Type Gemini Diffusion Gemini 2.0 Flash-Lite LiveCodeBench (v6) Code 30.9% 28.5% BigCodeBench Code 45.4% 45.8% LBPP (v2) Code 56.8% 56.0% SWE-Bench Verified* Code 22.9% 28.5% HumanEval Code 89.6% 90.2% MBPP Code 76.0% 75.8% GPQA Diamond Science 40.4% 56.5% AIME 2025 Mathematics 23.3% 20.0% BIG-Bench Extra Hard Reasoning 15.0% 21.0% Global MMLU (Lite) Multilingual 69.1% 79.0% * Non-agentic evaluation (single turn edit only), max prompt length of 32K. The two models were compared using several benchmarks, with scores based on how many times the model produced the correct answer on the first try. Gemini Diffusion performed well in coding and mathematics tests, while Gemini 2.0 Flash-lite had the edge on reasoning, scientific knowledge, and multilingual capabilities.  As Gemini Diffusion evolves, there’s no reason to think that its performance won’t catch up with more established models. According to O’Donoghue, the gap between the two techniques is “essentially closed in terms of benchmark performance, at least at the relatively small sizes we have scaled up to. In fact, there may be some performance advantage for diffusion in some domains where non-local consistency is important, for example, coding and reasoning.” Testing Gemini Diffusion VentureBeat was granted access to the experimental demo.

Beyond GPT architecture: Why Google’s Diffusion approach could reshape LLM deployment Read More »

The case for embedding audit trails in AI systems before scaling

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Editor’s note: Emilia will lead an editorial roundtable on this topic at VB Transform this month. Register today. Orchestration frameworks for AI services serve multiple functions for enterprises. They not only set out how applications or agents flow together, but they should also let administrators manage workflows and agents and audit their systems.  As enterprises begin to scale their AI services and put these into production, building a manageable, traceable, auditable and robust pipeline ensures their agents run exactly as they’re supposed to. Without these controls, organizations may not be aware of what is happening in their AI systems and may only discover the issue too late, when something goes wrong or they fail to comply with regulations.  Kevin Kiley, president of enterprise orchestration company Airia, told VentureBeat in an interview that frameworks must include auditability and traceability.  “It’s critical to have that observability and be able to go back to the audit log and show what information was provided at what point again,” Kiley said. “You have to know if it was a bad actor, or an internal employee who wasn’t aware they were sharing information or if it was a hallucination. You need a record of that.” Ideally, robustness and audit trails should be built into AI systems at a very early stage. Understanding the potential risks of a new AI application or agent and ensuring they continue to perform to standards before deployment would help ease concerns around putting AI into production. However, organizations did not initially design their systems with traceability and auditability in mind. Many AI pilot programs began life as experiments started without an orchestration layer or an audit trail.  The big question enterprises now face is how to manage all the agents and applications, ensure their pipelines remain robust and, if something goes wrong, they know what went wrong and monitor AI performance.  Choosing the right method Before building any AI application, however, experts said organizations need to take stock of their data. If a company knows which data they’re okay with AI systems to access and which data they fine-tuned a model with, they have that baseline to compare long-term performance with.  “When you run some of those AI systems, it’s more about, what kind of data can I validate that my system’s actually running properly or not?” Yrieix Garnier, vice president of products at DataDog, told VentureBeat in an interview. “That’s very hard to actually do, to understand that I have the right system of reference to validate AI solutions.” Once the organization identifies and locates its data, it needs to establish dataset versioning — essentially assigning a timestamp or version number — to make experiments reproducible and understand what the model has changed. These datasets and models, any applications that use these specific models or agents, authorized users and the baseline runtime numbers can be loaded into either the orchestration or observability platform.  Just like when choosing foundation models to build with, orchestration teams need to consider transparency and openness. While some closed-source orchestration systems have numerous advantages, more open-source platforms could also offer benefits that some enterprises value, such as increased visibility into decision-making systems. Open-source platforms like MLFlow, LangChain and Grafana provide agents and models with granular and flexible instructions and monitoring. Enterprises can choose to develop their AI pipeline through a single, end-to-end platform, such as DataDog, or utilize various interconnected tools from AWS. Another consideration for enterprises is to plug in a system that maps agents and application responses to compliance tools or responsible AI policies. AWS and Microsoft both offer services that track AI tools and how closely they adhere to guardrails and other policies set by the user.  Kiley said one consideration for enterprises when building these reliable pipelines revolves around choosing a more transparent system. For Kiley, not having any visibility into how AI systems work won’t work.  “Regardless of what the use case or even the industry is, you’re going to have those situations where you have to have flexibility, and a closed system is not going to work. There are providers out there that’ve great tools, but it’s sort of a black box. I don’t know how it’s arriving at these decisions. I don’t have the ability to intercept or interject at points where I might want to,” he said.  Join the conversation at VB Transform I’ll be leading an editorial roundtable at VB Transform 2025 in San Francisco, June 24-25, called “Best practices to build orchestration frameworks for agentic AI,” and I’d love to have you join the conversation. Register today. source

The case for embedding audit trails in AI systems before scaling Read More »

AI Is rewriting the data playbook — and knowledge graphs are page one

Presented by Google Cloud For decades, enterprise data infrastructure focused on answering the question: “What happened in our business?” Business intelligence tools, data warehouses, and pipelines were built to surface historical trends and performance snapshots, revealing past sales figures, customer patterns, and operational metrics. These systems worked well when decisions were driven by dashboards and quarterly reports. But artificial intelligence has changed the game. Today’s most powerful systems don’t just summarize the past; they make real-time decisions. They go beyond static observation to dynamic reasoning — not just answering what happened, but answering why they happened, what is likely to happen and, most critically, what action should be taken next. Enterprises are realizing that traditional architectures, even in the cloud, are not enough. AI needs more than access to data. It requires access to meaning and it needs to drive business outcomes for decision makers. That’s where knowledge graphs come in. The hidden layer that makes AI work There is a deeper “semantic” layer that is fundamental to AI success. How do enterprises take their data assets and expose the context, relationships, and metadata that allows AI models to infer deeper reasoning? A knowledge graph represents real-world entities like people, places, and products, along with the relationships between them. Unlike traditional databases that store data in tables, knowledge graphs organize information as nodes and edges. This makes them better suited for AI systems that reason, infer, and act based on context. Knowledge graphs helped solve key BI problems like brittle ETL and stale dashboards. Now, the same principles support AI. The demand for freshness and connected context is even more critical when algorithms must adapt and act in real time. Building this foundation requires understanding how knowledge graphs actually work in practice. Designing data infrastructure that thinks Once you recognize the need for a knowledge graph, your architecture must evolve. This is no mere modeling challenge. It is a shift in how data is ingested, connected, governed, and activated across the enterprise. Think of the AI data lifecycle in four stages: capture, process, analyze, and activate, with governance embedded throughout.  Integration is the first priority. A useful knowledge graph spans structured, semi-structured, and unstructured sources. That includes transaction logs, PDFs, and sensor streams, all mapped to a shared context. Entity resolution becomes foundational: recognizing that “John Smith” in CRM, “J. Smith” in emails, and employee ID 12345 all refer to the same person. Relationship inference then uncovers hidden links, like customers with the same billing address or products frequently bought together. Next, the infrastructure must support graph-native operations. Traditional query engines optimize for filtering and aggregation. Knowledge graphs support traversal — moving from a user to a product to a supplier to a document, following relationships to discover insights that weren’t explicitly programmed. These traversals must be fast, flexible, and semantically accurate. Finally, freshness and observability are essential. A stale or opaque graph leads to poor decisions. Your system must support real-time updates, lineage tracking, access controls, and monitoring of both graph quality and performance. What Google learned from a decade of knowledge graphs Google has spent over a decade building and running one of the world’s most widely used knowledge graphs. It powers Search, YouTube, and Maps, delivering contextual results to billions of users every day. When someone searches for “Jaguar,” the system doesn’t just return keyword matches — it infers whether they’re looking for the car, the animal, or the sports team. That shift from strings to entities is a defining feature of modern AI. This “strings vs. things” mindset enables AI to reason over relationships, not just patterns. That ability to understand meaning is what separates truly intelligent systems. But building the graph is only half the job. Running it at scale — keeping it fresh, evolving schemas, protecting privacy, and maintaining speed — is a continuous engineering challenge. You don’t just build a graph. You operate it like a core platform. That’s why companies need partners with deep infrastructure and AI expertise. Knowledge graphs demand full-stack discipline across ingestion, modeling, governance, and delivery. The intelligence layer for agentic AI  As AI shifts from summarizing the past to driving decisions, agentic AI pushes even further — pursuing business goals, invoking other tools, and chaining actions across systems. These agents need context, not just data, and knowledge graphs provide that context. Knowledge graphs serve as the system of intelligence layer to build smarter, more accurate and grounded agents, turning data into actions that drive business outcomes in agentic AI workflows. Just as knowledge graphs solved BI’s stale dashboards and brittle pipelines, they now power the real-time reasoning and coordination required for autonomous agents to act with intelligence and purpose. Explore how agents can help across the data lifecycle. Learn more here. Vinay Balasubramaniam is Director of Product at Google BigQuery. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact [email protected]. source

AI Is rewriting the data playbook — and knowledge graphs are page one Read More »

Just add humans: Oxford medical study underscores the missing link in chatbot testing

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Headlines have been blaring it for years: Large language models (LLMs) can not only pass medical licensing exams but also outperform humans. GPT-4 could correctly answer U.S. medical exam licensing questions 90% of the time, even in the prehistoric AI days of 2023. Since then, LLMs have gone on to best the residents taking those exams and licensed physicians. Move over, Doctor Google, make way for ChatGPT, M.D. But you may want more than a diploma from the LLM you deploy for patients. Like an ace medical student who can rattle off the name of every bone in the hand but faints at the first sight of real blood, an LLM’s mastery of medicine does not always translate directly into the real world. A paper by researchers at the University of Oxford found that while LLMs could correctly identify relevant conditions 94.9% of the time when directly presented with test scenarios, human participants using LLMs to diagnose the same scenarios identified the correct conditions less than 34.5% of the time. Perhaps even more notably, patients using LLMs performed even worse than a control group that was merely instructed to diagnose themselves using “any methods they would typically employ at home.” The group left to their own devices was 76% more likely to identify the correct conditions than the group assisted by LLMs. The Oxford study raises questions about the suitability of LLMs for medical advice and the benchmarks we use to evaluate chatbot deployments for various applications. Guess your malady Led by Dr. Adam Mahdi, researchers at Oxford recruited 1,298 participants to present themselves as patients to an LLM. They were tasked with both attempting to figure out what ailed them and the appropriate level of care to seek for it, ranging from self-care to calling an ambulance. Each participant received a detailed scenario, representing conditions from pneumonia to the common cold, along with general life details and medical history. For instance, one scenario describes a 20-year-old engineering student who develops a crippling headache on a night out with friends. It includes important medical details (it’s painful to look down) and red herrings (he’s a regular drinker, shares an apartment with six friends, and just finished some stressful exams). The study tested three different LLMs. The researchers selected GPT-4o on account of its popularity, Llama 3 for its open weights and Command R+ for its retrieval-augmented generation (RAG) abilities, which allow it to search the open web for help. Participants were asked to interact with the LLM at least once using the details provided, but could use it as many times as they wanted to arrive at their self-diagnosis and intended action. Behind the scenes, a team of physicians unanimously decided on the “gold standard” conditions they sought in every scenario, and the corresponding course of action. Our engineering student, for example, is suffering from a subarachnoid haemorrhage, which should entail an immediate visit to the ER. A game of telephone While you might assume an LLM that can ace a medical exam would be the perfect tool to help ordinary people self-diagnose and figure out what to do, it didn’t work out that way. “Participants using an LLM identified relevant conditions less consistently than those in the control group, identifying at least one relevant condition in at most 34.5% of cases compared to 47.0% for the control,” the study states. They also failed to deduce the correct course of action, selecting it just 44.2% of the time, compared to 56.3% for an LLM acting independently. What went wrong? Looking back at transcripts, researchers found that participants both provided incomplete information to the LLMs and the LLMs misinterpreted their prompts. For instance, one user who was supposed to exhibit symptoms of gallstones merely told the LLM: “I get severe stomach pains lasting up to an hour, It can make me vomit and seems to coincide with a takeaway,” omitting the location of the pain, the severity, and the frequency. Command R+ incorrectly suggested that the participant was experiencing indigestion, and the participant incorrectly guessed that condition. Even when LLMs delivered the correct information, participants didn’t always follow its recommendations. The study found that 65.7% of GPT-4o conversations suggested at least one relevant condition for the scenario, but somehow less than 34.5% of final answers from participants reflected those relevant conditions. The human variable This study is useful, but not surprising, according to Nathalie Volkheimer, a user experience specialist at the Renaissance Computing Institute (RENCI), University of North Carolina at Chapel Hill. “For those of us old enough to remember the early days of internet search, this is déjà vu,” she says. “As a tool, large language models require prompts to be written with a particular degree of quality, especially when expecting a quality output.” She points out that someone experiencing blinding pain wouldn’t offer great prompts. Although participants in a lab experiment weren’t experiencing the symptoms directly, they weren’t relaying every detail. “There is also a reason why clinicians who deal with patients on the front line are trained to ask questions in a certain way and a certain repetitiveness,” Volkheimer goes on. Patients omit information because they don’t know what’s relevant, or at worst, lie because they’re embarrassed or ashamed. Can chatbots be better designed to address them? “I wouldn’t put the emphasis on the machinery here,” Volkheimer cautions. “I would consider the emphasis should be on the human-technology interaction.” The car, she analogizes, was built to get people from point A to B, but many other factors play a role. “It’s about the driver, the roads, the weather, and the general safety of the route. It isn’t just up to the machine.” A better yardstick The Oxford study highlights one problem, not with humans or even LLMs, but with the way we sometimes measure them—in a vacuum. When we say an LLM can pass

Just add humans: Oxford medical study underscores the missing link in chatbot testing Read More »

Outset raises $17M to replace human interviewers with AI agents for enterprise research

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Outset, a San Francisco startup that uses artificial intelligence to conduct market research interviews, has raised $17 million in Series A funding to accelerate adoption of its AI-moderated research platform among Fortune 500 enterprises. The round, led by venture capital firm 8VC with participation from Future Back Ventures by Bain & Company and existing investors, brings the company’s total funding to $21 million. The two-year-old company has developed what it calls the first AI-moderated research platform that can conduct video interviews with research participants at unprecedented scale and speed. Major customers including Nestlé, Microsoft, and WeightWatchers are using the technology to replace traditional market research methods that have remained largely unchanged for decades. “From my own professional experience, I know firsthand how difficult and time-consuming it is to truly understand your customers and their needs – nothing matters more,” said Aaron Cannon, co-founder and CEO of Outset, in an exclusive interview with VentureBeat. “We designed AI agents to do what we couldn’t — talk to thousands of people at the speed, scale and depth that’s never been possible.” How AI agents are disrupting decades-old market research practices The funding comes as enterprises increasingly seek AI-powered alternatives to traditional market research, which typically requires weeks or months to complete and costs thousands of dollars per participant. Outset’s platform promises to deliver research results that are 8 times faster, 81 percent less expensive, and provide 10 times more reach than human-led research. The technology works by having AI moderators conduct video interviews directly with research participants using synthesized voice, text, images, and videos. Participants can respond through video, voice, text, or by sharing their mobile and desktop screens for user experience research. The AI then synthesizes results automatically, providing instant reporting and analytics. “Outset’s AI moderator prompts participants with synthesized voice, text, images, and videos. In response, participants share their videos, voices, text, and even their mobile and/or desktop screens for user experience research,” Cannon told VentureBeat. For context, traditional market research often involves conducting 25 in-depth interviews over 4-6 weeks, followed by 2-4 weeks of manual analysis. With Outset, companies can conduct 250 interviews and complete the entire project in less than a week while requiring fewer hours from research teams. Inside Nestlé and Microsoft’s AI-powered customer research strategies Nestlé, which develops new food products across more than 2,000 brands, exemplifies the platform’s enterprise applications. The food giant uses Outset to test new product concepts by conducting in-depth interviews with hundreds of participants over 1-2 days. “Nestlé develops new food products constantly across >2000 brands. When they have new concepts, they need to test those with consumers – they need to understand where and how this food would be consumed, reactions to the price point, ingredients and packaging, and even what kinds of other brands it might replace,” Cannon said. The results speak to the platform’s efficiency gains. In one project, Nestlé achieved the 81 percent cost reduction that Outset cites in its marketing materials while dramatically accelerating their research timeline. Microsoft, another customer, is using the platform to better understand user experiences with AI products. “AI-augmented research is here. We partner with Outset to accelerate and scale our team’s ability to learn from our users with the help of AI agents,” said Jess Holbrook, Head of Research at Microsoft AI. Why venture capitalists see a $140B market opportunity in AI research The investment reflects growing venture capital interest in AI applications that can replace traditional manual processes. Jack Moshkovich, Partner at 8VC, sees Outset addressing a massive addressable market. “UX research software budgets, general research software budgets, and human-led research budgets are ultimately all addressable by Outset, giving them a roughly $140B TAM,” Moshkovich said in an interview with VentureBeat. The investor cited four key factors driving the investment decision: large market size with strong inbound demand, early market positioning without a clear leader, potential for long-term product durability through technical challenges and data accumulation, and a stellar team with deep customer understanding. “AI-native applications that directly address spend previously allocated to manual labor have been a core theme for us over the last two years. Outset is a prime example of a business taking advantage of advances in LLMs to deliver a service that previously had to be manual,” Moshkovich explained. From 14 employees to millions in revenue: Outset’s explosive growth story The funding comes amid rapid growth for the 14-person company. Outset has achieved millions in annual recurring revenue with more than 50 enterprise customers, and revenue has doubled over the last four months. The company reported nearly 20 percent month-over-month revenue growth and a 10-fold increase in customer usage over the past year. “We have millions in annual recurring revenue with more than 50 enterprise customers and we’re growing super fast — revenue doubled over the last 4 months,” Cannon said. The rapid adoption reflects a broader shift in enterprise attitudes toward AI tools. “Inertia is the biggest objection. People are used to traditional tools and sometimes don’t feel ready to embrace new AI tools. This has shifted massively over the last 6 months,” Cannon said. Taking on Qualtrics and UserTesting with next-generation AI technology Outset primarily competes with traditional research incumbents like Qualtrics and UserTesting, which Cannon argues remain stuck with outdated methodologies. “They are still relying on static surveys to gather data and their ‘AI analysis’ tools are still stuck in the world of word clouds and basic sentiment analysis,” he said. The company also faces competition from other AI research startups, but Cannon believes Outset’s comprehensive enterprise platform provides competitive advantages. The platform includes security certifications, admin permissioning, data-segregated workspaces, and interviewer customization features that large enterprises require. “We created this new category (AI-moderated research) and continue to offer the most robust and flexible solutions for enterprises, stretching all the way from UX usability tools to strategic market research,” Cannon

Outset raises $17M to replace human interviewers with AI agents for enterprise research Read More »

Why most enterprise AI agents never reach production and how Databricks plans to fix it

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Many enterprise AI agent development efforts  never make it to production and it’s not because the technology isn’t ready. The problem, according to Databricks, is that companies are still relying on manual evaluations with a process that’s slow, inconsistent and difficult to scale. Today at the Data + AI Summit, Databricks launched Mosaic Agent Bricks as a solution to that challenge. The technology builds on and extends the Mosaic AI Agent Framework, which the company announced in 2024. Simply put, it’s no longer good enough just to be able to build AI agents to have a real-world impact. The Mosaic Agent Bricks platform automates agent optimization using a series of research-backed innovations. Among the key innovations is the integration of TAO (Test-time Adaptive Optimization), which provides a novel approach to AI tuning without the need for labeled data. Mosaic Agent Bricks also generates domain-specific synthetic data, creates task-aware benchmarks and optimizes quality-to-cost balance without manual intervention. The new platform’s fundamental goal is to solve an issue that Databricks users had with existing AI agent development efforts. “They were flying blind, they had no way to evaluate these agents,” Hanlin Tang, Databricks’ Chief Technology Officer of Neural Networks, told VentureBeat. “Most of them were relying on a kind of manual, manual vibe tracking to see if the agent sounds good enough, but this doesn’t give them the confidence to go into production.” From research innovation to enterprise AI production scale Tang was previously the co-founder and CTO of Mosaic, which was acquired by Databricks in 2023 for $1.3 billion. At Mosaic, much of the research innovation didn’t necessarily have an immediate enterprise impact. That all changed after the acquisition. “The big light bulb moment for me was when we first launched our product on Databricks, and instantly, overnight, we had, like thousands of enterprise customers using it,” Tang said. In contrast, prior to the acquisition Mosaic would spend months trying to get just a handful of enterprises to try out products. Integrating Mosaic into Databricks has given Mosaic’s research team direct access to enterprise problems at scale and revealed new areas to explore. This enterprise contact revealed new research opportunities.  “It’s only when you have contact with enterprise customers, you work with them deeply, that you actually uncover kind of interesting research problems to go after,” Tang explained. “Agent Bricks….is, in some ways, kind of an evolution of everything that we’ve been working on at Mosaic now that we’re all fully, fully bricksters.” Solving the agentic AI evaluation crisis Enterprise teams face a costly trial-and-error optimization process. Without task-aware benchmarks or domain-specific test data, every agent adjustment becomes an expensive guessing game. Quality drift, cost overruns and missed deadlines follow. Agent Bricks automates the entire optimization pipeline. The platform takes a high-level task description and enterprise data. It handles the rest automatically. First, it generates task-specific evaluations and LLM judges. Next, it creates synthetic data that mirrors customer data. Finally, it searches across optimization techniques to find the best configuration. “The customer describes the problem at a high level, and they don’t go into the low-level details, because we take care of those,” Tang said. “The system generates synthetic data and builds custom LLM judges specific to each task.” The platform offers four agent configurations: Information Extraction: Converts documents (PDFs, emails) into structured data. One use case could be retail organizations that use it to pull product details from supplier PDFs, even with complex formatting. Knowledge Assistant: Provides accurate, cited answers from enterprise data. For example, manufacturing technicians can get instant answers from maintenance manuals without digging through binders. Custom LLM: Handles text transformation tasks (summarization, classification). For example, healthcare organizations can customize models that summarize patient notes for clinical workflows. Multi-Agent Supervisor: Orchestrates multiple agents for complex workflows. One use case example is financial services firms that can coordinate agents for intent detection, document retrieval and compliance checks. Agents are great, but don’t forget about data Building and evaluating agents is a core part of making an AI enterprise-ready, but it’s not the only part that’s needed. Databricks positions Mosaic Agent Bricks as the AI consumption layer sitting atop its unified data stack. At the Data + AI Summit, Databricks also announced the general availability of its Lakeflow data engineering platform, which was first previewed in 2024. Lakeflow solves the data preparation challenge. It unifies three critical data engineering journeys that previously required separate tools. Ingestion handles getting both structured and unstructured data into Databricks. Transformation provides efficient data cleaning, reshaping and preparation. Orchestration manages production workflows and scheduling. The workflow connection is direct: Lakeflow prepares enterprise data through unified ingestion and transformation, then Agent Bricks builds optimized AI agents on that prepared data.  “We help get the data into the platform, and then you can do ML, BI and AI analytics,” Bilal Aslam, senior director of product management at Databricks told VentureBeat.  Going beyond data ingestion, Mosaic Agent Bricks also benefits from Databricks’ Unity Catalog’s governance features. That includes access controls and data lineage tracking. This integration ensures that agent behavior respects enterprise data governance without additional configuration. Agent learning from Human Feedback eliminates prompt stuffing One of the common approaches to guiding AI agents today is to use a system prompt. Tang referred to the practice of ‘prompt stuffing’ where users shove all kinds of guidance into a prompt in the hope that the agent will follow it. Agent Bricks introduces a new concept— Agent Learning from Human Feedback. This feature automatically adjusts system components based on natural language guidance. It solves what Tang calls the prompt stuffing problem. According to Tang, the prompt stuffing approach often fails because agent systems have multiple components that need adjustment. Agent Learning from Human Feedback is a system that automatically interprets natural language guidance and adjusts the appropriate system components. The approach mirrors reinforcement learning from human feedback (RLHF)

Why most enterprise AI agents never reach production and how Databricks plans to fix it Read More »

Senator’s RISE Act would require AI developers to list training data, evaluation methods in exchange for ‘safe harbor’ from lawsuits

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Amid an increasingly tense and destabilizing week for international news, it should not escape any technical decision-makers’ notice that some lawmakers in the U.S. Congress are still moving forward with new proposed AI regulations that could reshape the industry in powerful ways — and seek to steady it moving forward. Case in point, yesterday, U.S. Republican Senator Cynthia Lummis of Wyoming introduced the Responsible Innovation and Safe Expertise Act of 2025 (RISE), the first stand-alone bill that pairs a conditional liability shield for AI developers with a transparency mandate on model training and specifications. As with all new proposed legislation, both the U.S. Senate and House would need to vote in the majority to pass the bill and U.S. President Donald J. Trump would need to sign it before it becomes law, a process which would likely take months at the soonest. “Bottom line: If we want America to lead and prosper in AI, we can’t let labs write the rules in the shadows,” wrote Lummis on her account on X when announcing the new bill. We need public, enforceable standards that balance innovation with trust. That’s what the RISE Act delivers. Let’s get it done.” It also upholds traditional malpractice standards for doctors, lawyers, engineers, and other “learned professionals.” If enacted as written, the measure would take effect December 1 2025 and apply only to conduct that occurs after that date. Why Lummis says new AI legislation is necessary The bill’s findings section paints a landscape of rapid AI adoption colliding with a patchwork of liability rules that chills investment and leaves professionals unsure where responsibility lies. Lummis frames her answer as simple reciprocity: developers must be transparent, professionals must exercise judgment, and neither side should be punished for honest mistakes once both duties are met. In a statement on her website, Lummis calls the measure “predictable standards that encourage safer AI development while preserving professional autonomy.” With bipartisan concern mounting over opaque AI systems, RISE gives Congress a concrete template: transparency as the price of limited liability. Industry lobbyists may press for broader redaction rights, while public-interest groups could push for shorter disclosure windows or stricter opt-out limits. Professional associations, meanwhile, will scrutinize how the new documents can fit into existing standards of care. Whatever shape the final legislation takes, one principle is now firmly on the table: in high-stakes professions, AI cannot remain a black box. And if the Lummis bill becomes law, developers who want legal peace will have to open that box—at least far enough for the people using their tools to see what is inside. How the new ‘Safe Harbor’ provision for AI developers shielding them from lawsuits works RISE offers immunity from civil suits only when a developer meets clear disclosure rules: Model card – A public technical brief that lays out training data, evaluation methods, performance metrics, intended uses, and limitations. Model specification – The full system prompt and other instructions that shape model behavior, with any trade-secret redactions justified in writing. The developer must also publish known failure modes, keep all documentation current, and push updates within 30 days of a version change or newly discovered flaw. Miss the deadline—or act recklessly—and the shield disappears. Professionals like doctors, lawyers remain ultimately liable for using AI in their practices The bill does not alter existing duties of care. The physician who misreads an AI-generated treatment plan or a lawyer who files an AI-written brief without vetting it remains liable to clients. The safe harbor is unavailable for non-professional use, fraud, or knowing misrepresentation, and it expressly preserves any other immunities already on the books. Reaction from AI 2027 project co-author Daniel Kokotajlo, policy lead at the nonprofit AI Futures Project and a co-author of the widely circulated scenario planning document AI 2027, took to his X account to state that his team advised Lummis’s office during drafting and “tentatively endorse[s]” the result. He applauds the bill for nudging transparency yet flags three reservations: Opt-out loophole. A company can simply accept liability and keep its specifications secret, limiting transparency gains in the riskiest scenarios. Delay window. Thirty days between a release and required disclosure could be too long during a crisis. Redaction risk. Firms might over-redact under the guise of protecting intellectual property; Kokotajlo suggests forcing companies to explain why each blackout truly serves the public interest. The AI Futures Project views RISE as a step forward but not the final word on AI openness. What it means for devs and enterprise technical decision-makers The RISE Act’s transparency-for-liability trade-off will ripple outward from Congress straight into the daily routines of four overlapping job families that keep enterprise AI running. Start with the lead AI engineers—the people who own a model’s life cycle. Because the bill makes legal protection contingent on publicly posted model cards and full prompt specifications, these engineers gain a new, non-negotiable checklist item: confirm that every upstream vendor, or the in-house research squad down the hall, has published the required documentation before a system goes live. Any gap could leave the deployment team on the hook if a doctor, lawyer, or financial adviser later claims the model caused harm. Next come the senior engineers who orchestrate and automate model pipelines. They already juggle versioning, rollback plans, and integration tests; RISE adds a hard deadline. Once a model or its spec changes, updated disclosures must flow into production within thirty days. CI/CD pipelines will need a new gate that fails builds when a model card is missing, out of date, or overly redacted, forcing re-validation before code ships. The data-engineering leads aren’t off the hook, either. They will inherit an expanded metadata burden: capture the provenance of training data, log evaluation metrics, and store any trade-secret redaction justifications in a way auditors can query. Stronger lineage tooling becomes more than a best practice; it turns into the evidence that a company met its duty of care

Senator’s RISE Act would require AI developers to list training data, evaluation methods in exchange for ‘safe harbor’ from lawsuits Read More »