VentureBeat

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix.  However, these benchmarks often test for general capabilities. For organizations that want to use models and large language model-based agents, it’s harder to evaluate how well the agent or the model actually understands their specific needs.  Model repository Hugging Face launched Yourbench, an open-source tool where developers and enterprises can create their own benchmarks to test model performance against their internal data.  Sumuk Shashidhar, part of the evaluations research team at Hugging Face, announced Yourbench on X. The feature offers “custom benchmarking and synthetic data generation from ANY of your documents. It’s a big step towards improving how model evaluations work.” He added that Hugging Face knows “that for many use cases what really matters is how well a model performs your specific task. Yourbench lets you evaluate models on what matters to you.” Creating custom evaluations Hugging Face said in a paper that Yourbench works by replicating subsets of the Massive Multitask Language Understanding (MMLU) benchmark “using minimal source text, achieving this for under $15 in total inference cost while perfectly preserving the relative model performance rankings.”  Organizations need to pre-process their documents before Yourbench can work. This involves three stages: Document Ingestion to “normalize” file formats. Semantic Chunking to break down the documents to meet context window limits and focus the model’s attention. Document Summarization Next comes the question-and-answer generation process, which creates questions from information on the documents. This is where the user brings in their chosen LLM to see which one best answers the questions.  Hugging Face tested Yourbench with DeepSeek V3 and R1 models, Alibaba’s Qwen models including the reasoning model Qwen QwQ, Mistral Large 2411 and Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemma 3, GPT-4o, GPT-4o-mini, and o3 mini, and Claude 3.7 Sonnet and Claude 3.5 Haiku. Shashidhar said Hugging Face also offers cost analysis on the models and found that Qwen and Gemini 2.0 Flash “produce tremendous value for very very low costs.” Compute limitations However, creating custom LLM benchmarks based on an organization’s documents comes at a cost. Yourbench requires a lot of compute power to work. Shashidhar said on X that the company is “adding capacity” as fast they could. Hugging Face runs several GPUs and partners with companies like Google to use their cloud services for inference tasks. VentureBeat reached out to Hugging Face about Yourbench’s compute usage. Benchmarking is not perfect Benchmarks and other evaluation methods give users an idea of how well models perform, but these do not perfectly capture how the models will work daily. Some have even voiced skepticism that benchmark tests show models’ limitations and can lead to false conclusions about their safety and performance. A study also warned that benchmarking agents could be “misleading.” However, enterprises cannot avoid evaluating models now that there are many choices in the market, and technology leaders justify the rising cost of using AI models. This has led to different methods to test model performance and reliability.  Google DeepMind introduced FACTS Grounding, which tests a model’s ability to generate factually accurate responses based on information from documents. Some Yale and Tsinghua University researchers developed self-invoking code benchmarks to guide enterprises for which coding LLMs work for them.  source

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data Read More »

A new, enterprise-specific AI speech model is here: Jargonic from aiOla claims to best rivals at your business’s lingo

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Speech recognition models have become increasingly accurate in recent years. However, they may be built and benchmarked under ideal conditions—quiet rooms, clear audio and general-purpose vocabulary. For enterprises, however, real-world audio is far messier. That’s the challenge aiOla aims to address with the launch of Jargonic, its new automatic speech recognition (ASR) built specifically for enterprise use. The Israeli startup is unveiling Jargonic today. Jargonic is a new speech-to-text model designed to handle specialized jargon, background noise and diverse accents without extensive retraining or fine-tuning. “Our model focuses on three key challenges in speech recognition: jargon, background noise and accents,” said Gill Hetz, aiOla vice president of AI. “We built a model that understands specific industry jargon in a zero-shot manner, handles noisy environments and supports a wide range of accents.” Available now via API on aiOla’s enterprise platform, Jargonic is positioned as a production-ready ASR solution for businesses in industries such as manufacturing, logistics, financial services and healthcare. aiOla team. Credit: aiOla From product-first to AI-first The launch of Jargonic represents a shift in focus for aiOla itself. According to company leadership, the team redefined its approach to prioritize AI research and deployment. “When I arrived here, I saw an amazing product company that had invested heavily in advanced AI capabilities, but was mostly known for helping people fill out forms,” said Assaf Asbag, aiOla’s Chief Technology and Product Officer. “We shifted the perspective and became an AI company with a great product, instead of a product company with AI capabilities.” “We decided to open our capabilities to the world,” Asbag added. “Instead of serving our model only to enterprises within our product, we developed an API and are now launching it to make our enterprise-grade, bulletproof model available to everyone.” Jargon recognition, zero-shot adaptation One of Jargonic’s distinguishing features is its approach to specialized vocabulary. Speech recognition systems typically struggle when confronted with domain-specific jargon that does not appear in standard training data. Jargonic addresses this challenge with a proprietary keyword spotting system that allows for zero-shot adaptation—enterprises can simply provide a list of terms without additional retraining. In benchmark tests, Jargonic demonstrated a 5.91% average word error rate (WER) across four leading English academic datasets, outperforming competitors such as Eleven Labs, Assembly AI, OpenAI’s Whisper and Deepgram Nova-3. However, the company has not yet disclosed performance comparisons specifically against newer multimodal transcription models like OpenAI’s GPT-4o-transcribe, which came nine days ago, boasting top performance on benchmarks such as WER, with only 2.46% in English. aiOla claims its model is still better at picking out specific business jargon. Jargonic also achieved an 89.3% recall rate on specialized financial terms and consistently outperformed others in multilingual jargon recognition, reaching over 95% accuracy across five languages. “Once you have heavy jargon, recognition accuracy typically drops by 20%,” Asbag explained. “But with our zero-shot approach, where you just list important keywords, accuracy jumps back up to 95%. That’s unique to us.” This capability is designed to eliminate the time-consuming, resource-intensive retraining process typically required to adapt ASR systems for specific industries. Optimized for the enterprise environment Jargonic’s development was informed by years of experience building solutions for enterprise clients. The model was trained on over one million hours of transcribed speech, including significant data from industrial and business environments, ensuring robustness in noisy, real-life settings. “What differentiates us is that we’ve spent years solving real-world enterprise problems,” Hetz said. “We optimized for speed, accuracy, and the ability to handle complex environments—not just podcasts or videos, but noisy, messy, real-life workplaces.” The model’s architecture integrates keyword spotting directly into the transcription process, allowing Jargonic to maintain accuracy even in unpredictable audio conditions. The voice-first future For aiOla’s leadership, Jargonic is a step toward a broader shift in how people interact with technology. The company sees speech recognition not only as a business tool, but as an essential interface for the future of human-computer interaction. “Our vision is that every machine interface will soon be voice-first,” Hetz said. “You’ll be able to talk to your refrigerator, your vacuum cleaner, any machine—and it will act and do whatever you want. That’s the future we’re building toward.” Asbag echoed that sentiment, adding, “Conversational AI is going to become the new web browser. Machines are starting to understand us, and now we have a reason to interact with them naturally.” For now, aiOla’s focus remains on the enterprise. Jargonic is available immediately to enterprise customers via API, allowing them to integrate the model’s speech recognition capabilities into their own workflows, applications, or customer-facing services. source

A new, enterprise-specific AI speech model is here: Jargonic from aiOla claims to best rivals at your business’s lingo Read More »

Augment Code debuts AI agent with 70% win rate over GitHub Copilot and record-breaking SWE-bench score

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Augment Code, an AI coding assistant startup, unveiled its new “Augment Agent” technology today, designed to tackle the complexity of large software engineering projects rather than simple code generation. The company claims its approach represents a significant departure from other AI coding tools by focusing on helping developers navigate and modify large, established codebases that span millions of lines of code across multiple repositories. The company also announced it has achieved the highest score to date on SWE-bench verified, an industry benchmark for AI coding capabilities, by combining Anthropic’s Claude Sonnet 3.7 with OpenAI’s O1 reasoning model. “Most work in the coding AI space, which is clearly a hot sector, has focused on what people call ‘zero to one’ or ‘vibe coding’ – starting with nothing and producing a piece of software by the end of the session,” said Scott Dietzen, CEO of Augment Code, in an exclusive interview with VentureBeat. “What we targeted instead is the software engineering discipline of maintaining big, complex systems — databases, networking stacks, storage — codebases that have evolved over many years with hundreds of developers working on them collaboratively.” Founded in 2022, Augment Code has raised $270 million in total funding, including a $227 million Series B round announced in April 2024 at a post-money valuation of $977 million. The company’s investors include Sutter Hill Ventures, Index Ventures, Innovation Endeavors (led by former Google CEO Eric Schmidt), Lightspeed Venture Partners, and Meritech Capital. How Augment’s context engine tackles multi-million line codebases What sets Augment Agent apart, according to the company, is its ability to understand context across massive codebases. The agent boasts a 200,000 token context window, significantly larger than most competitors. “The challenge for any AI system, including Augment, is that when you’re working with large systems containing tens of millions of lines of code – which is typical for meaningful software applications – you simply can’t pass all that as context to today’s large language models,” explained Dietzen. “We’ve trained our AI models to perform sophisticated real-time sampling, identifying precisely the right subset of the codebase that allows the agent to do its job effectively.” This approach contrasts with competitors that either don’t handle large codebases or require developers to manually assemble the relevant context themselves. Another differentiator is Augment’s real-time synchronization of code changes across teams. “Most of our competitors work with stale versions of the codebase,” said Dietzen. “If you and I are collaborating in the same code branch and I make a change, you’d naturally want your AI to be aware of that change, just as you would be. That’s why we’ve implemented real-time synchronization of everyone’s view of the code.” The company reports its approach has led to a 70% win rate against GitHub Copilot when competing for enterprise business. Why ‘memories’ feature helps AI match your personal coding style Augment Agent includes a “Memories” feature that learns from developer interactions to better align with individual coding styles and preferences over time. “Part of what we wanted to be able to deliver with our agents is autonomy in the sense that you can give them tasks, but you can also intervene,” Dietzen said. “Memories are a tool for the model to generalize your intent, to capture that when I’m in this situation, I want you to take this path rather than the path that you took.” Contrary to the notion that coding is purely mathematical logic without stylistic elements, Dietzen emphasized that many developers care deeply about the aesthetic and structural aspects of their code. “There is definitely a mathematical aspect to code, but there’s also an art to coding as well,” he noted. “Many of our developers want to stay in the code. Some use our agents to write all of the code, but there’s a whole group of engineers that care about what the ultimate code looks like and have strong opinions about that.” Enterprise adoption of AI coding tools has been slowed by concerns about intellectual property protection and security. Augment has focused on addressing these issues with a robust security architecture and enterprise-grade integrations. “Agents need to be trusted. If you’re going to give them this autonomy, you want to make sure that they’re not going to do any harm,” said Dietzen. “We were the first to offer the various levels of SOC compliance and all of the associated penetration testing to harden our solution.” The company has also established integration with developer tools like GitHub, Linear, Jira, Notion, Google Search, and Slack. Unlike some competitors that implement these integrations on the client side, Augment handles these connections in the cloud, making them “easily shareable and consistent across a larger team,” according to Dietzen. Augment Agent is generally available for VS Code users starting today, with early preview access for JetBrains users. The company maintains full compatibility with Microsoft’s ecosystem, unlike competitor Cursor, which forked VS Code. “At some level, customers that choose Cursor are opting out of the Microsoft ecosystem. They’re not allowed to use all of the standard VS Code plug-ins that Microsoft provides for access to their environment, whereas we’ve preserved 100% compatibility with VS Code and the Microsoft ecosystem,” Dietzen explained. The evolving partnership between human engineers and AI assistants Despite the advances in AI coding assistance, Dietzen believes human software engineers will remain essential for the foreseeable future. “The arguments around whether software engineering is a good discipline for people going forward are very much off the mark today,” he said. “The discipline of software engineering is very, very different in terms of crafting and evolving these large code bases, and human insight is going to be needed for years to come.” However, he envisions a future where AI can take on more proactive roles in software development: “The real excitement around where we can ultimately get to with AI is AI just going in and assessing quality

Augment Code debuts AI agent with 70% win rate over GitHub Copilot and record-breaking SWE-bench score Read More »

OpenAI to release open-source model as AI economics force strategic shift

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI announced plans to release its first “open-weight” language model since 2019, marking a dramatic strategic shift for the company that built its business on proprietary AI systems. Sam Altman, OpenAI’s chief executive, revealed the news in a post on X on Monday. “We are excited to release a powerful new open-weight language model with reasoning in the coming months,” Altman wrote. The model would allow developers to run it on their own hardware, departing from OpenAI’s cloud-based subscription approach that has driven its revenue. “We’ve been thinking about this for a long time but other priorities took precedence. Now it feels important to do,” Altman added. The announcement coincided with OpenAI securing $40 billion in new funding at a $300 billion valuation — the largest fundraise in the company’s history. These major developments follow Altman’s admission during a February Reddit Q&A that OpenAI had been “on the wrong side of history” regarding open-source AI — a statement prompted by January’s release of DeepSeek R1, an open-source model from China that reportedly matches OpenAI’s performance at just 5-10% of the operating cost. TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: https://t.co/XKB4XxjREV we are excited to make this a very, very good model! __ we are planning to… — Sam Altman (@sama) March 31, 2025 OpenAI faces mounting economic pressure in a marketplace increasingly dominated by efficient open-source alternatives. The company reportedly spends $7-8 billion annually on operations, according to AI scholar Kai-Fu Lee, who recently questioned OpenAI’s sustainability against competitors with fundamentally different cost structures. “You’re spending $7 billion or $8 billion a year, making a massive loss, and here you have a competitor coming in with an open-source model that’s for free,” Lee said in a Bloomberg Television interview last week, comparing OpenAI’s finances with DeepSeek AI. Meta’s Llama models have established formidable market presence since their 2023 debut, surpassing one billion downloads as of this March. This widespread adoption demonstrates how quickly the field has shifted toward open models that can be deployed without the recurring costs of API-based services. Clement Delangue, CEO of Hugging Face, celebrated the announcement, writing: “Amazing news for the field and the world. Everyone benefits from open-source AI!” Amazing news for the field and the world. Everyone benefits from open-source AI! @elonmusk where’s open groq? https://t.co/ATThJQKIUH — clem ? (@ClementDelangue) March 31, 2025 The billion-dollar gamble: Why OpenAI is risking its primary revenue stream OpenAI’s move represents a high-stakes bet that could either secure its future relevance or accelerate its financial challenges. By releasing an open model, the company implicitly acknowledges that foundation models are becoming commoditized — an extraordinary concession from a company that has raised billions on the premise that its proprietary technology would remain superior and exclusive. The economics of AI have shifted dramatically since OpenAI’s founding. Training costs have fallen precipitously as hardware efficiency improves and algorithmic innovations like DeepSeek’s approach demonstrate that state-of-the-art performance no longer requires Google-scale infrastructure investments. For OpenAI, this creates an existential dilemma: maintain course with increasingly expensive proprietary models or adapt to a market that increasingly views base models as utilities rather than premium products. Their choice to release an open model suggests they’ve concluded that relevance and ecosystem influence may ultimately prove more valuable than short-term subscription revenue. This decision also reflects the company’s growing realization that competitive moats in AI may not lie in the base models themselves, but in the specialized fine-tuning, domain expertise, and application development that build upon them. Balancing openness with responsibility: How OpenAI plans to control what it can’t contain OpenAI emphasizes that safety remains central to its approach despite embracing greater openness. “Before release, we will evaluate this model according to our preparedness framework, like we would for any other model. And we will do extra work given that we know this model will be modified post-release,” Altman wrote. This represents the fundamental tension in open-weight releases: once published, these models can be modified, fine-tuned, and deployed in ways the original creators never intended. OpenAI’s challenge lies in creating guardrails that maintain reasonable safety without undermining the very openness they’ve promised. The company plans to host developer events to gather feedback and showcase early prototypes, beginning in San Francisco in the coming weeks before expanding to Europe and Asia-Pacific regions. These sessions may provide insight into how OpenAI plans to balance openness with responsibility. Enterprise impact: What CIOs and technical decision makers need to know about OpenAI’s strategic shift For enterprise customers, OpenAI’s move could significantly reshape AI implementation strategies. Organizations that have hesitated to build critical infrastructure atop subscription-based models now have reason to reconsider their approach. The ability to run models locally addresses persistent concerns around data sovereignty, vendor lock-in, and long-term cost management. This shift particularly matters for regulated industries like healthcare, finance, and government, where data privacy requirements have limited cloud-based AI adoption. Self-hosted models potentially enable these sectors to implement AI in previously restricted contexts, though questions around compute requirements and operational complexity remain unanswered. For existing OpenAI enterprise customers, the announcement creates uncertainty about long-term investment strategies. Those who have built systems atop GPT-4 or o1 APIs must now evaluate whether to maintain that approach or begin planning migrations to self-hosted alternatives — a decision complicated by the lack of specific details about the forthcoming model’s capabilities. Beyond base models: How the AI industry’s competitive landscape is fundamentally changing OpenAI’s pivot highlights a broader industry trend: the commoditization of foundation models and the shifting focus toward specialized applications. As base models become increasingly accessible, differentiation increasingly happens at the application layer — creating opportunities for startups and established players alike to build domain-specific solutions. This doesn’t mean the race to build better base models has

OpenAI to release open-source model as AI economics force strategic shift Read More »

Emergence AI’s new system automatically creates AI agents rapidly in realtime based on the work at hand

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Another day, another announcement about AI agents. Hailed by various market research reports as the big tech trend in 2025 — especially in the enterprise — it seems we can’t go more than 12 hours or so without the debut of another way to make, orchestrate (link together), or otherwise optimize purpose-built AI tools and workflows designed to handle routine white collar work. Yet Emergence AI, a startup founded by former IBM Research veterans and which late last year debuted its own, cross-platform AI agent orchestration framework, is out with something novel from all the rest: a new AI agent creation platform that lets the human user specify what work they are trying to accomplish via text prompts, and then turns it over to AI models to create the agents they believe are necessary to accomplish said work. This new system is literally a no code, natural language, AI-powered multi-agent builder, and it works in real time. Emergence AI describes it as a milestone in recursive intelligence, aims to simplify and accelerate complex data workflows for enterprise users. “Recursive intelligence paves the path for agents to create agents,” said Satya Nitta, co-founder and CEO of Emergence AI. “Our systems allow creativity and intelligence to scale fluidly, without human bottlenecks, but always within human-defined boundaries.” Image of Dr. Satya Nitta, Co-founder and CEO of Emergence AI, during his keynote at the AI Engineer World’s Fair 2024, where he unveiled Emergence’s Orchestrator meta-agent and introduced the open-source web agent, Agent-E. (photo courtesy AI Engineer World’s Fair) The platform is designed to evaluate incoming tasks, check its existing agent registry, and, if necessary, autonomously generate new agents tailored to fulfill specific enterprise needs. It can also proactively create agent variants to anticipate related tasks, broadening its problem-solving capabilities over time. According to Nitta, the orchestrator’s architecture enables entirely new levels of autonomy in enterprise automation. “Our orchestrator stitches multiple agents together autonomously to create multi-agent systems without human coding. If it doesn’t have an agent for a task, it will auto-generate one and even self-play to learn related tasks by creating new agents itself,” he explained. A brief demo shown to VentureBeat over a video call last week appeared duly impressive, with Nitta showing how a simple text instruction to have the AI categorize email sparked a wave of new agents being created, displayed on a visual timeline showing each agent represented as a colored dot in a column designating the category of work it was designed to help carry out. Animated GIF image showing Emergence AI’s user interface for automatically creating multiple enterprise AI Agents. Nitta also said the user could stop and intervene in this process, conveying additional text instructions, at any time. Bringing agentic coding to enterprise workflows Emergence AI’s technology focuses on automating data-centric enterprise workflows such as ETL pipeline creation, data migration, transformation, and analysis. The platform’s agents are equipped with agentic loops, long-term memory, and self-improvement abilities through planning, verification, and self-play. This enables the system to not only complete individual tasks but also understand and navigate surrounding task spaces for adjacent use cases. “We’re in a weird time in the development of technology and our society. We now have AI joining meetings,” Nitta said. “But beyond that, one of the most exciting things that’s happened in AI over the last two, three years is that large language models are producing code. They’re getting better, but they’re probabilistic systems. The code might not always be perfect, and they don’t execute, verify, or correct it.” Emergence AI’s platform seeks to fill that gap by integrating large language models’ code-generation abilities with autonomous agent technology. “We’re marrying LLMs’ code generation capabilities with autonomous agent technology,” Nitta added. “Agentic coding has enormous implications and will be the story of the next year and the next several years. The disruption is profound.” Emergence AI highlights the platform’s ability to integrate with leading AI models such as OpenAI’s GPT-4o and GPT-4.5, Anthropic’s Claude 3.7 Sonnet, and Meta’s Llama 3.3, as well as frameworks like LangChain, Crew AI, and Microsoft Autogen. The emphasis is on interoperability—allowing enterprises to bring their own models and third-party agents into the platform. Expanding multi-agent capabilities With the current release, the platform expands to include connector agents and data and text intelligence agents, allowing enterprises to build more complex systems without writing manual code. The orchestrator’s ability to evaluate its own limitations and take action is central to Emergence’s approach. “A very non-trivial thing that’s happening is when a new task comes in, the orchestrator figures out if it can solve the task by checking the registry of existing agents,” Nitta said. “If it can’t, it creates a new agent and registers it.” He added that this process is not simply reactive, but generative. “The orchestrator is not just creating agents; it’s creating goals for itself. It says, ‘I can’t solve this task, so I will create a goal to make a new agent.’ That’s what’s truly exciting.” Bet lest you worry the orchestrator will spiral out of control and create too many needless custom agents for each new task, Emergence’s research on its platform shows that it has been designed to — and successfully carries out — the additional requirement of winnowing down the number of agents created as it comes closer and closer to completing a task, adding agents with more general applicability to its internal registry for your enterprise, and checking back with that before creating any new ones. Graph showing the number of tasks increasing while the number of Emergence AI “core agents” and “multi agents” level off over time. Credit: Emergence AI Prioritizing safety, verification, and human oversight To maintain oversight and ensure responsible use, Emergence AI incorporates several safety and compliance features. These include guardrails and access controls, verification rubrics to evaluate agent performance, and human-in-the-loop oversight to validate key decisions. Nitta

Emergence AI’s new system automatically creates AI agents rapidly in realtime based on the work at hand Read More »

Researchers warn of ‘catastrophic overtraining’ in LLMs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A new academic study challenges a core assumption in developing large language models (LLMs), warning that more pre-training data may not always lead to better models. Researchers from some of the leading computer science institutions in the West and around the world—including Carnegie Mellon University, Stanford University, Harvard University and Princeton University—have introduced the concept of “Catastrophic Overtraining. ” They show that extended pre-training can actually make language models harder to fine-tune, ultimately degrading their performance. The study, “Overtrained Language Models Are Harder to Fine-Tune,” is available on arXiv and led by Jacob Mitchell Springer. Its co-authors are Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig and Aditi Raghunathan. The law of diminishing returns The research focuses on a surprising trend observed in modern LLM development: while models are pre-trained on ever-expanding pools of data—licensed or scraped from the web, represented to an LLM as a series of tokens or numerical representations of concepts and ideas—increasing the token number during pre-training may lead to reduced effectiveness when those models are later fine-tuned for specific tasks. The team conducted a series of empirical evaluations and theoretical analyses to examine the effect of extended pre-training on model adaptability. One of the key findings centers on AI2’s open source OLMo-1B model. The researchers compared two versions of this model: one pre-trained on 2.3 trillion tokens and another on 3 trillion tokens. Despite the latter being trained on 30% more data, the latter model performed worse after instruction tuning. Specifically, the 3T-token model showed over 2% worse performance on several standard language model benchmarks compared to its 2.3T-token counterpart. In some evaluations, the degradation in performance reached up to 3%. The researchers argue that this decline is not an anomaly but rather a consistent phenomenon they term “Catastrophic Overtraining.” Understanding sensitivity and forgetting The paper attributes this degradation to a systematic increase in what they call “progressive sensitivity.” As models undergo extended pre-training, their parameters become more sensitive to changes. This increased fragility makes them more vulnerable to degradation during post-training modifications such as instruction tuning, fine-tuning for multimodal tasks, or even simple weight perturbations. The researchers provide evidence that, beyond a certain point in pre-training, any modification—whether structured like fine-tuning or unstructured like adding Gaussian noise—leads to a greater loss of previously learned capabilities. This sensitivity results in “forgetting,” where the model’s original strengths deteriorate as new training data is introduced. The study identifies an “inflection point” in pre-training, after which additional training leads to diminishing and even negative returns regarding fine-tuning outcomes. For the OLMo-1B model, this threshold emerged around 2.5 trillion tokens. A wealth of evidence The team’s analysis spans real-world and controlled experimental settings. They tested the phenomenon across different tasks, including instruction tuning using datasets like Anthropic-HH and TULU and multimodal fine-tuning using the LLaVA framework. The results consistently showed that models pre-trained beyond certain token budgets underperformed after fine-tuning. Furthermore, the researchers constructed a theoretical model using linear networks to understand better why overtraining leads to increased sensitivity. Their analysis confirmed that progressive sensitivity and catastrophic overtraining are mathematically inevitable when pre-training continues indefinitely without proper constraints. The ultimate takeaway? Model providers and trainers must make trade-offs The findings challenge the widespread assumption that more pre-training data is always better. Instead, the paper suggests a nuanced trade-off: while longer pre-training improves the base model’s capabilities, it also increases the risk that fine-tuning will degrade those capabilities. In practice, attempts to mitigate this effect—such as adjusting fine-tuning learning rates or adding regularization—may delay the onset of catastrophic overtraining but cannot fully eliminate it without sacrificing downstream performance. Thus, for enterprises looking to leverage LLMs to improve business workflows and outcomes, if one idea for doing so is to fine-tune an open-source model, the lesson from this research indicates that fine-tuning lower parameter models trained on less material is likely to arrive at a more reliable production model. The authors acknowledge that further research is needed to understand the factors influencing when and how catastrophic overtraining occurs. Open questions include whether the pre-training optimizer, training objective, or data distribution can impact the severity of the phenomenon. Implications for future LLM and AI model development The study significantly impacts how organizations and researchers design and train large language models. As the field continues to pursue larger and more capable models, this research highlights the importance of balancing pre-training duration with post-training adaptability. Additionally, the findings may influence how model developers think about resource allocation. Rather than focusing exclusively on increasing pre-training budgets, developers may need to reassess strategies to optimize downstream performance without incurring the negative effects of catastrophic overtraining. source

Researchers warn of ‘catastrophic overtraining’ in LLMs Read More »

The tool integration problem that’s holding back enterprise AI (and how CoTools solves it)

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Researchers from the Soochow University of China have introduced Chain-of-Tools (CoTools), a novel framework designed to enhance how large language models (LLMs) use external tools. CoTools aims to provide a more efficient and flexible approach compared to existing methods. This will allow LLMs to leverage vast toolsets directly within their reasoning process, including ones they haven’t explicitly been trained on.  For enterprises looking to build sophisticated AI agents, this capability could unlock more powerful and adaptable applications without the typical drawbacks of current tool integration techniques. While modern LLMs excel at text generation, understanding and even complex reasoning, they need to interact with external resources and tools such as databases or applications for many tasks. Equipping LLMs with external tools—essentially APIs or functions they can call—is crucial for extending their capabilities into practical, real-world applications. However, current methods for enabling tool use face significant trade-offs. One common approach involves fine-tuning the LLM on examples of tool usage. While this can make the model proficient at calling the specific tools seen during training, it often restricts the model to only those tools. Furthermore, the fine-tuning process itself can sometimes negatively impact the LLM’s general reasoning abilities, such as Chain-of-Thought (CoT), potentially diminishing the core strengths of the foundation model. The alternative approach relies on in-context learning (ICL), where the LLM is provided with descriptions of available tools and examples of how to use them directly within the prompt. This method offers flexibility, allowing the model to potentially use tools it hasn’t seen before. However, constructing these complex prompts can be cumbersome, and the model’s efficiency decreases significantly as the number of available tools grows, making it less practical for scenarios with large, dynamic toolsets. As the researchers note in the paper introducing Chain-of-Tools, an LLM agent “should be capable of efficiently managing a large amount of tools and fully utilizing unseen ones during the CoT reasoning, as many new tools may emerge daily in real-world application scenarios.” CoTools offers a compelling alternative to existing methods by cleverly combining aspects of fine-tuning and semantic understanding while crucially keeping the core LLM “frozen”—meaning its original weights and powerful reasoning capabilities remain untouched. Instead of fine-tuning the entire model, CoTools trains lightweight, specialized modules that work alongside the LLM during its generation process. “The core idea of CoTools is to leverage the semantic representation capabilities of frozen foundation models for determining where to call tools and which tools to call,” the researchers write. In essence, CoTools taps into the rich understanding embedded within the LLM’s internal representations, often called “hidden states,” which are computed as the model processes text and generates response tokens. CoTools architecture Credit: arXiv The CoTools framework comprises three main components that operate sequentially during the LLM’s reasoning process: Tool Judge: As the LLM generates its response token by token, the Tool Judge analyzes the hidden state associated with the potential next token and decides whether calling a tool is appropriate at that specific point in the reasoning chain. Tool Retriever: If the Judge determines a tool is needed, the Retriever chooses the most suitable tool for the task. The Tool Retriever has been trained to create an embedding of the query and compare it to the available tools. This allows it to efficiently select the most semantically relevant tool from the pool of available tools, including “unseen” tools (i.e., not part of the training data for the CoTools modules). Tool Calling: Once the best tool is selected, CoTools uses an ICL prompt that demonstrates filling in the tool’s parameters based on the context. This targeted use of ICL avoids the inefficiency of adding thousands of demonstrations in the prompt for the initial tool selection. Once the selected tool is executed, its result is inserted back into the LLM’s response generation. By separating the decision-making (Judge) and selection (Retriever) based on semantic understanding from the parameter filling (Calling via focused ICL), CoTools achieves efficiency even with massive toolsets while preserving the LLM’s core abilities and allowing flexible use of new tools. However, since CoTools requires access to the model’s hidden states, it can only be applied to open-weight models such as Llama and Mistral instead of private models such as GPT-4o and Claude. Example of CoTools in action. Credit: arXiv The researchers evaluated CoTools across two distinct application scenarios: numerical reasoning using arithmetic tools and knowledge-based question answering (KBQA), which requires retrieval from knowledge bases. On arithmetic benchmarks like GSM8K-XL (using basic operations) and FuncQA (using more complex functions), CoTools applied to LLaMA2-7B achieved performance comparable to ChatGPT on GSM8K-XL and slightly outperformed or matched another tool-learning method, ToolkenGPT, on FuncQA variants. The results highlighted that CoTools effectively enhance the capabilities of the underlying foundation model. For the KBQA tasks, tested on the KAMEL dataset and a newly constructed SimpleToolQuestions (STQuestions) dataset featuring a very large tool pool (1836 tools, including 837 unseen in the test set), CoTools demonstrated superior tool selection accuracy. It particularly excelled in scenarios with massive tool numbers and when dealing with unseen tools, leveraging the descriptive information for effective retrieval where methods relying solely on trained tool representations faltered. The experiments also indicated that CoTools maintained strong performance despite lower-quality training data. Implications for the enterprise Chain-of-Tools presents a promising direction for building more practical and powerful LLM-powered agents in the enterprise. This is especially useful as new standards such as the Model Context Protocol (MCP) enable developers to integrate external tools and resources easily into their applications. Enterprises can potentially deploy agents that adapt to new internal or external APIs and functions with minimal retraining overhead. The framework’s reliance on semantic understanding via hidden states allows for nuanced and accurate tool selection, which could lead to more reliable AI assistants in tasks that require interaction with diverse information sources and systems. “CoTools explores the way to equip LLMs with massive new tools in a simple

The tool integration problem that’s holding back enterprise AI (and how CoTools solves it) Read More »

Runway Gen-4 solves AI video’s biggest problem: character consistency across scenes

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Runway AI Inc. launched its most advanced AI video generation model today, entering the next phase of competition to create tools that could transform film production. The new Gen-4 system introduces character and scene consistency across multiple shots — a capability that has evaded most AI video generators until now. The New York-based startup, backed by Google, Nvidia and Salesforce, is releasing “Gen-4” to all paid subscribers and enterprise customers, with additional features planned for later this week. Users can generate five and ten-second clips at 720p resolution. The release comes just days after OpenAI released a new image generation feature that also allows character consistency across its images. The release created a cultural phenomenon, with millions of users requesting Studio Ghibli-style images through ChatGPT. It was in part the consistency of the Ghibli style across chats that created the furor. The viral trend became so popular that it temporarily crashed OpenAI’s servers, with CEO Sam Altman tweeting that “our GPUs are melting” due to unprecedented demand. The Ghibli-style images also sparked heated debates about copyright, with many questioning whether AI companies can legally mimic distinctive artistic styles. Visual continuity: The missing piece in AI filmmaking until now So if character consistency led to massive viral growth for OpenAI’s image feature, could the same happen for Runway in video? Character and scene consistency — maintaining the same visual elements across multiple shots and angles — has been the Achilles’ heel of AI video generation. When a character’s face subtly changes between cuts or a background element disappears without explanation, the artificial nature of the content becomes immediately apparent to viewers. The challenge stems from how these models work at a fundamental level. Previous AI generators treated each frame as a separate creative task, with only loose connections between them. Imagine asking a room full of artists to each draw one frame of a film without seeing what came before or after — the result would be visually disjointed. Runway’s Gen-4 appears to have tackled this problem by creating what amounts to a persistent memory of visual elements. Once a character, object, or environment is established, the system can render it from different angles while maintaining its core attributes. This isn’t just a technical improvement; it’s the difference between creating interesting visual snippets and telling actual stories. Using visual references, combined with instructions, Gen-4 allows you to create new images and videos with consistent styles, subjects, locations and more. Allowing for continuity and control within your stories. To test the model’s narrative capabilities, we have put together… pic.twitter.com/IYz2BaeW2U — Runway (@runwayml) March 31, 2025 According to Runway’s documentation, Gen-4 allows users to provide reference images of subjects and describe the composition they want, with the AI generating consistent outputs from different angles. The company claims the model can render videos with realistic motion while maintaining subject, object, and style consistency. To showcase the model’s capabilities, Runway released several short films created entirely with Gen-4. One film, “New York is a Zoo,” demonstrates the model’s visual effects by placing realistic animals in cinematic New York settings. Another, titled “The Retrieval,” follows explorers searching for a mysterious flower and was produced in less than a week. From facial animation to world models: Runway’s AI filmmaking evolution Gen-4 builds on Runway’s previous tools. In October, the company released Act-One, a feature that allows filmmakers to capture facial expressions from smartphone video and transfer them to AI-generated characters. The following month, Runway added advanced 3D-like camera controls to its Gen-3 Alpha Turbo model, enabling users to zoom in and out of scenes while preserving character forms. This trajectory reveals Runway’s strategic vision. While competitors focus on creating ever more realistic single images or clips, Runway has been assembling the components of a complete digital production pipeline. The approach feels more akin to how actual filmmakers work — addressing problems of performance, coverage, and visual continuity as interconnected challenges rather than isolated technical hurdles. The evolution from facial animation tools to consistent world models suggests Runway understands that AI-assisted filmmaking needs to follow the logic of traditional production to be truly useful. It’s the difference between creating a tech demo and building tools professionals can actually incorporate into their workflows. AI video’s billion-dollar battle heats up The financial implications are substantial for Runway, which is reportedly raising a new funding round that would value the company at $4 billion. According to financial reports, the startup aims to reach $300 million in annualized revenue this year following the launch of new products and an API for its video-generating models. Runway has pursued Hollywood partnerships, securing a deal with Lionsgate to create a custom AI video generation model based on the studio’s catalog of more than 20,000 titles. The company has also established the Hundred Film Fund, offering filmmakers up to $1 million to produce movies using AI. “We believe that the best stories are yet to be told, but that traditional funding mechanisms often overlook new and emerging visions within the larger industry ecosystem,” Runway explains on its fund’s website. However, the technology raises concerns for film industry professionals. A 2024 study commissioned by the Animation Guild found that 75% of film production companies that have adopted AI have reduced, consolidated, or eliminated jobs. The study projects that more than 100,000 U.S. entertainment jobs will be affected by generative AI by 2026. Copyright questions follow AI’s creative explosion Like other AI companies, Runway faces legal scrutiny over its training data. The company is currently defending itself in a lawsuit brought by artists who allege their copyrighted work was used to train AI models without permission. Runway has cited the fair use doctrine as its defense, though courts have yet to definitively rule on this application of copyright law. The copyright debate intensified last week with OpenAI’s Studio Ghibli feature, which allowed users to generate images in the

Runway Gen-4 solves AI video’s biggest problem: character consistency across scenes Read More »

Google’s Gemini 2.5 Pro is the smartest model you’re not using – and 4 reasons it matters for enterprise AI

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The release of Gemini 2.5 Pro on Tuesday didn’t exactly dominate the news cycle. It landed the same week OpenAI’s image-generation update lit up social media with Studio Ghibli-inspired avatars and jaw-dropping instant renders. However, while the buzz went to OpenAI, Google may have quietly dropped the most enterprise-ready reasoning model to date. Gemini 2.5 Pro marks a significant leap forward for Google in the foundational model race—not just in benchmarks but also in usability. Based on early experiments, benchmark data and hands-on developer reactions, it’s a model worth serious attention from enterprise technical decision-makers, particularly those who’ve historically defaulted to OpenAI or Claude for production-grade reasoning. Here are four major takeaways for enterprise teams evaluating Gemini 2.5 Pro. 1. Transparent, structured reasoning – a new bar for chain-of-thought clarity What sets Gemini 2.5 Pro apart isn’t just its intelligence – it’s how clearly that intelligence shows its work. Google’s step-by-step training approach results in a structured chain of thought (CoT) that doesn’t feel like rambling or guesswork, like what we’ve seen from models like DeepSeek. These CoTs aren’t truncated into shallow summaries like OpenAI’s models. The new Gemini model presents ideas in numbered steps, with sub-bullets and internal logic that’s remarkably coherent and transparent. In practical terms, this is a breakthrough for trust and steerability. Enterprise users evaluating output for critical tasks – like reviewing policy implications, coding logic, or summarizing complex research – can now see how the model arrived at an answer. That means they can validate, correct, or redirect it more confidently. It’s a major evolution from the “black box” feel that still plagues many large language models (LLMs) outputs. For a deeper walkthrough of how this works in action, check out the video breakdown where we test Gemini 2.5 Pro live. One example we discuss: When asked about the limitations of large language models, Gemini 2.5 Pro showed remarkable awareness. It recited common weaknesses, and categorized them into areas like “physical intuition,” “novel concept synthesis,” “long-range planning” and “ethical nuances,” providing a framework that helps users understand what the model knows and how it’s approaching the problem. Enterprise technical teams can leverage this capability to: Debug complex reasoning chains in critical applications Better understand model limitations in specific domains Provide more transparent AI-assisted decision-making to stakeholders Improve their own critical thinking by studying the model’s approach One limitation worth noting is that while this structured reasoning is available in the Gemini app and Google AI Studio, it’s not yet accessible via the API—a shortcoming for developers looking to integrate this capability into enterprise applications. 2. A real contender for state-of-the-art – not just on paper The model is currently sitting at the top of the Chatbot Arena leaderboard by a notable margin – 35 Elo points ahead of the next-best model, which notably is the OpenAI 4o update that dropped the day after Gemini 2.5 Pro dropped. And while benchmark supremacy is often a fleeting crown (as new models drop weekly), Gemini 2.5 Pro feels genuinely different. Top of the LM Arena Leaderboard, as of publishing. It excels in tasks that reward deep reasoning: coding, nuanced problem-solving, synthesis across documents and even abstract planning. In internal testing, it’s performed especially well on previously hard-to-crack benchmarks like the “Humanity’s Last Exam,” a favorite for exposing LLM weaknesses in abstract and nuanced domains. (You can see Google’s announcement here, along with all of the benchmark information.) Enterprise teams might not care which model wins which academic leaderboard. But they’ll care that this one can think – and show you how it’s thinking. The vibe test matters, and for once, it’s Google’s turn to feel like they’ve passed it. As respected AI engineer Nathan Lambert noted, “Google has the best models again, as they should have started this whole AI bloom. The strategic error has been righted.” Enterprise users should view this not just as Google catching up to competitors, but potentially leapfrogging them in capabilities that matter for business applications. 3. Finally, Google’s coding game is strong Historically, Google has lagged behind OpenAI and Anthropic in developer-focused coding assistance. Gemini 2.5 Pro changes that—in a big way. In hands-on tests, it’s shown strong one-shot capability on coding challenges, including building a working Tetris game that ran on the first try when exported to Replit—no debugging needed. Even more notable, it reasoned through the code structure with clarity, labeling variables and steps thoughtfully and laying out its approach before writing a single line of code. The model rivals Anthropic’s Claude 3.7 Sonnet, which has been considered the leader in code generation, and a major reason for Anthropic’s success in the enterprise. But Gemini 2.5 offers a critical advantage: a massive 1-million token context window. Claude 3.7 Sonnet is only now getting around to offering 500,000 tokens. This massive context window opens new possibilities for reasoning across entire codebases, reading documentation inlines and working across multiple interdependent files. Software engineer Simon Willison’s experience illustrates this advantage. When using Gemini 2.5 Pro to implement a new feature across his codebase, the model identified necessary changes across 18 different files and completed the entire project in approximately 45 minutes, averaging less than three minutes per modified file. This is a serious tool for enterprises experimenting with agent frameworks or AI-assisted development environments. 4. Multimodal integration with agent-like behavior While some models like OpenAI’s latest 4o may show more dazzle with flashy image generation, Gemini 2.5 Pro feels like it is quietly redefining what grounded, multimodal reasoning looks like. In one example, Ben Dickson’s hands-on testing for VentureBeat demonstrated the model’s ability to extract key information from a technical article about search algorithms and create a corresponding SVG flowchart – then later improve that flowchart when shown a rendered version with visual errors. This level of multimodal reasoning enables new workflows that weren’t previously possible with text-only models. In another example, developer

Google’s Gemini 2.5 Pro is the smartest model you’re not using – and 4 reasons it matters for enterprise AI Read More »

Hands on with Gemini 2.5 Pro: why it might be the most useful reasoning model yet

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Unfortunately for Google, the release of its latest flagship language model, Gemini 2.5 Pro, got buried under the Studio Ghibli AI image storm that sucked the air out of the AI space. And perhaps fearful of its previous failed launches, Google cautiously presented it as “Our most intelligent AI model” instead of the approach of other AI labs, which introduce their new models as the best in the world. However, practical experiments with real-world examples show that Gemini 2.5 Pro is really impressive and might currently be the best reasoning model. This opens the way for many new applications and possibly puts Google at the forefront of the generative AI race.  Source: Polymarket Long context with good coding capabilities The outstanding feature of Gemini 2.5 Pro is its very long context window and output length. The model can process up to 1 million tokens (with 2 million coming soon), making it possible to fit multiple long documents and entire code repositories into the prompt when necessary. The model also has an output limit of 64,000 tokens instead of around 8,000 for other Gemini models.  The long context window also allows for extended conversations, as each interaction with a reasoning model can generate tens of thousands of tokens, especially if it involves code, images and video (I’ve run into this issue with Claude 3.7 Sonnet, which has a 200,000-token context window). For example, software engineer Simon Willison used Gemini 2.5 Pro to create a new feature for his website. Willison said in a blog, “It crunched through my entire codebase and figured out all of the places I needed to change—18 files in total, as you can see in the resulting PR. The whole project took about 45 minutes from start to finish—averaging less than three minutes per file I had to modify. I’ve thrown a whole bunch of other coding challenges at it, and the bottleneck on evaluating them has become my own mental capacity to review the resulting code!” Impressive multimodal reasoning Gemini 2.5 Pro also has impressive reasoning abilities over unstructured text, images and video. For example, I provided it with the text of my recent article about sampling-based search and prompted it to create an SVG graphic that depicts the algorithm described in the text. Gemini 2.5 Pro correctly extracted key information from the article and created a flowchart for the sampling and search process, even getting the conditional steps correctly. (For reference, the same task took multiple interactions with Claude 3.7 Sonnet and I eventually maxed out the token limit.) The rendered image had some visual errors (arrowheads are misplaced). It could use a facelift, so I next tested Gemini 2.5 Pro with a multi-modal prompt, giving it a screenshot of the rendered SVG file along with the code and prompting it to improve it. The results were impressive. It corrected the arrowheads and improved the visual quality of the diagram. Other users have had similar experiences with multimodal prompts. For example, in their tests, DataCamp replicated the runner game example presented in the Google Blog, then provided the code and a video recording of the game to Gemini 2.5 Pro and prompted it to make some changes to the game’s code. The model could reason over the visuals, find the part of the code that needed to be changed, and make the correct modifications. It is worth noting, however, that like other generative models, Gemini 2.5 Pro is prone to making mistakes such as modifying unrelated files and code segments. The more precise your instructions are, the lower the risk of the model making incorrect changes. Data analysis with useful reasoning trace Finally, I tested Gemini 2.5 Pro on my classic messy data analysis test for reasoning models. I provided it with a file containing a mix of plain text and raw HTML data I had copied and pasted from different stock history pages in Yahoo! Finance. Then I prompted it to calculate the value of a portfolio that would invest $140 at the beginning of each month, spread evenly across the Magnificent 7 stocks, from January 2024 to the latest date in the file. The model correctly identified which stocks it had to pick from the file (Amazon, Apple, Nvidia, Microsoft, Tesla, Alphabet and Meta), extracted the financial information from the HTML data, and calculated the value of each investment based on the price of the stocks at the beginning of each month. It responded to a well-formatted table with stock and portfolio value at each month and provided a breakdown of how much the entire investment was worth at the end of the period. More importantly, I found the reasoning trace to be very useful. It is not clear whether Google reveals the raw chain-of-thought (CoT) tokens for Gemini 2.5 Pro, but the reasoning trace is very detailed. You can clearly see how the model is reasoning over the data, extracting different bits of information, and calculating the results before generating the answer. This can help troubleshoot the model’s behavior and steer it in the right direction when it makes mistakes. Enterprise-grade reasoning? One concern about Gemini 2.5 Pro is that it is only available in reasoning mode, which means the model always goes through the “thinking” process even for very simple prompts that can be answered directly.  Gemini 2.5 Pro is currently in preview release. Once the full model is released and pricing information is available, we will have a better understanding of how much it will cost to build enterprise applications over the model. However, as inference costs continue to fall, we can expect it to become practical at scale. Gemini 2.5 Pro might not have had the splashiest debut, but its capabilities demand attention. Its massive context window, impressive multimodal reasoning and detailed reasoning chain offer tangible advantages for complex enterprise workloads, from codebase refactoring to nuanced data analysis.  source

Hands on with Gemini 2.5 Pro: why it might be the most useful reasoning model yet Read More »