VentureBeat

Nvidia’s $46.7B Q2 proves the platform, but its next fight is ASIC economics on inference

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Nvidia reported $46.7 billion in revenue for fiscal Q2 2026 in their earnings announcement and call yesterday, with data center revenue hitting $41.1 billion, up 56% year over year. The company also released guidance for Q3, predicting a $54 billion quarter. Behind these confirmed earnings call numbers lies a more complex story of how custom application-specific integrated circuits (ASICs) are gaining ground in key Nvidia segments and will challenge their growth in the quarters to come. Bank of America’s Vivek Arya asked Nvidia’s president and CEO, Jensen Huang, if he saw any scenario where ASICs could take market share from Nvidia GPUs. ASICs continue to gain ground on performance and cost advantages over Nvidia, Broadcom projects 55% to 60% AI revenue growth next year. Huang pushed back hard on the earnings call. He emphasized that building AI infrastructure is “really hard” and most ASIC projects fail to reach production. That’s a fair point, but they have a competitor in Broadcom, which is seeing its AI revenue steadily ramp up, approaching a $20 billion annual run rate. Further underscoring the growing competitive fragmentation of the market is how Google, Meta and Microsoft all deploy custom silicon at scale. The market has spoken. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO ASICs are redefining the competitive landscape in real-time Nvidia is more than capable of competing with new ASIC providers. Where they’re running into headwinds is how effectively ASIC competitors are positioning the combination of their use cases, performance claims and cost positions. They’re also looking to differentiate themselves in terms of the level of ecosystem lock-in they require, with Broadcom leading in this competitive dimension. The following table compares Nvidia Blackwell with its primary competitors. Real-world results vary significantly depending on specific workloads and deployment configurations: Metric Nvidia Blackwell Google TPU v5e/v6 AWS Trainium/Inferentia2 Intel Gaudi2/3 Broadcom Jericho3-AI Primary Use Cases Training, inference, generative AI Hyperscale training & inference AWS-focused training & inference Training, inference, hybrid-cloud deployments AI cluster networking Performance Claims Up to 50x improvement over Hopper* 67% improvement TPU v6 vs v5* Comparable GPU performance at lower power* 2-4x price-performance vs prior gen* InfiniBand parity on Ethernet* Cost Position Premium pricing, comprehensive ecosystem Significant savings vs GPUs per Google* Aggressive pricing per AWS marketing* Budget alternative positioning* Lower networking TCO per vendor* Ecosystem Lock-In Moderate (CUDA, proprietary) High (Google Cloud, TensorFlow/JAX) High (AWS, proprietary Neuron SDK) Moderate (supports open stack) Low (Ethernet-based standards) Availability Universal (cloud, OEM) Google Cloud-exclusive AWS-exclusive Multiple cloud and on-premise Broadcom direct, OEM integrators Strategic Appeal Proven scale, broad support Cloud workload optimization AWS integration advantages Multi-cloud flexibility Simplified networking Market Position Leadership with margin pressure Growing in specific workloads Expanding within AWS Emerging alternative Infrastructure enabler *Performance-per-watt improvements and cost savings depend on specific workload characteristics, model types, deployment configurations and vendor testing assumptions. Actual results vary significantly by use case. Hyperscalers continue building their own paths Every major cloud provider has adopted custom silicon to gain the performance, cost, ecosystem scale and extensive DevOps advantages of defining an ASIC from the ground up. Google operates TPU v6 in production through its partnership with Broadcom. Meta built MTIA chips specifically for ranking and recommendations. Microsoft develops Project Maia for sustainable AI workloads. Amazon Web Services encourages customers to use Trainium for training and Inferentia for inference. Add to that the fact that ByteDance runs TikTok recommendations on custom silicon despite geopolitical tensions. That’s billions of inference requests running on ASICs daily, not GPUs. CFO Colette Kress acknowledged the competitive reality during the call. She referenced China revenue, saying it had dropped to a low single-digit percentage of data center revenue. Current Q3 guidance excludes H20 shipments to China completely. While Huang’s statements about China’s extensive opportunities tried to steer the earnings call in a positive direction, it was clear that equity analysts weren’t buying all of it. The general tone and perspective is that export controls create ongoing uncertainty for Nvidia in a market that arguably represents its second most significant growth opportunity. Huang said that 50% of all AI researchers are in China and he is fully committed to serving that market.    Nvidia’s platform advantage is one of their greatest strengths Huang made a valid case for Nvidia’s integrated approach during the earnings call. Building modern AI requires six different chip types working together, he argued, and that complexity creates barriers competitors struggle to match. Nvidia doesn’t just ship GPUs anymore, he emphasized multiple times on the earnings call. The company delivers a complete AI infrastructure that scales globally, he emphatically stated, returning to AI infrastructure as a core message of the earnings call, citing it six times.   The platform’s ubiquity makes it a default configuration supported by nearly every DevOps cycle of cloud hyperscalers. Nvidia runs across AWS, Azure and Google Cloud. PyTorch and TensorFlow also optimize for CUDA by default. When Meta drops a new Llama model or Google updates Gemini, they target Nvidia hardware first because that’s where millions of developers already work. The ecosystem creates its own gravity. The networking business validates the AI infrastructure strategy. Revenue hit $7.3 billion in Q2, up 98% year over year. NVLink connects GPUs at speeds traditional networking can’t touch. Huang revealed the real economics during the call: Nvidia captures about 35% of a typical gigawatt AI factory’s budget. “Out of a gigawatt AI factory, which can go anywhere from 50 to, you know, plus or minus 10%, let’s say, to $60 billion, we represent about 35% plus or minus of that. … And of course, what you get for that is not a GPU. … we’ve really transitioned to become an AI infrastructure company,” Huang said. That’s

Nvidia’s $46.7B Q2 proves the platform, but its next fight is ASIC economics on inference Read More »

How procedural memory can cut the cost and complexity of AI agents

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A new technique from Zhejiang University and Alibaba Group gives large language model (LLM) agents a dynamic memory, making them more efficient and effective at complex tasks. The technique, called Memp, provides agents with a “procedural memory” that is continuously updated as they gain experience, much like how humans learn from practice. Memp creates a lifelong learning framework where agents don’t have to start from scratch for every new task. Instead, they become progressively better and more efficient as they encounter new situations in real-world environments, a key requirement for reliable enterprise automation. The case for procedural memory in AI agents LLM agents hold promise for automating complex, multi-step business processes. In practice, though, these long-horizon tasks can be fragile. The researchers point out that unpredictable events like network glitches, user interface changes or shifting data schemas can derail the entire process. For current agents, this often means starting over every time, which can be time-consuming and costly. Meanwhile, many complex tasks, despite surface differences, share deep structural commonalities. Instead of relearning these patterns every time, an agent should be able to extract and reuse its experience from past successes and failures, the researchers point out. This requires a specific “procedural memory,” which in humans is the long-term memory responsible for skills like typing or riding a bike, that become automatic with practice. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Starting from scratch (top) vs using procedural memory (bottom) (source: arXiv) Current agent systems often lack this capability. Their procedural knowledge is typically hand-crafted by developers, stored in rigid prompt templates or embedded within the model’s parameters, which are expensive and slow to update. Even existing memory-augmented frameworks provide only coarse abstractions and don’t adequately address how skills should be built, indexed, corrected and eventually pruned over an agent’s lifecycle. Consequently, the researchers note in their paper, “there is no principled way to quantify how efficiently an agent evolves its procedural repertoire or to guarantee that new experiences improve rather than erode performance.” How Memp works Memp is a task-agnostic framework that treats procedural memory as a core component to be optimized. It consists of three key stages that work in a continuous loop: building, retrieving, and updating memory. Memories are built from an agent’s past experiences, or “trajectories.” The researchers explored storing these memories in two formats: verbatim, step-by-step actions; or distilling these actions into higher-level, script-like abstractions. For retrieval, the agent searches its memory for the most relevant past experience when given a new task. The team experimented with different methods, such vector search, to match the new task’s description to past queries or extracting keywords to find the best fit. The most critical component is the update mechanism. Memp introduces several strategies to ensure the agent’s memory evolves. As an agent completes more tasks, its memory can be updated by simply adding the new experience, filtering for only successful outcomes or, most effectively, reflecting on failures to correct and revise the original memory. Memp framework (source: arXiv) This focus on dynamic, evolving memory places Memp within a growing field of research aimed at making AI agents more reliable for long-term tasks. The work parallels other efforts, such as Mem0, which consolidates key information from long conversations into structured facts and knowledge graphs to ensure consistency. Similarly, A-MEM enables agents to autonomously create and link “memory notes” from their interactions, forming a complex knowledge structure over time. However, co-author Runnan Fang highlights a critical distinction between Memp and other frameworks. “Mem0 and A-MEM are excellent works… but they focus on remembering salient content within a single trajectory or conversation,” Fang commented to VentureBeat. In essence, they help an agent remember “what” happened. “Memp, by contrast, targets cross-trajectory procedural memory.” It focuses on “how-to” knowledge that can be generalized across similar tasks, preventing the agent from re-exploring from scratch each time.  “By distilling past successful workflows into reusable procedural priors, Memp raises success rates and shortens steps,” Fang added. “Crucially, we also introduce an update mechanism so that this procedural memory keeps improving— after all, practice makes perfect for agents too.” Overcoming the ‘cold-start’ problem While the concept of learning from past trajectories is powerful, it raises a practical question: How does an agent build its initial memory when there are no perfect examples to learn from? The researchers address this “cold-start” problem with a pragmatic approach. Fang explained that devs can first define a robust evaluation metric instead of requiring a perfect “gold” trajectory upfront. This metric, which can be rule-based or even another LLM, scores the quality of an agent’s performance. “Once that metric is in place, we let state-of-the-art models explore within the agent workflow and retain the trajectories that achieve the highest scores,” Fang said. This process rapidly bootstraps an initial set of useful memories, allowing a new agent to get up to speed without extensive manual programming. Memp in action To test the framework, the team implemented Memp on top of powerful LLMs like GPT-4o, Claude 3.5 Sonnet and Qwen2.5, evaluating them on complex tasks like household chores in the ALFWorld benchmark and information-seeking in TravelPlanner. The results showed that building and retrieving procedural memory allowed an agent to distill and reuse its prior experience effectively. During testing, agents equipped with Memp not only achieved higher success rates but became much more efficient. They eliminated fruitless exploration and trial-and-error, leading to a substantial reduction in both the number of steps and the token consumption required to complete a task. Using procedural memory (right) helps agents accomplish tasks in fewer steps and using fewer tokens (source: arXiv) One of the most significant findings for enterprise applications is that procedural memory is

How procedural memory can cut the cost and complexity of AI agents Read More »

Chan Zuckerberg Initiative's rBio uses virtual cells to train AI, bypassing lab work

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now The Chan Zuckerberg Initiative announced Thursday the launch of rBio, the first artificial intelligence model trained to reason about cellular biology using virtual simulations rather than requiring expensive laboratory experiments — a breakthrough that could dramatically accelerate biomedical research and drug discovery. The reasoning model, detailed in a research paper published on bioRxiv, demonstrates a novel approach called “soft verification” that uses predictions from virtual cell models as training signals instead of relying solely on experimental data. This paradigm shift could help researchers test biological hypotheses computationally before committing time and resources to costly laboratory work. “The idea is that you have these super powerful models of cells, and you can use them to simulate outcomes rather than testing them experimentally in the lab,” said Ana-Maria Istrate, senior research scientist at CZI and lead author of the research, in an interview. “The paradigm so far has been that 90% of the work in biology is tested experimentally in a lab, while 10% is computational. With virtual cell models, we want to flip that paradigm.” How AI finally learned to speak the language of living cells The announcement represents a significant milestone for CZI’s ambitious goal to “cure, prevent, and manage all disease by the end of this century.” Under the leadership of pediatrician Priscilla Chan and Meta CEO Mark Zuckerberg, the $6 billion philanthropic initiative has increasingly focused its resources on the intersection of artificial intelligence and biology. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO rBio addresses a fundamental challenge in applying AI to biological research. While large language models like ChatGPT excel at processing text, biological foundation models typically work with complex molecular data that cannot be easily queried in natural language. Scientists have struggled to bridge this gap between powerful biological models and user-friendly interfaces. “Foundation models of biology — models like GREmLN and TranscriptFormer — are built on biological data modalities, which means you cannot interact with them in natural language,” Istrate explained. “You have to find complicated ways to prompt them.” The new model solves this problem by distilling knowledge from CZI’s TranscriptFormer — a virtual cell model trained on 112 million cells from 12 species spanning 1.5 billion years of evolution — into a conversational AI system that researchers can query in plain English. The ‘soft verification’ revolution: Teaching AI to think in probabilities, not absolutes The core innovation lies in rBio’s training methodology. Traditional reasoning models learn from questions with unambiguous answers, like mathematical equations. But biological questions involve uncertainty and probabilistic outcomes that don’t fit neatly into binary categories. CZI’s research team, led by Senior Director of AI Theofanis Karaletsos and Istrate, overcame this challenge by using reinforcement learning with proportional rewards. Instead of simple yes-or-no verification, the model receives rewards proportional to the likelihood that its biological predictions align with reality, as determined by virtual cell simulations. “We applied new methods to how LLMs are trained,” the research paper explains. “Using an off-the-shelf language model as a scaffold, the team trained rBio with reinforcement learning, a common technique in which the model is rewarded for correct answers. But instead of asking a series of yes/no questions, the researchers tuned the rewards in proportion to the likelihood that the model’s answers were correct.” This approach allows scientists to ask complex questions like “Would suppressing the actions of gene A result in an increase in activity of gene B?” and receive scientifically grounded responses about cellular changes, including shifts from healthy to diseased states. Beating the benchmarks: How rBio outperformed models trained on real lab data In testing against the PerturbQA benchmark — a standard dataset for evaluating gene perturbation prediction — rBio demonstrated competitive performance with models trained on experimental data. The system outperformed baseline large language models and matched performance of specialized biological models in key metrics. Particularly noteworthy, rBio showed strong “transfer learning” capabilities, successfully applying knowledge about gene co-expression patterns learned from TranscriptFormer to make accurate predictions about gene perturbation effects—a completely different biological task. “We show that on the PerturbQA dataset, models trained using soft verifiers learn to generalize on out-of-distribution cell lines, potentially bypassing the need to train on cell-line specific experimental data,” the researchers wrote. When enhanced with chain-of-thought prompting techniques that encourage step-by-step reasoning, rBio achieved state-of-the-art performance, surpassing the previous leading model SUMMER. From social justice to science: Inside CZI’s controversial pivot to pure research The rBio announcement comes as CZI has undergone significant organizational changes, refocusing its efforts from a broad philanthropic mission that included social justice and education reform to a more targeted emphasis on scientific research. The shift has drawn criticism from some former employees and grantees who saw the organization abandon progressive causes. However, for Istrate, who has worked at CZI for six years, the focus on biological AI represents a natural evolution of long-standing priorities. “My experience and work has not changed much. I have been part of the science initiative for as long as I have been at CZI,” she said. The concentration on virtual cell models builds on nearly a decade of foundational work. CZI has invested heavily in building cell atlases — comprehensive databases showing which genes are active in different cell types across species — and developing the computational infrastructure needed to train large biological models. “I’m really excited about the work that’s been happening at CZI for years now, because we’ve been building up to this moment,” Istrate noted, referring to the organization’s earlier investments in data platforms and single-cell transcriptomics. Building bias-free biology: How CZI curated diverse data to train fairer AI models One critical advantage of CZI’s approach stems from its years of careful data curation. The organization operates

Chan Zuckerberg Initiative's rBio uses virtual cells to train AI, bypassing lab work Read More »

Nous Research drops Hermes 4 AI models that outperform ChatGPT without content restrictions

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Nous Research, a secretive artificial intelligence startup that has emerged as a leading voice in the open-source AI movement, quietly released Hermes 4 on Monday, a family of large language models that the company claims can match the performance of leading proprietary systems while offering unprecedented user control and minimal content restrictions. The release represents a significant escalation in the battle between open-source AI advocates and major technology companies over who should control access to advanced artificial intelligence capabilities. Unlike models from OpenAI, Google, or Anthropic, Hermes 4 is designed to respond to nearly any request without the safety guardrails that have become standard in commercial AI systems. Nous Research presents Hermes 4, our latest line of hybrid reasoning models.https://t.co/E5EW9hBurb Hermes 4 builds on our legacy of user-aligned models with expanded test-time compute capabilities. Special attention was given to making the models creative and interesting to… pic.twitter.com/52VjnvrDWM — Nous Research (@NousResearch) August 26, 2025 “Hermes 4 builds on our legacy of user-aligned models with expanded test-time compute capabilities,” Nous Research announced on X (formerly Twitter). “Special attention was given to making the models creative and interesting to interact with, unencumbered by censorship, and neutrally aligned while maintaining state of the art level math, coding, and reasoning performance for open weight models.” How Hermes 4’s ‘hybrid reasoning’ mode outperforms ChatGPT and Claude on math benchmarks Hermes 4 introduces what Nous Research calls “hybrid reasoning,” allowing users to toggle between fast responses and deeper, step-by-step thinking processes. When activated, the models generate their internal reasoning within special <think> tags before providing a final answer — similar to OpenAI’s o1 reasoning models but with full transparency into the AI’s thought process. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO The technical achievement is substantial. In testing, Hermes 4’s largest 405-billion parameter model scored 96.3% on the MATH-500 benchmark in reasoning mode and 81.9% on the challenging AIME’24 mathematics competition — performance that rivals or exceeds many proprietary systems costing millions more to develop. “The challenge is making thinking traces useful and verifiable without runaway reasoning,” noted AI researcher Rohan Paul on X, highlighting one of the technical breakthroughs in the release. Perhaps most notably, Hermes 4 achieved the highest score among all tested models on “RefusalBench,” a new benchmark Nous Research created to measure how often AI systems refuse to answer questions. The model scored 57.1% in reasoning mode, significantly outperforming GPT-4o (17.67%) and Claude Sonnet 4 (17%). Hermes 4 models from Nous Research answered significantly more questions than competing AI systems on RefusalBench, a test measuring how often models refuse to respond to user requests. (Credit: Nous Research) Inside DataForge and Atropos: The breakthrough training systems behind Hermes 4’s capabilities Behind Hermes 4’s capabilities lies a sophisticated training infrastructure that Nous Research has developed over several years. The models were trained using two novel systems: DataForge, a graph-based synthetic data generator, and Atropos, an open-source reinforcement learning framework. DataForge creates training data through what the company describes as “random walks” through directed graphs, transforming simple pre-training data into complex instruction-following examples. The system can, for instance, take a Wikipedia article and transform it into a rap song, then generate questions and answers based on that transformation. Atropos, meanwhile, operates like hundreds of specialized training environments where AI models practice specific skills—mathematics, coding, tool use, and creative writing—receiving feedback only when they produce correct solutions. This “rejection sampling” approach ensures that only verified, high-quality responses make it into the training data. Atropos is Nous’ Reinforcement Learning framework Atropos is an open source reinforcement learning environment by Nous that has hundreds of “gyms” (like math, coding, games, tool‑use, vision) to train and evaluate LLM trajectories via scalable, async RL loops. In other words… pic.twitter.com/fjxaQKClEZ — Tommy (@Shaughnessy119) August 26, 2025 “Nous used these environments to generate the dataset for Hermes 4!” explained Tommy Shaughnessy, a venture capitalist at Delphi Ventures who has invested in Nous Research. “All in the dataset contains 3.5 million reasoning samples and 1.6 million non-reasoning samples! Hermes was trained on RL data, not just static datasets of question and answer!” The training process required 192 Nvidia B200 GPUs and 71,616 GPU hours for the largest model — a significant but not unprecedented computational investment that demonstrates how specialized techniques can compete with the massive scale of tech giants. Why Nous Research believes AI safety guardrails are ‘annoying as hell’ and hurt innovation Nous Research has built its reputation on a philosophy that puts user control above corporate content policies. The company’s models are designed to be “steerable,” meaning they can be fine-tuned or prompted to behave in specific ways without the rigid safety constraints that characterize commercial AI systems. “Hermes 4 is not shackled by disclaimers, rules and being overly cautious which is annoying as hell and hurts innovation and usability,” wrote Shaughnessy in a detailed thread analyzing the release. “If its open source but refuses all requests its pointless. Not an issue with Hermes 4.” Hermes 4 is not shackled by disclaimers, rules and being overly cautious which is annoying as hell and hurts innovation and usability. Hermes 4 70B is at the complete opposite of the spectrum vs OpenAI’s open source model. It’s also ~4x more open vs ChatGPT 4o! If its open… pic.twitter.com/q5RpX1oOzo — Tommy (@Shaughnessy119) August 26, 2025 This approach has made Nous Research popular among AI researchers and developers who want maximum flexibility, but it also places the company at the center of ongoing debates about AI safety and content moderation. While the models can theoretically be used for harmful purposes, Nous Research argues that transparency and user control are preferable to corporate gatekeeping. The company’s technical report, released alongside the models,

Nous Research drops Hermes 4 AI models that outperform ChatGPT without content restrictions Read More »

How Sakana AI’s new evolutionary algorithm builds powerful AI models without expensive retraining

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A new evolutionary technique from Japan-based AI lab Sakana AI enables developers to augment the capabilities of AI models without costly training and fine-tuning processes. The technique, called Model Merging of Natural Niches (M2N2), overcomes the limitations of other model merging methods and can even evolve new models entirely from scratch. M2N2 can be applied to different types of machine learning models, including large language models (LLMs) and text-to-image generators. For enterprises looking to build custom AI solutions, the approach offers a powerful and efficient way to create specialized models by combining the strengths of existing open-source variants. What is model merging? Model merging is a technique for integrating the knowledge of multiple specialized AI models into a single, more capable model. Instead of fine-tuning, which refines a single pre-trained model using new data, merging combines the parameters of several models simultaneously. This process can consolidate a wealth of knowledge into one asset without requiring expensive, gradient-based training or access to the original training data. For enterprise teams, this offers several practical advantages over traditional fine-tuning. In comments to VentureBeat, the paper’s authors said model merging is a gradient-free process that only requires forward passes, making it computationally cheaper than fine-tuning, which involves costly gradient updates. Merging also sidesteps the need for carefully balanced training data and mitigates the risk of “catastrophic forgetting,” where a model loses its original capabilities after learning a new task. The technique is especially powerful when the training data for specialist models isn’t available, as merging only requires the model weights themselves. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Early approaches to model merging required significant manual effort, as developers adjusted coefficients through trial and error to find the optimal blend. More recently, evolutionary algorithms have helped automate this process by searching for the optimal combination of parameters. However, a significant manual step remains: developers must set fixed sets for mergeable parameters, such as layers. This restriction limits the search space and can prevent the discovery of more powerful combinations. How M2N2 works M2N2 addresses these limitations by drawing inspiration from evolutionary principles in nature. The algorithm has three key features that allow it to explore a wider range of possibilities and discover more effective model combinations. Model Merging of Natural Niches Source: arXiv First, M2N2 eliminates fixed merging boundaries, such as blocks or layers. Instead of grouping parameters by pre-defined layers, it uses flexible “split points” and “mixing ration” to divide and combine models. This means that, for example, the algorithm might merge 30% of the parameters in one layer from Model A with 70% of the parameters from the same layer in Model B. The process starts with an “archive” of seed models. At each step, M2N2 selects two models from the archive, determines a mixing ratio and a split point, and merges them. If the resulting model performs well, it is added back to the archive, replacing a weaker one. This allows the algorithm to explore increasingly complex combinations over time. As the researchers note, “This gradual introduction of complexity ensures a wider range of possibilities while maintaining computational tractability.” Second, M2N2 manages the diversity of its model population through competition. To understand why diversity is crucial, the researchers offer a simple analogy: “Imagine merging two answer sheets for an exam… If both sheets have exactly the same answers, combining them does not make any improvement. But if each sheet has correct answers for different questions, merging them gives a much stronger result.” Model merging works the same way. The challenge, however, is defining what kind of diversity is valuable. Instead of relying on hand-crafted metrics, M2N2 simulates competition for limited resources. This nature-inspired approach naturally rewards models with unique skills, as they can “tap into uncontested resources” and solve problems others can’t. These niche specialists, the authors note, are the most valuable for merging. Third, M2N2 uses a heuristic called “attraction” to pair models for merging. Rather than simply combining the top-performing models as in other merging algorithms, it pairs them based on their complementary strengths. An “attraction score” identifies pairs where one model performs well on data points that the other finds challenging. This improves both the efficiency of the search and the quality of the final merged model. M2N2 in action The researchers tested M2N2 across three different domains, demonstrating its versatility and effectiveness. The first was a small-scale experiment evolving neural network–based image classifiers from scratch on the MNIST dataset. M2N2 achieved the highest test accuracy by a substantial margin compared to other methods. The results showed that its diversity-preservation mechanism was key, allowing it to maintain an archive of models with complementary strengths that facilitated effective merging while systematically discarding weaker solutions. Next, they applied M2N2 to LLMs, combining a math specialist model (WizardMath-7B) with an agentic specialist (AgentEvol-7B), both of which are based on the Llama 2 architecture. The goal was to create a single agent that excelled at both math problems (GSM8K dataset) and web-based tasks (WebShop dataset). The resulting model achieved strong performance on both benchmarks, showcasing M2N2’s ability to create powerful, multi-skilled models. A model merge with M2N2 combines the best of both seed models Source: arXiv Finally, the team merged diffusion-based image generation models. They combined a model trained on Japanese prompts (JSDXL) with three Stable Diffusion models primarily trained on English prompts. The objective was to create a model that combined the best image generation capabilities of each seed model while retaining the ability to understand Japanese. The merged model not only produced more photorealistic images with better semantic understanding but also developed an emergent bilingual ability. It could generate high-quality images from both English and Japanese

How Sakana AI’s new evolutionary algorithm builds powerful AI models without expensive retraining Read More »

Salesforce builds ‘flight simulator’ for AI agents as 95% of enterprise pilots fail to reach production

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Salesforce is betting that rigorous testing in simulated business environments will solve one of enterprise artificial intelligence’s biggest problems: agents that work in demonstrations but fail in the messy reality of corporate operations. The cloud software giant unveiled three major AI research initiatives this week, including CRMArena-Pro, what it calls a “digital twin” of business operations where AI agents can be stress-tested before deployment. The announcement comes as enterprises grapple with widespread AI pilot failures and fresh security concerns following recent breaches that compromised hundreds of Salesforce customer instances. “Pilots don’t learn to fly in a storm; they train in flight simulators that push them to prepare in the most extreme challenges,” said Silvio Savarese, Salesforce’s chief scientist and head of AI research, during a press conference. “Similarly, AI agents benefit from simulation testing and training, preparing them to handle the unpredictability of daily business scenarios in advance of their deployment.” The research push reflects growing enterprise frustration with AI implementations. A recent MIT report found that 95% of generative AI pilots at companies are failing to reach production, while Salesforce’s own studies show that large language models alone achieve only 35% success rates in complex business scenarios. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Digital twins for enterprise AI: how Salesforce simulates real business chaos CRMArena-Pro represents Salesforce’s attempt to bridge the gap between AI promise and performance. Unlike existing benchmarks that test generic capabilities, the platform evaluates agents on real enterprise tasks like customer service escalations, sales forecasting, and supply chain disruptions using synthetic but realistic business data. “If synthetic data is not generated carefully, it can lead to misleading or over optimistic results about how well your agent actually perform in your real environment,” explained Jason Wu, a research manager at Salesforce who led the CRMArena-Pro development. The platform operates within actual Salesforce production environments rather than toy setups, using data validated by domain experts with relevant business experience. It supports both business-to-business and business-to-consumer scenarios and can simulate multi-turn conversations that capture real conversational dynamics. Salesforce has been using itself as “customer zero” to test these innovations internally. “Before we bring anything to the market, we will put innovation into the hands of our own team to test it out,” said Muralidhar Krishnaprasad, Salesforce’s president and CTO, during the press conference. Five metrics that determine if your AI agent is enterprise-ready Alongside the simulation environment, Salesforce introduced the Agentic Benchmark for CRM, designed to evaluate AI agents across five critical enterprise metrics: accuracy, cost, speed, trust and safety, and environmental sustainability. The sustainability metric is particularly notable, helping companies align model size with task complexity to reduce environmental impact while maintaining performance. “By cutting through model overload noise, the benchmark gives businesses a clear, data-driven way to pair the right models with the right agents,” the company stated. The benchmarking effort addresses a practical challenge facing IT leaders: with new AI models released almost daily, determining which ones are suitable for specific business applications has become increasingly difficult. Why messy enterprise data could make or break your AI deployment The third initiative focuses on a fundamental prerequisite for reliable AI: clean, unified data. Salesforce’s Account Matching capability uses fine-tuned language models to automatically identify and consolidate duplicate records across systems, recognizing that “The Example Company, Inc.” and “Example Co.” represent the same entity. The data consolidation work emerged from a partnership between Salesforce’s research and product teams. “What identity resolution in Data Cloud implies is essentially, if you think about something as simple as even a user, they have many, many, many IDs across many systems within any company,” Krishnaprasad explained. One major cloud provider customer achieved a 95% match rate using the technology, saving sellers 30 minutes per connection by eliminating the need to manually cross-reference multiple screens to identify accounts. The announcements come amid heightened security concerns following a data theft campaign that affected over 700 Salesforce customer organizations earlier this month. According to Google’s Threat Intelligence Group, hackers exploited OAuth tokens from Salesloft’s Drift chat agent to access Salesforce instances and steal credentials for Amazon Web Services, Snowflake, and other platforms. The breach highlighted vulnerabilities in third-party integrations that enterprises rely on for AI-powered customer engagement. Salesforce has since removed Salesloft Drift from its AppExchange marketplace pending investigation. The gap between AI demos and enterprise reality is bigger than you think The simulation and benchmarking initiatives reflect a broader recognition that enterprise AI deployment requires more than impressive demonstration videos. Real business environments feature legacy software, inconsistent data formats, and complex workflows that can derail even sophisticated AI systems. “The main aspects that we want we were been discussing today is the consistency aspect, so how to ensure that we go from these in a way unsatisfactory performance, if you just plug an LM into an enterprise use cases, into something which is achieves much higher performances,” Savarese said during the press conference. Salesforce’s approach emphasizes the need for AI agents to work reliably across diverse scenarios rather than excelling at narrow tasks. The company’s concept of “Enterprise General Intelligence” (EGI) focuses on building agents that are both capable and consistent in performing complex business tasks. As enterprises continue to invest in AI technologies, the success of platforms like CRMArena-Pro may determine whether the current wave of AI enthusiasm translates into sustainable business transformation or becomes another example of technology promise exceeding practical delivery. The research initiatives will be showcased at Salesforce’s Dreamforce conference in October, where the company is expected to announce additional AI developments as it seeks to maintain its leadership position in the increasingly competitive enterprise AI market. source

Salesforce builds ‘flight simulator’ for AI agents as 95% of enterprise pilots fail to reach production Read More »

Busted by the em dash — AI’s favorite punctuation mark, and how it’s blowing your cover

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Let’s talk about the em dash. Not the little innocent hyphen, not its slightly more confident cousin, the en dash. No, I’m talking about the ‘EM dash,’ that long, dramatic line that AI looooooves to drop in your sentences like it’s getting paid per dash. Seriously, it’s the AI version of jazz hands. You may not notice it, but most everyone else does. It’s the dead giveaway that you’ve let your favorite robot sidekick dress your words up in AI drag, and just like a bad wig reveal in the third act of RuPaul’s Drag Race, it can be… a little too much. Let me set the scene: You’re writing a heartfelt email to your team. Something vulnerable, maybe even raw: “I’ve been thinking a lot about the way we work together — and how we can be better — not just as colleagues, but as humans.” Except, wait. You didn’t write that sentence, AI did. You just wanted it to fix a typo and maybe zhuzh up the tone, but now it’s full of em dashes, introspective pacing and oddly placed poetic pauses. You’ve officially been “EM-marked.” What is the em-mark for AI? The em dash is that long horizontal line (—) that’s often used in place of commas, colons, parentheses or the occasional dramatic pause. It’s like the Swiss Army knife of punctuation, and AI LOVES it. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO AI is obsessed with em dashes the way Gen Z is obsessed with Y2K fashion; it’s confusing, oddly stylish, and borderline offensive when overused. But here’s the kicker: AI uses em dashes like sprinkles on a kid’s cupcake, everywhere. Even when it’s not appropriate. Even when you say, “No sprinkles, please.” I have literally typed to AI: “Please remove the em dashes.” And what do I get back? “Got it!” followed by:“This is a major opportunity — one that demands urgency — and clarity — for maximum impact.” Thanks, GPT. You removed exactly zero. So, how do you sound human (but still use AI)? Despite the dash drama, I’m not here to tell you to throw out AI altogether. AI is brilliant at polishing, rephrasing and getting you out of your own mental way. But like a child with glitter glue, you still need to supervise it. Here are three actually-helpful tips to make sure your communication still sounds like you, not HAL 9000 with a journalism degree. 1. Human first draft, robot second Always, and I mean always, write the first draft yourself. Let it be messy, typo-riddled, emotionally chaotic and uncomfortably honest. That’s what gives your voice its fingerprints. Then let AI fix it up, rearrange and suggest better flow, but not before. AI can’t guess what you meant if you don’t give it something to work with first. Otherwise, it just serves you a perfectly punctuated bowl of oatmeal with the emotional depth of a DMV form letter. Think of it like this: You’re the chef, AI is just your fancy sous-chef with a tiny top hat. You tell it what you’re making. You don’t let it invent the recipe. 2. Strip the ems (and other AI tells) Once AI gives you its best version, rip it apart like you’re editing a screenplay about a talking golden retriever that writes blogs. Look for: Em dashes (obviously) The phrase “in today’s fast-paced world” (AI’s favorite opening line) Overuse of rhetorical questions Repetitive alliteration (AI really thinks it’s clever) Do a “find and replace” for “—” if you must. Replace them with commas, periods or, God forbid, actual pauses in thought. It’ll instantly humanize your tone. If your sentence feels like it’s being narrated by Morgan Freeman in a nature documentary, it’s probably too AI-ish. 3. Add the ‘you’ back in After polishing, re-read it aloud. Ask yourself: Would I say this out loud at brunch? Does this sound like me, or a guest columnist for Forbes trying too hard? Did I just accidentally quote Tony Robbins? If it feels too stiff or polished, loosen it up, add a little slang. Break a grammar rule, use sentence fragments, write like you talk when you’re three mimosas deep and giving your best friend life advice. That’s the secret sauce. Example: AI version: “Let’s explore innovative solutions to elevate our business trajectory.”You version: “Let’s figure out how to stop spinning our wheels and actually grow this thing already.” Feel the difference? Why you should still use AI, even if it likes em dashes more than is socially acceptable AI isn’t the enemy, it’s your collaborator, your co-writer, your overachieving intern who drank too much espresso and came back with a 1,200-word mission statement for a brunch flyer. Use it to: Tighten up your message Help with structure and flow Make your writing pop when you’re brain-fried Get past blank-page syndrome without crying Just don’t let it be the only voice in the room. Think of it like autocorrect,  helpful when it’s right, hilarious when it’s wrong and dangerous if you’re not paying attention. If your message starts sounding like it belongs in a Wall Street Journal op-ed, but you’re just trying to email your VA about a podcast schedule, take a step back, kill the em dashes, reclaim your weird little voice, and remember: AI doesn’t replace you, it just makes you sound 12% smarter… if you supervise it like a helicopter parent at a middle school dance. Now go forth, edit like a human, delete like a savage and send with swagger. (And please, for the love of all things analog, remove the em dashes.) Starr Hall is an entrepreneur, veteran publicist and marketer.  source

Busted by the em dash — AI’s favorite punctuation mark, and how it’s blowing your cover Read More »

Anthropic launches Claude for Chrome in limited beta, but prompt injection attacks remain a major concern

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Anthropic has begun testing a Chrome browser extension that allows its Claude AI assistant to take control of users’ web browsers, marking the company’s entry into an increasingly crowded and potentially risky arena where artificial intelligence systems can directly manipulate computer interfaces. The San Francisco-based AI company announced Tuesday that it would pilot “Claude for Chrome” with 1,000 trusted users on its premium Max plan, positioning the limited rollout as a research preview designed to address significant security vulnerabilities before wider deployment. The cautious approach contrasts sharply with more aggressive moves by competitors OpenAI and Microsoft, who have already released similar computer-controlling AI systems to broader user bases. The announcement underscores how quickly the AI industry has shifted from developing chatbots that simply respond to questions toward creating “agentic” systems capable of autonomously completing complex, multi-step tasks across software applications. This evolution represents what many experts consider the next frontier in artificial intelligence — and potentially one of the most lucrative, as companies race to automate everything from expense reports to vacation planning. How AI agents can control your browser but hidden malicious code poses serious security threats Claude for Chrome allows users to instruct the AI to perform actions on their behalf within web browsers, such as scheduling meetings by checking calendars and cross-referencing restaurant availability, or managing email inboxes and handling routine administrative tasks. The system can see what’s displayed on screen, click buttons, fill out forms, and navigate between websites — essentially mimicking how humans interact with web-based software. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO “We view browser-using AI as inevitable: so much work happens in browsers that giving Claude the ability to see what you’re looking at, click buttons, and fill forms will make it substantially more useful,” Anthropic stated in its announcement. However, the company’s internal testing revealed concerning security vulnerabilities that highlight the double-edged nature of giving AI systems direct control over user interfaces. In adversarial testing, Anthropic found that malicious actors could embed hidden instructions in websites, emails, or documents to trick AI systems into harmful actions without users’ knowledge—a technique called prompt injection. Without safety mitigations, these attacks succeeded 23.6% of the time when deliberately targeting the browser-using AI. In one example, a malicious email masquerading as a security directive instructed Claude to delete the user’s emails “for mailbox hygiene,” which the AI obediently executed without confirmation. “This isn’t speculation: we’ve run ‘red-teaming’ experiments to test Claude for Chrome and, without mitigations, we’ve found some concerning results,” the company acknowledged. OpenAI and Microsoft rush to market while Anthropic takes measured approach to computer-control technology Anthropic’s measured approach comes as competitors have moved more aggressively into the computer-control space. OpenAI launched its “Operator” agent in January, making it available to all users of its $200-per-month ChatGPT Pro service. Powered by a new “Computer-Using Agent” model, Operator can perform tasks like booking concert tickets, ordering groceries, and planning travel itineraries. Microsoft followed in April with computer use capabilities integrated into its Copilot Studio platform, targeting enterprise customers with UI automation tools that can interact with both web applications and desktop software. The company positioned its offering as a next-generation replacement for traditional robotic process automation (RPA) systems. The competitive dynamics reflect broader tensions in the AI industry, where companies must balance the pressure to ship cutting-edge capabilities against the risks of deploying insufficiently tested technology. OpenAI’s more aggressive timeline has allowed it to capture early market share, while Anthropic’s cautious approach may limit its competitive position but could prove advantageous if safety concerns materialize. “Browser-using agents powered by frontier models are already emerging, making this work especially urgent,” Anthropic noted, suggesting the company feels compelled to enter the market despite unresolved safety issues. Why computer-controlling AI could revolutionize enterprise automation and replace expensive workflow software The emergence of computer-controlling AI systems could fundamentally reshape how businesses approach automation and workflow management. Current enterprise automation typically requires expensive custom integrations or specialized robotic process automation software that breaks when applications change their interfaces. Computer-use agents promise to democratize automation by working with any software that has a graphical user interface, potentially automating tasks across the vast ecosystem of business applications that lack formal APIs or integration capabilities. Salesforce researchers recently demonstrated this potential with their CoAct-1 system, which combines traditional point-and-click automation with code generation capabilities. The hybrid approach achieved a 60.76% success rate on complex computer tasks while requiring significantly fewer steps than pure GUI-based agents, suggesting substantial efficiency gains are possible. “For enterprise leaders, the key lies in automating complex, multi-tool processes where full API access is a luxury, not a guarantee,” explained Ran Xu, Director of Applied AI Research at Salesforce, pointing to customer support workflows that span multiple proprietary systems as prime use cases. University researchers release free alternative to Big Tech’s proprietary computer-use AI systems The dominance of proprietary systems from major tech companies has prompted academic researchers to develop open alternatives. The University of Hong Kong recently released OpenCUA, an open-source framework for training computer-use agents that rivals the performance of proprietary models from OpenAI and Anthropic. The OpenCUA system, trained on over 22,600 human task demonstrations across Windows, macOS, and Ubuntu, achieved state-of-the-art results among open-source models and performed competitively with leading commercial systems. This development could accelerate adoption by enterprises hesitant to rely on closed systems for critical automation workflows. Anthropic’s safety testing reveals AI agents can be tricked into deleting files and stealing data Anthropic has implemented several layers of protection for Claude for Chrome, including site-level permissions that allow users to control which websites the AI can access, mandatory confirmations before high-risk actions like making purchases or sharing personal data,

Anthropic launches Claude for Chrome in limited beta, but prompt injection attacks remain a major concern Read More »

Gemini expands image editing for enterprises: Consistency, collaboration, and control at scale

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Google released Gemini 2.5 Flash Image, a new model that many beta users knew as nanobanana, which gives enterprises more choice for creative projects. It enables them to change the look of images they need quickly and with more control than what previous models offered. The model will be integrated into the Gemini app.  The model, built on top of Gemini 2.5 Flash, enhances the native image editing capabilities of the Gemini app. The Gemini 2.5 Flash Image maintains character likenesses across different images and offers greater consistency when editing pictures. If a user uploads a photo of their pet and then asks the model to change the background or add a hat to their dog, Gemini 2.5 Flash Image will do that without altering the subject of the picture.  “We know that when editing pictures of yourself or people you know well, subtle flaws matter, a depiction that’s ‘close but not quite the same’ doesn’t feel right,” Google said in a blog post written by Gemini Apps multimodal generation lead David Sharon and Google DeepMind Gemini image product lead Nicole Brichtova. “That’s why our latest update is designed to make photos of your friends, family and even your pets look consistently like themselves.”  AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO One complaint that enterprises and some individual users have is that when prompting edits on AI-generated images, even slight tweaks alter the photo too much. For example, someone may instruct the model to move a person’s position in the picture, and while the model does what it’s told, the person’s face is altered slightly.  All images generated on Gemini will include Google’s SynthID watermark. The model is available for all paid and free users of the Gemini app.  Speculation that Google plans to release a new image model ran rampant on social media platforms. Users on LM Arena saw a mysterious new model called nanobanana that followed “complex, multistep instructions with impressive accuracy,” as Andressen Horowitz partner Justine Moore put it in a post.  Mysterious new image edit model hit the arena ? “Nano-banana” lets you upload TWO images and prompt to combine them. It can follow complex, multi-step instructions with impressive accuracy. pic.twitter.com/Ylu54w7ge4 — Justine Moore (@venturetwins) August 17, 2025 People soon noticed that the nanobanana model seemed to come from Google before several early testers confirmed it. Though at the time, Google did not confirm what it planned to do with the model on LM Arena.  Nano-banana is BANANAS! ? Seriously, it took just my profile pic and this prompt: “Medium shot of the man facing the camera playing guitar on a stage in a bar” What model is this? I’m betting Imagen 5! ? Any guesses? pic.twitter.com/SAQRcdW2zL — Anis Aydar (@anisaydar) August 15, 2025 Google’s Nanobanana ? is about the drop an AI model that delivers pro-level Photoshop edits in seconds, with only text. This the next generation of what “filters” we’ve been promised forever. Here’s a thread of 10 examples: Changing facial expressions and the weather. 1/11 pic.twitter.com/M8WCf7JFNT — Deedy (@deedydas) August 23, 2025 Until this week, speculation about when the model would be released continued, which is prophetic in a way. Much of the excitement stems from the competition between model providers to offer more capable and realistic images and edits, highlighting the power of multimodal models.  However, Google still needs to fight off rivals like Qwen and its recently released Qwen-Image Edit and OpenAI, which added native AI image editing to ChatGPT and also made the model available as an API.  Of course, Adobe, long considered one of the leaders in the image editing space, added its flagship model Firefly to Photoshop and its other photo editing platforms.  Native image editing  Gemini added native AI image editing on Gemini in March, which it offered to free users of the chat platform.  Bringing image editing features directly into the chat platform would allow enterprises to fix images or graphs without moving windows.  Users can upload a photo to Gemini and then instruct the model on the desired changes. Once they are satisfied, the new pictures can be reuploaded to Gemini and made into a video.  Other than adding a costume or a location change, Gemini 2.5 Flash Image can blend different photos, offers multi-turn editing and mix styles of one picture to another. source

Gemini expands image editing for enterprises: Consistency, collaboration, and control at scale Read More »

Forget data labeling: Tencent’s R-Zero shows how LLMs can train themselves

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A new training framework developed by researchers at Tencent AI Lab and Washington University in St. Louis enables large language models (LLMs) to improve themselves without requiring any human-labeled data. The technique, called R-Zero, uses reinforcement learning to generate its own training data from scratch, addressing one of the main bottlenecks in creating self-evolving AI systems. R-Zero works by having two independent models co-evolve by interacting with and challenging each other. Experiments show that R-Zero substantially improves reasoning capabilities across different LLMs, which could lower the complexity and costs of training advanced AI. For enterprises, this approach could accelerate the development of specialized models for complex reasoning tasks without the massive expense of curating labeled datasets. The challenge of self-evolving LLMs The idea behind self-evolving LLMs is to create AI systems that can autonomously generate, refine, and learn from their own experiences. This offers a scalable path toward more intelligent and capable AI. However, a major challenge is that training these models requires large volumes of high-quality tasks and labels, which act as supervision signals for the AI to learn from. Relying on human annotators to create this data is not only costly and slow but also creates a fundamental bottleneck. It effectively limits an AI’s potential capabilities to what humans can teach it. To address this, researchers have developed label-free methods that derive reward signals directly from a model’s own outputs, for example, by measuring its confidence in an answer. While these methods eliminate the need for explicit labels, they still rely on a pre-existing set of tasks, thereby limiting their applicability in truly self-evolving scenarios. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Other approaches involve having models generate their own tasks to learn from. However, in domains like open-ended reasoning, where there is no simple way to check for correctness (such as a code executor), ensuring the quality of this self-generated data is a significant hurdle. How R-Zero works R-Zero is a framework designed to train reasoning LLMs that can evolve from zero external data. The process begins with a single base model, which is split into two roles: a “Challenger” and a “Solver.” These two models are optimized independently but evolve together through a continuous cycle of interaction. The Challenger’s goal is to create new tasks that are just at the threshold of the Solver’s current abilities, neither too easy nor impossible. The Solver, in turn, is rewarded for solving these increasingly complex tasks. In written comments to VentureBeat, Chengsong Huang, co-author of the paper and a doctoral student at Washington University in St. Louis, explained that this dynamic is crucial because generating high-quality questions is often more complicated than finding the answers. “What we found in a practical setting is that the biggest challenge is not generating the answers… but rather generating high-quality, novel, and progressively more difficult questions,” Huang said. “We believe that good teachers are far rarer than good students. The co-evolutionary dynamic automates the creation of this ‘teacher,’ ensuring a steady and dynamic curriculum that pushes the Solver’s capabilities far beyond what a static, pre-existing dataset could achieve.” Once the Challenger generates enough questions, they are filtered for diversity and compiled into a training dataset. In the Solver’s training phase, it is fine-tuned on these challenging questions. The “correct” answer for each question is determined by a majority vote from the Solver’s own previous attempts.  This entire process repeats, creating a self-improving loop that operates without any human intervention, allowing the two models to push each other to become progressively more capable across each iteration. R-Zero in action The researchers tested R-Zero on several open-source LLMs, including models from the Qwen3 and OctoThinker families. They first trained the models on math problems and then tested whether the learned reasoning skills could generalize to other complex, general-domain benchmarks like MMLU-Pro (multi-language understanding and reasoning tasks) and SuperGPQA (science and reasoning tasks). The results showed that R-Zero is a highly effective, model-agnostic framework. For instance, it boosted the Qwen3-4B-Base model’s score by +6.49 on average across math reasoning benchmarks. The training process consistently and substantially improved performance, with gains accumulating over several iterations. The larger Qwen3-8B-Base model saw its average math score climb by +5.51 points after three iterations. A key finding was the immediate performance leap after the first iteration, which validated the effectiveness of the Challenger’s role in creating a high-quality learning curriculum. “This confirms that the intelligent curriculum generated by the RL-trained Challenger is significantly more effective than that of a non-trained generator,” the researchers write in their paper. Notably, the skills learned from math problems were effectively transferred to general reasoning tasks, thereby enhancing the models’ underlying capabilities. For example, the same Qwen3-4B-Base model showed an improvement of +7.54 on general-domain reasoning benchmarks. Another interesting finding is that R-Zero can serve as a decisive pre-training step. Models first improved by R-Zero achieved even higher performance when later fine-tuned on traditional labeled data, suggesting the framework acts as a performance amplifier. For enterprises, the “from zero data” approach could be a game-changer, especially in niche domains where high-quality data is scarce or non-existent. Huang highlights that R-Zero’s main advantage is its ability to sidestep the most expensive and time-consuming part of AI development: data curation. “Our approach entirely bypasses the fundamental bottleneck of having to find, label, and curate high-quality datasets,” he said. “This is not just about a cost-saving measure; it’s a pathway toward creating AI that can surpass human capabilities, because it is no longer limited by the scope of human knowledge or data.” However, the co-evolutionary process also revealed a critical challenge. As the Challenger successfully generates progressively more difficult problems, the Solver’s ability to produce reliable “correct” answers via

Forget data labeling: Tencent’s R-Zero shows how LLMs can train themselves Read More »