VentureBeat

Genspark’s Super Agent ups the ante in the general AI agent race

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The general-purpose AI agent landscape is suddenly much more crowded and ambitious. This week, Palo Alto-based startup Genspark released what it calls Super Agent, a fast-moving autonomous system designed to handle real-world tasks across a wide range of domains – including some that raise eyebrows, like making phone calls to restaurants using a realistic synthetic voice. The launch adds fuel to what’s shaping up to be an important new front in the AI competition: Who will build the first reliable, flexible and truly useful general-purpose agent? Perhaps more urgently, what does that mean for enterprises? Genspark’s launch of Super Agent comes just three weeks after a different Chinese-founded startup, Manus, gained attention for its ability to coordinate tools and data sources to complete asynchronous cloud tasks like travel booking, resume screening and stock analysis – all without the hand-holding typical of most current agents. Genspark now claims to go even further. According to co-founder Eric Jing, Super Agent is built on three pillars: a concert of nine different LLMs, more than 80 tools and over 10 proprietary datasets – all working together in a coordinated flow. It moves well beyond traditional chatbots, handling complex workflows and returning fully executed outcomes. In a demo, Genspark’s agent planned a complete five-day San Diego trip, calculated walking distances between attractions, mapped public transit options and then used a voice-calling agent to book restaurants, including handling food allergies and seating preferences. Another demo showed the agent creating a cooking video reel by generating recipe steps, video scenes and audio overlays. In a third, it wrote and produced a South Park-style animated episode, riffing on the recent Signalgate political scandal involving sharing war plans with a political reporter. These may sound consumer-focused, but they showcase where the tech is headed – toward multi-modal, multi-step task automation that blurs the line between creative generation and execution. “Solving these real-world problems is much harder than we thought,” Jing says in the video, “but we’re excited about the progress we’ve made.” One compelling feature: Super Agent clearly visualizes its thought process, tracing how it reasons through each step, which tools it invokes and why. Watching that logic play out in real time makes the system feel less like a black box and more like a collaborative partner. It could also inspire enterprise developers to build similar traceable reasoning paths into their own AI systems, making applications more transparent and trustworthy. Super Agent was also impressively easy to try. The interface launched smoothly in a browser with no technical setup required. Genspark lets users begin testing without requiring personal credentials. In contrast, Manus still requires applicants to join a waitlist and disclose social accounts and other private information, adding friction to experimentation. We first wrote about Genspark back in November, when it launched Claude-powered financial reports. It has raised at least $160 million across two rounds, and is backed by U.S and Singapore based investors. Watch the latest video discussion between AI agent developer Sam Witteveen and me here for a deeper dive into how Genspark’s approach compares to other agent frameworks and why it matters for enterprise AI teams. How is Genspark pulling this off? Genspark’s approach stands out because it navigates a long-standing AI engineering challenge: tool orchestration at scale. Most current agents break down when juggling more than a handful of external APIs or tools. Genspark’s Super Agent appears to manage this better, likely by using model routing and retrieval-based selection to choose tools and sub-models dynamically based on the task. This strategy echoes the emerging research around CoTools, a new framework from Soochow University in China that enhances how LLMs use extensive and evolving toolsets. Unlike older approaches that rely heavily on prompt engineering or rigid fine-tuning, CoTools keeps the base model “frozen” while training smaller components to judge, retrieve, and call tools efficiently. Another enabler is the Model Context Protocol (MCP), a lesser-known but increasingly adopted standard that allows agents to carry richer tool and memory contexts across steps. Combined with Genspark’s proprietary datasets, MCP may be one reason their agent appears more “steerable” than alternatives. How does this compare to Manus? Genspark isn’t the first startup to promote general agents. Manus, launched last month by the China-based company Monica, made waves with its multi-agent system, which autonomously runs tools like a web browser, code editor or spreadsheet engine to complete multi-step tasks. Manus’s efficient integration of open-source parts, including web tools and LLMs like Claude from Anthropic, was surprising. Despite not building a proprietary model stack, it still outperformed OpenAI on the GAIA benchmark — a synthetic test designed to evaluate real-world task automation by agents. Genspark, however, claims to have leapfrogged Manus, scoring 87.8% on GAIA—ahead of Manus’s reported 86%—and doing so with an architecture that includes proprietary components and more extensive tool coverage. The big tech players: Still playing it safe? Meanwhile, the largest U.S.-based AI companies have been cautious. Microsoft’s main AI agent offering, Copilot Studio, focuses on fine-tuned vertical agents that align closely with enterprise apps like Excel and Outlook. OpenAI’s Agent SDK provides building blocks but stops short of shipping its own full-featured, general-purpose agent. Amazon’s recently announced Nova Act takes a developer-first approach, offering atomic browser-based actions via SDK but tightly tied to its Nova LLM and cloud infrastructure. These approaches are more modular, more secure and clearly targeted toward enterprise use. But they lack the ambition—or autonomy—shown in Genspark’s demo. One reason may be risk aversion. The reputational cost could be high if a general agent from Google or Microsoft books the wrong flight or says something odd on a voice call. These companies are also locked into their own model ecosystems, limiting their flexibility to experiment with multi-model orchestration. Startups like Genspark, by contrast, have the freedom to mix and match LLMs – and to move fast. Should enterprises care? That’s the strategic question. Most enterprises don’t need a

Genspark’s Super Agent ups the ante in the general AI agent race Read More »

Beyond RAG: How Articul8’s supply chain models achieve 92% accuracy where general AI fails

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More In the race to implement AI across business operations, many enterprises are discovering that general-purpose models often struggle with specialized industrial tasks that require deep domain knowledge and sequential reasoning. While fine-tuning and Retrieval Augmented Generation (RAG) can help, that’s often not enough for complex use cases like supply chain. It’s a challenge that startup Articul8 is looking to solve. Today, the company debuted a series of domain-specific AI models for manufacturing supply chains called A8-SupplyChain. The new models are accompanied by Articul8’s ModelMesh, which is an agentic AI-powered dynamic orchestration layer that makes real-time decisions about which AI models to use for specific tasks. Articul8 claims that its models achieve 92% accuracy on industrial workflows, outperforming general-purpose AI models on complex sequential reasoning tasks. Articul8 started as an internal development team inside Intel and was spun out as an independent business in 2024. The technology emerged from work at Intel, where the team built and deployed multimodal AI models for clients, including Boston Consulting Group, before ChatGPT had even launched. The company was built on a core philosophy that runs counter to much of the current market approach to enterprise AI. “We are built on the core belief that no single model is going to get you to enterprise outcomes, you really need a combination of models,” Arun Subramaniyan, CEO and founder of Articul8 told VentureBeat in an exclusive interview. “You need domain-specific models to actually go after complex use cases in regulated industries such as aerospace, defense, manufacturing, semiconductors or supply chain.” The supply chain AI challenge: When sequence and context determine success or failure Manufacturing and industrial supply chains present unique AI challenges that general-purpose models struggle to handle effectively. These environments involve complex multi-step processes where the sequence, branching logic and interdependencies between steps are mission-critical. “In the world of supply chain, the core underlying principle is everything is a bunch of steps,” Subramaniyan explained. “Everything is a bunch of related steps, and the steps sometimes have connections and they sometimes have recursions.” For example, say a user is trying to assemble a jet engine, there are often multiple manuals. Each of the manuals has at least a few hundred, if not a few thousand, steps that need to be followed in sequence. These documents aren’t just static information—they’re effectively time series data representing sequential processes that must be precisely followed. Subramaniyan argued that general AI models, even when augmented with retrieval techniques, often fail to grasp these temporal relationships. This type of complex reasoning—tracing backwards through a procedure to identify where an error occurred—represents a fundamental challenge that general models haven’t been built to handle. ModelMesh: A dynamic intelligence layer, not just another orchestrator At the heart of Articul8’s technology is ModelMesh, which goes beyond typical model orchestration frameworks to create what the company describes as “an agent of agents” for industrial applications. “ModelMesh is actually an intelligence layer that connects and continues to decide and rate things as they go past like one step at a time,” Subramaniyan explained. “It’s something that we had to build completely from scratch, because none of the tools out there actually come anywhere close to doing what we have to do, which is making hundreds, sometimes even thousands, of decisions at runtime.” Unlike existing frameworks like LangChain or LlamaIndex that provide predefined workflows, ModelMesh combines Bayesian systems with specialized language models to dynamically determine whether outputs are correct, what actions to take next and how to maintain consistency across complex industrial processes. This architecture enables what Articul8 describes as industrial-grade agentic AI—systems that can not only reason about industrial processes but actively drive them. Beyond RAG: A ground-up approach to industrial intelligence While many enterprise AI implementations rely on retrieval-augmented generation (RAG) to connect general models to corporate data, Articul8 takes a different approach to building domain expertise. “We actually take the underlying data and break them down into their constituent elements,” Subramaniyan explained. “We break down a PDF into text, images and tables. If it’s audio or video, we break that down into its underlying constituent elements, and then we describe those elements using a combination of different models.” The company starts with Llama 3.2 as a foundation, chosen primarily for its permissive licensing, but then transforms it through a sophisticated multi-stage process. This multi-layered approach allows their models to develop a much richer understanding of industrial processes than simply retrieving relevant chunks of data. The SupplyChain models undergo multiple stages of refinement designed specifically for industrial contexts. For well-defined tasks, they use supervised fine-tuning. For more complex scenarios requiring expert knowledge, they implement feedback loops where domain experts evaluate responses and provide corrections. How enterprises are using Articul8 While it’s still early for the new models, the company already claims a number of  customers and partners including  iBase-t, Itochu Techno-Solutions Corporation, Accenture and Intel. Like many organizations, Intel started its gen AI journey by evaluating general-purpose models to explore how they could support design and manufacturing operations.  “While these models are impressive in open-ended tasks, we quickly discovered their limitations when applied to our highly specialized semiconductor environment,” Srinivas Lingam, corporate vice president and general manager of the network, edge and AI Group at Intel, told VentureBeat. “They struggled with interpreting semiconductor-specific terminology, understanding context from equipment logs, or reasoning through complex, multi-variable downtime scenarios.” Intel is deploying Articul8’s platform to build what Lingam called – Manufacturing Incident Assistant – an intelligent, natural language-based system that helps engineers and technicians diagnose and resolve equipment downtime events in Intel’s fabs. He explained that the platform and domain-specific models ingest both historical and real-time manufacturing data, including structured logs, unstructured wiki articles and internal knowledge repositories. It helps Intel’s teams perform root cause analysis (RCA), recommends corrective actions and even automates parts of work order generation. What this means for enterprise AI strategy Articul8’s approach challenges the assumption

Beyond RAG: How Articul8’s supply chain models achieve 92% accuracy where general AI fails Read More »

Vibe coding at enterprise scale: AI tools now tackle the full development lifecycle

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The vibe coding phenomenon—where developers increasingly rely on AI to generate and assist with code—has rapidly evolved from a niche concept to a mainstream development approach.  With tools like GitHub Copilot normalizing AI-assisted coding, the next battleground has shifted from code generation to end-to-end development workflows. In this increasingly crowded landscape, players like Cursor, Lovable, Bolt and Windsurf (formerly codeium) have each staked their claim with various approaches to AI-assisted development. The term vibe coding itself represents a cultural shift in which developers focus more on intent and outcome than manual implementation details—a trend that has both enthusiastic advocates and skeptical critics. Vibe coding is all about using AI-powered tools to help with basic code completion tasks and generate entire applications with just a few prompts. Vibe coding diverges from low-code/no-code platforms by going beyond visual tools for simple business applications. According to some advocates, vibe coding promises to augment or even potentially replace real software developers. In this competitive field, Windsurf’s latest Wave 6 release which debuted on April 2 addresses a gap that some tools have often ignored: deployment. While code generation has become increasingly sophisticated across platforms, the journey from locally generated code to production deployment has remained stubbornly manual.  “We’ve really removed a lot of the friction involved with iterating and deploying applications,” Anshul Ramachandran, head of product and strategy at Windsurf told VentureBeat. “The promise of AI and all these agentic systems is that the activation energy, the barrier to building, is so much lower.” Windsurf Wave 6 feature breakdown: What enterprises need to know Looking specifically at the new features in Windsurf Wave 6, several enterprise capabilities address workflow bottlenecks: Deploys: A one-click solution to package and share Windsurf-built apps on the public internet. Currently integrated with Netlify, allowing users to deploy websites or JavaScript web apps to a public domain. Improved Performance for Long Conversations: Reduced quality degradation in extended conversations through checkpointing and summarization techniques. Tab Improvements: Enhanced context awareness, including user search history and support for Jupyter Notebooks within the Windsurf Editor. Conversation Table of Contents: New UX improvement that provides easy access to past messages and conversation reversion capabilities. Conversation management: Technical innovation that matters The Conversation Table of Contents feature in Wave 6 is also particularly interesting. It addresses a technical challenge that some competitors have overlooked: efficiently managing extended interactions with AI assistants when errors or misunderstandings occur. “AI is not perfect. It will occasionally make mistakes,” Ramachandran acknowledges. “You’d often find yourself in this kind of loop where people try to prompt the AI to get out of a bad state. In reality, instead of doing that, you should probably just revert the state of your conversation to the last point where things were going well, and then try a different prompt or direction.” The technical implementation creates a structured navigation system that changes how developers interact with AI assistants: Each significant interaction is automatically indexed within the conversation. A navigable sidebar allows immediate access to previous states. One-click reversion restores previous conversation states. The system preserves context while eliminating the inefficiency of repeatedly prompting an AI to correct itself. Getting the ‘vibe’ of the vibe coding landscape The Windsurf Wave 6 release has got some positive feedback in the short time it has been out. Builders: you still using Cursor or have you switched to Windsurf? I’m hearing more and more developers are switching. https://t.co/euQCNU3OWu — Robert Scoble (@Scobleizer) April 2, 2025 It’s a very active space, though, with fierce competition. Just last week, Replit Agent v2 became generally available. Replit Agent v2 benefits from Anthropic’s Claude 3.7 Sonnet, arguably the most powerful LLM for coding tasks. The new Replit Agent also integrates: Enhanced Autonomy: Forms hypotheses, searches for relevant files and makes changes only when sufficiently informed. Better Problem-Solving: Less likely to get stuck in loops; can step back to rethink approaches. Realtime App Design Preview: Industry-first feature showing live interfaces as the Agent builds. Improved UI Creation: Excels at creating high-quality interfaces with earlier design previews. Guided Ideation: Recommends potential next steps throughout the development process. Cursor is also highly active and offers a steady pace of incremental updates. Recent additions include chat tabs, which enable developers to have multiple conversations with the AI tool at the same time. On March 28, Cursor added support for the new Google Gemini 2.5 Pro model as an option for its users. Bolt also released a new update on March 28, along with a new mobile release in beta. At the end of February, Bolt AI v1.33 was released, adding full support for Claude 3.7 and prompt caching capabilities. Though not always included in the vibe coding spectrum, Cognition Labs released Devin 2.0 this week. Much like the tabbed feature in Windsurf Wave, Devin now has the ability to run multiple AI agents simultaneously on different tasks. It also now integrates interactive planning that helps scope and plan tasks from broad ideas. Devin 2.0 also integrates a novel search tool to navigate better and understand codebases The evolution of developer roles, not their replacement The vibe coding movement has sparked debates about whether traditional programming skills remain relevant.  Windsurf takes a distinctly pragmatic position that should reassure enterprise leaders concerned about the implications for their development teams. “Vibe coding has been used to refer to the new class of developers that are being created,” Ramachandran explains.  “People separating the ‘vibe coders’ and the ‘non-vibe coders’—it’s just a new class of people that can now write code, who might not have been able to before, which is great,” Ramachandran said. “This is how software has expanded over time, we make it easier to write software so more people can write software.” Much like how low-code and no-code tools never fully replaced enterprise application developers in the pre-AI era, it’s not likely that vibe coding will entirely replace all developers.

Vibe coding at enterprise scale: AI tools now tackle the full development lifecycle Read More »

Nvidia’s new Llama-3.1 Nemotron Ultra outperforms DeepSeek R1 at half the size

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Even as Meta fends off questions and criticisms of its new Llama 4 model family, graphics processing unit (GPU) master Nvidia has released a new, fully open source large language model (LLM) based on Meta’s older model Llama-3.1-405B-Instruct model and it’s claiming near top performance on a variety of third-party benchmarks — outperforming the vaunted rival DeepSeek R1 open source reasoning model. Llama-3.1-Nemotron-Ultra-253B-v1, is a dense 253-billion parameter designed to support advanced reasoning, instruction following, and AI assistant workflows. It was first mentioned back at Nvidia’s annual GPU Technology Conference (GTC) in March. The release reflects Nvidia continued focus on performance optimization through architectural innovation and targeted post-training. Announced last night, April 7, 2025, the model code is now publicly available on Hugging Face, with open weights and post-training data. It is designed to operate efficiently in both “reasoning on” and “reasoning off” modes, allowing developers to toggle between high-complexity reasoning tasks and more straightforward outputs based on system prompts. Designed for efficient inference The Llama-3.1-Nemotron-Ultra-253B builds on Nvidia’s previous work in inference-optimized LLM development. Its architecture—customized through a Neural Architecture Search (NAS) process—introduces structural variations such as skipped attention layers, fused feedforward networks (FFNs), and variable FFN compression ratios. This architectural overhaul reduces memory footprint and computational demands without severely impacting output quality, enabling deployment on a single 8x H100 GPU node. The result, according to Nvidia, is a model that offers strong performance while being more cost-effective to deploy in data center environments. Additional hardware compatibility includes support for Nvidia’s B100 and Hopper microarchitectures, with configurations validated in both BF16 and FP8 precision modes. Post-training for reasoning and alignment Nvidia enhanced the base model through a multi-phase post-training pipeline. This included supervised fine-tuning across domains such as math, code generation, chat, and tool use, followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to further boost instruction-following and reasoning performance. The model underwent a knowledge distillation phase over 65 billion tokens, followed by continual pretraining on an additional 88 billion tokens. Training datasets included sources like FineWeb, Buzz-V1.2, and Dolma. Post-training prompts and responses were drawn from a combination of public corpora and synthetic generation methods, including datasets that taught the model to differentiate between its reasoning modes. Improved performance across numerous domains and benchmarks Evaluation results show notable gains when the model operates in reasoning-enabled mode. For instance, on the MATH500 benchmark, performance increased from 80.40% in standard mode to 97.00% with reasoning enabled. Similarly, results on the AIME25 benchmark rose from 16.67% to 72.50%, and LiveCodeBench scores more than doubled, jumping from 29.03% to 66.31%. Performance gains were also observed in tool-based tasks like BFCL V2 and function composition, as well as in general question answering (GPQA), where the model scored 76.01% in reasoning mode versus 56.60% without. These benchmarks were conducted with a maximum sequence length of 32,000 tokens, and each test was repeated up to 16 times to ensure accuracy. Compared to DeepSeek R1, a state-of-the-art MoE model with 671 billion total parameters, Llama-3.1-Nemotron-Ultra-253B shows competitive results despite having less than half the number of parameters (model settings) — outperforming in tasks like GPQA (76.01 vs. 71.5), IFEval instruction following (89.45 vs. 83.3), and LiveCodeBench coding tasks (66.31 vs. 65.9). Meanwhile, DeepSeek R1 holds a clear advantage on certain math evaluations, particularly AIME25 (79.8 vs. 72.50), and slightly edges out MATH500 (97.3 vs. 97.00). These results suggest that despite being a dense model, Nvidia’s offering matches or exceeds MoE alternatives on reasoning and general instruction alignment tasks, while trailing slightly in math-heavy categories. Usage and integration The model is compatible with the Hugging Face Transformers library (version 4.48.3 recommended) and supports input and output sequences up to 128,000 tokens. Developers can control reasoning behavior via system prompts and select decoding strategies based on task requirements. For reasoning tasks, Nvidia recommends using temperature sampling (0.6) with a top-p value of 0.95. For deterministic outputs, greedy decoding is preferred. Llama-3.1-Nemotron-Ultra-253B supports multilingual applications, with capabilities in English and several additional languages, including German, French, Italian, Portuguese, Hindi, Spanish, and Thai. It is also suitable for common LLM use cases such as chatbot development, AI agent workflows, retrieval-augmented generation (RAG), and code generation. Licensed for commercial use Released under the Nvidia Open Model License and governed by the Llama 3.1 Community License Agreement, the model is ready for commercial use. Nvidia has emphasized the importance of responsible AI development, encouraging teams to evaluate the model’s alignment, safety, and bias profiles for their specific use cases. Oleksii Kuchaiev, Director of AI Model Post-Training at Nvidia, shared the announcement on X, stating that the team was excited to share the open release, describing it as a dense 253B model designed with toggle ON/OFF reasoning capabilities and released with open weights and data. source

Nvidia’s new Llama-3.1 Nemotron Ultra outperforms DeepSeek R1 at half the size Read More »

Gemini 2.5 Pro is now available without limits and for cheaper than Claude, GPT-4o

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Google’s Gemini 2.5 Pro, which the company calls its most intelligent model ever, quietly took the developer world by storm.  After seeing strong developer interest, Google announced it would increase rate limits for Gemini 2.5 Pro and offer the model at a lower price than many of its competitors. The company did not release pricing at launch.  “We’ve seen incredible developer enthusiasm and early adoption of Gemini 2.5 Pro, and we’ve been listening to your feedback,” Google said in a blog post today. “To make this powerful model available to more developers, we’re moving Gemini 2.5 Pro into public preview in the Gemini API in Google AI Studio today, with Vertex AI rolling out shortly.” Gemini 2.5 Pro is the first experimental Google model to feature higher rate limits and billing. Google said developers using Gemini 2.5 Pro on public preview, priced at $1.24 per one million tokens, will see increased rate limits. The experimental version of the model will remain free but have lower rate limits. Heading off competitors  Gemini 2.5 Pro’s pricing is competitive and significantly lower than competitors like Anthropic and OpenAI.  As previously mentioned, Gemini 2.5 Pro is $1.25 per million input tokens and $10 per million output tokens. Social media users expressed surprise that Google could pull off pricing such a powerful model for so low a price, noting that it’s “about to get wild.” Anthropic offers Claude 3.7 Sonnet, a comparable model to Gemini 2.5 Pro, at $3 per million input tokens and $15 for output tokens. On its site, Anthropic says that Claude 3.7 Sonnet users can save up to 90% of the cost if they use prompt caching.  OpenAI’s o1 reasoning model costs $15 per million input rockets and $60 per million output tokens. However, cached inputs cost $7.50. Its other reasoning model, o3-mini, is cheaper at $1.10 per million input tokens and $4.40 per million output tokens, but o3-mini is a smaller reasoning model. For non-reasoning models, OpenAI priced GPT-4o at $2.50 for inputs and $10 for outputs. Gemini 2.5 Pro demand Google released Gemini 2.5 Pro somewhat quietly, adding the experimental version of the model to the Gemini Advanced. Since its launch a few weeks ago, several developers and users have found it compelling.  VentureBeat’s Ben Dickson played with Gemini 2.5 Pro and declared it may be the “most useful reasoning model yet.”   Effectively pricing reasoning models is the next big battleground for AI model developers. DeepSeek’s low cost for DeepSeek R1 caused a ruckus among enterprises. DeepSeek continues to put out models at a lower price than most of the more prominent model developers, putting even more pressure on Google, OpenAI and Anthropic to offer robust and extremely capable models at affordable prices.  source

Gemini 2.5 Pro is now available without limits and for cheaper than Claude, GPT-4o Read More »

Meta defends Llama 4 release against ‘reports of mixed quality,’ blames bugs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Meta’s new flagship AI language model Llama 4 came suddenly over the weekend, with the parent company of Facebook, Instagram, WhatsApp and Quest VR (among other services and products) revealing not one, not two, but three versions — all upgraded to be more powerful and performant using the popular “Mixture-of-Experts” architecture and a new training method involving fixed hyperparameters, known as MetaP. Also, all three are equipped with massive context windows — the amount of information that an AI language model can handle in one input/output exchange with a user or tool. But following the surprise announcement and public release of two of those models for download and usage — the lower-parameter Llama 4 Scout and mid-tier Llama 4 Maverick — on Saturday, the response from the AI community on social media has been less than adoring. Llama 4 sparks confusion and criticism among AI users An unverified post on the North American Chinese language community forum 1point3acres made its way over to the r/LocalLlama subreddit on Reddit alleging to be from a researcher at Meta’s GenAI organization who claimed that the model performed poorly on third-party benchmarks internally and that company leadership “suggested blending test sets from various benchmarks during the post-training process, aiming to meet the targets across various metrics and produce a ‘presentable’ result.” The post was met with skepticism from the community in its authenticity, and a VentureBeat email to a Meta spokesperson has not yet received a reply. But other users found reasons to doubt the benchmarks regardless. “At this point, I highly suspect Meta bungled up something in the released weights … if not, they should lay off everyone who worked on this and then use money to acquire Nous,” commented @cto_junior on X, in reference to an independent user test showing Llama 4 Maverick’s poor performance (16%) on a benchmark known as aider polyglot, which runs a model through 225 coding tasks. That’s well below the performance of comparably sized, older models such as DeepSeek V3 and Claude 3.7 Sonnet. Referencing the 10 million-token context window Meta boasted for Llama 4 Scout, AI PhD and author Andriy Burkov wrote on X in part that: “The declared 10M context is virtual because no model was trained on prompts longer than 256k tokens. This means that if you send more than 256k tokens to it, you will get low-quality output most of the time.” Also on the r/LocalLlama subreddit, user Dr_Karminski wrote that “I’m incredibly disappointed with Llama-4,” and demonstrated its poor performance compared to DeepSeek’s non-reasoning V3 model on coding tasks such as simulating balls bouncing around a heptagon. Former Meta researcher and current AI2 (Allen Institute for Artificial Intelligence) Senior Research Scientist Nathan Lambert took to his Interconnects Substack blog on Monday to point out that a benchmark comparison posted by Meta to its own Llama download site of Llama 4 Maverick to other models, based on cost-to-performance on the third-party head-to-head comparison tool LMArena ELO aka Chatbot Arena, actually used a different version of Llama 4 Maverick than the company itself had made publicly available — one “optimized for conversationality.” As Lambert wrote: “Sneaky. The results below are fake, and it is a major slight to Meta’s community to not release the model they used to create their major marketing push. We’ve seen many open models that come around to maximize on ChatBotArena while destroying the model’s performance on important skills like math or code.” Lambert went on to note that while this particular model on the arena was “tanking the technical reputation of the release because its character is juvenile,” including lots of emojis and frivolous emotive dialog, “The actual model on other hosting providers is quite smart and has a reasonable tone!” In response to the torrent of criticism and accusations of benchmark cooking, Meta’s VP and Head of GenAI Ahmad Al-Dahle took to X to state: “We’re glad to start getting Llama 4 in all your hands. We’re already hearing lots of great results people are getting with these models. That said, we’re also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it’ll take several days for all the public implementations to get dialed in. We’ll keep working through our bug fixes and onboarding partners. We’ve also heard claims that we trained on test sets — that’s simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations. We believe the Llama 4 models are a significant advancement and we’re looking forward to working with the community to unlock their value.“ Yet even that response was met with many complaints of poor performance and calls for further information, such as more technical documentation outlining the Llama 4 models and their training processes, as well as additional questions about why this release compared to all prior Llama releases was particularly riddled with issues. It also comes on the heels of the number two at Meta’s VP of Research Joelle Pineau, who worked in the adjacent Meta Foundational Artificial Intelligence Research (FAIR) organization, announcing her departure from the company on LinkedIn last week with “nothing but admiration and deep gratitude for each of my managers.” Pineau, it should be noted also promoted the release of the Llama 4 model family this weekend. Llama 4 continues to spread to other inference providers with mixed results, but it’s safe to say the initial release of the model family has not been a slam dunk with the AI community. And the upcoming Meta LlamaCon on April 29, the first celebration and gathering for third-party developers of the model family, will likely have much fodder for discussion. We’ll be tracking it all, stay tuned. source

Meta defends Llama 4 release against ‘reports of mixed quality,’ blames bugs Read More »

$115 million just poured into this startup that makes engineering 1,000x faster — and Bezos, Altman, and Nvidia are all betting on its success

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Rescale, a digital engineering platform that helps companies run complex simulations and calculations in the cloud, announced today that it has raised $115 million in Series D funding to accelerate the development of AI-powered engineering tools that can dramatically speed up product design and testing. The funding round, which brings Rescale’s total capital raised to more than $260 million, included investments from Applied Ventures, Atika Capital, Foxconn, Hanwha Asset Management Deeptech Venture Fund, Hitachi Ventures, NEC Orchestrating Future Fund, Nvidia, Prosperity7, SineWave Ventures, TransLink Capital, the University of Michigan, and Y Combinator. The San Francisco-based company has drawn support from an impressive roster of early backers including Sam Altman, Jeff Bezos, Paul Graham, and Peter Thiel. This latest round aims to propel Rescale’s vision of transforming how products are designed across industries by combining high-performance computing, intelligent data management, and a new field the company calls “AI physics.” “Rescale was founded with the mission to empower engineers and scientists to accelerate innovation by running computations and simulations more efficiently,” Joris Poort, Rescale’s founder and CEO, said in an interview with VentureBeat. “That’s exactly what we’re focused on today.” From Boeing’s carbon fiber challenge to a $260 million startup The company’s origins trace back to Poort’s experience working on the Boeing 787 Dreamliner more than 20 years ago. He and his co-founder Adam McKenzie were tasked with designing the aircraft’s wing using complex physics-based simulations. “My co-founder, Adam, and I were working at Boeing, running large-scale physics simulations for the 787 Dreamliner,” Poort told VentureBeat. “It was the first fully carbon fiber commercial airplane, which posed significant engineering challenges. Most airplanes before had always been built out of aluminum, but carbon fiber has many different layers and variables that needed to be optimized.” The challenge they faced was a lack of sufficient computing resources to run the millions of calculations needed to optimize the innovative carbon fiber design. “We couldn’t get enough compute resources. This was 20 years ago, before cloud computing existed,” he recalled. “We had to bootstrap together and cobble together resources from different organizations just to run these large-scale simulations over the weekend.” This experience led directly to Rescale’s founding mission: build the platform they wished they had during those Boeing years. “Rescale was founded to build the platform we wish we had, because it took us many years to develop all these capabilities,” Poort explained. “We were really just engineers trying to design the best possible plane, but we had to become applied mathematicians and computer scientists, doing all this infrastructure work just to solve engineering problems.” How AI models are turning days of calculations into seconds Central to Rescale’s ambitions is the concept of “AI physics” — using artificial intelligence models trained on simulation data to dramatically accelerate computational engineering. While traditional physics simulations might take days to complete, AI models trained on those simulations can deliver approximate results in seconds. “With AI physics, you train AI models on simulation data sets, allowing you to run these simulations over 1,000 times faster,” Poort said. “The AI model provides probabilistic answers—essentially estimates—whereas traditional physics calculations are deterministic, giving you exact results.” He offered a concrete example from one of Rescale’s customers: “General Motors motorsports, they’re designing the external aerodynamics of a Formula One vehicle. They may run thousands of these sort of fluid dynamics, aerodynamic calculations. Normally, these may take, like, about three days on, say, 1000 compute cores. Now, with an AI model, they’re able to do this in like less than a second.” This thousand-fold acceleration allows engineers to explore design spaces much more rapidly, testing many more iterations and possibilities than previously feasible. “The really unique advantage of AI physics is that you can verify the answers. It’s just math,” Poort emphasized. “This is different from LLMs, where you might encounter hallucinations that are difficult to validate. Many questions don’t have definitive answers, but in physics, you have concrete, verifiable solutions.” The funding comes amid increasing enterprise investments in technologies that speed up product development. The high-performance computing market has grown to approximately $50 billion, with simulation software reaching $20 billion and product lifecycle data management about $30 billion, according to figures shared by Rescale. What differentiates Rescale is its “compute recommendation engine,” which optimizes workloads across different cloud architectures in real-time. “Our unique differentiation is our technology called the compute recommendation engine. This allows us to optimize workloads in real time across different architectures available across all public clouds,” Poort said. “We support 1,150 different applications with many versions, operating systems, and hardware architectures. When combined together, this creates more than 50 million different possible configurations.” The company’s enterprise customers, which include Arm, General Motors, Samsung, SLB (formerly Schlumberger), and the U.S. Department of Defense, collectively spend over $1 billion annually to power their virtual product development and scientific discovery environments. Beyond simulation: Data management and AI integration for modern engineering Rescale is accelerating its roadmap in three key areas. First, expanding its library of over 1,250 applications and network of more than 500 cloud datacenters. Second, establishing unified data management and digital thread capabilities for all computing workflows. Third, enabling faster engineering through AI. “We also have a product called Rescale Data, which focuses on creating an intelligent data layer,” Poort explained. “This is sometimes called the digital thread. Throughout the product lifecycle—whether you’re developing an aircraft, a car, or in life sciences, a medical device or drug—you need to track all that data. If an issue arises, you can look back to see when that data was created, what the input files were, and related information.” Applied Materials, one of the investors in this round, has been working with Rescale to enhance its simulation capabilities. Rather than simply accelerating existing processes, the partnership suggests a more profound shift in how engineering knowledge is captured and applied. The most intriguing aspect of

$115 million just poured into this startup that makes engineering 1,000x faster — and Bezos, Altman, and Nvidia are all betting on its success Read More »

New open source AI company Deep Cogito releases first models and they’re already topping the charts

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Deep Cogito, a new AI research startup based in San Francisco, officially emerged from stealth today with Cogito v1, a new line of open source large language models (LLMs) fine-tuned from Meta’s Llama 3.2 and equipped with hybrid reasoning capabilities — the ability to answer quickly and immediately, or “self-reflect” like OpenAI’s “o” series and DeepSeek R1. The company aims to push the boundaries of AI beyond current human-overseer limitations by enabling models to iteratively refine and internalize their own improved reasoning strategies. It’s ultimately on a quest toward developing superintelligence — AI smarter than all humans in all domains — yet the company says that “All models we create will be open sourced.” Deep Cogito’s CEO and co-founder Drishan Arora — a former Senior Software Engineer at Google who says he led the large language model (LLM) modeling for Google’s generative search product —also said in a post on X they are “the strongest open models at their scale – including those from LLaMA, DeepSeek, and Qwen.” The initial model lineup includes five base sizes: 3 billion, 8 billion, 14 billion, 32 billion, and 70 billion parameters, available now on AI code sharing community Hugging Face, Ollama and through application programming interfaces (API) on Fireworks and Together AI. They’re available under the Llama licensing terms which allows for commercial usage — so third-party enterprises could put them to work in paid products — up to 700 million monthly users, at which point they need to obtain a paid license from Meta. The company plans to release even larger models — up to 671 billion parameters — in the coming months. Arora describes the company’s training approach, iterated distillation and amplification (IDA), as a novel alternative to traditional reinforcement learning from human feedback (RLHF) or teacher-model distillation. The core idea behind IDA is to allocate more compute for a model to generate improved solutions, then distill the improved reasoning process into the model’s own parameters — effectively creating a feedback loop for capability growth. Arora likens this approach to Google AlphaGo’s self-play strategy, applied to natural language. Benchmarks and evaluations The company shared a broad set of evaluation results comparing Cogito models to open-source peers across general knowledge, mathematical reasoning, and multilingual tasks. Highlights include: Cogito 3B (Standard) outperforms LLaMA 3.2 3B on MMLU by 6.7 percentage points (65.4% vs. 58.7%), and on Hellaswag by 18.8 points (81.1% vs. 62.3%). In reasoning mode, Cogito 3B scores 72.6% on MMLU and 84.2% on ARC, exceeding its own standard-mode performance and showing the effect of IDA-based self-reflection. Cogito 8B (Standard) scores 80.5% on MMLU, outperforming LLaMA 3.1 8B by 12.8 points. It also leads by over 11 points on MMLU-Pro and achieves 88.7% on ARC. In reasoning mode, Cogito 8B achieves 83.1% on MMLU and 92.0% on ARC. It surpasses DeepSeek R1 Distill 8B in nearly every category except the MATH benchmark, where Cogito scores significantly lower (60.2% vs. 80.6%). Cogito 14B and 32B models outperform Qwen2.5 counterparts by around 2–3 percentage points on aggregate benchmarks, with Cogito 32B (Reasoning) reaching 90.2% on MMLU and 91.8% on the MATH benchmark. Cogito 70B (Standard) outperforms LLaMA 3.3 70B on MMLU by 6.4 points (91.7% vs. 85.3%) and exceeds LLaMA 4 Scout 109B on aggregate benchmark scores (54.5% vs. 53.3%). Against DeepSeek R1 Distill 70B, Cogito 70B (Reasoning) posts stronger results in general and multilingual benchmarks, with a notable 91.0% on MMLU and 92.7% on MGSM. Cogito models generally show their highest performance in reasoning mode, though some trade-offs emerge — particularly in mathematics. For instance, while Cogito 70B (Standard) matches or slightly exceeds peers in MATH and GSM8K, Cogito 70B (Reasoning) trails DeepSeek R1 in MATH by over five percentage points (83.3% vs. 89.0%). In addition to general benchmarks, Deep Cogito evaluated its models on native tool-calling performance — a growing priority for agents and API-integrated systems. Cogito 3B supports four tool-calling tasks natively (simple, parallel, multiple, and parallel-multiple), whereas LLaMA 3.2 3B does not support tool calling. Cogito 3B scores 92.8% on simple tool calls and over 91% on multiple tool calls. Cogito 8B scores over 89% across all tool call types, significantly outperforming LLaMA 3.1 8B, which ranges between 35% and 54%. These improvements are attributed not only to model architecture and training data, but also to task-specific post-training, which many baseline models currently lack. Looking ahead Deep Cogito plans to release larger-scale models in upcoming months, including mixture-of-expert variants at 109B, 400B, and 671B parameter scales. The company will also continue updating its current model checkpoints with extended training. The company positions its IDA methodology as a long-term path toward scalable self-improvement, removing dependence on human or static teacher models. Arora emphasizes that while performance benchmarks are important, real-world utility and adaptability are the true tests for these models — and that the company is just at the beginning of what it believes is a steep scaling curve. Deep Cogito’s research and infrastructure partnerships include teams from Hugging Face, RunPod, Fireworks AI, Together AI, and Ollama. All released models are open source and available now. source

New open source AI company Deep Cogito releases first models and they’re already topping the charts Read More »

Stanford’s AI Index: 5 critical insights reshaping enterprise tech strategy

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The Stanford Institute for Human-Centered Artificial Intelligence (HAI) has released its 2025 AI Index Report, providing a data-driven analysis of AI’s global development. HAI has been developing a report on AI over the last several years, with its first benchmark coming in 2022. Needless to say, a lot has changed. The 2025 report is loaded with statistics. Among some of the top findings: The U.S. produced 40 notable AI models in 2024, significantly ahead of China (15) and Europe (3). Training compute for AI models doubles approximately every five months, and dataset sizes every eight months. AI model inference costs have fallen dramatically – a 280-fold reduction from 2022 to 2024. Global private AI investment reached $252.3 billion in 2024, a 26% increase. 78% of organizations report using AI (up from 55% in 2023). For enterprise IT leaders charting their AI strategy, the report offers critical insights into model performance, investment trends, implementation challenges and competitive dynamics reshaping the technology landscape. Here are five key takeaways for enterprise IT leaders from the AI Index. 1. The democratization of AI power is accelerating Perhaps the most striking finding is how rapidly high-quality AI has become more affordable and accessible. The cost barrier that once restricted advanced AI to tech giants is crumbling. The finding is in stark contrast to what the 2024 Stanford report found. “I was struck by how much AI models have become cheaper, more open, and accessible over the past year,” Nestor Maslej, research manager for the AI Index at HAI told VentureBeat. “While training costs remain high, we’re now seeing a world where the cost of developing high-quality—though not frontier—models is plummeting.” The report quantifies this shift dramatically: the inference cost for an AI model performing at GPT-3.5 levels dropped from $20.00 per million tokens in November 2022 to just $0.07 per million tokens by October 2024—a 280-fold reduction in 18 months. Equally significant is the performance convergence between closed and open-weight models. The gap between top closed models (like GPT-4) and leading open models (like Llama) narrowed from 8.0% in Jan. 2024 to just 1.7% by Feb. 2025. IT leader action item: Reassess your AI procurement strategy. Organizations previously priced out of cutting-edge AI capabilities now have viable options through open-weight models or significantly cheaper commercial APIs. 2. The gap between AI adoption and value realization remains substantial While the report shows 78% of organizations now use AI in at least one business function (up from 55% in 2023), real business impact lags behind adoption. When asked about meaningful ROI at scale, Maslej acknowledged: “We have limited data on what separates organizations that achieve massive returns to scale with AI from those that do not. This is a critical area of analysis we intend to explore further.” The report indicates that most organizations using generative AI report modest financial improvements. For example, 47% of businesses using generative AI in strategy and corporate finance report revenue increases, but typically at levels below 5%. IT leader action item: Focus on measurable use cases with clear ROI potential rather than broad implementation. Consider developing stronger AI governance and measurement frameworks to track value creation better. 3. Specific business functions show stronger financial returns from AI The report provides granular insights into which business functions are seeing the most significant financial impact from AI implementation. “On the cost side, AI appears to benefit supply chain and service operations functions the most,” Maslej noted. “On the revenue side, strategy, corporate finance, and supply chain functions see the greatest gains.” Specifically, 61% of organizations using generative AI in supply chain and inventory management report cost savings, while 70% using it in strategy and corporate finance report revenue increases. Service operations and marketing/sales also show strong potential for value creation. IT leader action item: Prioritize AI investments in functions showing the most substantial financial returns in the report. Supply chain optimization, service operations and strategic planning emerge as high-potential areas for initial or expanded AI deployment. 4. AI shows strong potential to equalize workforce performance One of the most interesting findings concerns AI’s impact on workforce productivity across skill levels. Multiple studies cited in the report show AI tools disproportionately benefit lower-skilled workers. In customer support contexts, low-skill workers experienced 34% productivity gains with AI assistance, while high-skill workers saw minimal improvement. Similar patterns appeared in consulting (43% vs. 16.5% gains) and software engineering (21-40% vs. 7-16% gains). “Generally, these studies indicate that AI has strong positive impacts on productivity and tends to benefit lower-skilled workers more than higher-skilled ones, though not always,” Maslej explained. IT leader action item: Consider AI deployment as a workforce development strategy. AI assistants can help level the playing field between junior and senior staff, potentially addressing skill gaps while improving overall team performance. 5. Responsible AI implementation remains an aspiration, not a reality Despite growing awareness of AI risks, the report reveals a significant gap between risk recognition and mitigation. While 66% of organizations consider cybersecurity an AI-related risk, only 55% actively mitigate it. Similar gaps exist for regulatory compliance (63% vs. 38%) and intellectual property infringement (57% vs. 38%). These findings come against a backdrop of increasing AI incidents, which rose 56.4% to a record 233 reported cases in 2024. Organizations face real consequences for failing to implement responsible AI practices. IT leader action item: Don’t delay implementing robust responsible AI governance. While technical capabilities advance rapidly, the report suggests most organizations still lack effective risk mitigation strategies. Developing these frameworks now could be a competitive advantage rather than a compliance burden. Looking ahead The Stanford AI Index Report presents a picture of rapidly maturing AI technology becoming more accessible and capable, while organizations still struggle to capitalize on its potential fully.  For IT leaders, the strategic imperative is clear: focus on targeted implementations with measurable ROI, emphasize responsible governance and leverage AI to enhance workforce

Stanford’s AI Index: 5 critical insights reshaping enterprise tech strategy Read More »

Cisco: Fine-tuned LLMs are now threat multipliers—22x more likely to go rogue

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Weaponized large language models (LLMs) fine-tuned with offensive tradecraft are reshaping cyberattacks, forcing CISOs to rewrite their playbooks. They’ve proven capable of automating reconnaissance, impersonating identities and evading real-time detection, accelerating large-scale social engineering attacks. Models, including FraudGPT, GhostGPT and DarkGPT, retail for as little as $75 a month and are purpose-built for attack strategies such as phishing, exploit generation, code obfuscation, vulnerability scanning and credit card validation. Cybercrime gangs, syndicates and nation-states see revenue opportunities in providing platforms, kits and leasing access to weaponized LLMs today. These LLMs are being packaged much like legitimate businesses package and sell SaaS apps. Leasing a weaponized LLM often includes access to dashboards, APIs, regular updates and, for some, customer support. VentureBeat continues to track the progression of weaponized LLMs closely. It’s becoming evident that the lines are blurring between developer platforms and cybercrime kits as weaponized LLMs’ sophistication continues to accelerate. With lease or rental prices plummeting, more attackers are experimenting with platforms and kits, leading to a new era of AI-driven threats. Legitimate LLMs in the cross-hairs The spread of weaponized LLMs has progressed so quickly that legitimate LLMs are at risk of being compromised and integrated into cybercriminal tool chains. The bottom line is that legitimate LLMs and models are now in the blast radius of any attack. The more fine-tuned a given LLM is, the greater the probability it can be directed to produce harmful outputs. Cisco’s The State of AI Security Report reports that fine-tuned LLMs are 22 times more likely to produce harmful outputs than base models. Fine-tuning models is essential for ensuring their contextual relevance. The trouble is that fine-tuning also weakens guardrails and opens the door to jailbreaks, prompt injections and model inversion. Cisco’s study proves that the more production-ready a model becomes, the more exposed it is to vulnerabilities that must be considered in an attack’s blast radius. The core tasks teams rely on to fine-tune LLMs, including continuous fine-tuning, third-party integration, coding and testing, and agentic orchestration, create new opportunities for attackers to compromise LLMs. Once inside an LLM, attackers work fast to poison data, attempt to hijack infrastructure, modify and misdirect agent behavior and extract training data at scale. Cisco’s study infers that without independent security layers, the models teams work so diligently on to fine-tune aren’t just at risk; they’re quickly becoming liabilities. From an attacker’s perspective, they’re assets ready to be infiltrated and turned. Fine-Tuning LLMs dismantles safety controls at scale A key part of Cisco’s security team’s research centered on testing multiple fine-tuned models, including Llama-2-7B and domain-specialized Microsoft Adapt LLMs. These models were tested across a wide variety of domains including healthcare, finance and law. One of the most valuable takeaways from Cisco’s study of AI security is that fine-tuning destabilizes alignment, even when trained on clean datasets. Alignment breakdown was the most severe in biomedical and legal domains, two industries known for being among the most stringent regarding compliance, legal transparency and patient safety.  While the intent behind fine-tuning is improved task performance, the side effect is systemic degradation of built-in safety controls. Jailbreak attempts that routinely failed against foundation models succeeded at dramatically higher rates against fine-tuned variants, especially in sensitive domains governed by strict compliance frameworks. The results are sobering. Jailbreak success rates tripled and malicious output generation soared by 2,200% compared to foundation models. Figure 1 shows just how stark that shift is. Fine-tuning boosts a model’s utility but comes at a cost, which is a substantially broader attack surface. TAP achieves up to 98% jailbreak success, outperforming other methods across open- and closed-source LLMs. Source: Cisco State of AI Security 2025, p. 16. Malicious LLMs are a $75 commodity Cisco Talos is actively tracking the rise of black-market LLMs and provides insights into their research in the report. Talos found that GhostGPT, DarkGPT and FraudGPT are sold on Telegram and the dark web for as little as $75/month. These tools are plug-and-play for phishing, exploit development, credit card validation and obfuscation. DarkGPT underground dashboard offers “uncensored intelligence” and subscription-based access for as little as 0.0098 BTC—framing malicious LLMs as consumer-grade SaaS.Source: Cisco State of AI Security 2025, p. 9. Unlike mainstream models with built-in safety features, these LLMs are pre-configured for offensive operations and offer APIs, updates, and dashboards that are indistinguishable from commercial SaaS products. $60 dataset poisoning threatens AI supply chains “For just $60, attackers can poison the foundation of AI models—no zero-day required,” write Cisco researchers. That’s the takeaway from Cisco’s joint research with Google, ETH Zurich and Nvidia, which shows how easily adversaries can inject malicious data into the world’s most widely used open-source training sets. By exploiting expired domains or timing Wikipedia edits during dataset archiving, attackers can poison as little as 0.01% of datasets like LAION-400M or COYO-700M and still influence downstream LLMs in meaningful ways. The two methods mentioned in the study, split-view poisoning and frontrunning attacks, are designed to leverage the fragile trust model of web-crawled data. With most enterprise LLMs built on open data, these attacks scale quietly and persist deep into inference pipelines. Decomposition attacks quietly extract copyrighted and regulated content One of the most startling discoveries Cisco researchers demonstrated is that LLMs can be manipulated to leak sensitive training data without ever triggering guardrails. Cisco researchers used a method called decomposition prompting to reconstruct over 20% of select New York Times and Wall Street Journal articles. Their attack strategy broke down prompts into sub-queries that guardrails classified as safe, then reassembled the outputs to recreate paywalled or copyrighted content. Successfully evading guardrails to access proprietary datasets or licensed content is an attack vector every enterprise is grappling to protect today. For those that have LLMs trained on proprietary datasets or licensed content, decomposition attacks can be particularly devastating. Cisco explains that the breach isn’t happening at the input level, it’s emerging from the models’ outputs. That

Cisco: Fine-tuned LLMs are now threat multipliers—22x more likely to go rogue Read More »