VentureBeat

MIT report misunderstood: Shadow AI economy booms while headlines cry failure

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now The most widely cited statistic from a new MIT report has been deeply misunderstood. While headlines trumpet that “95% of generative AI pilots at companies are failing,” the report actually reveals something far more remarkable: the fastest and most successful enterprise technology adoption in corporate history is happening right under executives’ noses. The study, released this week by MIT’s Project NANDA, has sparked anxiety across social media and business circles, with many interpreting it as evidence that artificial intelligence is failing to deliver on its promises. But a closer reading of the 26-page report tells a starkly different story — one of unprecedented grassroots technology adoption that has quietly revolutionized work while corporate initiatives stumble. The researchers found that 90% of employees regularly use personal AI tools for work, even though only 40% of their companies have official AI subscriptions. “While only 40% of companies say they purchased an official LLM subscription, workers from over 90% of the companies we surveyed reported regular use of personal AI tools for work tasks,” the study explains. “In fact, almost every single person used an LLM in some form for their work.” Employees use personal A.I. tools at more than twice the rate of official corporate adoption, according to the MIT report. (Credit: MIT) How employees cracked the AI code while executives stumbled The MIT researchers discovered what they call a “shadow AI economy” where workers use personal ChatGPT accounts, Claude subscriptions and other consumer tools to handle significant portions of their jobs. These employees aren’t just experimenting — they’re using AI “multiples times a day every day of their weekly workload,” the study found. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO This underground adoption has outpaced the early spread of email, smartphones, and cloud computing in corporate environments. A corporate lawyer quoted in the MIT report exemplified the pattern: Her organization invested $50,000 in a specialized AI contract analysis tool, yet she consistently used ChatGPT for drafting work because “the fundamental quality difference is noticeable. ChatGPT consistently produces better outputs, even though our vendor claims to use the same underlying technology.” The pattern repeats across industries. Corporate systems get described as “brittle, overengineered, or misaligned with actual workflows,” while consumer AI tools win praise for “flexibility, familiarity, and immediate utility.” As one chief information officer told researchers: “We’ve seen dozens of demos this year. Maybe one or two are genuinely useful. The rest are wrappers or science projects.” The 95% failure rate that has dominated headlines applies specifically to custom enterprise AI solutions — the expensive, bespoke systems companies commission from vendors or build internally. These tools fail because they lack what the MIT researchers call “learning capability.” Most corporate AI systems “do not retain feedback, adapt to context, or improve over time,” the study found. Users complained that enterprise tools “don’t learn from our feedback” and require “too much manual context required each time.” Consumer tools like ChatGPT succeed because they feel responsive and flexible, even though they reset with each conversation. Enterprise tools feel rigid and static, requiring extensive setup for each use. The learning gap creates a strange hierarchy in user preferences. For quick tasks like emails and basic analysis, 70% of workers prefer AI over human colleagues. But for complex, high-stakes work, 90% still want humans. The dividing line isn’t intelligence — it’s memory and adaptability. General-purpose A.I. tools like ChatGPT reach production 40% of the time, while task-specific enterprise tools succeed only 5% of the time. (Credit: MIT) The hidden billion-dollar productivity boom happening under IT’s radar Far from showing AI failure, the shadow economy reveals massive productivity gains that don’t appear in corporate metrics. Workers have solved integration challenges that stymie official initiatives, proving AI works when implemented correctly. “This shadow economy demonstrates that individuals can successfully cross the GenAI Divide when given access to flexible, responsive tools,” the report explains. Some companies have started paying attention: “Forward-thinking organizations are beginning to bridge this gap by learning from shadow usage and analyzing which personal tools deliver value before procuring enterprise alternatives.” The productivity gains are real and measurable, just hidden from traditional corporate accounting. Workers automate routine tasks, accelerate research, and streamline communication — all while their companies’ official AI budgets produce little return. Workers prefer A.I. for routine tasks like emails but still trust humans for complex, multi-week projects. (Credit: MIT) Why buying beats building: external partnerships succeed twice as often Another finding challenges conventional tech wisdom: companies should stop trying to build AI internally. External partnerships with AI vendors reached deployment 67% of the time, compared to 33% for internally built tools. The most successful implementations came from organizations that “treated AI startups less like software vendors and more like business service providers,” holding them to operational outcomes rather than technical benchmarks. These companies demanded deep customization and continuous improvement rather than flashy demos. “Despite conventional wisdom that enterprises resist training AI systems, most teams in our interviews expressed willingness to do so, provided the benefits were clear and guardrails were in place,” the researchers found. The key was partnership, not just purchasing. Seven industries avoiding disruption are actually being smart The MIT report found that only technology and media sectors show meaningful structural change from AI, while seven major industries — including healthcare, finance, and manufacturing — show “significant pilot activity but little to no structural change.” This measured approach isn’t a failure — it’s wisdom. Industries avoiding disruption are being thoughtful about implementation rather than rushing into chaotic change. In healthcare and energy, “most executives report no current or anticipated hiring reductions over the next five years.” Technology and media move faster because they can absorb more risk.

MIT report misunderstood: Shadow AI economy booms while headlines cry failure Read More »

Meta is partnering with Midjourney and will license its technology for ‘future models and products’

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Even three years after its debut, with ever increasing competition in the AI image and video generation space, Midjourney, the bootstrapped San Francisco startup, remains the “gold standard” for its 20 million users — including us here at VentureBeat, where we use it to generate the “header” art to many of our articles. Apparently, the leaders of Facebook and Instagram parent company Meta feel similarly. Today, Alexandr Wang, the former Scale AI founder and CEO who has become Meta’s Chief AI Officer and head of the company’s newly formed Meta Superintelligence Labs (MSL), announced a partnership with Midjourney — believed to be the first of its kind for the independent AI image startup. Meta will “license their aesthetic technology for our future models and products, bringing beauty to billions,” Wang wrote on X, a rival social network to Meta’s own Threads and Facebook. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO 1/ Today we’re proud to announce a partnership with @midjourney, to license their aesthetic technology for our future models and products, bringing beauty to billions. — Alexandr Wang (@alexandr_wang) August 22, 2025 Midjourney had previously reportedly been in discussions with Elon Musk and xAI for some integration with the latter company’s Grok image generation capabilities, but xAI debuted Grok image generation powered by startup Black Forest Labs’ Flux AI image model initially, and now appears to have native image generation capabilities. Two months ago, Midjourney added a capability to turn images created on the site or uploaded by a user into artistic and captivating videos that has also impressed many users, surpassing their expectations. Wang framed the move as part of a bigger philosophy — Meta’s “all-of-the-above” approach to building advanced AI. That means recruiting top research talent, pouring billions into computing infrastructure, and, in this case, teaming up with a company whose work complements Meta’s in ways it can’t easily build on its own. Midjourney, Wang said, has achieved “true feats of technical and aesthetic excellence,” and Meta is eager to put that expertise to work. For Midjourney, the partnership is an opportunity to see its technology woven into one of the largest digital ecosystems on the planet. But in his own X post, Midjourney founder David Holz was quick to stress what isn’t changing: the lab’s independence. He reminded followers that Midjourney remains community-backed, has no outside investors, and is still pursuing an ambitious slate of projects aimed at shaping what he calls more “humane futures.” Bringing sublime tools of creation and beauty to billions of people is squarely within our mission. Excited to partner with the titans of industry to make this happen. https://t.co/LJqcrDtGSz — David (@DavidSHolz) August 22, 2025 We remain an independent, community-backed research lab, with no investors, working on a staggering array of ambitious projects focused on bringing about humane futures where we are all mid-journey. Join us! — David (@DavidSHolz) August 22, 2025 On paper, the tie-up makes sense. Meta brings scale, distribution, and staggering compute power. Midjourney brings a creative edge, honed through years of training models to generate imagery that resonates with actual human tastes. It’s a marriage of brute force and design flair, an alliance that could help Meta’s AI systems feel less utilitarian and more inspired. Details are lacking: how much $$$ is Midjourney getting from the partnership? But for now, the details are hazy. Neither company has said how much the deal is worth. There’s been no statements as to when Midjourney’s technology will start showing up in Meta’s products, or to what degree it will be baked into the company’s AI strategy. Is this about upgrading the polish of Meta’s widely mocked and recently criticized chatbots — one of which a user allegedly mistook for a real person and died traveling to visit? Will it be used to enhance Meta’s virtual reality worlds? Or supercharge creative tools across Instagram and Facebook? The answers remain vague for now. Similarly, a big question concerns what will happen with Midjourney’s stated plans to pursue an external application programming interface (API) for other enterprises to build products and services atop of its powerful image generation models. Last month, the official Midjourney account on X posted that it was “starting to investigate opening up an Enterprise API for people to start integrating Midjourney into their companies/services,” and provided an Enterprise API application questionnaire for interested parties to fill out. We’re starting to investigate opening up an Enterprise API for people to start integrating Midjourney into their companies/services. If you’d like to apply, help us figure out what to provide, or just want follow-up updates please fill out our Enterprise API application below ? — Midjourney (@midjourney) July 16, 2025 That application remains online for now, but with Meta inking a deal with Midjourney, the question becomes whether or not is exclusive and will stop plans for a separate Midjourney API in its tracks. I messaged founder Holz and have asked about the API and will update upon receiving a response. The announcement lands against the backdrop of Meta’s massive internal shake-up. In August, the company reorganized its AI operations, creating Meta Superintelligence Labs, with Wang — who joined after Meta’s $14.3 billion investment in Scale AI — at the helm. The reorg split AI work into four core tracks: research, training, product, and infrastructure, as Business Insider reported initially. Wang now oversees an elite bench of talent recruited from OpenAI, Anthropic, Google DeepMind, and beyond — recruited for eye-watering pay packages in the multi-hundred million dollar range — now tasked with pushing Meta toward its stated goal: personalized artificial superintelligence for each user. It’s an ambitious mission, and a controversial one. Some researchers inside Meta are reportedly uneasy

Meta is partnering with Midjourney and will license its technology for ‘future models and products’ Read More »

LLMs generate ‘fluent nonsense’ when reasoning outside their training zone

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A new study from Arizona State University researchers suggests that the celebrated “Chain-of-Thought” (CoT) reasoning in Large Language Models (LLMs) may be more of a “brittle mirage” than genuine intelligence. The research builds on a growing body of work questioning the depth of LLM reasoning, but it takes a unique “data distribution” lens to test where and why CoT breaks down systematically. Crucially for application builders, the paper goes beyond critique to offer clear, practical guidance on how to account for these limitations when developing LLM-powered applications, from testing strategies to the role of fine-tuning. The promise and problem of Chain-of-Thought CoT prompting, which asks an LLM to “think step by step,” has shown impressive results on complex tasks, leading to the perception that models are engaging in human-like inferential processes. However, a closer inspection often reveals logical inconsistencies that challenge this view.  Various studies show that LLMs frequently rely on surface-level semantics and clues rather than logical procedures. The models generate plausible-sounding logic by repeating token patterns they have seen during training. Still, this approach often fails on tasks that deviate from familiar templates or when irrelevant information is introduced.  AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Despite these observations, the researchers of the new study argue that “a systematic understanding of why and when CoT reasoning fails is still a mystery,” which their study aims to address. Previous work has already shown that LLMs struggle to generalize their reasoning abilities. As the paper notes, “theoretical and empirical evidence shows that CoT generalizes well only when test inputs share latent structures with training data; otherwise, performance declines sharply.” A new lens on LLM reasoning The ASU researchers propose a new lens to view this problem: CoT isn’t an act of reasoning but a sophisticated form of pattern matching, fundamentally bound by the statistical patterns in its training data. They posit that “CoT’s success stems not from a model’s inherent reasoning capacity, but from its ability to generalize conditionally to out-of-distribution (OOD) test cases that are structurally similar to in-distribution exemplars.” In other words, an LLM is good at applying old patterns to new data that looks similar, but not at solving truly novel problems. The data distribution lens Source: GitHub To test this hypothesis, they dissected CoT’s capabilities across three dimensions of “distributional shift” (changes between the training data and the test data). First, they tested “task generalization” to see if a model could apply a learned reasoning process to a new type of task. Second, they examined “length generalization” to determine if it could handle reasoning chains that are significantly longer or shorter than those it was trained on. Finally, they assessed “format generalization” to measure how sensitive the model is to minor changes in the prompt’s wording or structure.  For their analysis, they developed a framework called DataAlchemy to train smaller LLMs from scratch in a controlled environment, allowing them to precisely measure how performance degrades when pushed beyond the training data. “The data distribution lens and controlled environment are both central to what we were trying to convey,” Chengshuai Zhao, doctoral student at ASU and co-author of the paper, told VentureBeat. “We hope to create a space where the public, researchers, and developers can freely explore and probe the nature of LLMs and advance the boundaries of human knowledge.” The mirage confirmed Based on their findings, the researchers conclude that CoT reasoning is a “sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training.” When tested even slightly outside this distribution, performance collapses. What looks like structured reasoning is more of a mirage, “emerging from memorized or interpolated patterns in the training data rather than logical inference.” The breakdown was consistent across all three dimensions. On new tasks, models failed to generalize and instead replicated the closest patterns they had seen during training. When faced with reasoning chains of different lengths, they struggled, often trying to artificially add or remove steps to match the length of their training examples. Finally, their performance proved highly sensitive to superficial changes in the prompt, especially variations in core elements and instructions. Interestingly, the researchers found that these failures could be quickly fixed. By fine-tuning the models on a very small sample of the new, unseen data through supervised fine-tuning (SFT), performance on that specific type of problem increased rapidly. However, this quick fix further supports the pattern-matching theory, suggesting the model isn’t learning to reason more abstractly but is instead just memorizing a new pattern to overcome a specific weakness. Takeaways for the enterprise The researchers offer a direct warning to practitioners, highlighting “the risk of relying on CoT as a plug-and-play solution for reasoning tasks and caution against equating CoT-style output with human thinking.” They provide three key pieces of advice for developers building applications with LLMs. 1) Guard against over-reliance and false confidence. CoT should not be treated as a reliable module for reasoning in high-stakes fields like finance or legal analysis. LLMs can produce “fluent nonsense” (plausible but logically flawed reasoning) that is more deceptive than an outright incorrect answer.  “That doesn’t mean enterprises should abandon it entirely—it can still provide value within familiar, in-distribution tasks. The key is not to treat it as a “plug-and-play” reasoning engine,” Zhao said, adding that in high-stakes domains such as finance or health, guardrails such as rigorous auditing by domain experts, multi-model cross-checking, and fallback strategies are “essential if CoT outputs are to be used safely.” 2) Prioritize out-of-distribution (OOD) testing. Standard validation, where test data mirrors training data, is not enough to measure true robustness. Developers must implement rigorous testing that systematically probes for failures across task, length, and

LLMs generate ‘fluent nonsense’ when reasoning outside their training zone Read More »

OpenCUA’s open source computer-use agents rival proprietary models from OpenAI and Anthropic

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A new framework from researchers at The University of Hong Kong (HKU) and collaborating institutions provides an open source foundation for creating robust AI agents that can operate computers. The framework, called OpenCUA, includes the tools, data, and recipes for scaling the development of computer-use agents (CUAs). Models trained using this framework perform strongly on CUA benchmarks, outperforming existing open source models and competing closely with closed agents from leading AI labs like OpenAI and Anthropic. The challenge of building computer-use agents Computer-use agents are designed to autonomously complete tasks on a computer, from navigating websites to operating complex software. They can also help automate workflows in the enterprise. However, the most capable CUA systems are proprietary, with critical details about their training data, architectures, and development processes kept private. “As the lack of transparency limits technical advancements and raises safety concerns, the research community needs truly open CUA frameworks to study their capabilities, limitations, and risks,” the researchers state in their paper. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO At the same time, open source efforts face their own set of hurdles. There has been no scalable infrastructure for collecting the diverse, large-scale data needed to train these agents. Existing open source datasets for graphical user interfaces (GUIs) have limited data, and many research projects provide insufficient detail about their methods, making it difficult for others to replicate their work. According to the paper, “These limitations collectively hinder advances in general-purpose CUAs and restrict a meaningful exploration of their scalability, generalizability, and potential learning approaches.” Introducing OpenCUA OpenCUA framework Source: XLANG Lab at HKU OpenCUA is an open source framework designed to address these challenges by scaling both the data collection and the models themselves. At its core is the AgentNet Tool for recording human demonstrations of computer tasks on different operating systems. The tool streamlines data collection by running in the background on an annotator’s personal computer, capturing screen videos, mouse and keyboard inputs, and the underlying accessibility tree, which provides structured information about on-screen elements. This raw data is then processed into “state-action trajectories,” pairing a screenshot of the computer (the state) with the user’s corresponding action (a click, key press, etc.). Annotators can then review, edit, and submit these demonstrations. AgentNet tool Source: XLang Lab at HKU Using this tool, the researchers collected the AgentNet dataset, which contains over 22,600 task demonstrations across Windows, macOS, and Ubuntu, spanning more than 200 applications and websites. “This dataset authentically captures the complexity of human behaviors and environmental dynamics from users’ personal computing environments,” the paper notes. Recognizing that screen-recording tools raise significant data privacy concerns for enterprises, the researchers designed the AgentNet Tool with security in mind. Xinyuan Wang, co-author of the paper and PhD student at HKU, explained that they implemented a multi-layer privacy protection framework. “First, annotators themselves can fully observe the data they generate… before deciding whether to submit it,” he told VentureBeat. The data then undergoes manual verification for privacy issues and automated scanning by a large model to detect any remaining sensitive content before release. “This layered process ensures enterprise-grade robustness for environments handling sensitive customer or financial data,” Wang added. To accelerate evaluation, the team also curated AgentNetBench, an offline benchmark that provides multiple correct actions for each step, offering a more efficient way to measure an agent’s performance. A new recipe for training agents The OpenCUA framework introduces a novel pipeline for processing data and training computer-use agents. The first step converts the raw human demonstrations into clean state-action pairs suitable for training vision-language models (VLMs). However, the researchers found that simply training models on these pairs yields limited performance gains, even with large amounts of data. OpenCUA chain-of-thought pipeline Source: XLang Lab at HKU The key insight was to augment these trajectories with chain-of-thought (CoT) reasoning. This process generates a detailed “inner monologue” for each action, which includes planning, memory, and reflection. This structured reasoning is organized into three levels: a high-level observation of the screen, reflective thoughts that analyze the situation and plan the next steps, and finally, the concise, executable action. This approach helps the agent develop a deeper understanding of the tasks. “We find natural language reasoning crucial for generalizable computer-use foundation models, helping CUAs internalize cognitive capabilities,” the researchers write. This data synthesis pipeline is a general framework that can be adapted by companies to train agents on their own unique internal tools. According to Wang, an enterprise can record demonstrations of its proprietary workflows and use the same “reflector” and “generator” pipeline to create the necessary training data. “This allows them to bootstrap a high-performing agent tailored to their internal tools without needing to handcraft reasoning traces manually,” he explained. Putting OpenCUA to the test The researchers applied the OpenCUA framework to train a range of open source VLMs, including variants of Qwen and Kimi-VL, with parameter sizes from 3 billion to 32 billion. The models were evaluated on a suite of online and offline benchmarks that test their ability to perform tasks and understand GUIs. The 32-billion-parameter model, OpenCUA-32B, established a new state-of-the-art success rate among open source models on the OSWorld-Verified benchmark. It also surpassed OpenAI’s GPT-4o-based CUA and significantly closed the performance gap with Anthropic’s leading proprietary models. OpenCUA shows massive improvement over base models (left) while competing with leading CUA models (right) Source: XLANG Lab at HKU For enterprise developers and product leaders, the research offers several key findings. The OpenCUA method is broadly applicable, improving performance on models with different architectures (both dense and mixture-of-experts) and sizes. The trained agents also show strong generalization, performing well across a diverse range of tasks and operating systems. According

OpenCUA’s open source computer-use agents rival proprietary models from OpenAI and Anthropic Read More »

Don’t sleep on Cohere: Command A Reasoning, its first reasoning model, is built for enterprise customer service and more

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now I was in more meetings than usual today so I just caught up to the fact that Cohere, the Canadian startup co-founded by former Transformer paper author Aidan Gomez and geared toward making generative AI products work easily, powerfully, and securely for enterprises, has released its first reasoning large language model (LLM), Command A Reasoning. It looks to be a strong release. Benchmarks, technical specs, and early tests suggest the model delivers on flexibility, efficiency, and raw reasoning power. Customer service, market research, scheduling, data analysis are some of the tasks Cohere says it’s built to handle automatically at scale inside secure enterprise environments. It is a text-only model, however, but it should be easy enough to hook up to multimodal models and tools. In fact, tool use is one of its primary selling points. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO While it’s open for researchers to use for non-commercial purposes, enterprises will need to pay Cohere to get access and the company doesn’t publicly list its pricing because it says it makes bespoke customization and private deployment. Cohere was valued at $6.8 billion when it announced its latest funding round of $500 million a week and a day ago. Tuned for enterprises Command A Reasoning is tuned for enterprises with sprawling document libraries, long email chains, and workflows that can’t afford hallucinations. It supports up to 256,000 tokens on multi-GPU setups, a decent size and comparable to OpenAI’s GPT-5. The research release weighs in at 111-billion parameters, trained with tool-use and multilingual performance in mind. It supports 23 languages out of the box, including English, French, Spanish, Japanese, Arabic, and Hindi. That multilingual depth is key for global enterprises that need consistent agent quality across markets. The model slots directly into North, Cohere’s new platform for deploying AI agents and automations on-premises. That means enterprises can spin up custom agents that live entirely within their infrastructure, giving them control over data flows while still tapping into advanced reasoning. Cohere looks like it’s thought cleverly and strategically to identify some of the recurring functions across enterprises — onboarding, market research and analysis, development — and trained its model to support its agentic workflows for handling these automatically. Controlled thinking As with many other recent reasoning releases including Nvidia’s new Nemotron-Nano-9B-v2, Command A Reasoning introduces a token budget feature to let users or developers specify how much reasoning to allocate to specific inputs and tasks. Less budget means faster, cheaper replies. More budget means deeper, more accurate reasoning. The Hugging Face release even exposes this tradeoff directly: reasoning can be toggled on or off through a simple parameter. Developers can run the model in “reasoning mode” for maximum performance or switch it off for lower latency tasks—without changing models. Excels at enterprise targeted benchmarks So how does it perform in practice? Cohere’s benchmarks paint a clear picture. On enterprise reasoning tasks, Command A Reasoning consistently outpaces peers like DeepSeek-R1 0528, gpt-oss-120b, and Mistral Magistral Medium. It handles multilingual benchmarks with equal strength, important for global businesses. The token budget system isn’t just a gimmick. In head-to-head comparisons against Cohere’s previous Command A model, satisfaction scores climbed steadily as the budget increased. Even with “instant” minimal reasoning, Command A Reasoning beat its predecessor. At higher budgets, it pulled further ahead. The story is the same in deep research. On the DeepResearch Bench — which measures instruction following, readability, insight, and comprehensiveness — Cohere’s system came out on top against offerings from Gemini, OpenAI, Anthropic, Perplexity, and xAI’s Grok. The model excelled in turning sprawling questions into reports that are not only detailed but readable, a key challenge in enterprise knowledge work. Beyond benchmarks, the model is wired for action. Cohere trained it specifically for conversational tool use — letting it call APIs, connect to databases, or query external systems during a task. Developers can define tools via JSON schema and feed them into chat templates in Transformers, making it easier to integrate the model into existing enterprise systems. That design supports Cohere’s larger bet on agentic workflows: AI systems made up of multiple coordinated agents, each handling a piece of a bigger job. Command A Reasoning is the reasoning engine that keeps those workflows coherent and on task. Safety: built for high-stakes work Cohere is also pitching safety as a central feature. The model is trained to avoid the common enterprise headache of over-refusal — when an AI rejects legitimate requests out of caution — while still filtering harmful or malicious content. Evaluations focused on five high-risk categories: child safety, self-harm, violence and hate, explicit material, and conspiracy theories. For companies looking to deploy AI in regulated industries or sensitive domains, this balance is meant to make the model more practical in day-to-day operations. Early buy-in from large enterprises SAP SE is one of the first major partners to integrate the model. Dr. Walter Sun, SVP and Global Head of AI, said the collaboration will enhance SAP’s generative AI capabilities within the SAP Business Technology Platform. For customers, that means agentic applications that can be customized to fit enterprise-specific needs. Availability and licensing Command A Reasoning is available now on the Cohere platform, and for research use on Hugging Face. The Hugging Face repository provides open weights for research under a CC-BY-NC license, requiring users to share contact information and adhere to Cohere’s Acceptable Use Policy. Enterprises interested in commercial or private deployments can contact Cohere’s sales team for bespoke pricing. For enterprises, the pitch is straightforward: one model, multiple modes of deployment, fine-grained control over performance, multilingual capability, tool integration, and benchmark results that suggest it outperforms its peers. source

Don’t sleep on Cohere: Command A Reasoning, its first reasoning model, is built for enterprise customer service and more Read More »

Qwen-Image Edit gives Photoshop a run for its money with AI-powered text-to-image edits that work in seconds

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Adobe Photoshop is among the most recognizable pieces of software ever created, used by more than 90% of the world’s creative professionals, according to Photutorial. So the fact that a new open source AI model — Qwen-Image Edit, released yesterday by Chinese e-commerce giant Alibaba’s Qwen Team of AI researchers — is now able to accomplish a huge number of Photoshop-like editing jobs with text inputs alone, is a notable achievement. Built on the 20-billion-parameter Qwen-Image foundation model released earlier this month, Qwen-Image-Edit extends the system’s unique strengths in text rendering to cover a wide spectrum of editing tasks, from subtle appearance changes to broader semantic transformations. Simply upload a starting image — I tried one of myself from VentureBeat’s last annual Transform conference in San Francisco — and then type instructions of what you want to change, and Qwen-Image-Edit will return a new image with those edits applied. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Input image example: Photo credit: Michael O’Donnell Photography Output image example with prompt: “Make the man wearing a tuxedo.” The model is available now across several platforms, including Qwen Chat, Hugging Face, ModelScope, GitHub, and through the Alibaba Cloud application programming interface (API), the latter which allows any third-party developer or enterprise to integrate this new model into their own applications and workflows. I created my examples above on Qwen Chat, the Qwen Team’s rival to OpenAI’s ChatGPT, however, it should be noted for any aspiring users that generations are limited to about 8 free jobs (input/outputs) per 12 hour period before it resets. Paying users can have access to more jobs. With support for both English and Chinese inputs, and a dual focus on both semantic meaning and visual fidelity, Qwen-Image-Edit aims to lower barriers to professional-grade visual content creation. And given that the model is available as an open source code under an Apache 2.0 license, it’s safe for enterprises to take, download and set up for free on their own hardware or virtual clouds/machines, potentially resulting in a huge cost savings from proprietary software like Photoshop. As Junyang Lin, a Qwen Team researcher wrote on X, “it can remove a strand of hair, very delicate image modification.” The team’s announcement echoes this sentiment, presenting Qwen-Image-Edit not as an entirely new system, but as a natural extension of Qwen-Image that applies its unique text rendering and dual-encoding approach directly to editing tasks. Dual encodings allow for edits preserving style and content of original image Qwen-Image-Edit builds on the foundation established by Qwen-Image, which was introduced earlier this year as a large-scale model specializing in both image generation and text rendering. Qwen-Image’s technical report highlighted its ability to handle complex tasks like paragraph-level text rendering, Chinese and English characters, and multi-line layouts with accuracy. The report also emphasized a dual-encoding mechanism, feeding images simultaneously into Qwen2.5-VL for semantic control and a variational autoencoder (VAE) for reconstructive detail. This approach allows edits that remain faithful to both the intent of the prompt and the look of the original image. Those same architectural choices underpin Qwen-Image-Edit. By leveraging dual encodings, the model can adjust at two levels: semantic edits that change the meaning or structure of a scene, and appearance edits that introduce or remove elements while keeping the rest untouched. Semantic editing includes creating new intellectual property, rotating objects 90 or 180 degrees to reveal different views, or transforming an input into another style such as Studio Ghibli-inspired art. These edits typically modify many pixels but preserve the underlying identity of objects. Here’s an example of semantic editing from Shridhar Athinarayanan, an engineer at AI applications platform Replicate, who used a Replicate-hosted implementation or “inference” of Qwen to reskin a photo of Manhattan to look like a toy Lego set. Appearance editing focuses on precise, local changes. In these cases, most of the image remains unchanged while specific objects are altered. Demonstrations include adding a signboard that generates a reflection in water, removing stray hair strands from a portrait, and changing the color of a single letter in a text image. One good example of appearance editing with Qwen-Image Edit comes from AnswerAI co-founder and CEO Thomas Hill who posted a side-by-side on X showing his wife in her wedding dress below an archway and another with the same archway covered with graffiti: Combined with Qwen’s established strength in rendering Chinese and English text, the editing-focused system is positioned as a flexible tool for creators who need more than simple generative imagery. The dual control over semantic scope and appearance fidelity means the same tool can serve very different needs, from creative IP development to production-level photo retouching. Adding or removing text to images Another standout capability is bilingual text editing. Qwen-Image-Edit allows users to add, remove, or modify text in both Chinese and English while preserving font, size, and style. This expands on Qwen-Image’s reputation for strong text rendering, particularly in challenging scenarios like intricate Chinese characters. In practice, this allows for accurate editing of posters, signs, T-shirts, or calligraphy artworks where small text details matter, as seen in another example from Replicate below. One demonstration involved correcting errors in a piece of generated Chinese calligraphy through a step-by-step chained editing process. Users could highlight incorrect regions, instruct the system to fix them, and then further refine details until the correct characters were rendered. This iterative approach shows how the model can be applied to high-stakes editing tasks where precision is essential. Applications and use cases The Qwen team has highlighted a range of potential applications: Creative design and IP expansion, such as generating mascot-based emoji packs. Advertising and content creation, where logos, signage, and text-heavy visuals can be

Qwen-Image Edit gives Photoshop a run for its money with AI-powered text-to-image edits that work in seconds Read More »

Scaling agentic AI safely — and stopping the next big security breach

Presented by Gravitee There’s a growing disconnect between AI ambition and operational readiness, as businesses race to prove value with AI before their competitors. Across numerous organizations, a growing number of AI agents are deployed and operating without guardrails, and that’s going to have major consequences sooner rather than later, says Rory Blundell, CEO of Gravitee. “Organizations are rushing to implement AI agents without the necessary security frameworks or structured onboarding processes in place,” Blundell explains. “As a result, we believe there’s a strong likelihood that within the next couple of years, there will be a major data breach caused by an agent acting outside of its intended remit, whether unknowingly or due to oversight by its human operators. It’s a risk that businesses must get ahead of now, before it’s too late.” According to Gravitee’s recent State of Agentic AI survey, 72% of organizations are already using agentic AI systems. Additionally, 75% of respondents cite governance as their top concern. However, many global business leaders still don’t fully understand the breadth of risks inherent in their agentic experiments, especially as the number of agents they deploy stacks up. The risks of accelerated agent sprawl The challenges of agent sprawl now echo those early API days: individual teams spin up their own agents to tackle specific tasks, from chatbots to workflow automation, but without a centralized plan. Before long, agents are interacting with LLMs, triggering actions, or tapping into sensitive tools and data, all without shared oversight or visibility into performance, security posture, or cost, and the consequences could be far reaching. “You’re going to have exactly the same challenge you had with services and micro-services 10 or 15 years ago,” Blundell says. “As more agents are created without centralized control, it becomes nearly impossible to monitor their behavior, ensure they’re operating efficiently, or maintain critical security in all the interactions happening all the time between tools, LLMs, and agents. Then you’ve got badly monitored agents with clashing protocols and unsupervised behaviors that hamper speed to innovation, gum up systems, and actually cause inefficiencies instead of solving them.” Speed versus innovation Long-established enterprises with decades of pre-AI security, governance and control measures already in place have another challenge: balancing strict safety protocols for agents and getting overtaken by competitors who are faster and more nimble, or throwing agents at the wall and seeing what sticks. Those companies that are rushing into the fray might be exposing themselves to major risk, but interestingly, they’re also advancing far more quickly Blundell says. “When you don’t yet have the required security measures in place or a properly structured way to introduce more agents, you’re going to self-censor, and speed will reduce,” Blundell says. “You therefore risk the business not being able to achieve what it should achieve, and risk your overall company prosperity. Other businesses that don’t have that baggage will be able to accelerate beyond you and this is something we’re keen to help businesses prevent.” The role of centralized governance and control It’s now possible to address all of these challenges at once with a centralized governance layer that provides visibility into an entire agentic system through a unified interface. Putting a solution like Gravitee’s Agent Mesh in place can significantly accelerate innovation almost immediately, especially once an organization resolves performance issues caused by overburdened, inefficient agents that monopolize resources. The mesh is underpinned by a secure communication protocol, because agents are only as useful as their ability to coordinate and safely share data. It uses Google’s open Agent-to-Agent (A2A) protocol: an open-source project developed by Google Cloud in partnership with 50+ other companies, including Gravitee. The A2A protocol manages secure information exchange and coordinated actions among autonomous AI agents, regardless of whether their underlying technology or framework matches, allowing them to discover, authenticate security, and collaborate, ensuring organizations are protected. The A2A protocol and other agent governance tools, such as Gravitee’s Agent Mesh, are designed to reduce duplication as well as enforce policy. And it adds the observability and order that ensures multi-agent ecosystems are governed, efficient, and aligned with internal policies. When governance comes into play Given the current pace of AI enthusiasm and experimentation, most organizations are not starting their agentic AI journey with governance — nor should they, especially in the early stages, says Blundell. However, a four-stage AI readiness maturity model can help guide business progression: Stage one: Proof of concept and experimentation Stage two: Incorporation of tools and LLMs, and single-agent use Stage three: Multi-agent deployment Stage four: Fully governed, observable, and secure multi-agent ecosystems While AI tools and LLMs are becoming more commonplace, true multi-agent architectures remain rare. Most organizations today fall somewhere between stages one and two says Blundell — and that’s exactly where they should be, continuing to test, learn, and experiment before layering in governance. “Leaving governance to a later stage might seem contradictory, but it won’t actually make sense unless you start to understand what AI means to your organization first,” he explains. “Our recommendation is to go through all the initial stages of AI readiness and early agent proof of concepts, figuring out where agents fit in and how, before you consider a governance strategy. You don’t want to be locked into a framework that doesn’t work for the multi-agent architecture that works for your business.” An organization grappling with a messy, ungoverned, multi-agent architecture can plug existing agent APIs directly into the Gravitee Agent Mesh, without any risk to how that architecture functions. “Upgrading to Agent Mesh just requires a very short period of time to direct everything to the right places and centralize your agents, start monitoring performance, and enforcing best practices,” he says. “And then you’ll have unlocked the ability to accelerate forward to full AI maturity.” Dig deeper: To learn more about the role of governance and control when supercharging AI innovation, and why observability is critical to an agentic strategy that delivers fast, visit here. Also, Gravitee is hosting an A2A Summit for

Scaling agentic AI safely — and stopping the next big security breach Read More »

VB AI Impact Series: Can you really govern multi-agent AI?

Single copilots are yesterday’s news. Competitive differentiation is all about launching a network of specialized agents that collaborate, self-critique, and call the right model for every step. The latest installment of VentureBeat’s AI Impact Series, presented by SAP in San Francisco, tackled the issue of deploying and governing multi-agent AI systems. Yaad Oren, managing director SAP Labs U.S. and global head of research & innovation at SAP, and Raj Jampa, SVP and CIO with Agilent, an analytical and clinical laboratory technology firm, discussed how to deploy these systems in real-world environments while staying inside cost, latency, and compliance guardrails. SAP’s goal is to ensure that customers can scale their AI agents, but safely, Oren said. “You can be almost fully autonomous if you like, but we make sure there are a lot of checkpoints and monitoring to help to improve and fix,” he said. “This technology needs to be monitored at scale. It’s not perfect yet. This is the tip of the iceberg around what we’re doing to make sure that agents can scale, and also minimize any vulnerabilities.” Deploying active AI pilots across the organization Right now, Agilent is actively integrating AI across the organization, Jampa said. The results are promising, but they’re still in the process of tackling those vulnerability and scaling issues. “We’re in a stage where we’re seeing results,” he explained. “We’re now having to deal with problems like, how do we enhance monitoring for AI? How do we do cost optimization for AI? We’re definitely in the second stage of it, where we’re not exploring anymore. We’re looking at new challenges and how we deal with these costs and monitoring tools.” Within Agilent, AI is deployed in three strategic pillars, Jampa said. First, on the product side, they’re exploring how to accelerate innovation by embedding AI into the instruments they develop. Second, on the customer-facing side, they’re identifying which AI capabilities will deliver the greatest value to their clients. Third, they’re applying AI to internal operations, building solutions like self-healing networks to boost efficiency and capacity. “As we implement these use cases, one thing that we’ve focused on a lot is the governance framework,” Jampa explained. That includes setting policy-based boundaries and ensuring the guardrails for each solution remove unnecessary restrictions while still maintaining compliance and security. The importance of this was recently underscored when one of their agents did a config update, but they didn’t have a check in place to ensure its boundaries were solid. The upgrade immediately caused issues, Jampa said — but the network was quick to detect them, because the second piece of the pillar is auditing, or ensuring that every input and every output is logged and can be traced back. Adding a human layer is the last piece. “The small, lowercase use cases are pretty straightforward, but when you talk about natural language, big translations, those are scenarios where we have complex models involved,” he said. “For those bigger decisions, we add the element where the agent says, I need a human to intervene and approve my next step.” And the question of speed versus accuracy comes into play early during the decision-making process, he added, because costs can add up fast. Complex models for low-latency tasks push those costs substantially higher. A governance layer helps monitor the speed, latency and accuracy of agent results, so that they can identify opportunities to build on their existing deployments and continue to expand their AI strategy. Solving agent integration challenges Integration between AI agents and existing enterprise solutions remains a major pain point. While legacy on-premise systems can connect through data APIs or event-driven architecture, the best practice is to first ensure all solutions operate within a cloud framework. “As long as you have the cloud solution, it’s easier to have all the connections, all the delivery cycles,” Oren said. “Many enterprises have on-premise installations. We’re helping, using AI and agents, to migrate them into the cloud solution.” With SAP’s integrated tool chain, complexities like customization of legacy software are easily maintained in the cloud as well. Once everything is within the cloud infrastructure, the data layer comes in, which is equally if not more important. At SAP, the Business Data Cloud serves as a unified data platform that brings together information from both SAP and non-SAP sources. Much like Google indexes web content, the Business Data Cloud can index business data and add semantic context. Added Oren: “The agents then have the ability to connect and create business processes end-to-end.” Addressing gaps in enterprise agentic activations While many elements factor into the equation, three are critical: the data layer, the orchestration layer, and the privacy and security layer. Clean, well-structured data is, of course, crucial, and successful agentic deployments depend on a unified data layer. The orchestration layer manages agent connections, enabling powerful agentic automation across the system. “The way you orchestrate [agents] is a science, but an art as well,” Oren says. “Otherwise, you can have not only failures, but also auditing and other challenges.” Finally, investing in security and privacy is non-negotiable — especially when a swarm of agents is operating across your databases and enterprise architecture, where authorization and identity management are paramount. For example, an HR team member may need access to salary or personally identifiable information, but no one else should be able to view it. We’re headed toward a future in which human enterprise teams are joined by agent and robotic team members, and that’s when identity management becomes even more vital, Oren said. “We’re starting to look at agents more and more like they’re humans, but they need extra monitoring,” he added. “This involves onboarding and authorization. It also needs change management. Agents are starting to take on a professional personality that you need to maintain, just like an employee, just with much more monitoring and improvement. It’s not autonomous in terms of life cycle management. You have checkpoints to see what you need to change and improve.” source

VB AI Impact Series: Can you really govern multi-agent AI? Read More »

Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Benchmark testing models have become essential for enterprises, allowing them to choose the type of performance that resonates with their needs. But not all benchmarks are built the same and many test models are based on static datasets or testing environments.  Researchers from Inclusion AI, which is affiliated with Alibaba’s Ant Group, proposed a new model leaderboard and benchmark that focuses more on a model’s performance in real-life scenarios. They argue that LLMs need a leaderboard that takes into account how people use them and how much people prefer their answers compared to the static knowledge capabilities models have.  In a paper, the researchers laid out the foundation for Inclusion Arena, which ranks models based on user preferences.   “To address these gaps, we propose Inclusion Arena, a live leaderboard that bridges real-world AI-powered applications with state-of-the-art LLMs and MLLMs. Unlike crowdsourced platforms, our system randomly triggers model battles during multi-turn human-AI dialogues in real-world apps,” the paper said.  AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Inclusion Arena stands out among other model leaderboards, such as MMLU and OpenLLM, due to its real-life aspect and its unique method of ranking models. It employs the Bradley-Terry modeling method, similar to the one used by Chatbot Arena.  Inclusion Arena works by integrating the benchmark into AI applications to gather datasets and conduct human evaluations. The researchers admit that “the number of initially integrated AI-powered applications is limited, but we aim to build an open alliance to expand the ecosystem.” By now, most people are familiar with the leaderboards and benchmarks touting the performance of each new LLM released by companies like OpenAI, Google or Anthropic. VentureBeat is no stranger to these leaderboards since some models, like xAI’s Grok 3, show their might by topping the Chatbot Arena leaderboard. The Inclusion AI researchers argue that their new leaderboard “ensures evaluations reflect practical usage scenarios,” so enterprises have better information around models they plan to choose.  Using the Bradley-Terry method  Inclusion Arena draws inspiration from Chatbot Arena, utilizing the Bradley-Terry method, while Chatbot Arena also employs the Elo ranking method concurrently.  Most leaderboards rely on the Elo method to set rankings and performance. Elo refers to the Elo rating in chess, which determines the relative skill of players. Both Elo and Bradley-Terry are probabilistic frameworks, but the researchers said Bradley-Terry produces more stable ratings.  “The Bradley-Terry model provides a robust framework for inferring latent abilities from pairwise comparison outcomes,” the paper said. “However, in practical scenarios, particularly with a large and growing number of models, the prospect of exhaustive pairwise comparisons becomes computationally prohibitive and resource-intensive. This highlights a critical need for intelligent battle strategies that maximize information gain within a limited budget.”  To make ranking more efficient in the face of a large number of LLMs, Inclusion Arena has two other components: the placement match mechanism and proximity sampling. The placement match mechanism estimates an initial ranking for new models registered for the leaderboard. Proximity sampling then limits those comparisons to models within the same trust region.  How it works So how does it work?  Inclusion Arena’s framework integrates into AI-powered applications. Currently, there are two apps available on Inclusion Arena: the character chat app Joyland and the education communication app T-Box. When people use the apps, the prompts are sent to multiple LLMs behind the scenes for responses. The users then choose which answer they like best, though they don’t know which model generated the response.  The framework considers user preferences to generate pairs of models for comparison. The Bradley-Terry algorithm is then used to calculate a score for each model, which then leads to the final leaderboard.  Inclusion AI capped its experiment at data up to July 2025, comprising 501,003 pairwise comparisons.  According to the initial experiments with Inclusion Arena, the most performant model is Anthropic’s Claude 3.7 Sonnet, DeepSeek v3-0324, Claude 3.5 Sonnet, DeepSeek v3 and Qwen Max-0125.  Of course, this was data from two apps with more than 46,611 active users, according to the paper. The researchers said they can create a more robust and precise leaderboard with more data.  More leaderboards, more choices The increasing number of models being released makes it more challenging for enterprises to select which LLMs to begin evaluating. Leaderboards and benchmarks guide technical decision makers to models that could provide the best performance for their needs. Of course, organizations should then conduct internal evaluations to ensure the LLMs are effective for their applications.  It also provides an idea of the broader LLM landscape, highlighting which models are becoming competitive compared to their peers. Recent benchmarks such as RewardBench 2 from the Allen Institute for AI attempt to align models with real-life use cases for enterprises.  source

Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production Read More »

The looming crisis of AI speed without guardrails

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now OpenAI’s GPT-5 has arrived, bringing faster performance, more dependable reasoning and stronger tool use. It joins Claude Opus 4.1 and other frontier models in signaling a rapidly advancing cognitive frontier. While artificial general intelligence (AGI) remains in the future, DeepMind’s Demis Hassabis has described this era as “10 times bigger than the Industrial Revolution, and maybe 10 times faster.” According to OpenAI CEO Sam Altman, GPT-5 is “a significant fraction of the way to something very AGI-like.” What is unfolding is not just a shift in tools, but a reordering of personal value, purpose, meaning and institutional trust. The challenge ahead is not only to innovate, but to build the moral, civic and institutional frameworks necessary to absorb this acceleration without collapse. Transformation without readiness Anthropic CEO Dario Amodei provided an expansive view in his 2024 essay Machines of Loving Grace. He imagined AI compressing a century of human progress into a decade, with commensurate advances in health, economic development, mental well-being and even democratic governance. However, “it will not be achieved without a huge amount of effort and struggle by many brave and dedicated people.” He added that everyone “will need to do their part both to prevent [AI] risks and to fully realize the benefits.”  That is the fragile fulcrum on which these promises rest. Our AI-fueled future is near, even as the destination of this cognitive migration, which is nothing less than a reorientation of human purpose in a world of thinking machines, remains uncertain. While my earlier articles mapped where people and institutions must migrate, this one asks how we match acceleration with capacity. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO What this moment in time asks of us is not just technical adoption but cultural and social reinvention. That is a hard ask, as our governance, educational systems and civic norms were forged in a slower, more linear era. They moved with the gravity of precedent, not the velocity of code.  Empowerment without inclusion In a New Yorker essay, Dartmouth professor Dan Rockmore describes how a neuroscientist colleague on a long drive conversed with ChatGPT and, together, they brainstormed a possible solution to a problem in his research. ChatGPT suggested he investigate a technique called “disentanglement” to simplify his mathematical model. The bot then wrote some code that was waiting at the end of the drive. The researcher ran it, and it worked. He said of this experience: “I feel like I’m accelerating with less time, I’m accelerating my learning, and improving my creativity, and I’m enjoying my work in a way I haven’t in a while.”  This is a compelling illustration of how powerful emerging AI technology can be in the hands of certain professionals. It is indeed an excellent thought partner and collaborator, ideal for a university professor or anyone tasked with developing innovative ideas. But what about the usefulness for and impact on others? Consider the logistics planners, procurement managers, and budget analysts whose roles risk displacement rather than enhancement. Without targeted retraining, robust social protections or institutional clarity, their futures could quickly move from uncertain to untenable. The result is a yawning gap between what our technologies enable and what our social institutions can support. That is where true fragility lies: Not in the AI tools themselves, but in the expectation that our existing systems can absorb the impact from them without fracture.  Change without infrastructure Many have argued that some amount of societal disruption always occurs alongside a technological revolution, such as when wagon wheel manufacturers were displaced by the rise of the automobile. But these narratives quickly shift to the wonders of what came next. The Industrial Revolution, now remembered for its long-term gains, began with decades of upheaval, exploitation and institutional lag. Public health systems, labor protections and universal education were not designed in advance. They emerged later, often painfully, as reactions to harms already done. Charles Dickens’ Oliver Twist, with its orphaned child laborers and brutal workhouses, captured the social dislocation of that era with haunting clarity. The book was not a critique of technology itself, but of a society unprepared for its consequences.  If the AI revolution is, as Hassabis suggests, an order of magnitude greater in scope and speed of implementation than that earlier transformation, then our margin for error is commensurately narrower and the timeline for societal response more compressed. In that context, hope is at best an invitation to dialogue and, at worst, a soft response to hard and fast-arriving problems. Vision without pathways What are those responses? Despite the sweeping visions, there remains little consensus on how these ambitions will be integrated into the core functions of society. What does a “gentle singularity” look like in a hospital understaffed and underfunded? How do “machines of loving grace” support a public school system still struggling to provide basic literacy? How do these utopian aspirations square with predictions of 20% unemployment within five years? For all the talk of transformation, the mechanisms for wealth distribution, societal adaptation and business accountability remain vague at best. In many cases, AI is haphazardly arriving through unfettered market momentum. Language models are being embedded into government services, customer support, financial platforms and legal assistance tools, often without transparent review or meaningful public discourse and almost certainly without regulation. Even when these tools are helpful, their rollout bypasses the democratic and institutional channels that would otherwise confer trust. They arrive not through deliberation but as fait accompli, products of unregulated market momentum.  It is no wonder then, that the result is not a coordinated march toward abundance, but a patchwork of adoption defined more by technical possibility than social preparedness. In this environment, power accrues

The looming crisis of AI speed without guardrails Read More »