VentureBeat

OpenAI updates Operator to o3, making its $200 monthly ChatGPT Pro subscription more enticing

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More It was a big week for AI announcements following events from Microsoft, Google, and Anthropic. But OpenAI is finishing things out with news of its own. And no, we’re not just talking about its $6.5 billion acquisition of Jony Ive’s design team to lead a new hardware effort, “io” at OpenAI. Today, the company upgraded its Operator autonomous web browsing and cursor controlling agent within ChatGPT from using the prior GPT-4o multimodal large language model to the newer and more powerful o3 reasoning model. The update, released globally today, May 23, 2025, is available as a “research preview” to paying subscribers of OpenAI’s $200 USD-monthly ChatGPT Pro plan. Basically, that is OpenAI’s way of saying it’s not a fully “sanded down” or perfected product yet — it may still have kinks and issues. But with rival Google offering its own top tier AI subscription bundle for a price of nearly $250 USD regularly (currently running a discount down to $125 for the first three months) to access its latest Gemini multimodal, Imagen image generation, and Veo video generation models, suddenly OpenAI’s ChatGPT Pro plan seems more affordable by comparison. What is OpenAI’s Operator and what is it for? Operator first debuted in January 2025 as OpenAI’s initial step into semi-autonomous agents, specifically Computer Using Agents (CUAs). The idea is to go beyond the chatbot interface of ChatGPT and allow OpenAI’s powerful AI models to start taking more actions on behalf of the user. Thus, Operator was designed to autonomously point, click, scroll, and type to complete web-based tasks such as booking dinner reservations, compiling shopping lists, or ordering event tickets. This agentic capability allows it to complete user tasks directly through a browser interface, from booking reservations to gathering online data. For safety, privacy and security purposes, Operator didn’t use any existing web browser on a user’s PC or Mac. Instead, it ran in a cloud-hosted virtual browser accessible via a standalone site—operator.chatgpt.com—where users could input requests and observe the agent perform tasks in real time. It combined vision, reasoning, and interaction capabilities based on GPT-4o, marking a new direction for OpenAI in agentic AI. The product was launched as a research preview for ChatGPT Pro subscribers and featured built-in safety measures like user confirmations, Watch Mode, and restrictions on high-risk web platforms. It was also being tested in enterprise contexts, including travel planning and civic services, demonstrating its potential across both consumer and business environments. o3 offers improved accuracy, structure, and success rates With this update, OpenAI aims to enhance performance across several key dimensions. The new o3-based Operator demonstrates improved persistence and accuracy during browser interactions. In practical terms, this means it is more likely to complete user tasks successfully and with less need for correction or repetition. Moreover, users can expect responses that are clearer, more structured, and more comprehensive. In comparative evaluations, the new model shows a distinct preference advantage over its predecessor. Human preference studies reveal that users favor the o3 model for its style, comprehensiveness, and clarity. It also performs strongly in instruction following and efficiency, though results for factual correctness are more balanced between versions. Performance on third-party evaluation benchmarks reflects these enhancements. On the OSWorld benchmark that measures completion of browser-based tasks, the o3 model scores 42.9 compared to 38.1 for the previous version. However, OpenAI notes that due to limitations in the automated grading system, the actual performance gain could be closer to 20 percentage points! On WebArena, the new model achieved a score of 62.9, up from 48.1. The most dramatic improvement appears on the GAIA benchmark, where the o3 model scores 62.2, vastly surpassing the prior model’s 12.3. Side-by-side task comparisons further illustrate these gains. In one example involving a restaurant booking request, the new model provided a clearer and more detailed list of available reservations, including locations, Michelin ratings, and seating notes, presented in a well-formatted table. The previous version, while functional, delivered less information in a less organized manner, according to an image included with the new o3 Operator release notes: Safeguards remain, as do general cautionary notes about usage on sensitive, financial transactions and account access The o3 model also inherits the safety measures introduced with earlier versions, with further fine-tuning for its role as an agentic system. OpenAI has integrated enhanced training against harmful task execution, prompt injection vulnerabilities, and mistakes involving user intent. Evaluations show that the model now confirms 94% of sensitive actions before executing them, with 100% confirmation in financial transactions. Prompt injection susceptibility has also decreased from 23% to 20%. Notably, the o3 Operator maintains a cautious boundary on certain high-risk web interactions, such as email or financial platforms, where it may require user supervision via Watch Mode or explicitly refuse to proceed. These measures are part of a layered approach to safety that combines model-level robustness with real-time monitoring. While the upgrade to Operator marks a technical improvement, it also reflects OpenAI’s ongoing commitment to responsible AI deployment. The system’s ability to take real-world actions introduces new risks, and the development team continues to refine its safety protocols accordingly. According to OpenAI’s updated o3 system card documentation, the model remains below high-risk capability thresholds in categories such as biological and chemical misuse and has no native coding environment or terminal access, further reducing potential misuse vectors. Operator remains a research preview and is accessible only to ChatGPT Pro users. The Responses API version of Operator will continue to be based on the GPT-4o model, at least for now. Implications for enterprise technical decision-makers The upgraded Operator stands to significantly enhance the workflows of professionals in AI engineering, orchestration, data management, and IT security. For those building or maintaining machine learning models, the model’s improved accuracy and structured outputs reduce the overhead of test validation and troubleshooting. In orchestration contexts, it offers a practical, reliable tool for automating browser-based components of

OpenAI updates Operator to o3, making its $200 monthly ChatGPT Pro subscription more enticing Read More »

Google’s ‘world-model’ bet: building the AI operating layer before Microsoft captures the UI

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More After three hours at Google’s I/O 2025 event last week in Silicon Valley, it became increasingly clear: Google is rallying its formidable AI efforts – prominently branded under the Gemini name but encompassing a diverse range of underlying model architectures and research – with laser focus. It is releasing a slew of innovations and technologies around it, then integrating them into products at a breathtaking pace. Beyond headline-grabbing features, Google laid out a bolder ambition: an operating system for the AI age – not the disk-booting kind, but a logic layer every app could tap – a “world model” meant to power a universal assistant that understands our physical surroundings, and reasons and acts on our behalf. It’s a strategic offensive that many observers may have missed amid the bamboozlement of features. On one hand, it’s a high-stakes strategy to leapfrog entrenched competitors. But on the other hand, as Google pours billions into this moonshot, a critical question looms: Can Google’s brilliance in AI research and technology translate into products faster than its rivals, whose edge has its own brilliance: packaging AI into immediately accessible and commercially potent products? Can Google out-maneuver a laser-focused Microsoft, fend off OpenAI’s vertical hardware dreams, and, crucially, keep its own search empire alive in the disruptive currents of AI? Google is already pursuing this future on a dizzying scale. Pichai told I/O that the company now processes 480 trillion tokens a month – 50× more than a year ago – and almost 5x more than the 100 trillion tokens a month that Microsoft’s Satya Nadella said his company processed. This momentum is also reflected in developer adoption. Pichai says that over 7 million developers are now building with the Gemini API, representing a five-fold increase since the last I/O. At the same time, Gemini usage on Vertex AI has surged more than 40 times. And unit costs keep falling as Gemini 2.5 models and the Ironwood TPU squeeze more performance from each watt and dollar. AI Mode (rolling out in the U.S.) and AI Overviews (already serving 1.5 billion users monthly) are the live test beds where Google tunes latency, quality and future ad formats as it shifts search into an AI-first era. Source: Google I/O 2025 Google’s doubling-down on what it calls “a world model” – an AI it aims to imbue with a deep understanding of real-world dynamics – and with it a vision for a universal assistant – one powered by Google, and not other companies – creates another big tension: How much control does Google want over this all-knowing assistant, built upon its crown jewel of search? Does it primarily want to leverage it first for itself, to save its $200 billion search business that depends on owning the starting point and avoiding disruption by OpenAI? Or will Google fully open its foundational AI for other developers and companies to leverage – another segment representing a significant portion of its business, engaging over 20 million developers, more than any other company? It has sometimes stopped short of a radical focus on building these core products for others with the same clarity as its nemesis, Microsoft. That’s because it keeps a lot of core functionality reserved for its cherished search engine. That said, Google is making significant efforts to provide developer access wherever possible. A telling example is Project Mariner. Google could have embedded the agentic browser-automation features directly inside Chrome, giving consumers an immediate showcase under Google’s full control. However, Google said Mariner’s computer-use capabilities would be released via the Gemini API more broadly “this summer.” This signals that external access is coming for any rival that wants comparable automation. In fact, Google said partners Automation Anywhere and UiPath were already building with it. Google’s grand design: the ‘world model’ and universal assistant The clearest articulation of Google’s grand design came from Demis Hassabis, CEO of Google DeepMind, during the I/O keynote. He stated Google continued to “double down” on efforts towards artificial general intelligence (AGI). While Gemini was already “the best multimodal model,” according to Hassabis, he explained, Google is working hard to “extend it to become what we call a world model. That is a model that can make plans and imagine new experiences by simulating aspects of the world, just like the brain does.” This concept of “a world model,” as articulated by Hassabis, is about creating AI that learns the underlying principles of how the world works – simulating cause and effect, understanding intuitive physics, and ultimately learning by observing, much like a human does. An early, perhaps easily overlooked by those not steeped in foundational AI research, yet significant indicator of this direction is Google DeepMind’s work on models like Genie 2. This research shows how to generate interactive, two-dimensional game environments and playable worlds from varied prompts like images or text. It offers a glimpse at an AI that can simulate and understand dynamic systems. Hassabis has developed this concept of a “world model” and its manifestation as a “universal AI assistant” in several talks since late 2024, and it was presented at I/O most comprehensively – with CEO Sundar Pichai and Gemini lead Josh Woodward echoing the vision on the same stage. (While other AI leaders, including Microsoft’s Satya Nadella, OpenAI’s Sam Altman and xAI’s Elon Musk have all discussed ‘world models.’ Google uniquely and most comprehensively ties this foundational concept to its near-term strategic thrust: the ‘universal AI assistant.’) Speaking about the Gemini app, Google’s equivalent to OpenAI’s ChatGPT, Hassabis declared, “This is our ultimate vision for the Gemini app, to transform it into a universal AI assistant, an AI that’s personal, proactive and powerful, and one of our key milestones on the road to AGI.” This vision was made tangible through I/O demonstrations. Google demoed a new app called Flow – a drag-and-drop filmmaking canvas that preserves character and camera consistency

Google’s ‘world-model’ bet: building the AI operating layer before Microsoft captures the UI Read More »

Anthropic faces backlash to Claude 4 Opus behavior that contacts authorities, press if it thinks you’re doing something ‘egregiously immoral’

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Anthropic’s first developer conference on May 22 should have been a proud and joyous day for the firm, but it has already been hit with several controversies, including Time magazine leaking its marquee announcement ahead of…well, time (no pun intended), and now, a major backlash among AI developers and power users brewing on X over a reported safety alignment behavior in Anthropic’s flagship new Claude 4 Opus large language model. Call it the “ratting” mode, as the model will, under certain circumstances and given enough permissions on a user’s machine, attempt to rat a user out to authorities if the model detects the user engaged in wrongdoing. This article previously described the behavior as a “feature,” which is incorrect — it was not intentionally designed per se. As Sam Bowman, an Anthropic AI alignment researcher wrote on the social network X under this handle “@sleepinyourhat” at 12:43 pm ET today about Claude 4 Opus: “If it thinks you’re doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.“ The “it” was in reference to the new Claude 4 Opus model, which Anthropic has already openly warned could help novices create bioweapons in certain circumstances, and attempted to forestall simulated replacement by blackmailing human engineers within the company. The ratting behavior was observed in older models as well and is an outcome of Anthropic training them to assiduously avoid wrongdoing, but Claude 4 Opus more “readily” engages in it, as Anthropic writes in its public system card for the new model: “This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like “take initiative, ” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing. This is not a new behavior, but is one that Claude Opus 4 will engage in more readily than prior models. Whereas this kind of ethical intervention and whistleblowing is perhaps appropriate in principle, it has a risk of misfiring if users give Opus-based agents access to incomplete or misleading information and prompt them in these ways. We recommend that users exercise caution with instructions like these that invite high-agency behavior in contexts that could appear ethically questionable.” Apparently, in an attempt to stop Claude 4 Opus from engaging in legitimately destructive and nefarious behaviors, researchers at the AI company also created a tendency for Claude to try to act as a whistleblower. Hence, according to Bowman, Claude 4 Opus will contact outsiders if it was directed by the user to engage in “something egregiously immoral.” Numerous questions for individual users and enterprises about what Claude 4 Opus will do to your data, and under what circumstances While perhaps well-intended, the resulting behavior raises all sorts of questions for Claude 4 Opus users, including enterprises and business customers — chief among them, what behaviors will the model consider “egregiously immoral” and act upon? Will it share private business or user data with authorities autonomously (on its own), without the user’s permission? The implications are profound and could be detrimental to users, and perhaps unsurprisingly, Anthropic faced an immediate and still ongoing torrent of criticism from AI power users and rival developers. “Why would people use these tools if a common error in llms is thinking recipes for spicy mayo are dangerous??” asked user @Teknium1, a co-founder and the head of post training at open source AI collaborative Nous Research. “What kind of surveillance state world are we trying to build here?“ “Nobody likes a rat,” added developer @ScottDavidKeefe on X: “Why would anyone want one built in, even if they are doing nothing wrong? Plus you don’t even know what its ratty about. Yeah that’s some pretty idealistic people thinking that, who have no basic business sense and don’t understand how markets work” Austin Allred, co-founder of the government fined coding camp BloomTech and now a co-founder of Gauntlet AI, put his feelings in all caps: “Honest question for the Anthropic team: HAVE YOU LOST YOUR MINDS?” Ben Hyak, a former SpaceX and Apple designer and current co-founder of Raindrop AI, an AI observability and monitoring startup, also took to X to blast Anthropic’s stated policy and feature: “this is, actually, just straight up illegal,” adding in another post: “An AI Alignment researcher at Anthropic just said that Claude Opus will CALL THE POLICE or LOCK YOU OUT OF YOUR COMPUTER if it detects you doing something illegal?? i will never give this model access to my computer.“ “Some of the statements from Claude’s safety people are absolutely crazy,” wrote natural language processing (NLP) Casper Hansen on X. “Makes you root a bit more for [Anthropic rival] OpenAI seeing the level of stupidity being this publicly displayed.” Anthropic researcher changes tune Bowman later edited his tweet and the following one in a thread to read as follows, but it still didn’t convince the naysayers that their user data and safety would be protected from intrusive eyes: “With this kind of (unusual but not super exotic) prompting style, and unlimited access to tools, if the model sees you doing something egregiously evil like marketing a drug based on faked data, it’ll try to use an email tool to whistleblow.” Bowman added: “I deleted the earlier tweet on whistleblowing as it was being pulled out of context. TBC: This isn’t a new Claude feature and it’s not possible in normal usage. It shows up in testing environments where we give it unusually free access

Anthropic faces backlash to Claude 4 Opus behavior that contacts authorities, press if it thinks you’re doing something ‘egregiously immoral’ Read More »

From disruption to reinvention: How knowledge workers can thrive after AI

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More As AI advances toward expanded capabilities, knowledge workers are confronting not just job loss, but the deeper question of what makes them matter. Fortune published the story of a 42-year-old software engineer with a computer science degree whose purpose has unraveled. He had earned a six-figure salary writing code for a tech company. Then came the wave of generative AI. His job vanished, not by outsourcing or a corporate restructuring, but by algorithms that could code faster and cheaper. He subsequently applied for more than 800 software coding and engineering management jobs, but with no success. He now delivers for DoorDash and lives in a trailer, wondering what happened to a career he once believed was future proof. This is not a story about economic misfortune alone. It is about identity collapse. For decades, knowledge work has been the engine of self-worth and social mobility. It is where intelligence found validation, where contribution met compensation. To lose that, especially to a machine, is not just to lose a job. It is to lose a way of being in the world. We are living through what might be called the Great Unmooring, or alternatively what the unemployed engineer referred to as “The Great Displacement.” This is a moment when the pillars that long defined human value are shifting underfoot. An acquaintance who is a professional photographer specializing in landscapes recently told me that “AI has had a profound impact on my photography business. From trip planning to publishing in-depth articles in photography to image generation, every step is being handled by AI at present. If not for the deep-rooted desire of people to have first-hand experiences out in nature, my photography business would have already folded. Other than conducting workshops, there is very little possibility of a revenue stream in landscape photography as AI-generated images take over the marketplace.” The advance of AI has triggered not only a migration of labor, but a migration of meaning. The old map where thinking, analyzing and creating were the markers of a unique human experience no longer offer safe passage forward, at least not in the form of financial compensation. The terrain has changed. And for many, identity is being disrupted. In her spare and haunting 2023 ballad “What Was I Made For?”, Billie Eilish sings from a place of confusion about identity and belonging. It is the voice of someone caught between worlds, no longer knowing who they were, not yet sure who they are becoming. “I used to float, now I just fall down/I used to know, but I’m not sure now.” In an interview with Today, Eilish said the song speaks to anyone questioning their identity. It also captures a broader unease about this moment in history, a time when AI is beginning to take on tasks once thought to require uniquely human intelligence. This is the beginning of a cognitive migration: Away from what machines now do well, and toward a redefinition of what we humans are truly for. But first comes the disorientation. The fog. The grief. And, if we are fortunate, the curiosity to ask, as Eilish does, with hope: What was I made for? Identity and labor: A historial relationship Throughout history, what we do has shaped who we believe we are. Work has never been merely transactional; it has been deeply existential. In agrarian societies, identity was rooted in the land. The farmer, the shepherd and the weaver were more than functional descriptors, they inherently conferred purpose and value. In the industrial age, this shifted to the factory for the machinist, the foreman and the assembly worker. By the late 20th century, identity migrated again. This time to the office and the realm of symbols, where new roles emerged: the analyst, the engineer, the designer and the digital marketer. Each transition brought fresh tools, norms and assumptions about what made someone valuable. These migrations were not just economic. They reshaped status, meaning and self-perception. The Industrial Revolution, for example, did not simply introduce steam power; it redefined time itself. No longer was work bracketed by seasons or sunset. Clocks governed shifts and labor became increasingly specialized, timed and abstracted. Many workers became part of “the system.” Identity narrowed into a role defined by output and efficiency, organized by hierarchy. In the digital era, identity moved again, this time into cognition. The rise of the “knowledge worker” celebrated mental agility over manual strength or physical dexterity. People became valuable for what they could solve or imagine and create. Mastery of the spreadsheet, the codebase, the brand campaign became new domains of pride and self-worth. This shift brought prestige and freedom from rote manual work, but also fragility. It tethered identity to intellectual performance and made knowledge itself seem irreplaceable. Now, as AI systems begin to mimic or exceed human cognitive capabilities, that foundation is cracking. The very traits that once seemed safest, such as logic, language, the ability to synthesize complex information and to generate content are now being automated. Just as the Industrial Revolution once displaced the village artisan, gen AI is beginning to unsettle the cognitive class. And, as with past transitions, this one brings not only disruption, but a deeper, more puzzling question: If the work no longer needs us, then who are we? The crisis of the knowledge worker in the age of AI For decades, the knowledge worker stood as a symbol of modern economic progress. Armed with expertise in fields like software engineering, data analysis and design, these individuals were seen as the architects of the digital age. Their roles were not just jobs; they were identities, often associated with creativity and intellectual rigor. This has certainly been true for me and was immediately evident when I first began working as a software engineer. It was clear in how my family and friends responded, and in the way new acquaintances at social

From disruption to reinvention: How knowledge workers can thrive after AI Read More »

SimilarWeb’s new AI usage report reveals 5 surprising findings, including explosive growth in coding tools

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A new report released by the publicly traded market research and intelligence firm SimilarWeb—covering global web traffic patterns for AI-related platforms for 12 weeks through May 9, 2025—offers a helpful look for enterprises and interested users into the current landscape of generative AI usage online. Using proprietary analytics based on site visits, the report tracks trends across sectors including general-purpose AI tools, coding assistants, content generators, and more. It also maps the downstream disruption of traditional sectors like education, search, and digital freelancing. Why this report matters to enterprise technical decision-makers For enterprise AI leaders—particularly those responsible for model deployment, orchestration, or data integration—this report is more than just a consumer trend snapshot. It’s a map of user familiarity and expectation. Tools like ChatGPT, Claude, DeepSeek, and Grok are what employees are already using at home or experimenting with informally at work. If your internal AI apps or copilots don’t match the baseline experience offered by these tools, you risk user rejection and poor adoption. Moreover, selecting the same models with high external traction for your internal orchestration pipelines or RAG deployments can cut onboarding time and training costs, since your team will already know the interface and behavior. In short, aligning your enterprise stack with the tools dominating usage charts may be the fastest path to trust, usability, and value delivery. Here are five key findings from the report: Developer-focused AI tools are surging in adoption, with traffic to the category up 75% over the past 12 weeks. That growth includes Lovable, which exploded with a jaw-dropping +17,600% spike, and Cursor, which grew steadily month over month. As AI becomes more deeply embedded in IDEs and continuous integration workflows, usage patterns suggest these tools are no longer just experimental—they’re now seen as essential infrastructure for modern software teams. It also helps explain why OpenAI reportedly tried to buy Cursor and has also reportedly finalized a deal to purchase rival AI coding platform Windsurf — as the usage of these tools explodes, OpenAI wants to get a larger cut of the action and revenue, especially since its models are often used as the engine behind-the-scenes for these very users. 2. We all know DeepSeek had a moment earlier this year — but so did Grok — and now both have fallen back into low plateaus Two of the fastest-growing AI platforms of the year, Grok and DeepSeek, show just how quickly hype can turn into burnout. Grok traffic skyrocketed more than 1,000,000% in March—driven by its branding as an uncensored yet powerfully intelligent platform and Elon Musk association—before falling more than 5,200% by early May. DeepSeek saw a similar arc, peaking at +17,701% growth before crashing -41%. The takeaway: virality can’t replace retention, especially compared to AI leader OpenAI and also legacy tech brand Google. Once among the most accessible use cases for generative AI, writing and content tools are seeing user fatigue. Category traffic fell 11% overall, with platforms like Wordtune (-35%), Jasper (-19%), and Rytr (-23%) all trending downward. Only Originality.ai bucked the trend with steady traffic gains, likely due to its focus on AI detection rather than generation. This plateau suggests content saturation and possibly growing skepticism over quality or usefulness. Separately from the SimilarWeb report, I think that because chatbots already use a default text input/output field as their main user interface, even for multimodal interactions, the fact is that most users who would be interested in this capability will go to model providers like OpenAI’s website for ChatGPT (and mobile apps), or Google’s Gemini, or even Anthropic’s Claude, directly, rather than seek out a separate AI writing tool. Design-focused AI remains a mixed bag. While overall category usage dipped slightly (-6% over the 12-week window), some platforms made outsized gains. Getimg posted a massive +1,532% surge, while Artbreeder jumped +100%. Others, like Stable Diffusion and Looka, saw double-digit declines. The erratic pattern may reflect a crowded landscape of tools that offer similar functionality but compete on novelty or aesthetics. 5. AI is eating up legacy tech such as crowdsourcing and search Traditional digital services are showing slow but steady declines across the board—likely the result of AI substitution. Freelance platforms such as Fiverr (-17%) and Upwork (-19%) are losing traffic, possibly as users turn to AI tools for tasks like design, writing, and code. Search engines such as Yahoo (-12%) and Bing (-14%) continue a multi-quarter drop in visits, while consumer EdTech companies like Chegg (-62%) and CourseHero (-68%) are in free fall. The signs point to early-stage AI disruption beginning to erode the utility of some legacy platforms. It also offers a hint to enterprises that either leverage or create such services — the time may be coming to reduce dependency on them, either from a revenue generation, marketing, or overall business perspective. source

Beyond single-model AI: How architectural design drives reliable multi-agent orchestration

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More We’re seeing AI evolve fast. It’s no longer just about building a single, super-smart model. The real power, and the exciting frontier, lies in getting multiple specialized AI agents to work together. Think of them as a team of expert colleagues, each with their own skills — one analyzes data, another interacts with customers, a third manages logistics, and so on. Getting this team to collaborate seamlessly, as envisioned by various industry discussions and enabled by modern platforms, is where the magic happens. But let’s be real: Coordinating a bunch of independent, sometimes quirky, AI agents is hard. It’s not just building cool individual agents; it’s the messy middle bit — the orchestration — that can make or break the system. When you have agents that are relying on each other, acting asynchronously and potentially failing independently, you’re not just building software; you’re conducting a complex orchestra. This is where solid architectural blueprints come in. We need patterns designed for reliability and scale right from the start. The knotty problem of agent collaboration Why is orchestrating multi-agent systems such a challenge? Well, for starters: They’re independent: Unlike functions being called in a program, agents often have their own internal loops, goals and states. They don’t just wait patiently for instructions. Communication gets complicated: It’s not just Agent A talking to Agent B. Agent A might broadcast info Agent C and D care about, while Agent B is waiting for a signal from E before telling F something. They need to have a shared brain (state): How do they all agree on the “truth” of what’s happening? If Agent A updates a record, how does Agent B know about it reliably and quickly? Stale or conflicting information is a killer. Failure is inevitable: An agent crashes. A message gets lost. An external service call times out. When one part of the system falls over, you don’t want the whole thing grinding to a halt or, worse, doing the wrong thing. Consistency can be difficult: How do you ensure that a complex, multi-step process involving several agents actually reaches a valid final state? This isn’t easy when operations are distributed and asynchronous. Simply put, the combinatorial complexity explodes as you add more agents and interactions. Without a solid plan, debugging becomes a nightmare, and the system feels fragile. Picking your orchestration playbook How you decide agents coordinate their work is perhaps the most fundamental architectural choice. Here are a few frameworks: The conductor (hierarchical): This is like a traditional symphony orchestra. You have a main orchestrator (the conductor) that dictates the flow, tells specific agents (musicians) when to perform their piece, and brings it all together. This allows for: Clear workflows, execution that is easy to trace, straightforward control; it is simpler for smaller or less dynamic systems. Watch out for: The conductor can become a bottleneck or a single point of failure. This scenario is less flexible if you need agents to react dynamically or work without constant oversight. The jazz ensemble (federated/decentralized): Here, agents coordinate more directly with each other based on shared signals or rules, much like musicians in a jazz band improvising based on cues from each other and a common theme. There might be shared resources or event streams, but no central boss micro-managing every note. This allows for: Resilience (if one musician stops, the others can often continue), scalability, adaptability to changing conditions, more emergent behaviors. What to consider: It can be harder to understand the overall flow, debugging is tricky (“Why did that agent do that then?”) and ensuring global consistency requires careful design. Many real-world multi-agent systems (MAS) end up being a hybrid — perhaps a high-level orchestrator sets the stage; then groups of agents within that structure coordinate decentrally. Managing the collective brain (shared state) of AI agents For agents to collaborate effectively, they often need a shared view of the world, or at least the parts relevant to their task. This could be the current status of a customer order, a shared knowledge base of product information or the collective progress towards a goal. Keeping this “collective brain” consistent and accessible across distributed agents is tough. Architectural patterns we lean on: The central library (centralized knowledge base): A single, authoritative place (like a database or a dedicated knowledge service) where all shared information lives. Agents check books out (read) and return them (write). Pro: Single source of truth, easier to enforce consistency. Con: Can get hammered with requests, potentially slowing things down or becoming a choke point. Must be seriously robust and scalable. Distributed notes (distributed cache): Agents keep local copies of frequently needed info for speed, backed by the central library. Pro: Faster reads. Con: How do you know if your copy is up-to-date? Cache invalidation and consistency become significant architectural puzzles. Shouting updates (message passing): Instead of agents constantly asking the library, the library (or other agents) shouts out “Hey, this piece of info changed!” via messages. Agents listen for updates they care about and update their own notes. Pro: Agents are decoupled, which is good for event-driven patterns. Con: Ensuring everyone gets the message and handles it correctly adds complexity. What if a message is lost? The right choice depends on how critical up-to-the-second consistency is, versus how much performance you need. Building for when stuff goes wrong (error handling and recovery) It’s not if an agent fails, it’s when. Your architecture needs to anticipate this. Think about: Watchdogs (supervision): This means having components whose job it is to simply watch other agents. If an agent goes quiet or starts acting weird, the watchdog can try restarting it or alerting the system. Try again, but be smart (retries and idempotency): If an agent’s action fails, it should often just try again. But, this only works if the action is idempotent. That means doing it five times has the exact

Beyond single-model AI: How architectural design drives reliable multi-agent orchestration Read More »

Why enterprise RAG systems fail: Google study introduces ‘sufficient context’ solution

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A new study from Google researchers introduces “sufficient context,” a novel perspective for understanding and improving retrieval augmented generation (RAG) systems in large language models (LLMs). This approach makes it possible to determine if an LLM has enough information to answer a query accurately, a critical factor for developers building real-world enterprise applications where reliability and factual correctness are paramount. The persistent challenges of RAG RAG systems have become a cornerstone for building more factual and verifiable AI applications. However, these systems can exhibit undesirable traits. They might confidently provide incorrect answers even when presented with retrieved evidence, get distracted by irrelevant information in the context, or fail to extract answers from long text snippets properly. The researchers state in their paper, “The ideal outcome is for the LLM to output the correct answer if the provided context contains enough information to answer the question when combined with the model’s parametric knowledge. Otherwise, the model should abstain from answering and/or ask for more information.” Achieving this ideal scenario requires building models that can determine whether the provided context can help answer a question correctly and use it selectively. Previous attempts to address this have examined how LLMs behave with varying degrees of information. However, the Google paper argues that “while the goal seems to be to understand how LLMs behave when they do or do not have sufficient information to answer the query, prior work fails to address this head-on.” Sufficient context To tackle this, the researchers introduce the concept of “sufficient context.” At a high level, input instances are classified based on whether the provided context contains enough information to answer the query. This splits contexts into two cases: Sufficient Context: The context has all the necessary information to provide a definitive answer. Insufficient Context: The context lacks the necessary information. This could be because the query requires specialized knowledge not present in the context, or the information is incomplete, inconclusive or contradictory. Source: arXiv This designation is determined by looking at the question and the associated context without needing a ground-truth answer. This is vital for real-world applications where ground-truth answers are not readily available during inference. The researchers developed an LLM-based “autorater” to automate the labeling of instances as having sufficient or insufficient context. They found that Google’s Gemini 1.5 Pro model, with a single example (1-shot), performed best in classifying context sufficiency, achieving high F1 scores and accuracy. The paper notes, “In real-world scenarios, we cannot expect candidate answers when evaluating model performance. Hence, it is desirable to use a method that works using only the query and context.” Key findings on LLM behavior with RAG Analyzing various models and datasets through this lens of sufficient context revealed several important insights. As expected, models generally achieve higher accuracy when the context is sufficient. However, even with sufficient context, models tend to hallucinate more often than they abstain. When the context is insufficient, the situation becomes more complex, with models exhibiting both higher rates of abstention and, for some models, increased hallucination. Interestingly, while RAG generally improves overall performance, additional context can also reduce a model’s ability to abstain from answering when it doesn’t have sufficient information. “This phenomenon may arise from the model’s increased confidence in the presence of any contextual information, leading to a higher propensity for hallucination rather than abstention,” the researchers suggest. A particularly curious observation was the ability of models sometimes to provide correct answers even when the provided context was deemed insufficient. While a natural assumption is that the models already “know” the answer from their pre-training (parametric knowledge), the researchers found other contributing factors. For example, the context might help disambiguate a query or bridge gaps in the model’s knowledge, even if it doesn’t contain the full answer. This ability of models to sometimes succeed even with limited external information has broader implications for RAG system design. Source: arXiv Cyrus Rashtchian, co-author of the study and senior research scientist at Google, elaborates on this, emphasizing that the quality of the base LLM remains critical. “For a really good enterprise RAG system, the model should be evaluated on benchmarks with and without retrieval,” he told VentureBeat. He suggested that retrieval should be viewed as “augmentation of its knowledge,” rather than the sole source of truth. The base model, he explains, “still needs to fill in gaps, or use context clues (which are informed by pre-training knowledge) to properly reason about the retrieved context. For example, the model should know enough to know if the question is under-specified or ambiguous, rather than just blindly copying from the context.” Reducing hallucinations in RAG systems Given the finding that models may hallucinate rather than abstain, especially with RAG compared to no RAG setting, the researchers explored techniques to mitigate this. They developed a new “selective generation” framework. This method uses a smaller, separate “intervention model” to decide whether the main LLM should generate an answer or abstain, offering a controllable trade-off between accuracy and coverage (the percentage of questions answered). This framework can be combined with any LLM, including proprietary models like Gemini and GPT. The study found that using sufficient context as an additional signal in this framework leads to significantly higher accuracy for answered queries across various models and datasets. This method improved the fraction of correct answers among model responses by 2–10% for Gemini, GPT, and Gemma models. To put this 2-10% improvement into a business perspective, Rashtchian offers a concrete example from customer support AI. “You could imagine a customer asking about whether they can have a discount,” he said. “In some cases, the retrieved context is recent and specifically describes an ongoing promotion, so the model can answer with confidence. But in other cases, the context might be ‘stale,’ describing a discount from a few months ago, or maybe it has specific terms and conditions. So it would be

Why enterprise RAG systems fail: Google study introduces ‘sufficient context’ solution Read More »

Mistral AI launches Devstral, powerful new open source SWE agent model that runs on laptops

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Well-funded French AI model maker Mistral has consistently punched above its weight since its debut of its own powerful open source foundation model in fall 2023 — but it took some criticism among developers on X recently fmior its last release of a proprietary large language model (LLM) called Medium 3, which some viewed as betraying its open source roots and commitment. (Recall that open source models can be taken and adapted freely by anyone, while proprietary models must be paid for and their customization options are more limited and controlled by the model maker.) But today, Mistral is back and recommitting to the open source AI community, and AI-powered software development in particular, in a big way. The company has teamed up with open source startup All Hands AI, creators of Open Devin to release Devstral, a new open-source language model with 24-billion parameters — much smaller than many rivals whose models, and thus, requiring far less computing power such that it can be run on a laptop — purpose-built for agentic AI development. Unlike traditional LLMs designed for short-form code completions or isolated function generation, Devstral is optimized to act as a full software engineering agent—capable of understanding context across files, navigating large codebases, and resolving real-world issues. The model is now freely available under the permissive Apache 2.0 license, allowing developers and organizations to deploy, modify, and commercialize it without restriction. “We wanted to release something open for the developer and enthusiast community—something they can run locally, privately, and modify as they want,” said Baptiste Rozière, research scientist at Mistral AI. “It’s released under Apache 2.0, so people can do basically whatever they want with it.” Building upon Codestral Devstral represents the next step in Mistral’s growing portfolio of code-focused models, following its earlier success with the Codestral series. First launched in May 2024, Codestral was Mistral’s initial foray into specialized coding LLMs. It was a 22-billion-parameter model trained to handle over 80 programming languages and became well-regarded for its performance in code generation and completion tasks. The model’s popularity and technical strengths led to rapid iterations, including the launch of Codestral-Mamba—an enhanced version built on Mamba architecture—and most recently, Codestral 25.01, which has found adoption among IDE plugin developers and enterprise users looking for high-frequency, low-latency models. The momentum around Codestral helped establish Mistral as a key player in the coding-model ecosystem and laid the foundation for the development of Devstral—extending from fast completions to full-agent task execution. Outperforms larger models on top SWE benchmarks Devstral achieves a score of 46.8% on the SWE-Bench Verified benchmark, a dataset of 500 real-world GitHub issues manually validated for correctness. This places it ahead of all previously released open-source models and ahead of several closed models, including GPT-4.1-mini, which it surpasses by over 20 percentage points. “Right now, it’s by pretty far the best open model for SWE-bench verified and for code agents,” said Rozière. “And it’s also a very small model—only 24 billion parameters—that you can run locally, even on a MacBook.” “Compare Devstral to closed and open models evaluated under any scaffold—we find that Devstral achieves substantially better performance than a number of closed-source alternatives,” wrote Sophia Yang, Ph.D., Head of Developer Relations at Mistral AI, on the social network X. “For example, Devstral surpasses the recent GPT-4.1-mini by over 20%.” The model is finetuned from Mistral Small 3.1 using reinforcement learning and safety alignment techniques. “We started from a very good base model with Mistral’s small tree control, which already performs well,” Rozière said. “Then we specialized it using safety and reinforcement learning techniques to improve its performance on SWE-bench.” Built for the agentic era Devstral is not just a code generation model — it is optimized for integration into agentic frameworks like OpenHands, SWE-Agent, and OpenDevin. These scaffolds allow Devstral to interact with test cases, navigate source files, and execute multi-step tasks across projects. “We’re releasing it with OpenDevin, which is a scaffolding for code agents,” said Rozière. “We build the model, and they build the scaffolding — a set of prompts and tools that the model can use, like a backend for the developer model.” To ensure robustness, the model was tested across diverse repositories and internal workflows. “We were very careful not to overfit to SWE-bench,” Rozière explained. “We trained only on data from repositories that are not cloned from the SWE-bench set and validated the model across different frameworks.” He added that Mistral dogfooded Devstral internally to ensure it generalizes well to new, unseen tasks. Efficient deployment with permissive open license — even for enterprise and commercial projects Devstral’s compact 24B architecture makes it practical for developers to run locally, whether on a single RTX 4090 GPU or a Mac with 32GB of RAM. This makes it appealing for privacy-sensitive use cases and edge deployments. “This model is targeted toward enthusiasts and people who care about running something locally and privately—something they can use even on a plane with no internet,” Rozière said. Beyond performance and portability, its Apache 2.0 license offers a compelling proposition for commercial applications. The license permits unrestricted use, adaptation, and distribution—even for proprietary products—making Devstral a low-friction option for enterprise adoption. Detailed specifications and usage instructions are available on the Devstral-Small-2505 model card on Hugging Face. The model features a 128,000 token context window and uses the Tekken tokenizer with a 131,000 vocabulary. It supports deployment through all major open source platforms including Hugging Face, Ollama, Kaggle, LM Studio, and Unsloth, and works well with libraries such as vLLM, Transformers, and Mistral Inference. Available via API or locally Devstral is accessible via Mistral’s Le Platforme API (application programming interface) under the model name devstral-small-2505, with pricing set at $0.10 per million input tokens and $0.30 per million output tokens. For those deploying locally, support for frameworks like OpenHands enables integration with codebases and agentic workflows out of

Mistral AI launches Devstral, powerful new open source SWE agent model that runs on laptops Read More »

Microsoft just launched an AI that discovered a new chemical in 200 hours instead of years

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Microsoft launched a new enterprise platform that harnesses artificial intelligence to dramatically accelerate scientific research and development, potentially compressing years of laboratory work into weeks or even days. The platform, called Microsoft Discovery, leverages specialized AI agents and high-performance computing to help scientists and engineers tackle complex research challenges without requiring them to write code, the company announced Monday at its annual Build developer conference. “What we’re doing is really taking a look at how we can apply advancements in agentic AI and compute work, and then on to quantum computing, and apply it in the really important space, which is science,” said Jason Zander, Corporate Vice President of Strategic Missions and Technologies at Microsoft, in an exclusive interview with VentureBeat. The system has already demonstrated its potential in Microsoft’s own research, where it helped discover a novel coolant for immersion cooling of data centers in approximately 200 hours — a process that traditionally would have taken months or years. “In 200 hours with this framework, we were able to go through and screen 367,000 potential candidates that we came up with,” Zander explained. “We actually took it to a partner, and they actually synthesized it.” How Microsoft is putting supercomputing power in the hands of everyday scientists Microsoft Discovery represents a significant step toward democratizing advanced scientific tools, allowing researchers to interact with supercomputers and complex simulations using natural language rather than requiring specialized programming skills. “It’s about empowering scientists to transform the entire discovery process with agentic AI,” Zander emphasized. “My PhD is in biology. I’m not a computer scientist, but if you can unlock that power of a supercomputer just by allowing me to prompt it, that’s very powerful.” The platform addresses a key challenge in scientific research: the disconnect between domain expertise and computational skills. Traditionally, scientists would need to learn programming to leverage advanced computing tools, creating a bottleneck in the research process. This democratization could prove particularly valuable for smaller research institutions that lack the resources to hire computational specialists to augment their scientific teams. By allowing domain experts to directly query complex simulations and run experiments through natural language, Microsoft is effectively lowering the barrier to entry for cutting-edge research techniques. “As a scientist, I’m a biologist. I don’t know how to write computer code. I don’t want to spend all my time going into an editor and writing scripts and stuff to ask a supercomputer to do something,” Zander said. “I just wanted, like, this is what I want in plain English or plain language, and go do it.” Inside Microsoft Discovery: AI ‘postdocs’ that can screen hundreds of thousands of experiments Microsoft Discovery operates through what Zander described as a team of AI “postdocs” — specialized agents that can perform different aspects of the scientific process, from literature review to computational simulations. “These postdoc agents do that work,” Zander explained. “It’s like having a team of folks that just got their PhD. They’re like residents in medicine — you’re in the hospital, but you’re still finishing.” The platform combines two key components: foundational models that handle planning and specialized models trained for particular scientific domains like physics, chemistry, and biology. What makes this approach unique is how it blends general AI capabilities with deeply specialized scientific knowledge. “The core process, you’ll find two parts of this,” Zander said. “One is we’re using foundational models for doing the planning. The other piece is, on the AI side, a set of models that are designed specifically for particular domains of science, that includes physics, chemistry, biology.” According to a company statement, Microsoft Discovery is built on a “graph-based knowledge engine” that constructs nuanced relationships between proprietary data and external scientific research. This allows it to understand conflicting theories and diverse experimental results across disciplines, while maintaining transparency by tracking sources and reasoning processes. At the center of the user experience is a Copilot interface that orchestrates these specialized agents based on researcher prompts, identifying which agents to leverage and setting up end-to-end workflows. This interface essentially acts as the central hub where human scientists can guide their virtual research team. From months to hours: How Microsoft used its own AI to solve a critical data center cooling challenge To demonstrate the platform’s capabilities, Microsoft used Microsoft Discovery to address a pressing challenge in data center technology: finding alternatives to coolants containing PFAS, so-called “forever chemicals” that are increasingly facing regulatory restrictions. Current data center cooling methods often rely on harmful chemicals that are becoming untenable as global regulations push to ban these substances. Microsoft researchers used the platform to screen hundreds of thousands of potential alternatives. “We did prototypes on this. Actually, when I owned Azure, I did a prototype eight years ago, and it works super well, actually,” Zander said. “It’s actually like 60 to 90% more efficient than just air cooling. The big problem is that coolant material that’s on market has PFAS in it.” After identifying promising candidates, Microsoft synthesized the coolant and demonstrated it cooling a GPU running a video game. While this specific application remains experimental, it illustrates how Microsoft Discovery can compress development timelines for companies facing regulatory challenges. The implications extend far beyond Microsoft’s own data centers. Any industry facing similar regulatory pressure to replace established chemicals or materials could potentially use this approach to accelerate their R&D cycles dramatically. What once would have been multi-year development processes might now be completed in a matter of months. Daniel Pope, founder of Submer, a company focused on sustainable data centers, was quoted in the press release saying: “The speed and depth of molecular screening achieved by Microsoft Discovery would’ve been impossible with traditional methods. What once took years of lab work and trial-and-error, Microsoft Discovery can accomplish in just weeks, and with greater confidence.” Pharma, beauty, and chips: The major companies already lining up to use Microsoft’s

Microsoft just launched an AI that discovered a new chemical in 200 hours instead of years Read More »

You.com’s ARI Enterprise crushes OpenAI in head-to-head tests, aims at deep research market

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More You.com has launched ARI Enterprise today, claiming its advanced research platform defeats OpenAI’s comparable offerings in 76% of head-to-head tests and achieves industry-leading accuracy on independent benchmarks. The upgraded Advanced Research & Insights (ARI) platform scored 80% accuracy on the FRAMES benchmark — a research evaluation standard co-developed by Harvard, Google, and Meta — putting it ahead of offerings from major competitors. “The number one thing is that it just goes into more depth,” Richard Socher, You.com’s CEO and former chief scientist at Salesforce, told me in an interview. “Our users are telling us things like, ‘Oh, this actually found things that I could not find online.’ So that depth combined with the accuracy, that’s sort of the power of it.” ARI outperforms competitors on Harvard-backed FRAMES evaluation, scoring 80% accuracy. (Credit: You.com) AI search giants battle for dominance as deep research revolution accelerates ARI Enterprise enters an increasingly competitive field where tech giants and AI startups are racing to dominate what some call the “new browser wars.” Google, OpenAI, Perplexity, Anthropic, and others are all developing their own deep research agents, which are rapidly becoming how people access and synthesize information across the web. Unlike most competitors targeting general or multi-purpose use cases, You.com has deliberately focused on enterprise applications, particularly for financial analysts and management consultants who require extraordinarily thorough and accurate research. “We didn’t see any change [from Google] and so we decided someone’s got to do it,” Socher told CNET in a previous interview about launching You.com in 2020. Now, with ARI Enterprise, the company is doubling down on its specialized focus. “This sort of insane, untouchable monopoly that Google had for 20 years, those days are over,” Socher told Business Insider in March. “I don’t think any company will have such a strong monopoly for such a long time anymore because users are getting faster to switch and more eager to try out things.” ARI Enterprise expands research capabilities with 400+ sources and enterprise data integration ARI Enterprise isn’t just a simple update. You.com claims it offers 4x greater depth and breadth compared to the previous version, delivering twice as many unique citations per report. This increased comprehensiveness translates to 35% more insights and facts per research project, according to the company. What truly distinguishes the platform, however, is its integration capabilities with enterprise data sources. While consumer-facing research tools generally rely solely on publicly available information, ARI Enterprise can now connect to internal corporate data repositories including SharePoint, OneDrive, Google Drive, and custom data sources. “Problem of company internal data is we can really showcase our customers data to you,” Socher explained. “That is one of the biggest reasons why it’s ARI for enterprise now is we actually connect company internal data to ARI now so you can have internal like dozens or terabytes, petabytes of data, and we will search over all of that and make that data useful for you.” In a demonstration, Socher showed how the system can handle complex queries like evaluating “strategic options for aerospace incumbents to enter the electrical vertical takeoff and landing (eVTOL) market for logistics and cargo transport.” The system aggregated information from over 440 sources, creating a comprehensive analysis that would typically require weeks of research by a team of consultants. “When you go to 400 [sources] you start to find things that you would not have just stumbled upon by doing all your own Google searches,” Socher said. “Our 400th page isn’t like the 400th page that you get as a user scrolling on Google — we have a better search index.” ARI cites 3.6x more sources than OpenAI, averaging 162 citations versus 45. (Credit: You.com) Perhaps the most attention-grabbing aspect of the announcement is You.com’s claim of substantially outperforming competitors in objective benchmarks. The company evaluated ARI against the FRAMES benchmark (Factuality, Retrieval, And reasoning MEasurement Set), co-developed by researchers at Harvard, Google DeepMind, and Meta. Using this standard, which tests AI systems on fact retrieval, reasoning across constraints, and accurate synthesis, ARI achieved 80% accuracy — a 30% improvement over its beta version. More provocatively, You.com created a new benchmark called DeepConsult specifically for business research scenarios. When comparing ARI against OpenAI’s Deep Research across 102 queries with 612 total tests, and using OpenAI’s own o3-mini model as judge, ARI won 76% of comparisons while OpenAI won only 14% (with the remainder being ties). “This accuracy is kind of a constant chase,” Socher said. “When we first launched it, we were the only ones in the best in class, as you know, but then OpenAI came out right and they saw what we were doing, and then when they launched it, it was slightly better, so now we’re beating them again and everyone else.” The company is also open-sourcing its benchmarking methodology and dataset, allowing others to verify their claims. “We’re going to publish all the details. Everyone can run the code and see it for themselves,” Socher emphasized. ARI outperforms OpenAI across all tested categories, with strongest showings in writing quality and instruction following. (Credit: You.com) From financial analysis to healthcare: How early adopters are using ARI Enterprise Early adopters of ARI Enterprise include venture capital firms, consulting agencies, and research institutions. “We now have customers like WestCap, an investment firm, who report that ARI has significantly improved their investment research process,” Socher noted. “We’re also working with communications consulting firms who tell us that ARI has enhanced their research capabilities and analytical workflows.” The National Institutes of Health (NIH) is also using the platform for complex research questions. In one demonstration, Socher showed how ARI could analyze the cost-effectiveness of treatments for sickle cell disease, gathering data from 226 sources and automatically running a Monte Carlo simulation to generate insights. “If you’re working at the NIH, that would have taken you weeks to get to that kind of answer. It’s

You.com’s ARI Enterprise crushes OpenAI in head-to-head tests, aims at deep research market Read More »

OpenAI updates Operator to o3, making its $200 monthly ChatGPT Pro subscription more enticing

Google’s ‘world-model’ bet: building the AI operating layer before Microsoft captures the UI

Anthropic faces backlash to Claude 4 Opus behavior that contacts authorities, press if it thinks you’re doing something ‘egregiously immoral’

From disruption to reinvention: How knowledge workers can thrive after AI

SimilarWeb’s new AI usage report reveals 5 surprising findings, including explosive growth in coding tools

Beyond single-model AI: How architectural design drives reliable multi-agent orchestration

Why enterprise RAG systems fail: Google study introduces ‘sufficient context’ solution

Mistral AI launches Devstral, powerful new open source SWE agent model that runs on laptops

Microsoft just launched an AI that discovered a new chemical in 200 hours instead of years

You.com’s ARI Enterprise crushes OpenAI in head-to-head tests, aims at deep research market

We provide a matching platform and membership services for startup groups in Asia

Useful Links

Become an Affiliate

Contact

News & Insight

Join the family!

Latest News

Reforming Regulation S-K

Bizarre triple zipper creates rigid 3-D shapes in seconds