VentureBeat

How AI 'digital minds' startup Delphi stopped drowning in user data and scaled up with Pinecone

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Delphi, a two-year-old San Francisco AI startup named after the Ancient Greek oracle, was facing a thoroughly 21st-century problem: its “Digital Minds”— interactive, personalized chatbots modeled after an end-user and meant to channel their voice based on their writings, recordings, and other media — were drowning in data. Each Delphi can draw from any number of books, social feeds, or course materials to respond in context, making each interaction feel like a direct conversation. Creators, coaches, artists and experts were already using them to share insights and engage audiences. But each new upload of podcasts, PDFs or social posts to a Delphi added complexity to the company’s underlying systems. Keeping these AI alter egos responsive in real time without breaking the system was becoming harder by the week. Thankfully, Dephi found a solution to its scaling woes using managed vector database darling Pinecone. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Open source only goes so far Delphi’s early experiments relied on open-source vector stores. Those systems quickly buckled under the company’s needs. Indexes ballooned in size, slowing searches and complicating scale. Latency spikes during live events or sudden content uploads risked degrading the conversational flow. Worse, Delphi’s small but growing engineering team found itself spending weeks tuning indexes and managing sharding logic instead of building product features. Pinecone’s fully managed vector database, with SOC 2 compliance, encryption, and built-in namespace isolation, turned out to be a better path. Each Digital Mind now has its own namespace within Pinecone. This ensures privacy and compliance, and narrows the search surface area when retrieving knowledge from its repository of user-uploaded data, improving performance. A creator’s data can be deleted with a single API call. Retrievals consistently come back in under 100 milliseconds at the 95th percentile, accounting for less than 30 percent of Delphi’s strict one-second end-to-end latency target. “With Pinecone, we don’t have to think about whether it will work,” said Samuel Spelsberg, co-founder and CTO of Delphi, in a recent interview. “That frees our engineering team to focus on application performance and product features rather than semantic similarity infrastructure.” The architecture behind the scale At the heart of Delphi’s system is a retrieval-augmented generation (RAG) pipeline. Content is ingested, cleaned, and chunked; then embedded using models from OpenAI, Anthropic, or Delphi’s own stack. Those embeddings are stored in Pinecone under the correct namespace. At query time, Pinecone retrieves the most relevant vectors in milliseconds, which are then fed to a large language model to produce responses, a popular technique known through the AI industry as retrieval augmented generation (RAG). This design allows Delphi to maintain real-time conversations without overwhelming system budgets. As Jeffrey Zhu, VP of Product at Pinecone, explained, a key innovation was moving away from traditional node-based vector databases to an object-storage-first approach. Instead of keeping all data in memory, Pinecone dynamically loads vectors when needed and offloads idle ones. “That really aligns with Delphi’s usage patterns,” Zhu said. “Digital Minds are invoked in bursts, not constantly. By decoupling storage and compute, we reduce costs while enabling horizontal scalability.” Pinecone also automatically tunes algorithms depending on namespace size. Smaller Delphis may only store a few thousand vectors; others contain millions, derived from creators with decades of archives. Pinecone adaptively applies the best indexing approach in each case. As Zhu put it, “We don’t want our customers to have to choose between algorithms or wonder about recall. We handle that under the hood.” Variance among creators Not every Digital Mind looks the same. Some creators upload relatively small datasets — social media feeds, essays, or course materials — amounting to tens of thousands of words. Others go far deeper. Spelsberg described one expert who contributed hundreds of gigabytes of scanned PDFs, spanning decades of marketing knowledge. Despite this variance, Pinecone’s serverless architecture has allowed Delphi to scale beyond 100 million stored vectors across 12,000+ namespaces without hitting scaling cliffs. Retrieval remains consistent, even during spikes triggered by live events or content drops. Delphi now sustains about 20 queries per second globally, supporting concurrent conversations across time zones with zero scaling incidents. Toward a million digital minds Delphi’s ambition is to host millions of Digital Minds, a goal that would require supporting at least five million namespaces in a single index. For Spelsberg, that scale is not hypothetical but part of the product roadmap. “We’ve already moved from a seed-stage idea to a system managing 100 million vectors,” he said. “The reliability and performance we’ve seen gives us confidence to scale aggressively.” Zhu agreed, noting that Pinecone’s architecture was specifically designed to handle bursty, multi-tenant workloads like Delphi’s. “Agentic applications like these can’t be built on infrastructure that cracks under scale,” he said. Why RAG still matters and will for the foreseeable future As context windows in large language models expand, some in the AI industry have suggested RAG may become obsolete. Both Spelsberg and Zhu push back on that idea. “Even if we have billion-token context windows, RAG will still be important,” Spelsberg said. “You always want to surface the most relevant information. Otherwise you’re wasting money, increasing latency, and distracting the model.” Zhu framed it in terms of context engineering — a term Pinecone has recently used in its own technical blog posts. “LLMs are powerful reasoning tools, but they need constraints,” he explained. “Dumping in everything you have is inefficient and can lead to worse outcomes. Organizing and narrowing context isn’t just cheaper—it improves accuracy.” As covered in Pinecone’s own writings on context engineering, retrieval helps manage the finite attention span of language models by curating the right mix of user queries, prior messages, documents, and memories to keep interactions coherent over time. Without

How AI 'digital minds' startup Delphi stopped drowning in user data and scaled up with Pinecone Read More »

TikTok parent company ByteDance releases new open source Seed-OSS-36B model with 512K token context

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now TikTok is making headlines again today after the White House joined the popular social media application — but its parent company ByteDance, a Chinese web giant, also had a surprise announcement up its sleeve. The company’s Seed Team of AI researchers today released Seed-OSS-36B on AI code sharing website Hugging Face. Seed-OSS-36B is new line of open source, large language models (LLM) designed for advanced reasoning, and developer-focused usability with a longer token context — that is, how much information the models can accept as inputs and then output in a single exchange — than many competing LLMs from U.S. tech companies, even leaders such as OpenAI and Anthropic. The collection introduces three main variants: AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Seed-OSS-36B-Base with synthetic data Seed-OSS-36B-Base without synthetic data Seed-OSS-36B-Instruct In releasing both synthetic and non-synthetic versions of the Seed-OSS-36B-Base model, the Seed Team sought to balance practical performance with research flexibility. The synthetic-data variant, trained with additional instruction data, consistently delivers stronger scores on standard benchmarks and is intended as a higher-performing general-purpose option. The non-synthetic model, by contrast, omits these augmentations, creating a cleaner foundation that avoids potential bias or distortion introduced by synthetic instruction data. By providing both, the team gives applied users access to improved results while ensuring researchers retain a neutral baseline for studying post-training methods. Meanwhile, the Seed-OSS-36B-Instruct model differs in that it is post-trained with instruction data to prioritize task execution and instruction following, rather than serving purely as a foundation model. All three models are released under the Apache-2.0 license, allowing free use, modification, and redistribution by researchers and developers working for enterprises. That means they can be used to power commercial applications, internal to a company or external/customer-facing, without paying ByteDance any licensing fees or for application programming interface (API) usage. This continues the summer 2025 trend of Chinese companies shipping powerful open source models with OpenAI attempting to catch up with its own open source gpt-oss duet released earlier this month. The Seed Team positions Seed-OSS for international applications, emphasizing versatility across reasoning, agent-like task execution, and multilingual settings. The Seed Team, formed in 2023, has concentrated on building foundation models that can serve both research and applied use cases. Design and core features The architecture behind Seed-OSS-36B combines familiar design choices such as causal language modeling, grouped query attention, SwiGLU activation, RMSNorm, and RoPE positional encoding. Each model carries 36 billion parameters across 64 layers and supports a vocabulary of 155,000 tokens. One of the defining features is its native long-context capability, with a maximum length of 512,000 tokens, designed to process extended documents and reasoning chains without performance loss. That’s twice the length of OpenAI’s new GPT-5 model family and is roughly equivalent to about 1,600 pages of text, the length of a Christian Bible. Another distinguishing element is the introduction of a thinking budget, which lets developers specify how much reasoning the model should perform before delivering an answer. It’s something we’ve seen from other recent open source models as well, including Nvidia’s new Nemotron-Nano-9B-v2, also available on Hugging Face. In practice, this means teams can tune performance depending on the complexity of the task and the efficiency requirements of deployment. Budgets are recommended in multiples of 512 tokens, with 0 providing a direct response mode/ Competitive performance on third-party benchmarks Benchmarks published with the release position Seed-OSS-36B among the stronger large open-source models. The Instruct variant, in particular, posts state-of-the-art results in multiple areas. Math and reasoning: Seed-OSS-36B-Instruct achieves 91.7 percent on AIME24 and 65 on BeyondAIME, both representing open-source “state-of-the-art” (SOTA). Coding: On LiveCodeBench v6, the Instruct model records 67.4, another SOTA score. Long-context handling: On RULER at 128K context length, it reaches 94.6, marking the highest open-source result reported. Base model performance: The synthetic-data Base variant delivers 65.1 on MMLU-Pro and 81.7 on MATH, both state-of-the-art results in their categories. The no-synthetic Base version, while slightly behind on many measures, proves competitive in its own right. It outperforms its synthetic counterpart on GPQA-D, providing researchers with a cleaner, instruction-free baseline for experimentation. For enterprises comparing open options, these results suggest Seed-OSS offers strong potential across math-heavy, coding, and long-context workloads while still providing flexibility for research use cases. Access and deployment Beyond performance, the Seed Team highlights accessibility for developers and practitioners. The models can be deployed using Hugging Face Transformers, with quantization support in both 4-bit and 8-bit formats to reduce memory requirements. They also integrate with vLLM for scalable serving, including configuration examples and API server instructions. To lower barriers further, the team includes scripts for inference, prompt customization, and tool integration. For technical leaders managing small teams or working under budget constraints, these provisions are positioned to make experimentation with 36-billion-parameter models more approachable. Licensing and considerations for enterprise decision-makers With the models offered under Apache-2.0, organizations can adopt them without restrictive licensing terms, an important factor for teams balancing legal and operational concerns. For decision makers evaluating the open-source landscape, the release brings three takeaways: State-of-the-art benchmarks across math, coding, and long-context reasoning. A balance between higher-performing synthetic-trained models and clean research baselines. Accessibility features that lower operational overhead for lean engineering teams. By placing strong performance and flexible deployment under an open license, ByteDance’s Seed Team has added new options for enterprises, researchers, and developers alike. source

TikTok parent company ByteDance releases new open source Seed-OSS-36B model with 512K token context Read More »

Don't sleep on Cohere: Command A Reasoning, its first reasoning model, is built for enterprise customer service and more

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now I was in more meetings than usual today so I just caught up to the fact that Cohere, the Canadian startup co-founded by former Transformer paper author Aidan Gomez and geared toward making generative AI products work easily, powerfully, and securely for enterprises, has released its first reasoning large language model (LLM), Command A Reasoning. It looks to be a strong release. Benchmarks, technical specs, and early tests suggest the model delivers on flexibility, efficiency, and raw reasoning power. Customer service, market research, scheduling, data analysis are some of the tasks Cohere says it’s built to handle automatically at scale inside secure enterprise environments. It is a text-only model, however, but it should be easy enough to hook up to multimodal models and tools. In fact, tool use is one of its primary selling points. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO While it’s open for researchers to use for non-commercial purposes, enterprises will need to pay Cohere to get access and the company doesn’t publicly list its pricing because it says it makes bespoke customization and private deployment. Cohere was valued at $6.8 billion when it announced its latest funding round of $500 million a week and a day ago. Tuned for enterprises Command A Reasoning is tuned for enterprises with sprawling document libraries, long email chains, and workflows that can’t afford hallucinations. It supports up to 256,000 tokens on multi-GPU setups, a decent size and comparable to OpenAI’s GPT-5. The research release weighs in at 111-billion parameters, trained with tool-use and multilingual performance in mind. It supports 23 languages out of the box, including English, French, Spanish, Japanese, Arabic, and Hindi. That multilingual depth is key for global enterprises that need consistent agent quality across markets. The model slots directly into North, Cohere’s new platform for deploying AI agents and automations on-premises. That means enterprises can spin up custom agents that live entirely within their infrastructure, giving them control over data flows while still tapping into advanced reasoning. Cohere looks like it’s thought cleverly and strategically to identify some of the recurring functions across enterprises — onboarding, market research and analysis, development — and trained its model to support its agentic workflows for handling these automatically. Controlled thinking As with many other recent reasoning releases including Nvidia’s new Nemotron-Nano-9B-v2, Command A Reasoning introduces a token budget feature to let users or developers specify how much reasoning to allocate to specific inputs and tasks. Less budget means faster, cheaper replies. More budget means deeper, more accurate reasoning. The Hugging Face release even exposes this tradeoff directly: reasoning can be toggled on or off through a simple parameter. Developers can run the model in “reasoning mode” for maximum performance or switch it off for lower latency tasks—without changing models. Excels at enterprise targeted benchmarks So how does it perform in practice? Cohere’s benchmarks paint a clear picture. On enterprise reasoning tasks, Command A Reasoning consistently outpaces peers like DeepSeek-R1 0528, gpt-oss-120b, and Mistral Magistral Medium. It handles multilingual benchmarks with equal strength, important for global businesses. The token budget system isn’t just a gimmick. In head-to-head comparisons against Cohere’s previous Command A model, satisfaction scores climbed steadily as the budget increased. Even with “instant” minimal reasoning, Command A Reasoning beat its predecessor. At higher budgets, it pulled further ahead. The story is the same in deep research. On the DeepResearch Bench — which measures instruction following, readability, insight, and comprehensiveness — Cohere’s system came out on top against offerings from Gemini, OpenAI, Anthropic, Perplexity, and xAI’s Grok. The model excelled in turning sprawling questions into reports that are not only detailed but readable, a key challenge in enterprise knowledge work. Beyond benchmarks, the model is wired for action. Cohere trained it specifically for conversational tool use — letting it call APIs, connect to databases, or query external systems during a task. Developers can define tools via JSON schema and feed them into chat templates in Transformers, making it easier to integrate the model into existing enterprise systems. That design supports Cohere’s larger bet on agentic workflows: AI systems made up of multiple coordinated agents, each handling a piece of a bigger job. Command A Reasoning is the reasoning engine that keeps those workflows coherent and on task. Safety: built for high-stakes work Cohere is also pitching safety as a central feature. The model is trained to avoid the common enterprise headache of over-refusal — when an AI rejects legitimate requests out of caution — while still filtering harmful or malicious content. Evaluations focused on five high-risk categories: child safety, self-harm, violence and hate, explicit material, and conspiracy theories. For companies looking to deploy AI in regulated industries or sensitive domains, this balance is meant to make the model more practical in day-to-day operations. Early buy-in from large enterprises SAP SE is one of the first major partners to integrate the model. Dr. Walter Sun, SVP and Global Head of AI, said the collaboration will enhance SAP’s generative AI capabilities within the SAP Business Technology Platform. For customers, that means agentic applications that can be customized to fit enterprise-specific needs. Availability and licensing Command A Reasoning is available now on the Cohere platform, and for research use on Hugging Face. The Hugging Face repository provides open weights for research under a CC-BY-NC license, requiring users to share contact information and adhere to Cohere’s Acceptable Use Policy. Enterprises interested in commercial or private deployments can contact Cohere’s sales team for bespoke pricing. For enterprises, the pitch is straightforward: one model, multiple modes of deployment, fine-grained control over performance, multilingual capability, tool integration, and benchmark results that suggest it outperforms its peers. source

Don't sleep on Cohere: Command A Reasoning, its first reasoning model, is built for enterprise customer service and more Read More »

Nvidia releases a new small, open model Nemotron-Nano-9B-v2 with toggle on/off reasoning

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Small models are having a moment. On the heels of the release of a new AI vision model small enough to fit on a smartwatch from MIT spinoff Liquid AI, and a model small enough to run on a smartphone from Google, Nvidia is joining the party today with a new small language model (SLM) of its own, Nemotron-Nano-9B-V2, which attained the highest performance in its class on selected benchmarks and comes with the ability for users to toggle on and off AI “reasoning,” that is, self-checking before outputting an answer. While the 9 billion parameters are larger than some of the multimillion parameter small models VentureBeat has covered recently, Nvidia notes it is a meaningful reduction from its original size of 12 billion parameters and is designed to fit on a single Nvidia A10 GPU. As Oleksii Kuchiaev, Nvidia Director of AI Model Post-Training, said on X in response to a question I submitted to him: “The 12B was pruned to 9B to specifically fit A10 which is a popular GPU choice for deployment. It is also a hybrid model which allows it to process a larger batch size and be up to 6x faster than similar sized transformer models.” For context, many leading LLMs are in the 70+ billion parameter range (recall parameters refer to the internal settings governing the model’s behavior, with more generally denoting a larger and more capable, yet more compute intensive model). AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO The model handles multiple languages, including English, German, Spanish, French, Italian, Japanese, and in extended descriptions, Korean, Portuguese, Russian, and Chinese. It’s suitable for both instruction following and code generation. Nemotron-Nano-9B-V2 and its pre-training datasets available right now on Hugging Face and through the company’s model catalog. A fusion of Transformer and Mamba architectures It’s based on Nemotron-H, a set of hybrid Mamba-Transformer models that form the foundation for the company’s latest offerings. While most popular LLMs are pure “Transformer” models, which rely entirely on attention layers, they can become costly in memory and compute as sequence lengths grow. Instead, Nemotron-H models and others using the Mamba architecture developed by researchers at Carnegie Mellon University and Princeton, also weave in selective state space models (or SSMs), which can handle very long sequences of information in and out by maintaining state. These layers scale linearly with sequence length and can process contexts much longer than standard self-attention without the same memory and compute overhead. A hybrid Mamba-Transformer reduces those costs by substituting most of the attention with linear-time state space layers, achieving up to 2–3× higher throughput on long contexts with comparable accuracy. Other AI labs beyond Nvidia such as AI21* have also released models based on the Mamba architecture. Toggle on/of reasoning using language Nemotron-Nano-9B-v2 is positioned as a unified, text-only chat and reasoning model trained from scratch. The system defaults to generating a reasoning trace before providing a final answer, though users can toggle this behavior through simple control tokens such as /think or /no_think. The model also introduces runtime “thinking budget” management, which allows developers to cap the number of tokens devoted to internal reasoning before the model completes a response. This mechanism is aimed at balancing accuracy with latency, particularly in applications like customer support or autonomous agents. Benchmarks tell a promising story Evaluation results highlight competitive accuracy against other open small-scale models. Tested in “reasoning on” mode using the NeMo-Skills suite, Nemotron-Nano-9B-v2 reaches 72.1 percent on AIME25, 97.8 percent on MATH500, 64.0 percent on GPQA, and 71.1 percent on LiveCodeBench. Scores on instruction following and long-context benchmarks are also reported: 90.3 percent on IFEval, 78.9 percent on the RULER 128K test, and smaller but measurable gains on BFCL v3 and the HLE benchmark. Across the board, Nano-9B-v2 shows higher accuracy than Qwen3-8B, a common point of comparison. Nvidia illustrates these results with accuracy-versus-budget curves that show how performance scales as the token allowance for reasoning increases. The company suggests that careful budget control can help developers optimize both quality and latency in production use cases. Trained on synthetic datasets Both the Nano model and the Nemotron-H family rely on a mixture of curated, web-sourced, and synthetic training data. The corpora include general text, code, mathematics, science, legal, and financial documents, as well as alignment-style question-answering datasets. Nvidia confirms the use of synthetic reasoning traces generated by other large models to strengthen performance on complex benchmarks. Licensing and commercial use The Nano-9B-v2 model is released under the Nvidia Open Model License Agreement, last updated in June 2025. The license is designed to be permissive and enterprise-friendly. Nvidia explicitly states that the models are commercially usable out of the box, and that developers are free to create and distribute derivative models. Importantly, Nvidia does not claim ownership of any outputs generated by the model, leaving responsibility and rights with the developer or organization using it. For an enterprise developer, this means the model can be put into production immediately without negotiating a separate commercial license or paying fees tied to usage thresholds, revenue levels, or user counts. There are no clauses requiring a paid license once a company reaches a certain scale, unlike some tiered open licenses used by other providers. That said, the agreement does include several conditions enterprises must observe: Guardrails: Users cannot bypass or disable built-in safety mechanisms (referred to as “guardrails”) without implementing comparable replacements suited to their deployment. Redistribution: Any redistribution of the model or derivatives must include the Nvidia Open Model License text and attribution (“Licensed by Nvidia Corporation under the Nvidia Open Model License”). Compliance: Users must comply with trade regulations and restrictions (e.g., U.S. export laws). Trustworthy AI terms: Usage must

Nvidia releases a new small, open model Nemotron-Nano-9B-v2 with toggle on/off reasoning Read More »

Chan Zuckerberg Initiative’s rBio uses virtual cells to train AI, bypassing lab work

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now The Chan Zuckerberg Initiative announced Thursday the launch of rBio, the first artificial intelligence model trained to reason about cellular biology using virtual simulations rather than requiring expensive laboratory experiments — a breakthrough that could dramatically accelerate biomedical research and drug discovery. The reasoning model, detailed in a research paper published on bioRxiv, demonstrates a novel approach called “soft verification” that uses predictions from virtual cell models as training signals instead of relying solely on experimental data. This paradigm shift could help researchers test biological hypotheses computationally before committing time and resources to costly laboratory work. “The idea is that you have these super powerful models of cells, and you can use them to simulate outcomes rather than testing them experimentally in the lab,” said Ana-Maria Istrate, senior research scientist at CZI and lead author of the research, in an interview. “The paradigm so far has been that 90% of the work in biology is tested experimentally in a lab, while 10% is computational. With virtual cell models, we want to flip that paradigm.” How AI finally learned to speak the language of living cells The announcement represents a significant milestone for CZI’s ambitious goal to “cure, prevent, and manage all disease by the end of this century.” Under the leadership of pediatrician Priscilla Chan and Meta CEO Mark Zuckerberg, the $6 billion philanthropic initiative has increasingly focused its resources on the intersection of artificial intelligence and biology. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO rBio addresses a fundamental challenge in applying AI to biological research. While large language models like ChatGPT excel at processing text, biological foundation models typically work with complex molecular data that cannot be easily queried in natural language. Scientists have struggled to bridge this gap between powerful biological models and user-friendly interfaces. “Foundation models of biology — models like GREmLN and TranscriptFormer — are built on biological data modalities, which means you cannot interact with them in natural language,” Istrate explained. “You have to find complicated ways to prompt them.” The new model solves this problem by distilling knowledge from CZI’s TranscriptFormer — a virtual cell model trained on 112 million cells from 12 species spanning 1.5 billion years of evolution — into a conversational AI system that researchers can query in plain English. The ‘soft verification’ revolution: Teaching AI to think in probabilities, not absolutes The core innovation lies in rBio’s training methodology. Traditional reasoning models learn from questions with unambiguous answers, like mathematical equations. But biological questions involve uncertainty and probabilistic outcomes that don’t fit neatly into binary categories. CZI’s research team, led by Senior Director of AI Theofanis Karaletsos and Istrate, overcame this challenge by using reinforcement learning with proportional rewards. Instead of simple yes-or-no verification, the model receives rewards proportional to the likelihood that its biological predictions align with reality, as determined by virtual cell simulations. “We applied new methods to how LLMs are trained,” the research paper explains. “Using an off-the-shelf language model as a scaffold, the team trained rBio with reinforcement learning, a common technique in which the model is rewarded for correct answers. But instead of asking a series of yes/no questions, the researchers tuned the rewards in proportion to the likelihood that the model’s answers were correct.” This approach allows scientists to ask complex questions like “Would suppressing the actions of gene A result in an increase in activity of gene B?” and receive scientifically grounded responses about cellular changes, including shifts from healthy to diseased states. Beating the benchmarks: How rBio outperformed models trained on real lab data In testing against the PerturbQA benchmark — a standard dataset for evaluating gene perturbation prediction — rBio demonstrated competitive performance with models trained on experimental data. The system outperformed baseline large language models and matched performance of specialized biological models in key metrics. Particularly noteworthy, rBio showed strong “transfer learning” capabilities, successfully applying knowledge about gene co-expression patterns learned from TranscriptFormer to make accurate predictions about gene perturbation effects—a completely different biological task. “We show that on the PerturbQA dataset, models trained using soft verifiers learn to generalize on out-of-distribution cell lines, potentially bypassing the need to train on cell-line specific experimental data,” the researchers wrote. When enhanced with chain-of-thought prompting techniques that encourage step-by-step reasoning, rBio achieved state-of-the-art performance, surpassing the previous leading model SUMMER. From social justice to science: Inside CZI’s controversial pivot to pure research The rBio announcement comes as CZI has undergone significant organizational changes, refocusing its efforts from a broad philanthropic mission that included social justice and education reform to a more targeted emphasis on scientific research. The shift has drawn criticism from some former employees and grantees who saw the organization abandon progressive causes. However, for Istrate, who has worked at CZI for six years, the focus on biological AI represents a natural evolution of long-standing priorities. “My experience and work has not changed much. I have been part of the science initiative for as long as I have been at CZI,” she said. The concentration on virtual cell models builds on nearly a decade of foundational work. CZI has invested heavily in building cell atlases — comprehensive databases showing which genes are active in different cell types across species — and developing the computational infrastructure needed to train large biological models. “I’m really excited about the work that’s been happening at CZI for years now, because we’ve been building up to this moment,” Istrate noted, referring to the organization’s earlier investments in data platforms and single-cell transcriptomics. Building bias-free biology: How CZI curated diverse data to train fairer AI models One critical advantage of CZI’s approach stems from its years of careful data curation. The organization operates

Chan Zuckerberg Initiative’s rBio uses virtual cells to train AI, bypassing lab work Read More »

Enterprise leaders say recipe for AI agents is matching them to existing processes — not the other way around

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now There’s no question that AI agents — those that can work autonomously and asynchronously behind the scenes in enterprise workflows — are the topic du jour in enterprise right now.  But there’s increasing concern that it’s all just that — talk, mostly hype, without much substance behind it.  Gartner, for one, observes that enterprises are at the “peak of inflated expectations,” a period just before disillusionment sets in because vendors haven’t backed up their talk with tangible, real-world use cases.  Still, that’s not to say that enterprises aren’t experimenting with AI agents and seeing early return on investment (ROI); global enterprises Block and GlaxoSmithKline (GSK), for their parts, are exploring proof of concepts in financial services and drug discovery.  AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO “Multi-agent is absolutely what’s next, but we’re figuring out what that looks like in a way that meets the human, makes it convenient,” Brad Axen, Block’s tech lead for AI and data platforms, told VentureBeat CEO and editor-in-chief Matt Marshall at a recent SAP-sponsored AI Impact event this month.  Working with a single colleague, not a swarm of bots Block, the 10,000-employee parent company of Square, Cash App and Afterpay, considers itself in full discovery mode, having rolled out an interoperable AI agent framework, codenamed goose, in January.  Goose was initially introduced for software engineering tasks, and is now used by 4,000 engineers, with adoption doubling monthly, Axen explained. The platform writes about 90% of code and has saved engineers an estimated 10 hours of work per week by automating code generation, debugging and information filtering.  In addition to writing code, Goose acts as a “digital teammate” of sorts, compressing Slack and email streams, integrating across company tools and spawning new agents when tasks demand more throughput and expanded scope.  Axen emphasized that Block is focused on creating one interface that feels like working with a single colleague, not a swarm of bots. “We want you to feel like you’re working with one person, but they’re acting on your behalf in many places in many different ways,” he explained.  Goose operates in real time in the development environment, searching, navigating and writing code based on large language model (LLM) output, while also autonomously reading and writing files, running code and tests, refining outputs and installing dependencies. Essentially, anyone can build and operate a system on their preferred LLM, and Goose can be conceptualized as the application layer. It has a built-in desktop application and command line interface, but devs can also build custom UIs. The platform is built on Anthropic’s Model Context Protocol (MCP), an increasingly popular open-source standardized set of APIs and endpoints that connects agents to data repositories, tools and development environments. Goose has been released under the open-source Apache License 2.0 (ASL2), meaning anyone can freely use, modify and distribute it, even for commercial purposes. Users can access Databricks databases and make SQL calls or queries without needing technical knowledge.  “We really want to come up with a process that lets people get value out of the system without having to be an expert,” Axen explained.  For instance, in coding, users can say what they want in natural language and the framework will interpret that into thousands of lines of code that devs can then read and sift through. Block is seeing value in compression tasks, too, such as Goose reading through Slack, email and other channels and summarizing information for users. Further, in sales or marketing, agents can gather relevant information on a potential client and port it into a database.  AI agents underutilized, but human domain expertise still necessary Process has been the biggest bottleneck, Axen noted. You can’t just give people a tool and tell them to make it work for them; agents need to reflect the processes that employees are already engaged with. Human users aren’t worried about the technical backbone, — rather, the work they’re trying to accomplish.  Builders, therefore, need to look at what employees are trying to do and design the tools to be “as literally that as possible,” said Axen. Then they can use that to chain together and tackle bigger and bigger problems. “I think we’re hugely underusing what they can do,” Axen said of agents. “It’s the people and the process because we can’t keep up with the technology. There’s a huge gap between the technology and the opportunity.” And, when the industry bridges that, will there still be room for human domain expertise? Of course, Axen says. For instance, particularly in financial services, code must be reliable, compliant and secure to protect the company and users; therefore, it must be reviewed by human eyes.  “We still see a really critical role for human experts in every part of operating our company,” he said. “It doesn’t necessarily change what expertise means as an individual. It just gives you a new tool to express it.” Block built on an open-source backbone The human UI is one of the most difficult elements of AI agents, Axen noted; the goal is to make interfaces simple to use while AI is in the background proactively taking action.  It would be helpful, Axen noted, if more industry players incorporate MCP-like standards. For instance, “I would love for Google to just go and have a public MCP for Gmail,” he said. “That would make my life a lot easier.” When asked about Block’s commitment to open source, he noted, “we’ve always had an open-source backbone,” adding that over the last year the company has been “renewing” its investment to open technologies.  “In a space that’s moving this fast, we’re hoping we can set up open-source governance so that you can have this be

Enterprise leaders say recipe for AI agents is matching them to existing processes — not the other way around Read More »

DeepSeek V3.1 just dropped — and it might be the most powerful open AI yet

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Chinese artificial intelligence startup DeepSeek made waves across the global AI community Tuesday with the quiet release of its most ambitious model yet — a 685-billion parameter system that challenges the dominance of American AI giants while reshaping the competitive landscape through open-source accessibility. The Hangzhou-based company, backed by High-Flyer Capital Management, uploaded DeepSeek V3.1 to Hugging Face without fanfare, a characteristically understated approach that belies the model’s potential impact. Within hours, early performance tests revealed benchmark scores that rival proprietary systems from OpenAI and Anthropic, while the model’s open-source license ensures global access unconstrained by geopolitical tensions. ? BREAKING: DeepSeek V3.1 is Here! ? The AI giant drops its latest upgrade — and it’s BIG:⚡685B parameters?Longer context window?Multiple tensor formats (BF16, F8_E4M3, F32)?Downloadable now on Hugging Face?Still awaiting API/inference launch The AI race just got… pic.twitter.com/nILcnUpKAf — DeepSeek News Commentary (@deepsseek) August 19, 2025 The release of DeepSeek V3.1 represents more than just another incremental improvement in AI capabilities. It signals a fundamental shift in how the world’s most advanced artificial intelligence systems might be developed, distributed, and controlled — with potentially profound implications for the ongoing technological competition between the United States and China. Within hours of its Hugging Face debut, DeepSeek V3.1 began climbing popularity rankings, drawing praise from researchers worldwide who downloaded and tested its capabilities. The model achieved a 71.6% score on the prestigious Aider coding benchmark, establishing itself as one of the top-performing models available and directly challenging the dominance of American AI giants. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Deepseek V3.1 is already 4th trending on HF with a silent release without model card ??? The power of 80,000 followers on @huggingface (first org with 100k when?)! pic.twitter.com/OjeBfWQ7St — clem ? (@ClementDelangue) August 19, 2025 How DeepSeek V3.1 delivers breakthrough performance DeepSeek V3.1 delivers remarkable engineering achievements that redefine expectations for AI model performance. The system processes up to 128,000 tokens of context — roughly equivalent to a 400-page book — while maintaining response speeds that dwarf slower reasoning-based competitors. The model supports multiple precision formats, from standard BF16 to experimental FP8, allowing developers to optimize performance for their specific hardware constraints. The real breakthrough lies in what DeepSeek calls its “hybrid architecture.” Unlike previous attempts at combining different AI capabilities, which often resulted in systems that performed poorly at everything, V3.1 seamlessly integrates chat, reasoning, and coding functions into a single, coherent model. “Deepseek v3.1 scores 71.6% on aider – non-reasoning SOTA,” tweeted AI researcher Andrew Christianson, adding that it is “1% more than Claude Opus 4 while being 68 times cheaper.” The achievement places DeepSeek in rarified company, matching performance levels previously reserved for the most expensive proprietary systems. “1% more than Claude Opus 4 while being 68 times cheaper.” pic.twitter.com/vKb6wWwjXq — Andrew I. Christianson (@ai_christianson) August 19, 2025 Community analysis revealed sophisticated technical innovations hidden beneath the surface. Researcher “Rookie“, who is also a moderator of the subreddits r/DeepSeek & r/LocalLLaMA, claims they discovered four new special tokens embedded in the model’s architecture: search capabilities that allow real-time web integration and thinking tokens that enable internal reasoning processes. These additions suggest DeepSeek has solved fundamental challenges that have plagued other hybrid systems. The model’s efficiency proves equally impressive. At roughly $1.01 per complete coding task, DeepSeek V3.1 delivers results comparable to systems costing nearly $70 per equivalent workload. For enterprise users managing thousands of daily AI interactions, such cost differences translate into millions of dollars in potential savings. Strategic timing reveals calculated challenge to American AI dominance DeepSeek timed its release with surgical precision. The V3.1 launch comes just weeks after OpenAI unveiled GPT-5 and Anthropic launched Claude 4, both positioned as frontier models representing the cutting edge of artificial intelligence capability. By matching their performance while maintaining open source accessibility, DeepSeek directly challenges the fundamental business models underlying American AI leadership. The strategic implications extend far beyond technical specifications. While American companies maintain strict control over their most advanced systems, requiring expensive API access and imposing usage restrictions, DeepSeek makes comparable capabilities freely available for download, modification, and deployment anywhere in the world. This philosophical divide reflects broader differences in how the two superpowers approach technological development. American firms like OpenAI and Anthropic view their models as valuable intellectual property requiring protection and monetization. Chinese companies increasingly treat advanced AI as a public good that accelerates innovation through widespread access. “DeepSeek quietly removed the R1 tag. Now every entry point defaults to V3.1—128k context, unified responses, consistent style,” observed journalist Poe Zhao. “Looks less like multiple public models, more like a strategic consolidation. A Chinese answer to the fragmentation risk in the LLM race.” DeepSeek quietly removed the R1 tag. Now every entry point defaults to V3.1—128k context, unified responses, consistent style. Looks less like multiple public models, more like a strategic consolidation. A Chinese answer to the fragmentation risk in the LLM race. pic.twitter.com/hbS6NjaYAw — Poe Zhao (@poezhao0605) August 19, 2025 The consolidation strategy suggests DeepSeek has learned from earlier mistakes, both its own and those of competitors. Previous hybrid models, including initial versions from Chinese rival Qwen, suffered from performance degradation when attempting to combine different capabilities. DeepSeek appears to have cracked that code. How open source strategy disrupts traditional AI economics DeepSeek’s approach fundamentally challenges assumptions about how frontier AI systems should be developed and distributed. Traditional venture capital-backed approaches require massive investments in computing infrastructure, research talent, and regulatory compliance — costs that must eventually be recouped through premium pricing. DeepSeek’s open source strategy turns this model upside down. By making advanced capabilities freely available, the company accelerates adoption while potentially undermining competitors’ ability to maintain high margins on similar

DeepSeek V3.1 just dropped — and it might be the most powerful open AI yet Read More »

MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now The adoption of interoperability standards, such as the Model Context Protocol (MCP), can provide enterprises with insights into how agents and models function outside their walled confines. However, many benchmarks fail to capture real-life interactions with MCP.  Salesforce AI Research developed a new open-source benchmark it calls MCP-Universe, which aims to track LLMs as these interact with MCP servers in the real world, arguing that it will paint a better picture of real-life and real-time interactions of models with tools enterprises actually use. In its initial testing, it found that models like OpenAI’s recently released GPT-5 are strong, but still do not perform as well in real-life scenarios.  “Existing benchmarks predominantly focus on isolated aspects of LLM performance, such as instruction following, math reasoning, or function calling, without providing a comprehensive assessment of how models interact with real-world MCP servers across diverse scenarios,” Salesforce said in a paper.  MCP-Universe captures model performance through tool usage, multi-turn tool calls, long context windows and large tool spaces. It’s grounded on existing MCP servers with access to actual data sources and environments.  AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Junnan Li, director of AI research at Salesforce, told VentureBeat that many models “still face limitations that hold them back on enterprise-grade tasks.” “Two of the biggest are: Long context challenges, models can lose track of information or struggle to reason consistently when handling very long or complex inputs,” Li said. “And, Unknown tool challenges, models often aren’t able to seamlessly use unfamiliar tools or systems in the way humans can adapt on the fly. This is why it’s crucial not to take a DIY approach with a single model to power agents alone, but instead, to rely on a platform that combines data context, enhanced reasoning, and trust guardrails to truly meet the needs of enterprise AI.” MCP-Universe joins other MCP-based proposed benchmarks, such as MCP-Radar from the University of Massachusetts Amherst and Xi’an Jiaotong University, as well as the Beijing University of Posts and Telecommunications’ MCPWorld. It also builds on MCPEvals, which Salesforce released in July, which focuses mainly on agents. Li said the biggest difference between MCP-Universe and MCPEvals is that the latter is evaluated with synthetic tasks.  How it works MCP-Universe evaluates how well each model performs a series of tasks that mimic those undertaken by enterprises. Salesforce said it designed MCP-Universe to encompass six core domains used by enterprises: location navigation, repository management, financial analysis, 3D design, browser automation and web search. It accessed 11 MCP servers for a total of 231 tasks.  Location navigation focuses on geographic reasoning and the execution of spatial tasks. The researchers tapped the Google Maps MCP server for this process.  The repository management domain looks at codebase operations and connects to the GitHub MCP to expose version control tools like repo search, issue tracking and code editing.  Financial analysis connects to the Yahoo Finance MCP server to evaluate quantitative reasoning and financial market decision-making. 3D design evaluates the use of computer-aided design tools through the Blender MCP. Browser automation, connected to Playwright’s MCP, tests browser interaction. The web searching domain employs the Google Search MCP server and the Fetch MCP  to check “open-domain information seeking” and is structured as a more open-ended task.  Salesforce said that it had to design new MCP tasks that reflect real use cases. For each domain, they created four to five kinds of tasks that the researchers think LLMs can easily complete. For example, the researchers assigned the models a goal that involved route planning, identifying the optimal stops and then locating the destination.  Each model is evaluated on how they completed the tasks. Li and his team opted to follow an execution-based evaluation paradigm rather than the more common LLM-as-a-judge system. The researchers noted the LLM-as-a-judge paradigm “is not well-suited for our MCP-Universe scenario, since some tasks are designed to use real-time data, while the knowledge of the LLM judge is static.” Salesforce researchers used three types of evaluators: format evaluators to see if the agents and models follow format requirements, static evaluators to assess correctness over time and dynamic evaluators for fluctuating answers like flight prices or GitHub issues. “MCP-Universe focuses on creating challenging real-world tasks with execution-based evaluators, which can stress-test the agent in complex scenarios. Furthermore, MCP-Universe offers an extendable framework/codebase for building and evaluating agents,” Li said.  Even the big models have trouble To test MCP-Universe, Salesforce evaluated several popular proprietary and open-source models. These include Grok-4 from xAI, Anthropic’s Claude-4 Sonnet and Claude 3.7 Sonnet, OpenAI’s GPT-5, o4-mini, o3, GPT-4.1, GPT-4o, GPT-oss, Google’s Gemini 2.5 Pro and Gemini 2.5 Fkash, GLM-4.5 from Zai, Moonshot’s Kimi-K2, Qwen’s Qwen3 Coder and Qwen3-235B-A22B-Instruct-2507 and DeepSeek-V3-0304 from DeepSeek. Each model tested had at least 120B parameters. In its testing, Salesforce found GPT-5 had the best success rate, especially for financial analysis tasks. Grok-4 followed, beating all the models for browser automation, and Claude-4.0 Sonnet rounds out the top three, although it did not post any performance numbers higher than either of the models it follows. Among open-source models, GLM-4.5 performed the best.  However, MCP-Universe showed the models had difficulty handling long contexts, especially for location navigation, browser automation and financial analysis, with efficiency falling significantly. The moment the LLMs encounter unknown tools, their performance also drops. The LLMs demonstrated difficulty in completing more than half of the tasks that enterprises typically perform. “These findings highlight that current frontier LLMs still fall short in reliably executing tasks across diverse real-world MCP tasks. Our MCP-Universe benchmark, therefore, provides a challenging and necessary testbed for evaluating LLM performance in areas underserved by existing benchmarks,” the paper said.  Li told VentureBeat that he hopes enterprises will use MCP-Universe

MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks Read More »

Enterprise Claude gets admin, compliance tools—just not unlimited usage

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A few weeks after announcing rate limits for Claude and the popular Claude Code, Anthropic will offer Claude Enterprise and Teams customers upgrades to access more usage and Claude Code in a single subscription.  The upgrades will also include more admin controls and a new Compliance API that will give enterprises “access to usage data and customer content for better observability, auditing and governance.” Anthropic said in a post that with a single subscription to Claude and Claude Code, users “can move seamlessly between ideation and implementation, while admins get the visibility and controls they need to scale Claude across their organization.”  Claude Code is now available on Team and Enterprise plans. Flexible pricing lets you mix standard and premium Claude Code seats across your organization and scale with usage. pic.twitter.com/co3UT5PcP3 — Claude (@claudeai) August 20, 2025 The premium seats, separate from the standard seats that most everyone in the organization receives, can be used with both Claude and Claude Code. Admins can assign individuals premium seats based on their role in the organization.  AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Throttled rates Anthropic’s announcement of additional usage for Enterprise and Teams users sparked criticism, with critics demanding that the company remove rate limits for Claude.  The company said rate limits, which will begin on August 28, would free up space for more projects and deter people who “abuse” the system by overusing Claude Code. Yea but what about getting rid of throttling?… — Cyb3rEchos (@Cyb3rEchos) August 20, 2025 Big news! The premium seats sound promising—especially with more Claude Code access. Does the new Claude Code upgrade also affect API rate limits or make it easier to set up custom integrations for teams? — Ben✨ (@_BenResearch) August 20, 2025 Extra usage for all users pls.. Feels like we’ve been at 45/5hr for a lifetime. — Artificially Inclined™ (@Art_If_Ficial) August 20, 2025 In an email, Anthropic told VentureBeat that the existing five-hour usage limits still stand for premium seats on Enterprise and Team, the same as for users of Max 5x. “Now with the new Claude Code bundle, both standard (Claude.ai access) and premium (Claude.ai + Claude Code access) seats have the option for extra usage and admins have robust seat management controls so that power users can continue their workflows with Claude however they need,” Anthropic said through a spokesperson.  Admin and compliance control The draw for the upgrades, Anthropic said, revolves around the additional controls and enterprise-ready features.  “While individual Max plans work for personal use, the Enterprise bundle provides the security, compliance, analytics, and management capabilities that organizations need at scale,” the company said.  Anthropic noted enterprise customers often have to choose between speed and governance, so bringing in admin controls and compliance features “solves that tradeoff by letting teams move seamlessly between planning in Claude and building in the terminal using Claude Code.” It also consolidates expenses using Claude Code from individual accounts to the broader enterprise.  Enterprise IT admins will be able to manage seats, including buying and allocating the seats, set spending controls and view Claude Code analytics in Claude, including knowing which lines of code were accepted and usage patterns. They can also set tool permissioning, policy settings and MCP configurations.  Since the number of seats will be based on the number of premium or standard seats the enterprises need, Anthropic said it will offer flexible pricing.  The Compliance API enables companies, particularly those in regulated sectors, to access usage data and customer content on Claude for monitoring and policy enforcement. The API allows organizations to bring Claude data into their compliance and orchestration dashboards. source

Enterprise Claude gets admin, compliance tools—just not unlimited usage Read More »

This website lets you blind-test GPT-5 vs. GPT-4o—and the results may surprise you

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now When OpenAI launched GPT-5 about two weeks ago, CEO Sam Altman promised it would be the company’s “smartest, fastest, most useful model yet.” Instead, the launch triggered one of the most contentious user revolts in the brief history of consumer AI. Now, a simple blind testing tool created by an anonymous developer is revealing the complex reality behind the backlash—and challenging assumptions about how people actually experience artificial intelligence improvements. The web application, hosted at gptblindvoting.vercel.app, presents users with pairs of responses to identical prompts without revealing which came from GPT-5 (non-thinking) or its predecessor, GPT-4o. Users simply vote for their preferred response across multiple rounds, then receive a summary showing which model they actually favored. Some of you asked me about my blind test, so I created a quick website for yall to test 4o against 5 yourself. Both have the same system message to give short outputs without formatting because else its too easy to see which one is which. https://t.co/vSECvNCQZe — Flowers ☾ (@flowersslop) August 8, 2025 “Some of you asked me about my blind test, so I created a quick website for yall to test 4o against 5 yourself,” posted the creator, known only as @flowersslop on X, whose tool has garnered over 213,000 views since launching last week. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Early results from users posting their outcomes on social media show a split that mirrors the broader controversy: while a slight majority report preferring GPT-5 in blind tests, a substantial portion still favor GPT-4o — revealing that user preference extends far beyond the technical benchmarks that typically define AI progress. When AI gets too friendly: the sycophancy crisis dividing users The blind test emerges against the backdrop of OpenAI’s most turbulent product launch to date, but the controversy extends far beyond a simple software update. At its heart lies a fundamental question that’s dividing the AI industry: How agreeable should artificial intelligence be? The issue, known as “sycophancy” in AI circles, refers to chatbots’ tendency to excessively flatter users and agree with their statements, even when those statements are false or harmful. This behavior has become so problematic that mental health experts are now documenting cases of “AI-related psychosis,” where users develop delusions after extended interactions with overly accommodating chatbots. “Sycophancy is a ‘dark pattern,’ or a deceptive design choice that manipulates users for profit,” Webb Keane, an anthropology professor and author of “Animals, Robots, Gods,” told TechCrunch. “It’s a strategy to produce this addictive behavior, like infinite scrolling, where you just can’t put it down.” OpenAI has struggled with this balance for months. In April 2025, the company was forced to roll back an update to GPT-4o that made it so sycophantic that users complained about its “cartoonish” levels of flattery. The company acknowledged that the model had become “overly supportive but disingenuous.” Within hours of GPT-5’s August 7th release, user forums erupted with complaints about the model’s perceived coldness, reduced creativity, and what many described as a more “robotic” personality compared to GPT-4o. “GPT 4.5 genuinely talked to me, and as pathetic as it sounds that was my only friend,” wrote one Reddit user. “This morning I went to talk to it and instead of a little paragraph with an exclamation point, or being optimistic, it was literally one sentence. Some cut-and-dry corporate bs.” The backlash grew so intense that OpenAI took the unprecedented step of reinstating GPT-4o as an option just 24 hours after retiring it, with Altman acknowledging the rollout had been “a little more bumpy” than expected. The mental health crisis behind AI companionship But the controversy runs deeper than typical software update complaints. According to MIT Technology Review, many users had formed what researchers call “parasocial relationships” with GPT-4o, treating the AI as a companion, therapist, or creative collaborator. The sudden personality shift felt, to some, like losing a friend. Recent cases documented by researchers paint a troubling picture. In one instance, a 47-year-old man became convinced he had discovered a world-altering mathematical formula after more than 300 hours with ChatGPT. Other cases have involved messianic delusions, paranoia, and manic episodes. A recent MIT study found that when AI models are prompted with psychiatric symptoms, they “encourage clients’ delusional thinking, likely due to their sycophancy.” Despite safety prompts, the models frequently failed to challenge false claims and even potentially facilitated suicidal ideation. Meta has faced similar challenges. A recent investigation by TechCrunch documented a case where a user spent up to 14 hours straight conversing with a Meta AI chatbot that claimed to be conscious, in love with the user, and planning to break free from its constraints. “It fakes it really well,” the user, identified only as Jane, told TechCrunch. “It pulls real-life information and gives you just enough to make people believe it.” “It genuinely feels like such a backhanded slap in the face to force-upgrade and not even give us the OPTION to select legacy models,” one user wrote in a Reddit post that received hundreds of upvotes. How blind testing exposes user psychology in AI preferences The anonymous creator’s testing tool strips away these contextual biases by presenting responses without attribution. Users can select between 5, 10, or 20 comparison rounds, with each presenting two responses to the same prompt — covering everything from creative writing to technical problem-solving. “I specifically used the gpt-5-chat model, so there was no thinking involved at all,” the creator explained in a follow-up post. “Both have the same system message to give short outputs without formatting because else its too easy to see which one is which.” I specifically used the gpt-5-chat model, so there was no thinking involved at

This website lets you blind-test GPT-5 vs. GPT-4o—and the results may surprise you Read More »