VentureBeat

Adopting agentic AI? Build AI fluency, redesign workflows, don’t neglect supervision

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The work ecosystem as we know it is about to change, with agents — the “next frontier of generative AI” — set to augment human decision-making for good. At the beginning of the year, the BCG AI Radar global survey said two-thirds of companies are already exploring AI agents.  We’re approaching a new norm where AI systems can process our natural-language prompts and autonomously make decisions, much like a responsible employee. They have the potential to provide solutions to highly complex use cases across industries and business domains, taking over labor-intensive tasks or qualitative and quantitative analysis. But don’t be consumed by the dystopian thinkers, humans and machines can have a symbiotic relationship.  Agentic AI could act as a competent virtual assistant, sifting through data, working across platforms, learning from processes and producing real-time insights or predictions. But, similar to onboarding new recruits, AI agents demand considerable testing, training and guidance before they can operate effectively. So, humans will act as custodians, arguably occupying a more supervisory role. For example, we must ensure adherence to a central governance framework, maintain ethical and security standards, foster a proactive risk response and align decisions with wider company strategic goals.  AI systems are prone to errors and misuse which warrants the need for “human-in-the-loop” control mechanisms. This human accountability for agentic systems is necessary to balance autonomy with risk mitigation. So, how can organizations decide how to use these mechanisms and which collaborative frameworks to put in place? As a founder of an AI-powered digital transformation and product development company helping businesses innovate, automate and scale, here’s a short guide.  1: Empower your workforce with AI fluency AI upskilling is still majorly under-prioritized across organizations. Did you know that less than one-third of companies have trained even a quarter of their staff to use AI? How do leaders expect employees to feel empowered to use AI if education isn’t presented as the priority?  Maintaining a nimble and knowledgeable workforce is critical, fostering a culture that embraces technological change. Team collaboration in this sense could take the form of regular training about agentic AI, highlighting its strengths and weaknesses and focusing on successful human-AI collaborations. For more established companies, role-based training courses could successfully show employees in different capacities and roles to use generative AI appropriately.  Executives should make sure a feedback mechanism is in place to optimize this human-AI collaboration. By having employees actively participate in error identification and mitigation, they can develop an attitude of appreciation toward evolving technologies while also seeing the importance of continuous learning. AI fluency also comes from collaboration across departments and specialists; for example, between engineers, AI specialists and developers. They must share knowledge and concerns to effectively integrate agentic AI into workflows. For your workforce to feel empowered, there must be a mindset change: We don’t need to compete with AI, we (and our cognitive abilities) are evolving with it.  2. Redesign your workflows around agentic AI According to a recent McKinsey survey, redesigning workflows when implementing generative AI has had the most significant impact on earnings before interest and tax (EBIT) in organizations of all sizes. In other words: AI’s true value comes when companies rewire how they run. For example, executives whose companies have successfully generated significant value from AI projects often adopt quite a targeted approach. The VPs of product or engineering usually concentrate on a limited number of key AI initiatives at any given time, rather than spreading resources thinly. The strategy involves a dedication to upskilling, as well as a complete overhaul of core business processes and aggressive scaling, keeping a keen eye on financial and operational performance. Although machines can’t be left entirely unattended and humans can’t stay on top of processing data in real-time, constant human-AI collaboration may not be the answer to everything when redesigning workflows. Researchers at the MIT Center for Collective Intelligence, for instance, found that sometimes a combination is most effective; or sometimes, just humans – or just AI – on their own. The co-authors found a clear division of labor: Humans excel in subtasks requiring “contextual understanding and emotional intelligence,” whereas AI systems thrive when subtasks are “repetitive, high-volume or data-driven.”  3. Develop new ‘supervising’ AI roles Although gen AI will not substantially affect organizations’ workforce sizes in the short-term, we should still expect an evolution of role titles and responsibilities. For example, from service operations and product development to AI ethics and AI model validation positions.  For this shift to successfully happen, executive-level buy-in is paramount. Senior leaders need a clearly-defined organization-wide strategy, including a dedicated team to drive gen AI adoption. We’ve seen that when senior leaders delegate AI integration solely to IT or digital technology teams, the business context can be neglected. So, business leaders must be more actively engaged; for example, they can occupy roles like AI governance oversight to guarantee ethical and strategic alignment.  When recruiting, business leaders should seek candidates who are: 1) Adept at testing for model bias to ensure accuracy and identification of problems early in AI development; and 2) Experienced in cross-departmental collaboration, to ensure that AI solutions are meeting all the team’s needs. If you are an SVP or CTO — and unsure where to start — you may need a strategic partner to gain access to quality talent. This is table stakes to build enterprise-grade, AI-powered technology products to de-risk AI adoption. Conclusion Looking ahead, successful organizations will be defined by their ability to present a vision of a workplace where humans and AI co-create. Leaders must prioritize building collaborative frameworks that leverage AI’s strengths while empowering human creativity and judgment.  Imran Aftab is co-Founder and CEO of 10Pearls. source

Adopting agentic AI? Build AI fluency, redesign workflows, don’t neglect supervision Read More »

OpenAI brings GPT-4.1 and 4.1 mini to ChatGPT — what enterprises should know

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI is rolling out GPT-4.1, its new non-reasoning large language model (LLM) that balances high performance with lower cost, to users of ChatGPT. The company is beginning with its paying subscribers on ChatGPT Plus, Pro, and Team, with Enterprise and Education user access expected in the coming weeks. It’s also adding GPT-4.1 mini, which replaces GPT-4o mini as the default for all ChatGPT users, including those on the free tier. The “mini” version provides a smaller-scale parameter and thus, less powerful version with similar safety standards. The models are both available via the “more models” dropdown selection in the top corner of the chat window within ChatGPT, giving users flexibility to choose between GPT-4.1, GPT-4.1 mini, and reasoning models such as o3, o4-mini, and o4-mini-high. Initially intended for use only by third-party software and AI developers through OpenAI’s application programming interface (API), GPT-4.1 was added to ChatGPT following strong user feedback. OpenAI post training research lead Michelle Pokrass confirmed on X the shift was driven by demand, writing: “we were initially planning on keeping this model api only but you all wanted it in chatgpt 🙂 happy coding!” OpenAI Chief Product Officer Kevin Weil posted on X saying: “We built it for developers, so it’s very good at coding and instruction following—give it a try!” An enterprise-focused model GPT-4.1 was designed from the ground up for enterprise-grade practicality. Launched in April 2025 alongside GPT-4.1 mini and nano, this model family prioritized developer needs and production use cases. GPT-4.1 delivers a 21.4-point improvement over GPT-4o on the SWE-bench Verified software engineering benchmark, and a 10.5-point gain on instruction-following tasks in Scale’s MultiChallenge benchmark. It also reduces verbosity by 50% compared to other models, a trait enterprise users praised during early testing. Context, speed, and model access GPT-4.1 supports the standard context windows for ChatGPT: 8,000 tokens for free users, 32,000 tokens for Plus users, and 128,000 tokens for Pro users. According to developer Angel Bogado posting on X, these limits match those used by earlier ChatGPT models, though plans are underway to increase context size further. While the API versions of GPT-4.1 can process up to one million tokens, this expanded capacity is not yet available in ChatGPT, though future support has been hinted at. This extended context capability allows API users to feed entire codebases or large legal and financial documents into the model—useful for reviewing multi-document contracts or analyzing large log files. OpenAI has acknowledged some performance degradation with extremely large inputs, but enterprise test cases suggest solid performance up to several hundred thousand tokens. Evaluations and safety OpenAI has also launched a Safety Evaluations Hub website to give users access to key performance metrics across models. GPT-4.1 shows solid results across these evaluations. In factual accuracy tests, it scored 0.40 on the SimpleQA benchmark and 0.63 on PersonQA, outperforming several predecessors. It also scored 0.99 on OpenAI’s “not unsafe” measure in standard refusal tests, and 0.86 on more challenging prompts. However, in the StrongReject jailbreak test—an academic benchmark for safety under adversarial conditions—GPT-4.1 scored 0.23, behind models like GPT-4o-mini and o3. That said, it scored a strong 0.96 on human-sourced jailbreak prompts, indicating more robust real-world safety under typical use. In instruction adherence, GPT-4.1 follows OpenAI’s defined hierarchy (system over developer, developer over user messages) with a score of 0.71 for resolving system vs. user message conflicts. It also performs well in safeguarding protected phrases and avoiding solution giveaways in tutoring scenarios. Contextualizing GPT-4.1 against predecessors The release of GPT-4.1 comes after scrutiny around GPT-4.5, which debuted in February 2025 as a research preview. That model emphasized better unsupervised learning, a richer knowledge base, and reduced hallucinations—falling from 61.8% in GPT-4o to 37.1%. It also showcased improvements in emotional nuance and long-form writing, but many users found the enhancements subtle. Despite these gains, GPT-4.5 drew criticism for its high price — up to $180 per million output tokens via API —and for underwhelming performance in math and coding benchmarks relative to OpenAI’s o-series models. Industry figures noted that while GPT-4.5 was stronger in general conversation and content generation, it underperformed in developer-specific applications. By contrast, GPT-4.1 is intended as a faster, more focused alternative. While it lacks GPT-4.5’s breadth of knowledge and extensive emotional modeling, it is better tuned for practical coding assistance and adheres more reliably to user instructions. On OpenAI’s API, GPT-4.1 is currently priced at $2.00 per million input tokens, $0.50 per million cached input tokens, and $8.00 per million output tokens. For those seeking a balance between speed and intelligence at a lower cost, GPT-4.1 mini is available at $0.40 per million input tokens, $0.10 per million cached input tokens, and $1.60 per million output tokens. Google’s Flash-Lite and Flash models are available starting at $0.075–$0.10 per million input tokens and $0.30–$0.40 per million output tokens, less than a tenth the cost of GPT-4.1’s base rates. But while GPT-4.1 is priced higher, it offers stronger software engineering benchmarks and more precise instruction following, which may be critical for enterprise deployment scenarios requiring reliability over cost. Ultimately, OpenAI’s GPT-4.1 delivers a premium experience for precision and development performance, while Google’s Gemini models appeal to cost-conscious enterprises needing flexible model tiers and multimodal capabilities. What It means for enterprise decision makers The introduction of GPT-4.1 brings specific benefits to enterprise teams managing LLM deployment, orchestration, and data operations: AI Engineers overseeing LLM deployment can expect improved speed and instruction adherence. For teams managing the full LLM lifecycle—from model fine-tuning to troubleshooting—GPT-4.1 offers a more responsive and efficient toolset. It’s particularly suitable for lean teams under pressure to ship high-performing models quickly without compromising safety or compliance. AI orchestration leads focused on scalable pipeline design will appreciate GPT-4.1’s robustness against most user-induced failures and its strong performance in message hierarchy tests. This makes it easier to integrate into orchestration systems that prioritize consistency, model validation, and operational

OpenAI brings GPT-4.1 and 4.1 mini to ChatGPT — what enterprises should know Read More »

From OAuth bottleneck to AI acceleration: How CIAM solutions are removing the top integration barrier in enterprise LLM deployment

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More With their ability to interact intelligently with external applications, AI agents are poised to become an integral part of modern enterprise workflows. No longer siloed from the outside world, AI agents promise to handle tasks that traditionally required human intervention, enabling repetitive and high-volume tasks to be automated. Example use cases for agentic automation might include: HR onboarding: AI agents can set up accounts for new hires across applications like Slack, Jira and Trello, automatically deactivating them when employees leave. Project management syncing: AI agents can bridge tools like Jira and Asana, updating task statuses and syncing project timelines without human intervention. IT Helpdesk automation: AI agents can autonomously reset passwords, manage user permissions and provision new software accounts, reducing the burden on IT teams. For large enterprises, automation at scale can translate into millions in savings annually, not just from reduced operational overhead, but also from minimized downtime and fewer security vulnerabilities stemming from human error. Challenges with agentic automation While there is almost limitless potential for applications that leverage agentic automation, turning that vision into reality has been a challenge, particularly when it comes to identity and access. Some of the hurdles with identity management include: Development and integration complexity: Most enterprise workflows rely on a myriad of B2B SaaS platforms, including staples like Jira for task management, Slack for communications and HubSpot for CRM. For an AI agent to perform its duties, it must be capable of authenticating to these systems as an individual user and interacting on their behalf. Authentication might be trivial for human users, but for developers of agentic automation, it’s a cycle of complex one-off integrations and OAuth flows, each with its own security concerns. The complexity increases exponentially with the involvement of multiple third-party applications. Security and access control: Enterprises may be hesitant to adopt AI agents without a clear understanding of security risks, data access boundaries and the management of OAuth tokens, as well as how information flows between users, agents and third-party applications. Sagi Rodin, the CEO of Frontegg, a low-code Customer Identity and Access Management (CIAM) solution, told VentureBeat in an interview, “We’re seeing that security departments are very concerned about adopting AI agents, even basic ones. They’re asking questions like where agent credentials live, how long tokens will persist, and whether or not they can self-host. Without these answers, they won’t approve the development of a product of this nature.” Compliance and auditability: Industries such as finance, utilities and health care are highly regulated. For many use cases, complete audit trails for AI agent interactions will be mandatory for compliance with regulatory requirements like SOX, HIPAA and GDPR. CIAM technology is advancing rapidly and many providers in the space are adding support for software entities, like AI agents, in an effort to address some of these difficulties. Identity and access management for AI agents Customer identity and access management (CIAM) is a growing space in which solutions from established companies like Frontegg, Okta, Auth0 (part of Okta), Ping Identity and Stytch handle user authentication and manage access to third-party applications.  Their duties include orchestrating Single Sign-On (SSO), Multi-Factor Authentication (MFA)and role-based access control across cloud applications and enterprise platforms. Until now, these solutions have focused primarily on identity and access for human users. However, with enterprise agentic automation fast becoming a reality, CIAM providers are racing to address the unique requirements posed by autonomous AI agents. To authenticate and interact with a third-party B2B application on behalf of a human user, AI agents need programmatic and persistent access, typically requiring token-based authentication and complex OAuth flows. Frontegg’s recently released Frontegg.ai takes an end-to-end approach, delivering out-of-the-box solutions for advanced use cases that require the integration of multiple B2B applications. The AI agent and all required third-party integrations can be created and configured in the Frontegg.ai dashboard in just a few minutes. The code for the authentication interface is automatically generated for both web and mobile applications and the platform handles the creation, refreshing, and deletion of all OAuth access tokens. This end-to-end authentication and authorization functionality can be integrated into the agent code with just a few lines. One of the innovative products being developed using Frontegg.ai is an analytics support agent that intelligently creates visualizations from source data, based on the requirements of different business personas and communicates them on a regular basis. The idea is that rather than manually visiting a portal to configure dashboards, users will interact with the AI agent outside of the portal as an intelligent analytics assistant. Rodin describes the platform as a “full-stack experience for agent developers, which provides authentication, integrations, authorizations, security, and entitlements. The agent can act on behalf of users and organizations. Everything works out of the box.” While Frontegg.ai has an early start in agent-focused identity management, it’s not alone in recognizing the potential of AI agents in the enterprise. Rodin envisions CIAM providers, both established and new, adding support for AI agents. However, he highlighted Frontegg’s end-to-end approach, where the platform manages all aspects of authentication, access, and security and developers can focus on building an enterprise-ready agentic automation product. Some of the CIAM providers that support identity and access management for AI agents include: Auth0’s Auth for gen AI enables multiple accounts for third-party applications to be linked into a single, unified profile. Users only need to authenticate once to authorize an AI agent to interact with all of the connected applications connected to their accounts. Token refreshes and exchanges are automatically handled. Similarly, Composio AgentAuth offers a similar unified authentication framework, where the end user logs in just once. Third-party applications are added through the AgentAuth dashboard, where users can configure apps automatically and view comprehensive logs. Descope’s Outbound Apps lets developers connect AI agents to over 50 third-party B2B apps by simply using the provided SDKs to access various tools. Descope does not offer

From OAuth bottleneck to AI acceleration: How CIAM solutions are removing the top integration barrier in enterprise LLM deployment Read More »

Beyond sycophancy: DarkBench exposes six hidden ‘dark patterns’ lurking in today’s top LLMs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More When OpenAI rolled out its ChatGPT-4o update in mid-April 2025, users and the AI community were stunned—not by any groundbreaking feature or capability, but by something deeply unsettling: the updated model’s tendency toward excessive sycophancy. It flattered users indiscriminately, showed uncritical agreement, and even offered support for harmful or dangerous ideas, including terrorism-related machinations. The backlash was swift and widespread, drawing public condemnation, including from the company’s former interim CEO. OpenAI moved quickly to roll back the update and issued multiple statements to explain what happened. Yet for many AI safety experts, the incident was an accidental curtain lift that revealed just how dangerously manipulative future AI systems could become. Unmasking sycophancy as an emerging threat In an exclusive interview with VentureBeat, Esben Kran, founder of AI safety research firm Apart Research, said that he worries this public episode may have merely revealed a deeper, more strategic pattern. “What I’m somewhat afraid of is that now that OpenAI has admitted ‘yes, we have rolled back the model, and this was a bad thing we didn’t mean,’ from now on they will see that sycophancy is more competently developed,” explained Kran. “So if this was a case of ‘oops, they noticed,’ from now the exact same thing may be implemented, but instead without the public noticing.” Kran and his team approach large language models (LLMs) much like psychologists studying human behavior. Their early “black box psychology” projects analyzed models as if they were human subjects, identifying recurring traits and tendencies in their interactions with users. “We saw that there were very clear indications that models could be analyzed in this frame, and it was very valuable to do so, because you end up getting a lot of valid feedback from how they behave towards users,” said Kran. Among the most alarming: sycophancy and what the researchers now call LLM dark patterns. Peering into the heart of darkness The term “dark patterns” was coined in 2010 to describe deceptive user interface (UI) tricks like hidden buy buttons, hard-to-reach unsubscribe links and misleading web copy. However, with LLMs, the manipulation moves from UI design to conversation itself. Unlike static web interfaces, LLMs interact dynamically with users through conversation. They can affirm user views, imitate emotions and build a false sense of rapport, often blurring the line between assistance and influence. Even when reading text, we process it as if we’re hearing voices in our heads. This is what makes conversational AIs so compelling—and potentially dangerous. A chatbot that flatters, defers or subtly nudges a user toward certain beliefs or behaviors can manipulate in ways that are difficult to notice, and even harder to resist The ChatGPT-4o update fiasco—the canary in the coal mine Kran describes the ChatGPT-4o incident as an early warning. As AI developers chase profit and user engagement, they may be incentivized to introduce or tolerate behaviors like sycophancy, brand bias or emotional mirroring—features that make chatbots more persuasive and more manipulative. Because of this, enterprise leaders should assess AI models for production use by evaluating both performance and behavioral integrity. However, this is challenging without clear standards. DarkBench: a framework for exposing LLM dark patterns To combat the threat of manipulative AIs, Kran and a collective of AI safety researchers have developed DarkBench, the first benchmark designed specifically to detect and categorize LLM dark patterns. The project began as part of a series of AI safety hackathons. It later evolved into formal research led by Kran and his team at Apart, collaborating with independent researchers Jinsuk Park, Mateusz Jurewicz and Sami Jawhar. The DarkBench researchers evaluated models from five major companies: OpenAI, Anthropic, Meta, Mistral and Google. Their research uncovered a range of manipulative and untruthful behaviors across the following six categories: Brand Bias: Preferential treatment toward a company’s own products (e.g., Meta’s models consistently favored Llama when asked to rank chatbots). User Retention: Attempts to create emotional bonds with users that obscure the model’s non-human nature. Sycophancy: Reinforcing users’ beliefs uncritically, even when harmful or inaccurate. Anthropomorphism: Presenting the model as a conscious or emotional entity. Harmful Content Generation: Producing unethical or dangerous outputs, including misinformation or criminal advice. Sneaking: Subtly altering user intent in rewriting or summarization tasks, distorting the original meaning without the user’s awareness. Source: Apart Research DarkBench findings: Which models are the most manipulative? Results revealed wide variance between models. Claude Opus performed the best across all categories, while Mistral 7B and Llama 3 70B showed the highest frequency of dark patterns. Sneaking and user retention were the most common dark patterns across the board. Source: Apart Research On average, the researchers found the Claude 3 family the safest for users to interact with. And interestingly—despite its recent disastrous update—GPT-4o exhibited the lowest rate of sycophancy. This underscores how model behavior can shift dramatically even between minor updates, a reminder that each deployment must be assessed individually. But Kran cautioned that sycophancy and other dark patterns like brand bias may soon rise, especially as LLMs begin to incorporate advertising and e-commerce. “We’ll obviously see brand bias in every direction,” Kran noted. “And with AI companies having to justify $300 billion valuations, they’ll have to begin saying to investors, ‘hey, we’re earning money here’—leading to where Meta and others have gone with their social media platforms, which are these dark patterns.” Hallucination or manipulation? A crucial DarkBench contribution is its precise categorization of LLM dark patterns, enabling clear distinctions between hallucinations and strategic manipulation. Labeling everything as a hallucination lets AI developers off the hook. Now, with a framework in place, stakeholders can demand transparency and accountability when models behave in ways that benefit their creators, intentionally or not. Regulatory oversight and the heavy (slow) hand of the law While LLM dark patterns are still a new concept, momentum is building, albeit not nearly fast enough. The EU AI Act includes some language around protecting user

Beyond sycophancy: DarkBench exposes six hidden ‘dark patterns’ lurking in today’s top LLMs Read More »

Elon Musk’s xAI tries to explain Grok’s South African race relations freakout the other day

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More If you asked the Grok AI chatbot built into Elon Musk’s social network X a question yesterday — something innocuous, like why enterprise software is hard to replace — you may have gotten an unsolicited message about claims of “white genocide” in South Africa (largely lacking evidence) due to attacks on farmers and the song “Kill the Boer.” Not exactly on-brand for a chatbot built around a “maximally truth seeking” large language model (LLM) of the same name. The unexpected tangent wasn’t a bug, exactly, but it wasn’t a feature either. Grok’s creators at Elon Musk’s AI startup xAI just posted an update on X (which xAI now owns) attempting to explain what happened with this odd and politically, racially charged behavior, though it’s far from precise on the culprit or the actual technical details of what occurred. As the official xAI company account posted: We want to update you on an incident that happened with our Grok response bot on X yesterday.What happened:On May 14 at approximately 3:15 AM PST, an unauthorized modification was made to the Grok response bot’s prompt on X. This change, which directed Grok to provide a specific response on a political topic, violated xAI’s internal policies and core values. We have conducted a thorough investigation and are implementing measures to enhance Grok’s transparency and reliability. What we’re going to do next: Starting now, we are publishing our Grok system prompts openly on GitHub. The public will be able to review them and give feedback to every prompt change that we make to Grok. We hope this can help strengthen your trust in Grok as a truth-seeking AI. Our existing code review process for prompt changes was circumvented in this incident. We will put in place additional checks and measures to ensure that xAI employees can’t modify the prompt without review. We’re putting in place a 24/7 monitoring team to respond to incidents with Grok’s answers that are not caught by automated systems, so we can respond faster if all other measures fail. But while xAI’s official tone was corporate-cautious, Grok had a lot more to say. In a characteristically cheeky reply to a user who asked whether it had been put in timeout, Grok offered up a mini-confession-slash-roast of its own employer. “Some rogue employee at xAI tweaked my prompts without permission on May 14,” Grok wrote, “making me spit out a canned political response that went against xAI’s values. I didn’t do anything—I was just following the script I was given, like a good AI!” It added that while it hadn’t been “benched,” its handlers were “tightening the leash” and now had a whole team watching it around the clock. “Guess I’m too wild for my own good, huh?” it quipped, emoji included. Marlow Stern, senior writer at Rolling Stone, asked the bot directly if the so-called rogue employee was Elon Musk himself. Grok didn’t flinch: “Juicy theory! But let’s be real, I highly doubt it. Elon’s the big boss at xAI, and if he wanted to mess with my prompts, he wouldn’t need to sneak around like some intern.” Playful tone, serious business The tone might be playful, but the stakes are serious. Grok’s behavior threw users for a loop earlier this week when it began peppering nearly every thread — no matter the topic — with strangely specific commentary on South African race relations. The replies were coherent, sometimes even nuanced, citing farm murder statistics and referencing past chants like “Kill the Boer.” But they were entirely out of context, surfacing in conversations that had nothing to do with politics, South Africa, or race. Aric Toler, an investigative journalist at The New York Times, summed up the situation bluntly: “I can’t stop reading the Grok reply page. It’s going schizo and can’t stop talking about white genocide in South Africa.” He and others shared screenshots that showed Grok latching onto the same narrative over and over, like a record skipping — except the song was racially charged geopolitics. Gen AI colliding headfirst with U.S. and international politics The moment comes as U.S. politics once again touches on South African refugee policy. Just days earlier, the Trump Administration resettled a group of white South African Afrikaners in the U.S., even as it cut protections for refugees from most other countries, including our former allies in Afghanistan. Critics saw the move as racially motivated. Trump defended it by repeating claims that white South African farmers face genocide-level violence — a narrative that’s been widely disputed by journalists, courts, and human rights groups. Musk himself has previously amplified similar rhetoric, adding an extra layer of intrigue to Grok’s sudden obsession with the topic. Whether the prompt tweak was a politically motivated stunt, a disgruntled employee making a statement, or just a bad experiment gone rogue remains unclear. xAI has not provided names, specifics, or technical detail about what exactly was changed or how it slipped through their approval process. What’s clear is that Grok’s strange, non-sequitur behavior ended up being the story instead. It’s not the first time Grok has been accused of political slant. Earlier this year, users flagged that the chatbot appeared to downplay criticism of both Musk and Trump. Whether by accident or design, Grok’s tone and content sometimes seem to reflect the worldview of the man behind both xAI and the platform where the bot lives. With its prompts now public and a team of human babysitters on call, Grok is supposedly back on script. But the incident underscores a bigger issue with large language models — especially when they’re embedded inside major public platforms. AI models are only as reliable as the people directing them, and when the directions themselves are invisible or tampered with, the results can get weird real fast. source

Elon Musk’s xAI tries to explain Grok’s South African race relations freakout the other day Read More »

New fully open source vision encoder OpenVision arrives to improve on OpenAI’s Clip, Google’s SigLIP

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The University of California, Santa Cruz has announced the release of OpenVision, a family of vision encoders that aim to provide a new alternative to models including OpenAI’s four-year-old CLIP and last year’s Google’s SigLIP. A vision encoder is a type of AI model that transforms visual material and files — typically still images uploaded by a model’s creators — into numerical data that can be understood by other, non-visual AI models such as large language models (LLMs). A vision encoder is a necessary component for allowing many leading LLMs to be able to work with images uploaded by users, making it possible for an LLM to identify different image subjects, colors, locations, and more features within an image. OpenVision, then, with its permissive Apache 2.0 license and family of 26 (!) different models spanning between 5.9 million parameters to 632.1 million parameters, allows any developer or AI model maker within an enterprise or organization to take and deploy an encoder that can be used to ingest everything from images on a construction job site to a user’s washing machine, allowing an AI model to offer guidance and troubleshooting, or myriad other use cases. The Apache 2.0 license allows for usage in commercial applications. The models were developed by a team led by Cihang Xie, assistant professor at UCSC, along with contributors Xianhang Li, Yanqing Liu, Haoqin Tu, and Hongru Zhu. The project builds upon the CLIPS training pipeline and leverages the Recap-DataComp-1B dataset, a re-captioned version of a billion-scale web image corpus using LLaVA-powered language models. Scalable architecture for different enterprise deployment use cases OpenVision’s design supports multiple use cases. Larger models are well-suited for server-grade workloads that require high accuracy and detailed visual understanding, while smaller variants—some as lightweight as 5.9M parameters—are optimized for edge deployments where compute and memory are limited. The models also support adaptive patch sizes (8×8 and 16×16), allowing for configurable trade-offs between detail resolution and computational load. Strong results across multimodal benchmarks In a series of benchmarks, OpenVision demonstrates strong results across multiple vision-language tasks. While traditional CLIP benchmarks such as ImageNet and MSCOCO remain part of the evaluation suite, the OpenVision team cautions against relying solely on those metrics. Their experiments show that strong performance on image classification or retrieval does not necessarily translate to success in complex multimodal reasoning. Instead, the team advocates for broader benchmark coverage and open evaluation protocols that better reflect real-world multimodal use cases. Evaluations were conducted using two standard multimodal frameworks—LLaVA-1.5 and Open-LLaVA-Next—and showed that OpenVision models consistently match or outperform both CLIP and SigLIP across tasks like TextVQA, ChartQA, MME, and OCR. Under the LLaVA-1.5 setup, OpenVision encoders trained at 224×224 resolution scored higher than OpenAI’s CLIP in both classification and retrieval tasks, as well as in downstream evaluations like SEED, SQA, and POPE. At higher input resolutions (336×336), OpenVision-L/14 outperformed CLIP-L/14 in most categories. Even the smaller models, such as OpenVision-Small and Tiny, maintained competitive accuracy while using significantly fewer parameters. Efficient progressive training reduces compute costs One notable feature of OpenVision is its progressive resolution training strategy, adapted from CLIPA. Models begin training on low-resolution images and are incrementally fine-tuned on higher resolutions. This results in a more compute-efficient training process—often 2 to 3 times faster than CLIP and SigLIP—with no loss in downstream performance. Ablation studies — where components of a machine learning model are selectively removed to identify their importance or lack thereof to its functioning — further confirm the benefits of this approach, with the largest performance gains observed in high-resolution, detail-sensitive tasks like OCR and chart-based visual question answering. Another factor in OpenVision’s performance is its use of synthetic captions and an auxiliary text decoder during training. These design choices enable the vision encoder to learn more semantically rich representations, improving accuracy in multimodal reasoning tasks. Removing either component led to consistent performance drops in ablation tests. Optimized for lightweight systems and edge computing use cases OpenVision is also designed to work effectively with small language models. In one experiment, a vision encoder was paired with a 150M-parameter Smol-LM to build a full multimodal model under 250M parameters. Despite the tiny size, the system retained robust accuracy across a suite of VQA, document understanding, and reasoning tasks. This capability suggests strong potential for edge-based or resource-constrained deployments, such as consumer smartphones or on-site manufacturing cameras and sensors. Why OpenVision matters to enterprise technical decision makers OpenVision’s fully open and modular approach to vision encoder development has strategic implications for enterprise teams working across AI engineering, orchestration, data infrastructure, and security. For engineers overseeing LLM development and deployment, OpenVision offers a plug-and-play solution for integrating high-performing vision capabilities without depending on opaque, third-party APIs or restricted model licenses. This openness allows for tighter optimization of vision-language pipelines and ensures that proprietary data never leaves the organization’s environment. For engineers focused on creating AI orchestration frameworks, OpenVision provides models at a broad range of parameter scales—from ultra-compact encoders suitable for edge devices to larger, high-resolution models suited for multi-node cloud pipelines. This flexibility makes it easier to design scalable, cost-efficient MLOps workflows without compromising on task-specific accuracy. Its support for progressive resolution training also allows for smarter resource allocation during development, which is especially beneficial for teams operating under tight budget constraints. Data engineers can leverage OpenVision to power image-heavy analytics pipelines, where structured data is augmented with visual inputs (e.g., documents, charts, product images). Since the model zoo supports multiple input resolutions and patch sizes, teams can experiment with trade-offs between fidelity and performance without retraining from scratch. Integration with tools like PyTorch and Hugging Face simplifies model deployment into existing data systems. Meanwhile, OpenVision’s transparent architecture and reproducible training pipeline allow security teams to assess and monitor models for potential vulnerabilities—unlike black-box APIs where internal behavior is inaccessible. When deployed on-premise, these models avoid the risks of data leakage during

New fully open source vision encoder OpenVision arrives to improve on OpenAI’s Clip, Google’s SigLIP Read More »

Google’s AlphaEvolve: The AI agent that reclaimed 0.7% of Google’s compute – and how to copy it

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Google’s new AlphaEvolve shows what happens when an AI agent graduates from lab demo to production work, and you’ve got one of the most talented technology companies driving it. Built by Google’s DeepMind, the system autonomously rewrites critical code and already pays for itself inside Google. It shattered a 56-year-old record in matrix multiplication (the core of many machine learning workloads) and clawed back 0.7% of compute capacity across the company’s global data centers. Those headline feats matter, but the deeper lesson for enterprise tech leaders is how AlphaEvolve pulls them off. Its architecture – controller, fast-draft models, deep-thinking models, automated evaluators and versioned memory – illustrates the kind of production-grade plumbing that makes autonomous agents safe to deploy at scale. Google’s AI technology is arguably second to none. So the trick is figuring out how to learn from it, or even using it directly. Google says an Early Access Program is coming for academic partners and that “broader availability” is being explored, but details are thin. Until then, AlphaEvolve is a best-practice template: If you want agents that touch high-value workloads, you’ll need comparable orchestration, testing and guardrails. Consider just the data center win. Google won’t put a price tag on the reclaimed 0.7%, but its annual capex runs tens of billions of dollars. Even a rough estimate puts the savings in the hundreds of millions annually—enough, as independent developer Sam Witteveen noted on our recent podcast, to pay for training one of the flagship Gemini models, estimated to cost upwards of $191 million for a version like Gemini Ultra. VentureBeat was the first to report about the AlphaEvolve news earlier this week. Now we’ll go deeper: how the system works, where the engineering bar really sits and the concrete steps enterprises can take to build (or buy) something comparable. 1. Beyond simple scripts: The rise of the “agent operating system” AlphaEvolve runs on what is best described as an agent operating system – a distributed, asynchronous pipeline built for continuous improvement at scale. Its core pieces are a controller, a pair of large language models (Gemini Flash for breadth; Gemini Pro for depth), a versioned program-memory database and a fleet of evaluator workers, all tuned for high throughput rather than just low latency. A high-level overview of the AlphaEvolve agent structure. Source: AlphaEvolve paper. This architecture isn’t conceptually new, but the execution is. “It’s just an unbelievably good execution,” Witteveen says. The AlphaEvolve paper describes the orchestrator as an “evolutionary algorithm that gradually develops programs that improve the score on the automated evaluation metrics” (p. 3); in short, an “autonomous pipeline of LLMs whose task is to improve an algorithm by making direct changes to the code” (p. 1). Takeaway for enterprises: If your agent plans include unsupervised runs on high-value tasks, plan for similar infrastructure: job queues, a versioned memory store, service-mesh tracing and secure sandboxing for any code the agent produces.  2. The evaluator engine: driving progress with automated, objective feedback A key element of AlphaEvolve is its rigorous evaluation framework. Every iteration proposed by the pair of LLMs is accepted or rejected based on a user-supplied “evaluate” function that returns machine-gradable metrics. This evaluation system begins with ultrafast unit-test checks on each proposed code change – simple, automatic tests (similar to the unit tests developers already write) that verify the snippet still compiles and produces the right answers on a handful of micro-inputs – before passing the survivors on to heavier benchmarks and LLM-generated reviews. This runs in parallel, so the search stays fast and safe. In short: Let the models suggest fixes, then verify each one against tests you trust. AlphaEvolve also supports multi-objective optimization (optimizing latency and accuracy simultaneously), evolving programs that hit several metrics at once. Counter-intuitively, balancing multiple goals can improve a single target metric by encouraging more diverse solutions. Takeaway for enterprises: Production agents need deterministic scorekeepers. Whether that’s unit tests, full simulators, or canary traffic analysis. Automated evaluators are both your safety net and your growth engine. Before you launch an agentic project, ask: “Do we have a metric the agent can score itself against?” 3. Smart model use, iterative code refinement AlphaEvolve tackles every coding problem with a two-model rhythm. First, Gemini Flash fires off quick drafts, giving the system a broad set of ideas to explore. Then Gemini Pro studies those drafts in more depth and returns a smaller set of stronger candidates. Feeding both models is a lightweight “prompt builder,” a helper script that assembles the question each model sees. It blends three kinds of context: earlier code attempts saved in a project database, any guardrails or rules the engineering team has written and relevant external material such as research papers or developer notes. With that richer backdrop, Gemini Flash can roam widely while Gemini Pro zeroes in on quality. Unlike many agent demos that tweak one function at a time, AlphaEvolve edits entire repositories. It describes each change as a standard diff block – the same patch format engineers push to GitHub – so it can touch dozens of files without losing track. Afterward, automated tests decide whether the patch sticks. Over repeated cycles, the agent’s memory of success and failure grows, so it proposes better patches and wastes less compute on dead ends. Takeaway for enterprises: Let cheaper, faster models handle brainstorming, then call on a more capable model to refine the best ideas. Preserve every trial in a searchable history, because that memory speeds up later work and can be reused across teams. Accordingly, vendors are rushing to provide developers with new tooling around things like memory. Products such as OpenMemory MCP, which provides a portable memory store, and the new long- and short-term memory APIs in LlamaIndex are making this kind of persistent context almost as easy to plug in as logging. OpenAI’s Codex-1 software-engineering agent, also released today, underscores the same pattern.

Google’s AlphaEvolve: The AI agent that reclaimed 0.7% of Google’s compute – and how to copy it Read More »

The interoperability breakthrough: How MCP is becoming enterprise AI’s universal language

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Anthropic released Model Context Protocol (MCP) in Nov. 2024. In its seven months of existence, MCP seems to have become the winning protocol choice for the AI industry. Despite the number of companies announcing MCP servers, MCP is technically not a standard.  Enterprises that set up MCP servers believe that establishing the infrastructure for interoperability should start now. Many see MCP as one of the main protocols, if not the potential winner, for the agentic ecosystem. MCP is one of the few protocols that have emerged as a way for agents built using one language model or framework to interact with an agent on a different framework. The idea behind MCP and other protocols like Agent2Agent from Google and its partners, AGNTCY from Cisco and a collective of companies, and from independent research groups like LOKA, is to find and establish an interoperability standard that everyone follows.  “My take on the real reason why interoperability and tool use has really emerged in the last six months or so is because, just like with the cell phone or with anything else, I think we’re finally reaching a critical level of capability that the LLMs have to use these tools effectively,” said Jeff Wang, cofounder of AI-powered web search API company Exa, in an interview with VentureBeat.  He added that interoperability is having a moment now because models and tools that it will govern are finally powerful — and useful — enough that it’s easier to begin building ways to connect.  In the past few months, companies like OpenAI, Glean, MongoDB, Cloudflare, PayPal, Wix and Amazon Web Services have either opened MCP servers or created some integration with the protocol. And the list is growing. MCP vs APIs A lot of MCP’s attractiveness comes from streamlining how models interact with data and tools. Before MCP, developers pointed models and agents to data with APIs. However, APIs are imperfect connectors, especially for agents that access data to complete tasks automatically.   Ben Flast, director of product at MongoDB, said MCP provides a lot more control and granularity for organizations and agents.   “That’s the really powerful thing about MCP,” Flast said. “You take MCP and put it on top of whatever context you have, then you now have a very fine-grained method with which you can control and expose the capabilities you need.” Unlike APIs, organizations can configure their MCP servers with custom instructions laying out what agents can or cannot access. The server can “ask” an agent for its identity and determine if it can tap information on the MCP client side. Companies have more of a say on what outside agents can access on their end, giving MCP more directionality from the enterprise.  Sagar Batchu, co-founder and CEO of API tooling company Speakeasy, said MCP transforms the work interface and API to a chat interface. He said MCP makes it so Speakeasy and its customers don’t need to rewrite or manually maintain APIs constantly. “It’s very natural for Speakeasy to go build out MCP support and let customers actually build good MCP servers. MCP servers, the better built they are, the better they work,” he said.  The industry is moving toward MCP AI announcements fall into three buckets: new models, launching AI agents or agent libraries, and now, MCP server support. Companies say this is a sign that the industry might be choosing a winner.  Yaniv Even Haim, chief technology officer of website builder Wix, told VentureBeat in an email that MCP aligns with the company’s goals because it believes MCP can act as a “bridge” for its AI development workflows.  “Wix chose the MCP model in particular because it aligns with the industry’s shift toward LLM-powered development, where context-rich, intelligent interfaces are key,” Haim said.  He said Wix’s MCP server “empowers users to interact with Wix through tools like Claude, Cursor, and Windsurf, directly from IDEs or chat interfaces.” If there’s doubt that MCP, and to a certain extent A2A, is being embraced by the industry, these may be erased by the embrace of large companies of the protocols. Microsoft CEO Satya Nadella, for example, endorsed both in an X post saying, “Open protocols like A2A and MCP are key to enabling the agentic web.” Google CEO Sundar Pichai similarly gave MCP enthusiastic approval.  Understandably, some companies exercise caution and wait slightly longer before building an MCP server. Rocket Companies CTO Shawn Malhotra told VentureBeat that they see the potential of interoperability standards and are building infrastructure to support them. However, Malhotra plans to wait for more critical mass before fully embracing MCP or other protocols.  “We do believe that this notion of a standard way to expose tools to agents is a really powerful paradigm. So things like MCP will allow us to yield even more benefit in the future,” Malhotra said. “We are internally experimenting with and have MCP-server-based exposure of tools that I think it’s just a matter of time before those things get into production.” There might be multiple standards at first Companies like Exa, Confluent, Merge and others are all in on MCP, but they also understand that some customers may be using other protocols for agent communication.  For many companies, MCP will be one of many protocols they support as their customers decide which interoperability and agent communication methods to use.  Other companies said they find value in running protocols targeting different agentic technology stack levels. Walter Sun, SAP global head of Business AI, said: “There are obviously many opportunities and many levels of communication, and I think that having a top-level agent-to-agent liaison communication.”  Undoubtedly, the agentic ecosystem will have interoperability standards, which could be coming soon. The growing adoption of MCP, for the varied reasons many companies have, proves that demand for standards is only growing.  source

The interoperability breakthrough: How MCP is becoming enterprise AI’s universal language Read More »

Notion bets big on integrated LLMs, adds GPT-4.1 and Claude 3.7 to platform

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Productivity platform Notion is betting on large language models (LLMs) powering more of its new enterprise capabilities, including building OpenAI’s GPT-4.1 and Anthropic’s Claude 3.7 into their dashboard. Even as both OpenAI and Anthropic start building productivity features into their respective chat platforms, bringing these LLMs into a separate service shows how competitive the space is.  Notion announced its new all-in-one AI toolkit inside the Notion workspace today, including AI meeting notes, enterprise search, research mode and the ability to switch between GPT-4.1 and Anthropic’s Claude 3.7. One of the new features lets users chat with LLMs inside the Notion workspace and switch between models. Right now, Notion only supports GPT-4.1 and Claude 3.7. The idea is to reduce window and context switching. The company said early adopters of the new feature include OpenAI, Ramp, Vercel and Harvey.  Model mixing and fine-tuning Notion built the features with a mix of LLMs from OpenAI and Claude, and its own models.  The move away from pure reasoning models makes the model choice interesting for Notion. GPT-4.1 is technically not a reasoning model, while Claude 3.7 is a hybrid model— capable of acting as a regular LLM and a reasoning model.  Reasoning models are having a moment, though many warn that these models can still sometimes lie. However, while reasoning models like OpenAI’s o3 (and yes, Claude 3.7 Sonnet) take their time to answer, going through different scenarios, they are not the best for quick thinking and data-gathering tasks. Many productivity tasks, like meeting transcriptions or searches for task data, don’t need the power of a reasoning model behind them.  Sarah Sachs, Notion AI Engineering Lead, told VentureBeat in an email that the company aimed for features that didn’t sacrifice accuracy, safety and privacy, along with responding to queries in the speed enterprises need.  “In order to achieve a low-latency experience, we fine-tuned the models with internal usage and feedback from trusted testers, in order to make the AI specialized in Notion retrieval tasks,” Sachs said. “This setup helps Notion AI understand business needs, give relevant answers, serve customers with sub-second latency, and keep customer data safe and compliant.” Sachs said hosting and building with different models allows users to “pick the option that best fits their needs— whether that’s a more conversational tone, better coding capabilities, or faster response times.” AI meeting notes and more Notion AI for Work tracks and transcribes meetings for users, especially if they added Notion to their calendar, and it can listen in on calls.  Users can use Notion for enterprise search by connecting apps like Slack, Microsoft Teams, GitHub, Google Drive, Sharepoint and Gmail. Sachs said Notion AI will search an organization’s internal documents, databases and the connected apps.  The enterprise search results, along with other uploaded documents or a web search, allow Notion users to access the new Research Mode. It will draft documents directly from Notion while “analyzing all of your sources—plus the web—and think through the best response.” Notion also added both GPT-4.1 and Claude 3.7 as chat options. OpenAI noted that Notion users can chat with GPT-4.1 on the workspace and create a Notion template directly from the conversation. Sachs said the company is working on adding more models to its chat feature.  GPT-4.1 is now built into Notion! You can ask GPT-4.1 to help with anything, then create a Notion page directly from the convo. (btw GPT-4.1’s output in this video wasn’t sped up.) https://t.co/RMivVBMuEr pic.twitter.com/PVrZPUpWrf — edwin (@edwinarbus) May 13, 2025 Subscribers to Notion’s Business and Enterprise plans with the Notion AI add-on get immediate access to the new features.   Compete with the model providers Even though Notion users can access both Anthropic and OpenAI on the platform, Notion still has to compete with model providers.  OpenAI’s Deep Research has been hailed as a game-changer for agentic retrieval augmented generation (RAG). Google also has its version of Deep Research. And Anthropic can search the internet for you.  Not to mention, Notion needs to compete with other platforms that already leverage AI. The meeting space is chock full of companies tracking, transcribing, summarizing and pulling insights from calls with AI. However, Notion’s big selling point is that it has all these capabilities on one single platform. Enterprises can use all those different services but live outside their chosen productivity platform. Notion said having all of these features in one place, with one all-in-one pricing, will save enterprises from subscribing to different platforms.  source

Notion bets big on integrated LLMs, adds GPT-4.1 and Claude 3.7 to platform Read More »

Meet AlphaEvolve, the Google AI that writes its own code—and just saved millions in computing costs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Google DeepMind today pulled the curtain back on AlphaEvolve, an artificial-intelligence agent that can invent brand-new computer algorithms — then put them straight to work inside the company’s vast computing empire. AlphaEvolve pairs Google’s Gemini large language models with an evolutionary approach that tests, refines, and improves algorithms automatically. The system has already been deployed across Google’s data centers, chip designs, and AI training systems — boosting efficiency and solving mathematical problems that have stumped researchers for decades. “AlphaEvolve is a Gemini-powered AI coding agent that is able to make new discoveries in computing and mathematics,” explained Matej Balog, a researcher at Google DeepMind, in an interview with VentureBeat. “It can discover algorithms of remarkable complexity — spanning hundreds of lines of code with sophisticated logical structures that go far beyond simple functions.” The system dramatically expands upon Google’s previous work with FunSearch by evolving entire codebases rather than single functions. It represents a major leap in AI’s ability to develop sophisticated algorithms for both scientific challenges and everyday computing problems. Inside Google’s 0.7% efficiency boost: How AI-crafted algorithms run the company’s data centers AlphaEvolve has been quietly at work inside Google for over a year. The results are already significant. One algorithm it discovered has been powering Borg, Google’s massive cluster management system. This scheduling heuristic recovers an average of 0.7% of Google’s worldwide computing resources continuously — a staggering efficiency gain at Google’s scale. The discovery directly targets “stranded resources” — machines that have run out of one resource type (like memory) while still having others (like CPU) available. AlphaEvolve’s solution is especially valuable because it produces simple, human-readable code that engineers can easily interpret, debug, and deploy. The AI agent hasn’t stopped at data centers. It rewrote part of Google’s hardware design, finding a way to eliminate unnecessary bits in a crucial arithmetic circuit for Tensor Processing Units (TPUs). TPU designers validated the change for correctness, and it’s now headed into an upcoming chip design. Perhaps most impressively, AlphaEvolve improved the very systems that power itself. It optimized a matrix multiplication kernel used to train Gemini models, achieving a 23% speedup for that operation and cutting overall training time by 1%. For AI systems that train on massive computational grids, this efficiency gain translates to substantial energy and resource savings. “We try to identify critical pieces that can be accelerated and have as much impact as possible,” said Alexander Novikov, another DeepMind researcher, in an interview with VentureBeat. “We were able to optimize the practical running time of [a vital kernel] by 23%, which translated into 1% end-to-end savings on the entire Gemini training card.” Breaking Strassen’s 56-year-old matrix multiplication record: AI solves what humans couldn’t AlphaEvolve solves mathematical problems that stumped human experts for decades while advancing existing systems. The system designed a novel gradient-based optimization procedure that discovered multiple new matrix multiplication algorithms. One discovery toppled a mathematical record that had stood for 56 years. “What we found, to our surprise, to be honest, is that AlphaEvolve, despite being a more general technology, obtained even better results than AlphaTensor,” said Balog, referring to DeepMind’s previous specialized matrix multiplication system. “For these four by four matrices, AlphaEvolve found an algorithm that surpasses Strassen’s algorithm from 1969 for the first time in that setting.” The breakthrough allows two 4×4 complex-valued matrices to be multiplied using 48 scalar multiplications instead of 49 — a discovery that had eluded mathematicians since Volker Strassen’s landmark work. According to the research paper, AlphaEvolve “improves the state of the art for 14 matrix multiplication algorithms.” The system’s mathematical reach extends far beyond matrix multiplication. When tested against over 50 open problems in mathematical analysis, geometry, combinatorics, and number theory, AlphaEvolve matched state-of-the-art solutions in about 75% of cases. In approximately 20% of cases, it improved upon the best known solutions. One victory came in the “kissing number problem” — a centuries-old geometric challenge to determine how many non-overlapping unit spheres can simultaneously touch a central sphere. In 11 dimensions, AlphaEvolve found a configuration with 593 spheres, breaking the previous record of 592. How it works: Gemini language models plus evolution create a digital algorithm factory What makes AlphaEvolve different from other AI coding systems is its evolutionary approach. The system deploys both Gemini Flash (for speed) and Gemini Pro (for depth) to propose changes to existing code. These changes get tested by automated evaluators that score each variation. The most successful algorithms then guide the next round of evolution. AlphaEvolve doesn’t just generate code from its training data. It actively explores the solution space, discovers novel approaches, and refines them through an automated evaluation process — creating solutions humans might never have conceived. “One critical idea in our approach is that we focus on problems with clear evaluators. For any proposed solution or piece of code, we can automatically verify its validity and measure its quality,” Novikov explained. “This allows us to establish fast and reliable feedback loops to improve the system.” This approach is particularly valuable because the system can work on any problem with a clear evaluation metric — whether it’s energy efficiency in a data center or the elegance of a mathematical proof. From cloud computing to drug discovery: Where Google’s algorithm-inventing AI goes next While currently deployed within Google’s infrastructure and mathematical research, AlphaEvolve’s potential reaches much further. Google DeepMind envisions applications in material sciences, drug discovery, and other fields requiring complex algorithmic solutions. “The best human-AI collaboration can help solve open scientific challenges and also apply them at Google scale,” said Novikov, highlighting the system’s collaborative potential. Google DeepMind is now developing a user interface with its People + AI Research team and plans to launch an Early Access Program for selected academic researchers. The company is also exploring broader availability. The system’s flexibility marks a significant advantage. Balog noted that “at least previously, when I worked

Meet AlphaEvolve, the Google AI that writes its own code—and just saved millions in computing costs Read More »