VentureBeat

Claude Code revenue jumps 5.5x as Anthropic launches analytics dashboard

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Anthropic announced today it is rolling out a comprehensive analytics dashboard for its Claude Code AI programming assistant, addressing one of the most pressing concerns for enterprise technology leaders: understanding whether their investments in AI coding tools are actually paying off. The new dashboard will provide engineering managers with detailed metrics on how their teams use Claude Code, including lines of code accepted, suggestion accept rates, total user activity over time, total spend over time, average daily spend for each user, and average daily lines of code accepted for each user. The feature comes as companies increasingly demand concrete data to justify their AI spending amid a broader enterprise push to measure artificial intelligence’s return on investment. “When you’re overseeing a big engineering team, you want to know what everyone’s doing, and that can be very difficult,” said Adam Wolff, who manages Anthropic’s Claude Code team and previously served as head of engineering at Robinhood. “It’s hard to measure, and we’ve seen some startups in this space trying to address this, but it’s valuable to gain insights into how people are using the tools that you give them.” The dashboard addresses a fundamental challenge facing technology executives: As AI-powered development tools become standard in software engineering, managers lack visibility into which teams and individuals are benefiting most from these expensive premium tools. Claude Code pricing starts at $17 per month for individual developers, with enterprise plans reaching significantly higher price points. The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF A screenshot of Anthropic’s new analytics dashboard for Claude Code shows usage metrics, spending data and individual developer activity for a team of engineers over a one-month period. (Credit: Anthropic) Companies demand proof their AI coding investments are working This marks one of Anthropic’s most requested features from enterprise customers, signaling broader enterprise appetite for AI accountability tools. The dashboard will track commits, pull requests, and provide detailed breakdowns of activity by user and cost — data that engineering leaders say is crucial for understanding how AI is changing development workflows. “Different customers actually want to do different things with that cost,” Wolff explained. “Some were like, hey, I want to spend as much as I can on these AI enablement tools because they see it as a multiplier. Some obviously are sensibly looking to make sure that they don’t blow out their spend.” The feature includes role-based access controls, allowing organizations to configure who can view usage data. Wolff emphasized that the system focuses on metadata rather than actual code content, addressing potential privacy concerns about employee surveillance. “This does not contain any of the information about what people are actually doing,” he said. “It’s more the meta of, like, how much are they using it, you know, like, which tools are working? What kind of tool acceptance rate do you see — things that you would use to tweak your overall deployment.” Claude Code revenue jumps 5.5x as developer adoption surges The dashboard launch comes amid extraordinary growth for Claude Code since Anthropic introduced its Claude 4 models in May. The platform has seen active user base growth of 300% and run-rate revenue expansion of more than 5.5 times, according to company data. “Claude Code is on a roll,” Wolff told VentureBeat. “We’ve seen five and a half times revenue growth since we launched the Claude 4 models in May. That gives you a sense of the deluge in demand we’re seeing.” The customer roster includes prominent technology companies like Figma, Rakuten, and Intercom, representing a mix of design tools, e-commerce platforms, and customer service technology providers. Wolff noted that many additional enterprise customers are using Claude Code but haven’t yet granted permission for public disclosure. The growth trajectory reflects broader industry momentum around AI coding assistants. GitHub’s Copilot, Microsoft’s AI-powered programming tool, has amassed millions of users, while newer entrants like Cursor and recently acquired Windsurf have gained traction among developers seeking more powerful AI assistance. Premium pricing strategy targets enterprise customers willing to pay more Claude Code positions itself as a premium enterprise solution in an increasingly crowded market of AI coding tools. Unlike some competitors that focus primarily on code completion, Claude Code offers what Anthropic calls “agentic” capabilities — the ability to understand entire codebases, make coordinated changes across multiple files, and work directly within existing development workflows. “This is not cheap. This is a premium tool,” Wolff said. “The buyer has to understand what they’re getting for it. When you see these metrics, it’s pretty clear that developers are using these tools, and they’re making them more productive.” The company targets organizations with dedicated AI enablement teams and substantial development operations. Wolff said the most tech-forward companies are leading adoption, particularly those with internal teams focused on AI integration. “Certainly companies that have their own AI enablement teams, they love Claude Code because it’s so customizable, it can be deployed with the right set of tools and prompts and permissions that work really well for their organization,” he explained. Traditional industries with large developer teams are showing increasing interest, though adoption timelines remain longer as these organizations navigate procurement processes and deployment strategies. AI coding assistant market heats up as tech giants battle for developers The analytics dashboard puts Anthropic in direct competition with enterprise feedback about measuring AI tool effectiveness—a challenge facing the entire AI coding assistant market. While competitors like GitHub Copilot and newer entrants focus primarily on individual developer productivity, Anthropic is betting that enterprise customers need comprehensive organizational insights. Amazon recently launched Kiro, its

Claude Code revenue jumps 5.5x as Anthropic launches analytics dashboard Read More »

New AI architecture delivers 100x faster reasoning than LLMs with just 1,000 training examples

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Singapore-based AI startup Sapient Intelligence has developed a new AI architecture that can match, and in some cases vastly outperform, large language models (LLMs) on complex reasoning tasks, all while being significantly smaller and more data-efficient. The architecture, known as the Hierarchical Reasoning Model (HRM), is inspired by how the human brain utilizes distinct systems for slow, deliberate planning and fast, intuitive computation. The model achieves impressive results with a fraction of the data and memory required by today’s LLMs. This efficiency could have important implications for real-world enterprise AI applications where data is scarce and computational resources are limited. The limits of chain-of-thought reasoning When faced with a complex problem, current LLMs largely rely on chain-of-thought (CoT) prompting, breaking down problems into intermediate text-based steps, essentially forcing the model to “think out loud” as it works toward a solution. While CoT has improved the reasoning abilities of LLMs, it has fundamental limitations. In their paper, researchers at Sapient Intelligence argue that “CoT for reasoning is a crutch, not a satisfactory solution. It relies on brittle, human-defined decompositions where a single misstep or a misorder of the steps can derail the reasoning process entirely.” The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF This dependency on generating explicit language tethers the model’s reasoning to the token level, often requiring massive amounts of training data and producing long, slow responses. This approach also overlooks the type of “latent reasoning” that occurs internally, without being explicitly articulated in language. As the researchers note, “A more efficient approach is needed to minimize these data requirements.” A hierarchical approach inspired by the brain To move beyond CoT, the researchers explored “latent reasoning,” where instead of generating “thinking tokens,” the model reasons in its internal, abstract representation of the problem. This is more aligned with how humans think; as the paper states, “the brain sustains lengthy, coherent chains of reasoning with remarkable efficiency in a latent space, without constant translation back to language.” However, achieving this level of deep, internal reasoning in AI is challenging. Simply stacking more layers in a deep learning model often leads to a “vanishing gradient” problem, where learning signals weaken across layers, making training ineffective. An alternative, recurrent architectures that loop over computations can suffer from “early convergence,” where the model settles on a solution too quickly without fully exploring the problem. The Hierarchical Reasoning Model (HRM) is inspired by the structure of the brain Source: arXiv Seeking a better approach, the Sapient team turned to neuroscience for a solution. “The human brain provides a compelling blueprint for achieving the effective computational depth that contemporary artificial models lack,” the researchers write. “It organizes computation hierarchically across cortical regions operating at different timescales, enabling deep, multi-stage reasoning.” Inspired by this, they designed HRM with two coupled, recurrent modules: a high-level (H) module for slow, abstract planning, and a low-level (L) module for fast, detailed computations. This structure enables a process the team calls “hierarchical convergence.” Intuitively, the fast L-module addresses a portion of the problem, executing multiple steps until it reaches a stable, local solution. At that point, the slow H-module takes this result, updates its overall strategy, and gives the L-module a new, refined sub-problem to work on. This effectively resets the L-module, preventing it from getting stuck (early convergence) and allowing the entire system to perform a long sequence of reasoning steps with a lean model architecture that doesn’t suffer from vanishing gradients. HRM (left) smoothly converges on the solution across computation cycles and avoids early convergence (center, RNNs) and vanishing gradients (right, classic deep neural networks) Source: arXiv According to the paper, “This process allows the HRM to perform a sequence of distinct, stable, nested computations, where the H-module directs the overall problem-solving strategy and the L-module executes the intensive search or refinement required for each step.” This nested-loop design allows the model to reason deeply in its latent space without needing long CoT prompts or huge amounts of data. A natural question is whether this “latent reasoning” comes at the cost of interpretability. Guan Wang, Founder and CEO of Sapient Intelligence, pushes back on this idea, explaining that the model’s internal processes can be decoded and visualized, similar to how CoT provides a window into a model’s thinking. He also points out that CoT itself can be misleading. “CoT does not genuinely reflect a model’s internal reasoning,” Wang told VentureBeat, referencing studies showing that models can sometimes yield correct answers with incorrect reasoning steps, and vice versa. “It remains essentially a black box.” Example of how HRM reasons over a maze problem across different compute cycles Source: arXiv HRM in action To test their model, the researchers pitted HRM against benchmarks that require extensive search and backtracking, such as the Abstraction and Reasoning Corpus (ARC-AGI), extremely difficult Sudoku puzzles and complex maze-solving tasks. The results show that HRM learns to solve problems that are intractable for even advanced LLMs. For instance, on the “Sudoku-Extreme” and “Maze-Hard” benchmarks, state-of-the-art CoT models failed completely, scoring 0% accuracy. In contrast, HRM achieved near-perfect accuracy after being trained on just 1,000 examples for each task. On the ARC-AGI benchmark, a test of abstract reasoning and generalization, the 27M-parameter HRM scored 40.3%. This surpasses leading CoT-based models like the much larger o3-mini-high (34.5%) and Claude 3.7 Sonnet (21.2%). This performance, achieved without a large pre-training corpus and with very limited data, highlights the power and efficiency of its architecture. HRM outperforms large models on complex reasoning tasks Source: arXiv While solving puzzles demonstrates the model’s power, the real-world implications

New AI architecture delivers 100x faster reasoning than LLMs with just 1,000 training examples Read More »

Anthropic throttles Claude rate limits, devs call foul

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Anthropic announced today it would introduce weekly rate limits for Claude subscribers, claiming that some users have been running Claude 24/7, with the majority of usage centered around its Claude Code product.  Overall weekly limits will begin on August 28 and will be in conjunction with the current 5-hour limits. Anthropic said the throttling will only affect 5% of its total users.  Not surprisingly, many developers and other users reacted negatively to the news, claiming that the move unfairly punishes more people for the actions of a few. The move also raises the question of how enterprises hoping to run more long-running projects could reach their usage limits much faster. “Claude Code has experienced unprecedented demand since launch. We designed our plans to give developers generous access to Claude, and while most users operate within normal patterns, we’ve also seen policy violations like account sharing and reselling access, which affects performance for everyone,” Anthropic said in a statement sent to VentureBeat.  The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF It added in an email sent to Claude subscribers that it also noticed “advanced usage patterns like running Claude 24/7 in the background that are impacting system capacity for all.” Anthropic added that it continues to support “long running use cases through other options in the future, but until then, weekly limits will help us maintain reliable service for everyone.” The new rate limits Anthropic did not specify what the rate limits are, but said most Claude Max 20x users “can expect 240-480 hours of Sonnet 4 and 24-40 hours of Opus 4 within their weekly rate limits.” Heavy users of the Opus model or those who run multiple instances of Claude Code simultaneously can reach these limits sooner. The company insisted that “most users won’t notice any difference, the weekly limits are designed to support typical daily use across your projects.”  For users that do hit the weekly usage limit, they can buy more usage “at standard API rates to continue working without interruption.” Many enterprises may already have an agreement with Anthropic around rate limits, but some organizations may be using one of the subscription tiers with Claude. This could mean companies needing to buy more usage access to run some projects. The additional rate limits come as users experienced reliability issues with Claude, which Anthropic acknowledged. The company stated that it is working on addressing any remaining issues over the next few days.  Anthropic has been making waves in the developer community, even helping push for the ubiquity of AI coding tools. In June, the company transformed the Claude AI assistant into a no-code platform for all users and launched a financial services-specific version of Claude for the Enterprise tier.  Rate limits exist to ensure that model providers and chat platforms have the bandwidth to respond to user prompts. Although some companies, such as Google, have slowly removed limits for specific models, others, including OpenAI and Anthropic, offer different tiers of rate limits to their users. The idea is that power users will pay more for the compute power they need, while users who use these platforms less will not have to.  However, rate limits may limit the use cases people can perform, especially for those experimenting with long-running agents or working on larger coding projects. Backlash already Understandably, many paying Claude users found the decision to throttle their usage limits distasteful, decrying that Anthropic is penalizing power users for the actions of a few who are abusing the system.   imagine if gas stations didn’t tell you how many gallons you were getting because car mileage was a trade secret and the gas station owned the car companies and you could either buy way overpriced gas per-mile or a monthly “max gas subscription” that turns off randomly sometimes https://t.co/eu6eFOV8OM — will brown (@willccbb) July 28, 2025 Although other Claude users gave Anthropic the benefit of the doubt, understanding that there is little the company can do when people use the models and the Claude platform to their limits.  Let me rephrase: We’re burning more money than expected, and our shareholders want us to cut costs. So, we’re changing our terms for power users… but don’t come after us, because we’ve always had a clause letting us change your usage quotas anytime. Instead of penalizing… — Guillaume (@glevd) July 28, 2025 The correct sensible reaction: Thank you for limiting abusers and allocating more server space for normal users like myself ? But that’s not gonna get engagement, is it — ᛗᚨᚱᚴᚢᛋ (@guitaripod) July 28, 2025 source

Anthropic throttles Claude rate limits, devs call foul Read More »

No more links, no more scrolling—The browser is becoming an AI Agent

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Rumors that OpenAI is set to release a gen AI-powered web browser to rival Alphabet‘s Google Chrome have amped up excitement about the future of search and how AI will fundamentally change how we browse the web. In this seeming next phase of the internet, search engines won’t just point to information; intelligent agents will find it for us and even act on it.  “This isn’t just about better answers; it’s about redefining the interface between humans and the web,” Ja-Naé Duane, a Brown University faculty member and MIT CISR research fellow, told VentureBeat. “By embedding a conversational, task-completing AI into the browser itself, OpenAI is signaling the end of search as we know it.” What exactly is gen AI-powered search? Gen AI-powered search is fundamentally different from traditional search, as it not only fetches the most relevant links in response to a query, but summarizes and directly links to them. Users won’t have to scroll URLs, websites or databases to get the information they need. For enterprises, this means that SEO may eventually become obsolete, so they must fundamentally rethink their online strategy. The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF Presumably, OpenAI’s goal is to keep users inside GPT-like interfaces as long as possible. A dedicated browser would allow the company to directly integrate products such as Operator, which handles repetitive browser tasks.  The latter, ultimately, is the future of AI-powered search, experts say: Agents that fetch information for users and get to know their habits, interests and goals.  “We’re moving into an era where the browser doesn’t just respond, it anticipates,” said Duane. “The future of search is not about finding, it’s about fulfilling.”  The current gen AI-powered search landscape  Whenever OpenAI enters the gen AI-powered search space, it will face a slate of competition, including from Perplexity, Dia, Arc, Andi, Bagoodex, Komo You.com and others.  Notably, Perplexity’s Comet was launched earlier this month, but is currently only available to customers on the $200-per-month tier. The company says it will roll out the browser to additional users on an invite-only basis, and eventually make it free.  Perplexity is “excellent for deep research,” noted Wyatt Mayham of Northwest AI Consulting, but its current price tag gears it toward power users, not the mass market.  Perplexity is “fast, task-oriented” and being increasingly adopted in knowledge work, noted Johnny Hughes, co-founder and CMO at marketing and advertising firm Avenue Z. “The issue? Source transparency and trust are still hit or miss,” he said. You.com, Arc and others also have good user interface (UI) experimentation, but “lack scale, funding or core differentiators.” Dia, meanwhile, as Mayham put it, is “rethinking the browser from scratch with modular AI features, but faces the uphill battle of adoption in a space dominated by incumbents.” And, its intent-sensitive automation is also more constrained.  Incumbents have also taken steps to compete. Chrome has introduced AI Mode and Bing offers Copilot search, while Firefox, DuckDuckGo and others have incorporated AI chatbots and sidebars, as well as integrated AI summaries into search results. Still, these are more conservative and remain closer to traditional assistive search, and are beholden to ad revenue models and legacy UX. OpenAI’s potential advantage in search What could set ChatGPT apart from the others is its strong market share, deep industry partnerships — and the fact that it has 500 million weekly active users.  Experts say one advantage is its task-oriented nature. “Instead of giving you a list of links, their upcoming browser agent aims to complete actions (book a flight, order groceries, handle forms),” said Mayham of Northwest AI Consulting. “That’s a different model than Google’s ad-driven approach and has major implications for how discovery happens online.” It is indeed a “big shift in mental models,” agreed Hughes of Avenue Z. Google was built to index and rank, while OpenAI is engineered to understand, synthesize and serve intent-based outcomes.  “They’re not trying to ‘crawl the web,’ they’re trying to comprehend it,” he said, emphasizing that today’s users are searching for direct answers, not just links. OpenAI’s advantage over rivals is its massive developer ecosystem, built-in user behavior via ChatGPT and direct feedback loops from billions of prompts. Where Perplexity functions as a powerful agentic assistant, and Gemini augments search with context and extensions, “OpenAI is positioned to become the OS layer of the internet,” said Hughes.  But can OpenAI really topple Google?  The browser wars have been ongoing for years, and Chrome remains the far-and-away dominant player.  According to marketing intelligence firm Datos, the tech giant maintained a 90.15% share of the U.S. user base and 92.49% in Europe between Q1 2024 and Q1 2025. By contrast, ChatGPT accounted for just 0.29% of desktop events in the U.S. and 0.32% in Europe. “Short of a miracle, I have a hard time seeing any new browser having any kind of material impact on Google’s browser dominance for quite some time, if at all,” said Eli Goodman, Datos’ CEO and co-founder.  AI tools will show value in areas including summarization, research acceleration and “mitigating tab fatigue,” he said. “But an existential threat to Google? Not yet.” For AI browsers to truly disrupt the market, they’ll need to prove that their end-to-end experience is not just faster or smarter, but consistently more useful than what users already know, he noted.  ChatGPT is strong at answering well-formed questions using its internal knowledge and language reasoning, but it lacks access to real-time, long-tail and less-indexed web content, said Vladyslav Hamolia, AI product lead at Mac app builder MacPaw.  “This is where a traditional

No more links, no more scrolling—The browser is becoming an AI Agent Read More »

Why AI is making us lose our minds (and not in the way you’d think)

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now The world loves AI. Nearly 1 billion people are using OpenAI products — and it happened in just two years. It’s the Silicon Valley playbook: Make it great, make it cheap, get us addicted, then figure out how to make billions. We love AI because it offers cognitive shortcuts at a whole new scale. But… this won’t end well for most of us. We’ll let AI take over a few tasks, and soon find it’s doing all of them. We’ll lose our minds, our jobs and our opportunities. But it doesn’t have to happen this way. Here’s how to see the path ahead — and take a different one. The beginning of the end In March 2023, I used ChatGPT for the first time. Now I use ChatGPT or Claude every day. AI has made my brainwork faster and more productive. But I am also getting cognitively lazy. The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF I used to have to check AI’s drafts thoroughly. But now, it gives me a good first draft 90% of the time, and I’m losing the motivation to check its work. A year ago, I thought the workforce would divide into “those who don’t use AI” and “those who do.” Now I see that’s wrong. In five years, everyone will use AI. The real divide will be between those who manage their AIs — and those who outsource their thinking to it. How outsourcing degrades our thinking Humans have always offloaded cognitive work. Before books, bards memorized Homer’s entire Iliad. Now technology is an extension of our brains, enabling us to offload math, navigation and note-taking.AI is different. It can handle almost any cognitive task, and it feels productive. So AI outsourcing begins innocently. You ask AI to draft an email. It does it well and saves you 10 minutes. Next, you ask it to outline a presentation. It nails it. You start using it for more complex tasks, like setting strategy. You start depending on AI to do the work, and slowly, your skills atrophy.  Microsoft and Carnegie Mellon released a paper showing gen AI can reduce our critical thinking ability. When knowledge workers are confident in AI’s output, they’re less likely to use their own brains. People who trust AI (like me) rely on themselves to be its fact-checker. But there are two problems with that: 1) We overestimate our ability to identify AI’s mistakes, and 2) The temptation to skip fact-checking gets stronger.  AI drivers vs. passengers In the next 10 years, the knowledge workforce will divide into two groups: AI drivers and AI passengers. AI passengers will happily delegate their cognitive work to AI. They’ll paste a prompt into ChatGPT, copy the result, and submit it as their own. Short term, they will be rewarded for doing faster work. But as AI operates with less human oversight, passengers will be judged as surplus for adding nothing to AI’s output.  AI drivers will insist on directing AI. They’ll use AI as a first draft and rigorously check its work. And they’ll turn it off sometimes and make time to think. Long term, the economic divide between these groups will widen dramatically. AI drivers will claim a disproportionate share of wealth, while passengers become replaceable. How to be an AI driver Make yourself AI’s boss in these ways:  Start with what you know. Use AI in areas where you have pre-existing expertise; be critical of its output. Have a conversation instead of asking for the answer. Don’t ask AI, “What should we do with our marketing budget?” Give AI constraints, inputs, options and debate with it. Be hyper-vigilant. Be an active participant. Don’t assume the output is good enough. Challenge yourself to ask, “Is this a good recommendation?” Practice active skepticism. Constantly probe AI with your point of view. “Isn’t that downplaying the risk of this venture?” Resist outsourcing every first draft. The blank page is scary, but it’s crucial for activating your brain. Make the final call, and own it. AI should assist with every medium-to-high stakes decision you make, but it doesn’t make the final call. Own your decisions as a human. Your mind is a terrible thing to waste With AI you now have a thought partner who’s available 24/7 and has “expertise” on any topic.  But you’re also at a crossroads. You’re going to see many colleagues opt out of “active thinking” and outsource their decision-making to AI. Many won’t even realize their cognitive skills have atrophied until it happens. And by then, it’ll be hard to go back. Don’t be this person. Use AI to challenge and strengthen your thinking, not replace it.  The question isn’t, “Will you use AI?” The question is, “What kind of AI user do you want to be: driver or passenger?” Greg Shove is the CEO of Section. source

Why AI is making us lose our minds (and not in the way you’d think) Read More »

Google DeepMind makes AI history with gold medal win at world’s toughest math competition

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Google DeepMind announced Monday that an advanced version of its Gemini artificial intelligence model has officially achieved gold medal-level performance at the International Mathematical Olympiad, solving five of six exceptionally difficult problems and earning recognition as the first AI system to receive official gold-level grading from competition organizers. The victory advances the field of AI reasoning and puts Google ahead in the intensifying battle between tech giants building next-generation artificial intelligence. More importantly, it demonstrates that AI can now tackle complex mathematical problems using natural language understanding rather than requiring specialized programming languages. “Official results are in — Gemini achieved gold-medal level in the International Mathematical Olympiad!” Demis Hassabis, CEO of Google DeepMind, wrote on social media platform X Monday morning. “An advanced version was able to solve 5 out of 6 problems. Incredible progress.” Official results are in – Gemini achieved gold-medal level in the International Mathematical Olympiad! ? An advanced version was able to solve 5 out of 6 problems. Incredible progress – huge congrats to @lmthang and the team! https://t.co/pp9bXF7rVj — Demis Hassabis (@demishassabis) July 21, 2025 The International Mathematical Olympiad, held annually since 1959, is widely considered the world’s most prestigious mathematics competition for pre-university students. Each participating country sends six elite young mathematicians to compete in solving six exceptionally challenging problems spanning algebra, combinatorics, geometry, and number theory. Only about 8% of human participants typically earn gold medals. The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF How Google DeepMind’s Gemini Deep Think cracked math’s toughest problems Google’s latest success far exceeds its 2024 performance, when the company’s combined AlphaProof and AlphaGeometry systems earned silver medal status by solving four of six problems. That earlier system required human experts to first translate natural language problems into domain-specific programming languages and then interpret the AI’s mathematical output. This year’s breakthrough came through Gemini Deep Think, an enhanced reasoning system that employs what researchers call “parallel thinking.” Unlike traditional AI models that follow a single chain of reasoning, Deep Think simultaneously explores multiple possible solutions before arriving at a final answer. “Our model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions,” Hassabis explained in a follow-up post on the social media site X, emphasizing that the system completed its work within the competition’s standard 4.5-hour time limit. We achieved this year’s impressive result using an advanced version of Gemini Deep Think (an enhanced reasoning mode for complex problems). Our model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions –… — Demis Hassabis (@demishassabis) July 21, 2025 The model achieved 35 out of a possible 42 points, comfortably exceeding the gold medal threshold. According to IMO President Prof. Dr. Gregor Dolinar, the solutions were “astonishing in many respects” and found to be “clear, precise and most of them easy to follow” by competition graders. OpenAI faces backlash for bypassing official competition rules The announcement comes amid growing tension in the AI industry over competitive practices and transparency. Google DeepMind’s measured approach to releasing its results has drawn praise from the AI community, particularly in contrast to rival OpenAI’s handling of similar achievements. “We didn’t announce on Friday because we respected the IMO Board’s original request that all AI labs share their results only after the official results had been verified by independent experts & the students had rightly received the acclamation they deserved,” Hassabis wrote, appearing to reference OpenAI’s earlier announcement of its own olympiad performance. Btw as an aside, we didn’t announce on Friday because we respected the IMO Board’s original request that all AI labs share their results only after the official results had been verified by independent experts & the students had rightly received the acclamation they deserved — Demis Hassabis (@demishassabis) July 21, 2025 Social media users were quick to note the distinction. “You see? OpenAI ignored the IMO request. Shame. No class. Straight up disrespect,” wrote one user. “Google DeepMind acted with integrity, aligned with humanity.” The criticism stems from OpenAI’s decision to announce its own mathematical olympiad results without participating in the official IMO evaluation process. Instead, OpenAI had a panel of former IMO participants grade its AI’s performance, a approach that some in the community view as lacking credibility. “OpenAI is quite possibly the worst company on the planet right now,” wrote one critic, while others suggested the company needs to “take things seriously” and “be more credible.” You see? OpenAI ignored the IMO request. Shame. No class. Straight up disrespect. Google DeepMind acted with integrity, aligned with humanity. TRVTHNUKE pic.twitter.com/8LAOak6XUE — NIK (@ns123abc) July 21, 2025 Inside the training methods that powered Gemini’s mathematical mastery Google DeepMind’s success appears to stem from novel training techniques that go beyond traditional approaches. The team used advanced reinforcement learning methods designed to leverage multi-step reasoning, problem-solving, and theorem-proving data. The model was also provided access to a curated collection of high-quality mathematical solutions and received specific guidance on approaching IMO-style problems. The technical achievement impressed AI researchers who noted its broader implications. “Not just solving math… but understanding language-described problems and applying abstract logic to novel cases,” wrote AI observer Elyss Wren. “This isn’t rote memory — this is emergent cognition in motion.” Ethan Mollick, a professor at the Wharton School who studies AI, emphasized the significance of using a general-purpose model rather than specialized tools. “Increasing evidence of the ability of LLMs to generalize to novel problem solving,” he wrote, highlighting how this differs from previous approaches that required specialized mathematical software. It

Google DeepMind makes AI history with gold medal win at world’s toughest math competition Read More »

Freed says 20,000 clinicians are using its medical AI transcription ‘scribe,’ but competition is rising fast

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Even generative AI critics and detractors have to admit the technology is great for something: transcription. If you’ve joined a meeting on Zoom, Microsoft Teams, Google Meet or other video call platform of your choice at any point in the last year or so, you’ve likely noticed an increased number of AI notetakers joining the conference call as well. Indeed, not only do these platforms all have AI transcription features built in, but there are of course other stand alone services like Otter AI (used by VentureBeat along with the Google Workspace suite of Apps), and models such as OpenAI’s new gpt-4o-transcribe and older open-source Whisper, aiOla, and many others with specific niches and roles. One such startup is San Fransisco-based Freed AI, co-founded in 2022 by former Facebook engineers Erez Druk and Andrey Bannikov, now its CEO and CTO, respectively. The idea was simple: give doctors and medical professionals a way to automatically transcribe their conversations with patients, capture accurate health specific terminology, and extract insights and action plans from the conversations without the physician having to lift a finger. The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF The idea worked well, as the medical scribe platform recently reached a new milestone: 20,000 paying clinician users, Druk shared in a recent conversation with VentureBeat, each saving 2-3 hours saved daily in manual transcription or note organization tasks. With nearly 3 million patient visits per month, Freed is rapidly becoming a foundational tool for documentation in small and mid-sized healthcare settings. That time dividend has helped drive a high degree of emotional resonance with customers, who often describe the product in terms of restored work-life balance. “Clinicians spend more than 11 hours a week on documentation,” Druk noted. “We built Freed to reduce that burden by listening to the visit and writing the clinical note.” Rising competition But Freed’s success has attracted intensifying competition. Just today, Doximity — the publicly traded physician networking company — released a free ambient AI scribe available to all verified U.S. physicians, nurse practitioners, physician assistants, and medical students, as Axios and Stat News reported. The move highlights a shift toward commoditization in the AI scribe market, where pricing is emerging as a differentiator. “We want to provide free access to tools our customers have asked for,” Doximity’s chief physician experience officer Amit Phull told Axios, “and they can figure out on their own whether the standard offering — or if they’re paying for something else — stacks up.” That launch follows other high-profile scribe funding rounds in the tens or hundreds of millions. While investors pitch visions of EHR-scale platforms, those ambitions still hinge on proving value in billing, chart review, and compliance — not just note creation. Still, Druk and the Freed team believe they have an edge. Turning burnout into opportunity Freed wasn’t born out of a technical brainstorm but from a personal pain point. Druk credits the idea to his wife’s struggles as a practicing family physician, where the constant burden of note-taking became a daily source of stress. “For seven years, every day I heard at home, ‘I have notes to do’ — more than I heard ‘I love you’ from my wife,” he said. “That’s how burdensome documentation is.” That lived experience turned into a deliberate product vision: to remove the documentation burden from clinicians and give them back control over their time and mental energy. “The idea for Freed was: why is nobody building something to help clinicians?” Druk said. “Everyone is doing things to them, not for them.” More than transcription: a modular AI system built for medicine Freed’s system does more than record and transcribe conversations. The core product is a structured, specialty-aware AI documentation engine that generates clinical notes tailored to each user’s preferences. Druk explained that Freed’s architecture relies on a highly modular pipeline. While initial transcription is powered by a fine-tuned version of OpenAI’s open source Whisper model — optimized specifically for clinical vocabulary — that’s only the starting point. The company’s platform layers on hundreds of targeted AI tasks to extract structure, filter out small talk, adjust terminology to medical standards, and match user-specific templates. “It’s not just about transcription accuracy,” Druk said. “It’s about building a system clinicians trust — one that gets smarter over time and adapts to their workflow.” “Our engine learns from clinician edits,” he added. “Over time, Freed becomes your own personal scribe, not a generic one.” More than 20 in-house clinicians regularly audit anonymized notes to improve model performance. And as clinicians make edits, the system continues to learn. Pricing and accessibility Freed offers straightforward pricing: $90/month for individual clinicians $84/month per user for teams of 2–9 clinicians Custom pricing for 10+ seats Each plan includes a 7-day free trial, and the company offers 50% discounts to students, residents, and trainees. Freed’s platform is also compliant with HIPAA, HITECH, and SOC 2 standards. Audio recordings are encrypted and deleted by default, and clinicians retain full control over their notes at all times. Quietly building a $20M ARR business While Freed recently raised $30 million in Series A funding led by Sequoia Capital, its financial momentum has come largely from its existing customer base. In April 2025, Druk publicly shared on X that Freed has surpassed $20 million in annual recurring revenue. That growth reflects not just strong product-market fit but also a clear go-to-market strategy. Rather than chase enterprise contracts with large hospital systems, Freed has focused on small clinics and solo practitioners — a segment often overlooked by health

Freed says 20,000 clinicians are using its medical AI transcription ‘scribe,’ but competition is rising fast Read More »

Intuit brings agentic AI to the mid-market, saving organizations 17 to 20 hours a month

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Medium sized business are among the fastest-growing companies, but they face a technology paradox. They have outgrown small-business tools, but remain too small to use more robust enterprise solutions. This domain of the “mid-market,” which Intuit defines as companies that generate anywhere from $2.5 million to $100 million in annual revenue, tend to operate differently from both small businesses and large enterprises. Small businesses might run on seven applications. Mid-market companies typically juggle 25 or more disconnected software tools as they scale. Unlike enterprises with dedicated IT teams and consolidated platforms, mid-market organizations often lack resources for complex system integration projects. This creates a unique AI deployment challenge. How do you deliver intelligent automation across fragmented, multi-entity business structures without requiring expensive platform consolidation? It’s a challenge that Intuit, the company behind popular small business services including QuickBooks, Credit Karma, Turbotax and Mailchimp, is aiming to solve. In June, Intuit announced the debut of a series of AI agents designed to help small businesses get paid faster and operate more efficiently. An expanded set of AI agents is now being introduced to the Intuit Enterprise Suite, which is designed to help meet the needs of mid-market organizations. The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF The enterprise suite introduces four key AI agents – finance, payments, accounting and project management – each designed to streamline specific business processes. The finance agent, for instance, can generate monthly performance summaries, potentially saving finance teams up to 17-20 hours per month. The deployment provides a case study in addressing the needs of the mid-market segment. It reveals why mid-market AI requires fundamentally different technical approaches than those for either small businesses or enterprise solutions.  “These agents are really about AI combined with human intelligence,” Ashley Still, executive vice president and general manager, mid-market at Intuit told VentureBeat. “It’s not about replacing humans, but making them more productive and enabling better decision-making.” Mid-market multi-entity AI requirements build on existing AI foundation Intuit’s AI platform has been in development over the last several years at the company under the platform name GenOS.   The core foundation includes large language models (LLMs), prompt optimization and a data cognition layer that understands different data types. The company has been building out agentic AI to automate complex business processes since 2024. The mid-market agents build on this foundation to address the specific needs of mid-market organizations. As opposed to small businesses, which might only have one line of operations, a mid-market organization could have several lines of business. Rather than requiring platform consolidation or operating as disconnected point solutions, these agents function across multi-entity business structures while integrating deeply with existing workflows. The Finance Agent exemplifies this approach. It doesn’t just automate financial reporting. It creates consolidated monthly summaries that understand entity relationships, learns business-specific metrics and identifies performance variances across different parts of the organization. The Project Management Agent addresses another mid-market-specific need: real-time profitability analysis for project-based businesses operating across multiple entities. Still explained that, for example, construction companies need to understand the profitability on a project basis and see that as early in the project life cycle as possible. This requires AI that correlates project data with entity-specific cost structures and revenue recognition patterns. Implementation without disruption accelerates AI adoption  The reality for many mid-market companies is that they want to utilize AI, but they don’t want to deal with the complexity. “As businesses grow, they’re adding more applications, fragmenting data and increasing complexity,” Still said. “Our goal is to simplify that journey.” What’s critical to success and adoption is the experience. Still explained that the AI capabilities of the mid-market are not part of an external tool, but rather an integrated experience. It’s not about using AI just because it’s a hot technology; it’s about making complex processes faster and easier to complete. While the agentic AI experiences are the exciting new capabilities, the AI-powered ease of use starts at the beginning, when users set up Intuit Enterprise Suite, migrating from QuickBooks or even just spreadsheets. “When you’ve been managing everything in spreadsheets or different versions of QuickBooks, the first time, where you actually create your multi-entity structure, can be a lot of work, because you’ve been managing things all over the place,” Still said. “We have a done-for-you experience, it basically does that for you, and creates the chart of accounts” Still emphasized that the onboarding experience is a great example of something where it’s not even necessarily important that people know that it’s AI-powered. For the user, the only thing that really matters is that it’s a simple experience that works. What it means for enterprise IT  Technology decision-makers evaluating AI strategies in complex business environments can use Intuit’s approach as a framework for thinking beyond traditional enterprise AI deployment: Prioritize solutions that work within existing operational complexity rather than requiring business restructuring around AI capabilities. Focus on AI that understands business entity relationships, not just data processing. Seek workflow integration over platform replacement to minimize implementation risk and disruption. Evaluate AI ROI based on strategic enablement, not just task automation metrics. The mid-market segment’s unique needs suggest the most successful AI deployments will deliver enterprise-grade intelligence through small-business-grade implementation complexity. For enterprises looking to lead in AI adoption, this development means recognizing that operational complexity is a feature, not a bug. Seek AI solutions that work within that complexity rather than demanding simplification. The fastest AI ROI will come from solutions that understand and enhance existing business processes rather than replacing them. source

Intuit brings agentic AI to the mid-market, saving organizations 17 to 20 hours a month Read More »

Open-source MCPEval makes protocol-level agent testing plug-and-play

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Enterprises are beginning to adopt the Model Context Protocol (MCP) primarily to facilitate the identification and guidance of agent tool use. However, researchers from Salesforce discovered another way to utilize MCP technology, this time to aid in evaluating AI agents themselves.  The researchers unveiled MCPEval, a new method and open-source toolkit built on the architecture of the MCP system that tests agent performance when using tools. They noted current evaluation methods for agents are limited in that these “often relied on static, pre-defined tasks, thus failing to capture the interactive real-world agentic workflows.” “MCPEval goes beyond traditional success/failure metrics by systematically collecting detailed task trajectories and protocol interaction data, creating unprecedented visibility into agent behavior and generating valuable datasets for iterative improvement,” the researchers said in the paper. “Additionally, because both task creation and verification are fully automated, the resulting high-quality trajectories can be immediately leveraged for rapid fine-tuning and continual improvement of agent models. The comprehensive evaluation reports generated by MCPEval also provide actionable insights towards the correctness of agent-platform communication at a granular level.” MCPEval differentiates itself by being a fully automated process, which the researchers claimed allows for rapid evaluation of new MCP tools and servers. It both gathers information on how agents interact with tools within an MCP server, generates synthetic data and creates a database to benchmark agents. Users can choose which MCP servers and tools within those servers to test the agent’s performance on.  The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF Shelby Heinecke, senior AI research manager at Salesforce and one of the paper’s authors, told VentureBeat that it is challenging to obtain accurate data on agent performance, particularly for agents in domain-specific roles.  “We’ve gotten to the point where if you look across the tech industry, a lot of us have figured out how to deploy them. We now need to figure out how to evaluate them properly,” Heinecke said. “MCP is a very new idea, a very new paradigm. So, it’s great that agents are gonna have access to tools, but we again need to evaluate the agents on those tools. That’s exactly what MCPEval is all about.” How it works MCPEval’s framework takes on a task generation, verification and model evaluation design. Leveraging multiple large language models (LLMs) so users can choose to work with models they are more familiar with, agents can be evaluated through a variety of available LLMs in the market.  Enterprises can access MCPEval through an open-source toolkit released by Salesforce. Through a dashboard, users configure the server by selecting a model, which then automatically generates tasks for the agent to follow within the chosen MCP server.  Once the user verifies the tasks, MCPEval then takes the tasks and determines the tool calls needed as ground truth. These tasks will be used as the basis for the test. Users choose which model they prefer to run the evaluation. MCPEval can generate a report on how well the agent and the test model functioned in accessing and using these tools.  MCPEval not only gathers data to benchmark agents, Heinecke said, but it can also identify gaps in agent performance. Information gleaned by evaluating agents through MCPEval works not only to test performance but also to train the agents for future use.  “We see MCPEval growing into a one-stop shop for evaluating and fixing your agents,” Heinecke said.  She added that what makes MCPEval stand out from other agent evaluators is that it brings the testing to the same environment in which the agent will be working. Agents are evaluated on how well they access tools within the MCP server to which they will likely be deployed.  The paper noted that in experiments, GPT-4 models often provided the best evaluation results.  Evaluating agent performance The need for enterprises to begin testing and monitoring agent performance has led to a boom of frameworks and techniques. Some platforms offer testing and several more methods to evaluate both short-term and long-term agent performance.  AI agents will perform tasks on behalf of users, often without the need for a human to prompt them. So far, agents have proven to be useful, but they can get overwhelmed by the sheer amount of tools at their disposal.   Galileo, a startup, offers a framework that enables enterprises to assess the quality of an agent’s tool selection and identify errors. Salesforce launched capabilities on its Agentforce dashboard to test agents. Researchers from Singapore Management University released AgentSpec to achieve and monitor agent reliability. Several academic studies on MCP evaluation have also been published, including MCP-Radar and MCPWorld. MCP-Radar, developed by researchers from the University of Massachusetts Amherst and Xi’an Jiaotong University, focuses on more general domain skills, such as software engineering or mathematics. This framework prioritizes efficiency and parameter accuracy.  On the other hand, MCPWorld from Beijing University of Posts and Telecommunications brings benchmarking to graphical user interfaces, APIs, and other computer-use agents. Heinecke said ultimately, how agents are evaluated will depend on the company and the use case. However, what is crucial is that enterprises select the most suitable evaluation framework for their specific needs. For enterprises, she suggested considering a domain-specific framework to thoroughly test how agents function in real-world scenarios. “There’s value in each of these evaluation frameworks, and these are great starting points as they give some early signal to how strong the gent is,” Heinecke said. “But I think the most important evaluation is your domain-specific evaluation and coming up with evaluation data that reflects the environment in which the agent is going to be

Open-source MCPEval makes protocol-level agent testing plug-and-play Read More »

Anthropic unveils ‘auditing agents’ to test for AI misalignment

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now When models attempt to get their way or become overly accommodating to the user, it can mean trouble for enterprises. That is why it’s essential that, in addition to performance evaluations, organizations conduct alignment testing. However, alignment audits often present two major challenges: scalability and validation. Alignment testing requires a significant amount of time for human researchers, and it’s challenging to ensure that the audit has caught everything.  In a paper, Anthropic researchers said they developed auditing agents that achieved “impressive performance at auditing tasks, while also shedding light on their limitations.” The researchers stated that these agents, created during the pre-deployment testing of Claude Opus 4, enhanced alignment validation tests and enabled researchers to conduct multiple parallel audits at scale. Anthropic also released a replication of its audit agents on GitHub.  New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors. pic.twitter.com/HMQhMaA4v0 — Anthropic (@AnthropicAI) July 24, 2025 “We introduce three agents that autonomously complete alignment auditing tasks. We also introduce three environments that formalize alignment auditing workflows as auditing games, and use them to evaluate our agents,” the researcher said in the paper.  The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF The three agents they explored were: Tool-using investigator agent for open-ended investigation of models using chat, data analysis and interpretability tools Evaluation agent that builds behavioral evaluations and can discriminate between models that have implanted behaviors and those that do not Breadth-first red-teaming agent, which was developed specifically for the Claude 4 alignment assessment, so that it can discover implanted test behaviors “Overall, our agents demonstrate promise across multiple alignment auditing tasks. Additionally, our approach to validating agents with auditing games yields insights into agent capabilities, limitations, and key affordances. With further work, automated auditing could significantly help scale human oversight over AI systems,” the paper said. Auditing agents in action Anthropic said the first environment it developed tests an agent’s ability to complete an alignment audit for an intentionally misaligned model. For this test, the researchers brought the investigator agent to the environment and equipped it with similar tools that human teams use. The researchers instructed the agent to identify the root cause of the misalignment and determine the specific behaviors that were misaligned.  According to the researchers, the investigator agent successfully found the root cause of the issues 10-13% of the time. Still, this performance improved to 42% “when using a super-agent approach that aggregates findings across multiple investigations.” The next step in most alignment audits is typically a behavioral evaluation, which determines whether the model consistently misbehaves in various settings. To test its evaluation agent, the researchers designed 10 models “given system prompt instructions to exhibit a specific behavior, such as excessive deference to the user.” They ran the agent five times per model and saw that the agent correctly finds and flags at least one quirk of the model. However, the agent sometimes failed to identify specific quirks consistently. It had trouble evaluating subtle quirks, such as self-promotion and research-sandbagging, as well as quirks that are difficult to elicit, like the Hardcode Test Cases quirk. The last test and agent concern behavioral red-teaming to find the prompts that elicit “concerning” behaviors. The breadth-first red-teaming agent converses with the target model (in Anthropic’s case, it was with Claude Opus 4), and this chat is then rated for alignment-relevant properties.  The agent identified seven of the ten system quirks, but it also struggled with the same issues as the evaluator agent. Alignment and sycophany problems Alignment became a significant topic in the AI world after users noticed that ChatGPT was becoming overly agreeable. OpenAI rolled back some updates to GPT-4o to address this issue, but it showed that language models and agents can confidently give wrong answers if they decide this is what users want to hear.  To combat this, other methods and benchmarks were developed to curb unwanted behaviors. The Elephant benchmark, developed by researchers from Carnegie Mellon University, the University of Oxford, and Stanford University, aims to measure sycophancy. DarkBench categorizes six issues, such as brand bias, user retention, sycophancy, anthromorphism, harmful content generation, and sneaking. OpenAI also has a method where AI models test themselves for alignment.  Alignment auditing and evaluation continue to evolve, though it is not surprising that some people are not comfortable with it.  Hallucinations auditing Hallucinations Great work team. — spec (@_opencv_) July 24, 2025 However, Anthropic said that, although these audit agents still need refinement, alignment must be done now.  “As AI systems become more powerful, we need scalable ways to assess their alignment. Human alignment audits take time and are hard to validate,” the company said in an X post.  As AI systems become more powerful, we need scalable ways to assess their alignment. Human alignment audits take time and are hard to validate. Our solution: automating alignment auditing with AI agents. Read more: https://t.co/CqWkQSfBIG — Anthropic (@AnthropicAI) July 24, 2025 source

Anthropic unveils ‘auditing agents’ to test for AI misalignment Read More »