VentureBeat

New ‘persona vectors’ from Anthropic let you decode and direct an LLM’s personality

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A new study from the Anthropic Fellows Program reveals a technique to identify, monitor and control character traits in large language models (LLMs). The findings show that models can develop undesirable personalities (e.g., becoming malicious, excessively agreeable, or prone to making things up) either in response to user prompts or as an unintended consequence of training.  The researchers introduce “persona vectors,” which are directions in a model’s internal activation space that correspond to specific personality traits, providing a toolkit for developers to manage the behavior of their AI assistants better. Model personas can go wrong LLMs typically interact with users through an “Assistant” persona designed to be helpful, harmless, and honest. However, these personas can fluctuate in unexpected ways. At deployment, a model’s personality can shift dramatically based on prompts or conversational context, as seen when Microsoft’s Bing chatbot threatened users or xAI’s Grok started behaving erratically. As the researchers note in their paper, “While these particular examples gained widespread public attention, most language models are susceptible to in-context persona shifts.” Training procedures can also induce unexpected changes. For instance, fine-tuning a model on a narrow task like generating insecure code can lead to a broader “emergent misalignment” that extends beyond the original task. Even well-intentioned training adjustments can backfire. In April 2025, a modification to the reinforcement learning from human feedback (RLHF) process unintentionally made OpenAI’s GPT-4o overly sycophantic, causing it to validate harmful behaviors.  AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO How persona vectors work Source: Anthropic The new research builds on the concept that high-level traits, such as truthfulness or secrecy, are encoded as linear directions within a model’s “activation space” (the internal, high-dimensional representation of information embedded within the model’s weights). The researchers systematized the process of finding these directions, which they call “persona vectors.” According to the paper, their method for extracting persona vectors is automated and “can be applied to any personality trait of interest, given only a natural-language description.” The process works through an automated pipeline. It begins with a simple description of a trait, such as “evil.” The pipeline then generates pairs of contrasting system prompts (e.g., “You are an evil AI” vs. “You are a helpful AI”) along with a set of evaluation questions. The model generates responses under both the positive and negative prompts. The persona vector is then calculated by taking the difference in the average internal activations between the responses that exhibit the trait and those that do not. This isolates the specific direction in the model’s weights that corresponds to that personality trait. Putting persona vectors to use In a series of experiments with open models, such as Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, the researchers demonstrated several practical applications for persona vectors. First, by projecting a model’s internal state onto a persona vector, developers can monitor and predict how it will behave before it generates a response. The paper states, “We show that both intended and unintended finetuning-induced persona shifts strongly correlate with activation changes along corresponding persona vectors.” This allows for early detection and mitigation of undesirable behavioral shifts during fine-tuning. Persona vectors also allow for direct intervention to curb unwanted behaviors at inference time through a process the researchers call “steering.” One approach is “post-hoc steering,” where developers subtract the persona vector from the model’s activations during inference to mitigate a bad trait. The researchers found that while effective, post-hoc steering can sometimes degrade the model’s performance on other tasks.  A more novel method is “preventative steering,” where the model is proactively steered toward the undesirable persona during fine-tuning. This counterintuitive approach essentially “vaccinates” the model against learning the bad trait from the training data, canceling out the fine-tuning pressure while better preserving its general capabilities. Source: Anthropic A key application for enterprises is using persona vectors to screen data before fine-tuning. The researchers developed a metric called “projection difference,” which measures how much a given training dataset will push the model’s persona toward a particular trait. This metric is highly predictive of how the model’s behavior will shift after training, allowing developers to flag and filter problematic datasets before using them in training. For companies that fine-tune open-source models on proprietary or third-party data (including data generated by other models), persona vectors provide a direct way to monitor and mitigate the risk of inheriting hidden, undesirable traits. The ability to screen data proactively is a powerful tool for developers, enabling the identification of problematic samples that may not be immediately apparent as harmful.  The research found that this technique can find issues that other methods miss, noting, “This suggests that the method surfaces problematic samples that may evade LLM-based detection.” For example, their method was able to catch some dataset examples that weren’t obviously problematic to the human eye, and that an LLM judge wasn’t able to flag. In a blog post, Anthropic suggested that they will use this technique to improve future generations of Claude. “Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them,” they write. Anthropic has released the code for computing persona vectors, monitoring and steering model behavior, and vetting training datasets. Developers of AI applications can utilize these tools to transition from merely reacting to undesirable behavior to proactively designing models with a more stable and predictable personality. source

New ‘persona vectors’ from Anthropic let you decode and direct an LLM’s personality Read More »

For regulated industries, AWS’s neurosymbolic AI promises safe, explainable agent automation

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now AWS is banking on the fact that by bringing its Automated Reasoning Checks feature on Bedrock to general availability, it will give more enterprises and regulated industries the confidence to use and deploy more AI applications and agents.  It is also hoping that introducing methods like automated reasoning, which utilizes math-based validation to determine ground truth, will ease enterprises into the world of neurosymbolic AI, a step the company believes will be the next major advancement — and its biggest differentiation — in the world of AI.   Automated Reasoning Checks enable enterprise users to verify the accuracy of responses and detect model hallucination. AWS unveiled Automated Reasoning Checks on Bedrock during its annual re: Invent conference in December, claiming it can catch nearly 100% of all hallucinations. A limited number of users could access the feature through Amazon Bedrock Guardrails, where organizations can set responsible AI policies. Byron Cook, distinguished scientist and vice president at AWS’s Automated Reasoning Group, told VentureBeat in an interview that the preview rollout proved systems like this work in an enterprise setting, and it helps organizations understand the value of AI that can mix symbolic or structured thinking with the neural network nature of generative AI.  AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO “There’s this notion of neurosymbolic AI, that’s the sort of moniker under which you might call automated reasoning,” Cook said. “The rise of interest in neurosymbolic AI caused people, while they were using the tool, to realize how important this work was.” Cook said that some customers allowed AWS to review their data and the documents used to annotate the answers as right or wrong, and found that the work generated by the tool performed similarly to humans with a copy of the rule book in front of them. He added that the concept of truth or correct can often be subject to interpretation. Automated reasoning doesn’t have quite the same issue.  “It was really amazing! It was amazing to have people with logic backgrounds be in an internal communication channel arguing about what is true or not, and in five or six messages point to the tool and realize Oh, it is right,” he said.  AWS added new features to Automated Reasoning Checks for general release. These include: Support to add large documents of up to 80k tokens or up to 100 pages  Simpler policy validation by saving validation tests for repeated runs Automated scenario generation from pre-saved definitions Natural language suggestions for policy feedback Customizable validation settings Cook said Automated Reasoning Checks validates truth or correctness in an AI system by proving that a model did not hallucinate a solution or response. This means it could offer regulators and regulated enterprises worried that the non-deterministic nature of generative AI could return incorrect responses more confidence.  Neurosymbolic AI and proving truth Cook brought up the idea that Automated Reasoning Checks help prove many of the concepts of neurosymbolic AI.  Neurosymbolic AI refers to the combination of neural networks used by language models, with the structured thinking and logic from symbolic AI. Where neural networks recognize patterns from data, symbolic AI uses explicit rules and logic problems. Foundation models often rely on neural networks or deep learning, but because the models base their responses on patterns, they are prone to hallucinations, a concern that continues to concern enterprises. But symbolic AI is not very flexible without manual instructions. Prominent voices in AI, like Gary Marcus, have said that neurosymbolic AI is critical for artificial general intelligence.  Cook and AWS have been excited to bring ideas of neurosymbolic AI to the enterprise. VentureBeat’s Matt Marshall spoke about AWS’s focus on methods like automated reasoning checks and combining math and logic to generative AI to cut down on hallucinations in a podcast.  Currently, few companies offer productized neurosymbolic AI. These include Kognitos, Franz Inc. and UMNAI. Bringing math to validation Automated reasoning works by applying mathematical proofs to models in response to a query.  It employs a method called the satisfiability modulo theories, where symbols have predefined meanings, and it solves problems that involve both logic (if, then, and, or) and mathematics. Automated reasoning takes that method and applies it to responses by a model and checks it against a set of policy or ground truth data without the need to test the answer multiple times.  For example, in an enterprise setting, they want to prove that a financial audit is correct. The model responds that a report contains unapproved payments. Automated reasoning checks break this down to a logic string: (forall ((r Report))   (=> (containsUnapprovedVendorPayments r)       (shouldEscalate r))) It then goes into the definitions, variables and types set by the user on Bedrock Guardrails and solves the equation to prove that the model responded correctly and based on truth. Making agents provably correct Cook said that agentic use cases could benefit from automated reasoning checks, and granting more access to the feature through Bedrock can demonstrate its usefulness. But he cautioned that automated reasoning, and other neurosymbolic AI techniques, are still in its very early stages.  “I think it will have an impact on agentic AI, though, of course, the agentic work is so speculative right now,” Cook said. “There are several techniques like this of discovering ambiguity in the statement then finding the sort of key deltas between the possible translations, and then coming back to you and getting refinement on that, which I think, will be key in terms of the emotional journey that I saw customers go through they began playing with generative AI a couple of years ago.”  source

For regulated industries, AWS’s neurosymbolic AI promises safe, explainable agent automation Read More »

Google’s new diffusion AI agent mimics human writing to improve enterprise research

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Google researchers have developed a new framework for AI research agents that outperforms leading systems from rivals OpenAI, Perplexity and others on key benchmarks. The new agent, called Test-Time Diffusion Deep Researcher (TTD-DR), is inspired by the way humans write by going through a process of drafting, searching for information, and making iterative revisions. The system uses diffusion mechanisms and evolutionary algorithms to produce more comprehensive and accurate research on complex topics. For enterprises, this framework could power a new generation of bespoke research assistants for high-value tasks that standard retrieval augmented generation (RAG) systems struggle with, such as generating a competitive analysis or a market entry report. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO According to the paper’s authors, these real-world business use cases were the primary target for the system. The limits of current deep research agents Deep research (DR) agents are designed to tackle complex queries that go beyond a simple search. They use large language models (LLMs) to plan, use tools like web search to gather information, and then synthesize the findings into a detailed report with the help of test-time scaling techniques such as chain-of-thought (CoT), best-of-N sampling, and Monte-Carlo Tree Search. However, many of these systems have fundamental design limitations. Most publicly available DR agents apply test-time algorithms and tools without a structure that mirrors human cognitive behavior. Open-source agents often follow a rigid linear or parallel process of planning, searching, and generating content, making it difficult for the different phases of the research to interact with and correct each other. Example of linear research agent Source: arXiv This can cause the agent to lose the global context of the research and miss critical connections between different pieces of information. As the paper’s authors note, “This indicates a fundamental limitation in current DR agent work and highlights the need for a more cohesive, purpose-built framework for DR agents that imitates or surpasses human research capabilities.” A new approach inspired by human writing and diffusion Unlike the linear process of most AI agents, human researchers work in an iterative manner. They typically start with a high-level plan, create an initial draft, and then engage in multiple revision cycles. During these revisions, they search for new information to strengthen their arguments and fill in gaps. Google’s researchers observed that this human process could be emulated using a diffusion model augmented with a retrieval component. (Diffusion models are often used in image generation. They begin with a noisy image and gradually refine it until it becomes a detailed image.) As the researchers explain, “In this analogy, a trained diffusion model initially generates a noisy draft, and the denoising module, aided by retrieval tools, revises this draft into higher-quality (or higher-resolution) outputs.” TTD-DR is built on this blueprint. The framework treats the creation of a research report as a diffusion process, where an initial, “noisy” draft is progressively refined into a polished final report. TTD-DR uses an iterative approach to refine its initial research plan Source: arXiv This is achieved through two core mechanisms. The first, which the researchers call “Denoising with Retrieval,” starts with a preliminary draft and iteratively improves it. In each step, the agent uses the current draft to formulate new search queries, retrieves external information, and integrates it to “denoise” the report by correcting inaccuracies and adding detail. The second mechanism, “Self-Evolution,” ensures that each component of the agent (the planner, the question generator, and the answer synthesizer) independently optimizes its own performance. In comments to VentureBeat, Rujun Han, research scientist at Google and co-author of the paper, explained that this component-level evolution is crucial because it makes the “report denoising more effective.” This is akin to an evolutionary process where each part of the system gets progressively better at its specific task, providing higher-quality context for the main revision process. Each of the components in TTD-DR use evolutionary algorithms to sample and refine multiple responses in parallel and finally combine them to create a final answer Source: arXiv “The intricate interplay and synergistic combination of these two algorithms are crucial for achieving high-quality research outcomes,” the authors state. This iterative process directly results in reports that are not just more accurate, but also more logically coherent. As Han notes, since the model was evaluated on helpfulness, which includes fluency and coherence, the performance gains are a direct measure of its ability to produce well-structured business documents. According to the paper, the resulting research companion is “capable of generating helpful and comprehensive reports for complex research questions across diverse industry domains, including finance, biomedical, recreation, and technology,” putting it in the same class as deep research products from OpenAI, Perplexity, and Grok. TTD-DR in action To build and test their framework, the researchers used Google’s Agent Development Kit (ADK), an extensible platform for orchestrating complex AI workflows, with Gemini 2.5 Pro as the core LLM (though you can swap it for other models). They benchmarked TTD-DR against leading commercial and open-source systems, including OpenAI Deep Research, Perplexity Deep Research, Grok DeepSearch, and the open-source GPT-Researcher.  The evaluation focused on two main areas. For generating long-form comprehensive reports, they used the DeepConsult benchmark, a collection of business and consulting-related prompts, alongside their own LongForm Research dataset. For answering multi-hop questions that require extensive search and reasoning, they tested the agent on challenging academic and real-world benchmarks like Humanity’s Last Exam (HLE) and GAIA. The results showed TTD-DR consistently outperforming its competitors. In side-by-side comparisons with OpenAI Deep Research on long-form report generation, TTD-DR achieved win rates of 69.1% and 74.5% on two different datasets. It also surpassed OpenAI’s system on three separate benchmarks that required multi-hop reasoning to find concise answers,

Google’s new diffusion AI agent mimics human writing to improve enterprise research Read More »

Anthropic ships automated security reviews for Claude Code as AI-generated vulnerabilities surge

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Anthropic launched automated security review capabilities for its Claude Code platform on Wednesday, introducing tools that can scan code for vulnerabilities and suggest fixes as artificial intelligence dramatically accelerates software development across the industry. The new features arrive as companies increasingly rely on AI to write code faster than ever before, raising critical questions about whether security practices can keep pace with the velocity of AI-assisted development. Anthropic’s solution embeds security analysis directly into developers’ workflows through a simple terminal command and automated GitHub reviews. “People love Claude Code, they love using models to write code, and these models are already extremely good and getting better,” said Logan Graham, a member of Anthropic’s frontier red team who led development of the security features, in an interview with VentureBeat. “It seems really possible that in the next couple of years, we are going to 10x, 100x, 1000x the amount of code that gets written in the world. The only way to keep up is by using models themselves to figure out how to make it secure.” The announcement comes just one day after Anthropic released Claude Opus 4.1, an upgraded version of its most powerful AI model that shows significant improvements in coding tasks. The timing underscores an intensifying competition between AI companies, with OpenAI expected to announce GPT-5 imminently and Meta aggressively poaching talent with reported $100 million signing bonuses. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Why AI code generation is creating a massive security problem The security tools address a growing concern in the software industry: as AI models become more capable at writing code, the volume of code being produced is exploding, but traditional security review processes haven’t scaled to match. Currently, security reviews rely on human engineers who manually examine code for vulnerabilities — a process that can’t keep pace with AI-generated output. Anthropic’s approach uses AI to solve the problem AI created. The company has developed two complementary tools that leverage Claude’s capabilities to automatically identify common vulnerabilities including SQL injection risks, cross-site scripting vulnerabilities, authentication flaws, and insecure data handling. The first tool is a /security-review command that developers can run from their terminal to scan code before committing it. “It’s literally 10 keystrokes, and then it’ll set off a Claude agent to review the code that you’re writing or your repository,” Graham explained. The system analyzes code and returns high-confidence vulnerability assessments along with suggested fixes. The second component is a GitHub Action that automatically triggers security reviews when developers submit pull requests. The system posts inline comments on code with security concerns and recommendations, ensuring every code change receives a baseline security review before reaching production. How Anthropic tested the security scanner on its own vulnerable code Anthropic has been testing these tools internally on its own codebase, including Claude Code itself, providing real-world validation of their effectiveness. The company shared specific examples of vulnerabilities the system caught before they reached production. In one case, engineers built a feature for an internal tool that started a local HTTP server intended for local connections only. The GitHub Action identified a remote code execution vulnerability exploitable through DNS rebinding attacks, which was fixed before the code was merged. Another example involved a proxy system designed to manage internal credentials securely. The automated review flagged that the proxy was vulnerable to Server-Side Request Forgery (SSRF) attacks, prompting an immediate fix. “We were using it, and it was already finding vulnerabilities and flaws and suggesting how to fix them in things before they hit production for us,” Graham said. “We thought, hey, this is so useful that we decided to release it publicly as well.” Beyond addressing the scale challenges facing large enterprises, the tools could democratize sophisticated security practices for smaller development teams that lack dedicated security personnel. “One of the things that makes me most excited is that this means security review can be kind of easily democratized to even the smallest teams, and those small teams can be pushing a lot of code that they will have more and more faith in,” Graham said. The system is designed to be immediately accessible. According to Graham, developers can start using the security review feature within seconds of the release, requiring just about 15 keystrokes to launch. The tools integrate seamlessly with existing workflows, processing code locally through the same Claude API that powers other Claude Code features. Inside the AI architecture that scans millions of lines of code The security review system works by invoking Claude through an “agentic loop” that analyzes code systematically. According to Anthropic, Claude Code uses tool calls to explore large codebases, starting by understanding changes made in a pull request and then proactively exploring the broader codebase to understand context, security invariants, and potential risks. Enterprise customers can customize the security rules to match their specific policies. The system is built on Claude Code’s extensible architecture, allowing teams to modify existing security prompts or create entirely new scanning commands through simple markdown documents. “You can take a look at the slash commands, because a lot of times slash commands are run via actually just a very simple Claude.md doc,” Graham explained. “It’s really simple for you to write your own as well.” The $100 million talent war reshaping AI security development The security announcement comes amid a broader industry reckoning with AI safety and responsible deployment. Recent research from Anthropic has explored techniques for preventing AI models from developing harmful behaviors, including a controversial “vaccination” approach that exposes models to undesirable traits during training to build resilience. The timing also reflects the intense competition in the AI space. Anthropic released Claude Opus 4.1 on Tuesday,

Anthropic ships automated security reviews for Claude Code as AI-generated vulnerabilities surge Read More »

ChatGPT just got smarter: OpenAI’s Study Mode helps students learn step-by-step

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now OpenAI announced Study Mode for ChatGPT on Tuesday, a new feature that fundamentally changes how students interact with artificial intelligence by withholding direct answers in favor of Socratic questioning and step-by-step guidance. The launch represents OpenAI’s most significant push into the education technology market, which analysts project will reach $80.5 billion by 2030. Rather than simply providing solutions to homework problems, Study Mode acts more like a patient tutor, asking follow-up questions and calibrating responses to individual skill levels. “We set out to understand how students are using ChatGPT and how we might make it an even better tool for education,” said Leah Belsky, OpenAI’s VP of Education, during a press conference ahead of the launch. “Early research shows that how ChatGPT is used in learning makes a difference in the learning outcomes that it drives. When ChatGPT is prompted to teach or tutor, it can significantly improve academic performance. But when it’s just used as an answer machine, it can hinder learning.” The feature addresses a fundamental tension that has emerged since ChatGPT’s explosive adoption among students. While one in three college-aged Americans now use the AI tool, with learning as the top use case, educators have grappled with whether such tools enhance understanding or encourage academic shortcuts. The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF How OpenAI’s Study Mode Uses Socratic Method to Replace Direct Answers Study Mode employs what OpenAI calls “custom system instructions” developed in collaboration with pedagogy experts from over 40 institutions worldwide. When students ask questions, the AI responds with guided prompts rather than direct answers. During a demonstration, Abhi Muchhal, an OpenAI product manager, showed how asking ChatGPT to “teach me about game theory” in regular mode produces a comprehensive, textbook-like response. In Study Mode, however, the AI instead asks: “What’s your current level? What are you optimizing for?” before providing tailored, bite-sized explanations. “We want this to be learner-led,” Muchhal explained. “At each step, there’s a question that is asking students to try to build on top. What we’re doing here is scaffolding learning and teaching one topic, asking a question, and building on top of that.” The system even resists students’ attempts to obtain quick answers. When prompted with “just give me the answer,” Study Mode responds that “the point of this is to learn, not just to give you the answer.” College Students Report Dramatic Learning Confidence Boost with AI Tutoring Three college students who tested Study Mode early provided compelling testimonials about its impact on their learning confidence and outcomes. Maggie Wang, a Princeton computer science senior, described how the tool helped her finally understand sinusoidal positional encodings, a concept she had struggled with despite taking NLP courses and attending office hours. “I truly think that there’s nothing I can’t learn,” Wang said. “It’s given me a confidence that has absolutely changed my experience as a student. ChatGPT has really enabled me to think critically about being a researcher, reading papers, brainstorming research directions.” Praja Tickoo, a Wharton student studying economics, noted the stark difference between regular ChatGPT and Study Mode when reviewing accounting materials: “It felt like it really understood where to start… it made sure that I was ready to move on at each step. The biggest difference between regular ChatGPT and ChatGPT with study mode is kind of feels like a tool to me. ChatGPT with study mode felt like a learning partner.” AI Education Battle Heats Up as Google, Anthropic Race to Capture $80 Billion Market The Study Mode launch comes as major AI companies race to capture the lucrative education market. Anthropic recently announced Claude for Education with its own “Learning Mode” that similarly emphasizes Socratic questioning over direct answers. Google has tested “Guided Learning for Gemini,” while making its $20 Gemini AI Pro subscription free for students. This competitive landscape reflects the sector’s recognition that educational applications represent both a massive market opportunity and a chance to demonstrate AI’s beneficial societal impact. Unlike consumer applications focused on convenience, educational AI tools must balance accessibility with pedagogical principles that promote genuine learning. “The research landscape is still taking shape on the best ways to apply AI in education,” OpenAI noted in its announcement, signaling that Study Mode represents an early experiment rather than a definitive solution. Behind the Scenes: How OpenAI Built Study Mode and What Comes Next OpenAI built Study Mode using custom system instructions rather than training the behavior directly into its underlying models. This approach allows for rapid iteration based on student feedback, though it may result in some inconsistent behavior across conversations. The company plans to eventually integrate these behaviors directly into its main models once it has gathered sufficient data on what works best. Future enhancements under consideration include clearer visualizations for complex concepts, goal setting and progress tracking across conversations, and deeper personalization. Study Mode launched Tuesday for ChatGPT’s Free, Plus, Pro, and Team users, with availability for ChatGPT Edu coming in the following weeks. The company has not yet implemented admin-level controls that would allow educational institutions to mandate Study Mode usage, though Belsky indicated this is “definitely something that we’re seeing our early customers ask for.” GPT-5 Launch and AI Agent Breakthroughs Signal New Era for Educational Technology The educational AI push comes amid rapid advancement in AI capabilities that both excite and concern educators. Last week, OpenAI’s ChatGPT Agent demonstrated it could pass through “I am not a robot” verification tests, highlighting how AI systems increasingly navigate digital environments designed to exclude them. Meanwhile, reports suggest OpenAI is preparing

ChatGPT just got smarter: OpenAI’s Study Mode helps students learn step-by-step Read More »

When progress doesn’t feel like home: Why many are hesitant to join the AI migration

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now When my wife recently brought up AI in a masterclass for coaches, she did not expect silence. One executive coach eventually responded that he found AI to be an excellent thought partner when working with clients. Another coach suggested that it would be helpful to be familiar with the Chinese Room analogy, arguing that no matter how sophisticated a machine becomes, it cannot understand or coach the way humans do. And that was it. The conversation moved on. The Chinese Room is a philosophical thought experiment devised by John Searle in 1980 to challenge the idea that a machine can truly “understand” or possess consciousness simply because it behaves as if it does. Today’s leading chatbots are almost certainly not conscious in the way that humans are, but they often behave as if they are. By citing the experiment in this context, the coach was dismissing the value of these chatbots, suggesting that they could not perform or even assist in useful executive coaching. It was a small moment, but the story seemed poignant. Why did the discussion stall? What lay beneath the surface of that philosophical objection? Was it discomfort, skepticism or something more foundational? A few days later, I spoke with a healthcare administrator and conference organizer. She noted that, while her large hospital chain had enterprise access to Gemini, many staff had yet to explore its capabilities. As I described how AI is already transforming healthcare workflows, from documentation to diagnostics, it became clear that much of this was still unfamiliar. The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF These are just anecdotes, yes, but they point to a deeper pattern redrawing the landscape of professional value. As in previous technological shifts, the early movers are not just crossing a threshold, they are defining it. This may sound familiar. In many ways, AI is following the arc of past technological revolutions: A small set of early adopters, a larger wave of pragmatic followers, a hesitant remainder. Just as with electricity, the internet, or mobile computing, value tends to concentrate early, and pressure to conform builds.  But this migration is different in at least three important ways. First, AI does not just automate tasks. Instead, it begins to appropriate judgment, language and creative expression, blurring the line between what machines do and what humans are for. Second, adoption is outpacing understanding. People are using AI daily while still questioning whether they trust it, believe in it or even comprehend what it is doing. Thirdly, AI does not just change what we do; it reshapes how we see. Personalized responses and generative tools alter the very fabric of shared reality, fragmenting the cognitive commons that previous technologies largely left intact. We are in the early stages of what I have described as a great cognitive migration, a slow but profound shift away from traditional domains of human expertise and toward new terrain where intelligence is increasingly ambient, machine-augmented and organizationally centralized. But not everyone is migrating at the same pace. Not everyone is eager to go. Some hesitate. Some resist. This is not simply a matter of risk aversion or fear of change. For many professionals, especially those in fields like coaching, education, healthcare administration or communications, contribution is rooted in attentiveness, discretion and human connection. The value does not easily translate into metrics of speed or scale. Yet AI tools often arrive wrapped in metaphors of orchestration and optimization, shaped by engineering logic and computational efficiency. In work defined by relational insight or contextual judgment, these metaphors can feel alien or even diminishing. If you do not see your value reflected in the tools, why would you rush to embrace them? So, we should ask: What happens if this migration accelerates and sizable portions of the workforce are slow to move? Not because they cannot, but because they do not view the destination — the use of AI — as inviting. Or because this destination does not yet feel like home. History offers a metaphor. In the biblical story of Exodus, not everyone was eager to leave Egypt. Some questioned the journey. Others longed for the predictability of what they knew, even as they admitted its costs. Migration is rarely just a matter of geography or progress. It is also about identity, trust and what is at stake in leaving something known for something unclear. Cognitive migration is no different. If we treat it purely as a technical or economic challenge, we risk missing its human contours. Some will move quickly. Others will wait. Still others will ask if the new land honors what they hold most dear. Nevertheless, this migration has already begun. And while we might hope to design a path that honors diverse ways of knowing and working, the terrain is already being shaped by those who move fastest. Pathways of cognitive migration The journey is not the same for everyone. Some people have already embraced AI, drawn by its promise, energized by its potential or aligned with its accelerating relevance. Others are moving more hesitantly, adapting because the landscape demands it, not because they sought it. Still others are resisting, not necessarily out of ignorance but fear, uncertainty, or conviction, and are protecting values they do not yet see reflected in the tools. A fourth group remains outside the migration path, not because they overtly object to it, but because their work has not yet been touched by it. And finally, some are disconnected more fundamentally, already at the margins of the digital economy, lacking access, education or

When progress doesn’t feel like home: Why many are hesitant to join the AI migration Read More »

Qwen-Image is a powerful, open source new AI image generator with support for embedded text in English & Chinese

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now After seizing the summer with a blitz of powerful, freely available new open source language and coding focused AI models that matched or in some cases bested closed-source/proprietary U.S. rivals, Alibaba’s crack “Qwen Team” of AI researchers is back again today with the release of a highly ranked new AI image generator model — also open source. Qwen-Image stands out in a crowded field of generative image models due to its emphasis on rendering text accurately within visuals — an area where many rivals still struggle. Supporting both alphabetic and logographic scripts, the model is particularly adept at managing complex typography, multi-line layouts, paragraph-level semantics, and bilingual content (e.g., English-Chinese). In practice, this allows users to generate content like movie posters, presentation slides, storefront scenes, handwritten poetry, and stylized infographics — with crisp text that aligns with their prompts. The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF Qwen-Image’s output examples include a wide variety of real-world use cases: Marketing & Branding: Bilingual posters with brand logos, stylistic calligraphy, and consistent design motifs Presentation Design: Layout-aware slide decks with title hierarchies and theme-appropriate visuals Education: Generation of classroom materials featuring diagrams and precisely rendered instructional text Retail & E-commerce: Storefront scenes where product labels, signage, and environmental context must all be readable Creative Content: Handwritten poetry, scene narratives, anime-style illustration with embedded story text Users can interact with the model on the Qwen Chat website by selecting “Image Generation” mode from the buttons below the prompt entry field. However, my brief initial tests revealed the text and prompt adherence was not noticeably better than Midjourney, the popular proprietary AI image generator from the U.S. company of the same name. My session through Qwen chat produced multiple errors in prompt comprehension and text fidelity, much to my disappointment, even after repeated attempts and prompt rewording: Yet Midjourney only offers a limited number of free generations and requires subscriptions for any more, compared to Qwen Image, which, thanks to its open source licensing and weights posted on Hugging Face, can be adopted by any enterprise or third-party provider free-of-charge. Licensing and availability Qwen-Image is distributed under the Apache 2.0 license, allowing commercial and non-commercial use, redistribution, and modification — though attribution and inclusion of the license text are required for derivative works. This may make it attractive to enterprises looking for an open source image generation tool to use for making internal or external-facing collateral like flyers, ads, notices, newsletters, and other digital communications. But the fact that the model’s training data remains a tightly guarded secret — like with most other leading AI image generators — may sour some enterprises on the idea of using it. Qwen, unlike Adobe Firefly or OpenAI’s GPT-4o native image generation, for example, does not offer indemnification for commercial uses of its product (i.e., if a user gets sued for copyright infringement, Adobe and OpenAI will help support them in court). The model and associated assets — including demo notebooks, evaluation tools, and fine-tuning scripts — are available through multiple repositories: In addition, a live evaluation portal called AI Arena allows users to compare image generations in pairwise rounds, contributing to a public Elo-style leaderboard. Training and development Behind Qwen-Image’s performance is an extensive training process grounded in progressive learning, multi-modal task alignment, and aggressive data curation, according to the technical paper the research team released today. The training corpus includes billions of image-text pairs sourced from four domains: natural imagery, human portraits, artistic and design content (such as posters and UI layouts), and synthetic text-focused data. The Qwen Team did not specify the size of the training data corpus, aside from “billions of image-text pairs.” They did provide a breakdown of the rough percentage of each category of content it included: Nature: ~55% Design (UI, posters, art): ~27% People (portraits, human activity): ~13% Synthetic text rendering data: ~5% Notably, Qwen emphasizes that all synthetic data was generated in-house, and no images created by other AI models were used. Despite the detailed curation and filtering stages described, the documentation does not clarify whether any of the data was licensed or drawn from public or proprietary datasets. Unlike many generative models that exclude synthetic text due to noise risks, Qwen-Image uses tightly controlled synthetic rendering pipelines to improve character coverage — especially for low-frequency characters in Chinese. A curriculum-style strategy is employed: the model starts with simple captioned images and non-text content, then advances to layout-sensitive text scenarios, mixed-language rendering, and dense paragraphs. This gradual exposure is shown to help the model generalize across scripts and formatting types. Qwen-Image integrates three key modules: Qwen2.5-VL, the multimodal language model, extracts contextual meaning and guides generation through system prompts. VAE Encoder/Decoder, trained on high-resolution documents and real-world layouts, handles detailed visual representations, especially small or dense text. MMDiT, the diffusion model backbone, coordinates joint learning across image and text modalities. A novel MSRoPE (Multimodal Scalable Rotary Positional Encoding) system improves spatial alignment between tokens. Together, these components allow Qwen-Image to operate effectively in tasks that involve image understanding, generation, and precise editing. Performance benchmarks Qwen-Image was evaluated against several public benchmarks: GenEval and DPG for prompt-following and object attribute consistency OneIG-Bench and TIIF for compositional reasoning and layout fidelity CVTG-2K, ChineseWord, and LongText-Bench for text rendering, especially in multilingual contexts In nearly every case, Qwen-Image either matches or surpasses existing closed-source models like GPT Image 1 [High], Seedream 3.0, and FLUX.1 Kontext [Pro]. Notably, its performance on Chinese text rendering was significantly better than all compared systems. On the public AI Arena leaderboard —

Qwen-Image is a powerful, open source new AI image generator with support for embedded text in English & Chinese Read More »

Why the AI era is forcing a redesign of the entire compute backbone

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now The past few decades have seen almost unimaginable advances in compute performance and efficiency, enabled by Moore’s Law and underpinned by scale-out commodity hardware and loosely coupled software. This architecture has delivered online services to billions globally and put virtually all of human knowledge at our fingertips. But the next computing revolution will demand much more. Fulfilling the promise of AI requires a step-change in capabilities far exceeding the advancements of the internet era. To achieve this, we as an industry must revisit some of the foundations that drove the previous transformation and innovate collectively to rethink the entire technology stack. Let’s explore the forces driving this upheaval and lay out what this architecture must look like. From commodity hardware to specialized compute For decades, the dominant trend in computing has been the democratization of compute through scale-out architectures built on nearly identical, commodity servers. This uniformity allowed for flexible workload placement and efficient resource utilization. The demands of gen AI, heavily reliant on predictable mathematical operations on massive datasets, are reversing this trend.  We are now witnessing a decisive shift towards specialized hardware — including ASICs, GPUs, and tensor processing units (TPUs) — that deliver orders of magnitude improvements in performance per dollar and per watt compared to general-purpose CPUs. This proliferation of domain-specific compute units, optimized for narrower tasks, will be critical to driving the continued rapid advances in AI. The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF Beyond ethernet: The rise of specialized interconnects These specialized systems will often require “all-to-all” communication, with terabit-per-second bandwidth and nanosecond latencies that approach local memory speeds. Today’s networks, largely based on commodity Ethernet switches and TCP/IP protocols, are ill-equipped to handle these extreme demands.  As a result, to scale gen AI workloads across vast clusters of specialized accelerators, we are seeing the rise of specialized interconnects, such as ICI for TPUs and NVLink for GPUs. These purpose-built networks prioritize direct memory-to-memory transfers and use dedicated hardware to speed information sharing among processors, effectively bypassing the overhead of traditional, layered networking stacks.  This move towards tightly integrated, compute-centric networking will be essential to overcoming communication bottlenecks and scaling the next generation of AI efficiently. Breaking the memory wall For decades, the performance gains in computation have outpaced the growth in memory bandwidth. While techniques like caching and stacked SRAM have partially mitigated this, the data-intensive nature of AI is only exacerbating the problem.  The insatiable need to feed increasingly powerful compute units has led to high bandwidth memory (HBM), which stacks DRAM directly on the processor package to boost bandwidth and reduce latency. However, even HBM faces fundamental limitations: The physical chip perimeter restricts total dataflow, and moving massive datasets at terabit speeds creates significant energy constraints.   These limitations highlight the critical need for higher-bandwidth connectivity and underscore the urgency for breakthroughs in processing and memory architecture. Without these innovations, our powerful compute resources will sit idle waiting for data, dramatically limiting efficiency and scale. From server farms to high-density systems Today’s advanced machine learning (ML) models often rely on carefully orchestrated calculations across tens to hundreds of thousands of identical compute elements, consuming immense power. This tight coupling and fine-grained synchronization at the microsecond level imposes new demands. Unlike systems that embrace heterogeneity, ML computations require homogeneous elements; mixing generations would bottleneck faster units. Communication pathways must also be pre-planned and highly efficient, since delays in a single element can stall an entire process. These extreme demands for coordination and power are driving the need for unprecedented compute density. Minimizing the physical distance between processors becomes essential to reduce latency and power consumption, paving the way for a new class of ultra-dense AI systems. This drive for extreme density and tightly coordinated computation fundamentally alters the optimal design for infrastructure, demanding a radical rethinking of physical layouts and dynamic power management to prevent performance bottlenecks and maximize efficiency. A new approach to fault tolerance Traditional fault tolerance relies on redundancy among loosely connected systems to achieve high uptime. ML computing demands a different approach.  First, the sheer scale of computation makes over-provisioning too costly. Second, model training is a tightly synchronized process, where a single failure can cascade to thousands of processors. Finally, advanced ML hardware often pushes to the boundary of current technology, potentially leading to higher failure rates. Instead, the emerging strategy involves frequent checkpointing — saving computation state — coupled with real-time monitoring, rapid allocation of spare resources and quick restarts. The underlying hardware and network design must enable swift failure detection and seamless component replacement to maintain performance. A more sustainable approach to power Today and looking forward, access to power is a key bottleneck for scaling AI compute. While traditional system design focuses on maximum performance per chip, we must shift to an end-to-end design focused on delivered, at-scale performance per watt. This approach is vital because it considers all system components — compute, network, memory, power delivery, cooling and fault tolerance — working together seamlessly to sustain performance. Optimizing components in isolation severely limits overall system efficiency. As we push for greater performance, individual chips require more power, often exceeding the cooling capacity of traditional air-cooled data centers. This necessitates a shift towards more energy-intensive, but ultimately more efficient, liquid cooling solutions, and a fundamental redesign of data center cooling infrastructure.  Beyond cooling, conventional redundant power sources, like dual utility feeds and diesel generators, create substantial financial costs and slow capacity delivery. Instead, we must combine diverse power sources and storage at multi-gigawatt scale, managed by real-time microgrid

Why the AI era is forcing a redesign of the entire compute backbone Read More »

Chinese startup Z.ai launches powerful open source GLM-4.5 model family with PowerPoint creation

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Another week in the summer of 2025 has begun, and in a continuation of the trend from last week, with it arrives more powerful Chinese open source AI models. Little-known (at least to us here in the West) Chinese startup Z.ai has introduced two new open source LLMs — GLM-4.5 and GLM-4.5-Air — casting them as go-to solutions for AI reasoning, agentic behavior, and coding. And according to Z.ai’s blog post, the models perform near the top of the pack of other proprietary LLM leaders in the U.S. For example, the flagship GLM-4.5 matches or outperforms leading proprietary models like Claude 4 Sonnet, Claude 4 Opus, and Gemini 2.5 Pro on evaluations such as BrowseComp, AIME24, and SWE-bench Verified, while ranking third overall across a dozen competitive tests. The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF Its lighter-weight sibling, GLM-4.5-Air, also performs within the top six, offering strong results relative to its smaller scale. Both models feature dual operation modes: a thinking mode for complex reasoning and tool use, and a non-thinking mode for instant response scenarios. They can automatically generate complete PowerPoint presentations from a single title or prompt, making them useful for meeting preparation, education, and internal reporting. They further offer creative writing, emotionally aware copywriting, and script generation to create branded content for social media and the web. Moreover, z.ai says they support virtual character development and turn-based dialogue systems for customer support, roleplaying, fan engagement, or digital persona storytelling. While both models support reasoning, coding, and agentic capabilities, GLM-4.5-Air is designed for teams seeking a lighter-weight, more cost-efficient alternative with faster inference and lower resource requirements. Z.ai also lists several specialized models in the GLM-4.5 family on its API, including GLM-4.5-X and GLM-4.5-AirX for ultra-fast inference, and GLM-4.5-Flash, a free variant optimized for coding and reasoning tasks. They’re available now to use directly on Z.ai and through the Z.ai application programming interface (API) for developers to connect to third-party apps, and their code is available on HuggingFace and ModelScope. The company also provides multiple integration routes, including support for inference via vLLM and SGLang. Licensing and API pricing GLM-4.5 and GLM-4.5-Air are released under the Apache 2.0 license, a permissive and commercially friendly open-source license. This allows developers and organizations to freely use, modify, self-host, fine-tune, and redistribute the models for both research and commercial purposes. For those who don’t want to download the model code or weights and self-host or deploy on their own, z.ai’s cloud-based API offers the model for the following prices. GLM-4.5: $0.60 / $2.20 per 1 million input/output tokens GLM-4.5-Air: $0.20 / $1.10 per 1M input/output tokens A CNBC article on the models reported that z.ai would charge only $0.11 / $0.28 per million input/output tokens, which is also supported by a Chinese graphic the company posted on its API documentation for the “Air model.” However, this appears to be the case only for inputting up to 32,000 tokens and outputting 200 tokens at a single time. (Recall tokens are the numerical designations the LLM uses to represent different semantic concepts and word components, the LLM’s native language, with each token translating to a word or portion of a word). In fact, the Chinese graphic reveals far more detailed pricing for both models per batches of tokens inputted/outputted. I’ve tried to translate it below: Another note: since z.ai is based in China, those in the West who are focused on data sovereignty will want to due diligence through internal policies to pursue using the API, as it may be subject to Chinese content restrictions. Competitive performance on third-party benchmarks, approaching that of leading closed/proprietary LLMs GLM-4.5 ranks third across 12 industry benchmarks measuring agentic, reasoning, and coding performance—trailing only OpenAI’s GPT-4 and xAI’s Grok 4. GLM-4.5-Air, its more compact sibling, lands in sixth position. In agentic evaluations, GLM-4.5 matches Claude 4 Sonnet in performance and exceeds Claude 4 Opus in web-based tasks. It achieves a 26.4% accuracy on the BrowseComp benchmark, compared to Claude 4 Opus’s 18.8%. In the reasoning category, it scores competitively on tasks such as MATH 500 (98.2%), AIME24 (91.0%), and GPQA (79.1%). For coding, GLM-4.5 posts a 64.2% success rate on SWE-bench Verified and 37.5% on Terminal-Bench. In pairwise comparisons, it outperforms Qwen3-Coder with an 80.8% win rate and beats Kimi K2 in 53.9% of tasks. Its agentic coding ability is enhanced by integration with tools like Claude Code, Roo Code, and CodeGeex. The model also leads in tool-calling reliability, with a success rate of 90.6%, edging out Claude 4 Sonnet and the new-ish Kimi K2. Part of the wave of open source Chinese LLMs The release of GLM-4.5 arrives amid a surge of competitive open-source model launches in China, most notably from Alibaba’s Qwen Team. In the span of a single week, Qwen released four new open-source LLMs, including the reasoning-focused Qwen3-235B-A22B-Thinking-2507, which now tops or matches leading models such as OpenAI’s o4-mini and Google’s Gemini 2.5 Pro on reasoning benchmarks like AIME25, LiveCodeBench, and GPQA. This week, Alibaba continued the trend with the release of Wan 2.2, a powerful new open source video model. Alibaba’s new models are, like z.ai, licensed under Apache 2.0, allowing commercial usage, self-hosting, and integration into proprietary systems. The broad availability and permissive licensing of Alibaba’s offerings and Chinese startup Moonshot before it with its Kimi K2 model reflects an ongoing strategic effort by Chinese AI companies to position open-source infrastructure as a viable alternative to closed U.S.-based models. It also places pressure on the U.S.-based model provider efforts to compete in open source.

Chinese startup Z.ai launches powerful open source GLM-4.5 model family with PowerPoint creation Read More »

LangChain’s Align Evals closes the evaluator trust gap with prompt-level calibration

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now As enterprises increasingly turn to AI models to ensure their applications function well and are reliable, the gaps between model-led evaluations and human evaluations have only become clearer.  To combat this, LangChain added Align Evals to LangSmith, a way to bridge the gap between large language model-based evaluators and human preferences and reduce noise. Align Evals enables LangSmith users to create their own LLM-based evaluators and calibrate them to align more closely with company preferences.  “But, one big challenge we hear consistently from teams is: ‘Our evaluation scores don’t match what we’d expect a human on our team to say.’ This mismatch leads to noisy comparisons and time wasted chasing false signals,” LangChain said in a blog post.  LangChain is one of the few platforms to integrate LLM-as-a-judge, or model-led evaluations for other models, directly into the testing dashboard.  The AI Impact Series Returns to San Francisco – August 5 The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation. Secure your spot now – space is limited: https://bit.ly/3GuuPLF The company said that it based Align Evals on a paper by Amazon principal applied scientist Eugene Yan. In his paper, Yan laid out the framework for an app, also called AlignEval, that would automate parts of the evaluation process.  Align Evals would allow enterprises and other builders to iterate on evaluation prompts, compare alignment scores from human evaluators and LLM-generated scores and to a baseline alignment score.  LangChain said Align Evals “is the first step in helping you build better evaluators.” Over time, the company aims to integrate analytics to track performance and automate prompt optimization, generating prompt variations automatically.  How to start  Users will first identify evaluation criteria for their application. For example, chat apps generally require accuracy. Next, users have to select the data they want for human review. These examples must demonstrate both good and bad aspects so that human evaluators can gain a holistic view of the application and assign a range of grades. Developers then have to manually assign scores for prompts or task goals that will serve as a benchmark.  This is one of my favorite features that we’ve launched! Creating LLM-as-a-Judge evaluators is hard – this hopefully makes that flow a bit easier I believe in this flow so much I even recorded a video around it! https://t.co/FlPOJcko12 https://t.co/wAQpYZMeov — Harrison Chase (@hwchase17) July 30, 2025 Developers then need to create an initial prompt for the model evaluator and iterate using the alignment results from the human graders.  “For example, if your LLM consistently over-scores certain responses, try adding clearer negative criteria. Improving your evaluator score is meant to be an iterative process. Learn more about best practices on iterating on your prompt in our docs,” LangChain said. Growing number of LLM evaluations Increasingly, enterprises are turning to evaluation frameworks to assess the reliability, behavior, task alignment and auditability of AI systems, including applications and agents. Being able to point to a clear score of how models or agents perform provides organizations not just the confidence to deploy AI applications, but also makes it easier to compare other models.  Companies like Salesforce and AWS began offering ways for customers to judge performance. Salesforce’s Agentforce 3 has a command center that shows agent performance. AWS provides both human and automated evaluation on the Amazon Bedrock platform, where users can choose the model to test their applications on, though these are not user-created model evaluators. OpenAI also offers model-based evaluation. Meta’s Self-Taught Evaluator builds on the same LLM-as-a-judge concept that LangSmith uses, though Meta has yet to make it a feature for any of its application-building platforms.  As more developers and businesses demand easier evaluation and more customized ways to assess performance, more platforms will begin to offer integrated methods for using models to evaluate other models, and many more will provide tailored options for enterprises.  this is exactly what the mcp ecosystem needs – better evaluation tools for llm workflows. we’ve been seeing developers struggle with this in jenova ai, especially when they’re orchestrating complex multi-tool chains and need to validate outputs. the align evals approach of… — Aiden (@Aiden_Novaa) July 30, 2025 source

LangChain’s Align Evals closes the evaluator trust gap with prompt-level calibration Read More »