VentureBeat

Mistral just updated its open source Small model from 3.1 to 3.2: here’s why

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more French AI darling Mistral is keeping the new releases coming this summer. Just days after announcing its own domestic AI-optimized cloud service Mistral Compute, the well-funded company has released an update to its 24B parameter open source model Mistral Small, jumping from a 3.1 release to 3.2-24B Instruct-2506. The new version builds directly on Mistral Small 3.1, aiming to improve specific behaviors such as instruction following, output stability, and function calling robustness. While overall architectural details remain unchanged, the update introduces targeted refinements that affect both internal evaluations and public benchmarks. According to Mistral AI, Small 3.2 is better at adhering to precise instructions and reduces the likelihood of infinite or repetitive generations — a problem occasionally seen in prior versions when handling long or ambiguous prompts. Similarly, the function calling template has been upgraded to support more reliable tool-use scenarios, particularly in frameworks like vLLM. And at the same time, it could run on a setup with a single Nvidia A100/H100 80GB GPU, drastically opening up the options for businesses with tight compute resources and/or budgets. An updated model after only 3 months Mistral Small 3.1 was announced in March 2025 as a flagship open release in the 24B parameter range. It offered full multimodal capabilities, multilingual understanding, and long-context processing of up to 128K tokens. The model was explicitly positioned against proprietary peers like GPT-4o Mini, Claude 3.5 Haiku, and Gemma 3-it — and, according to Mistral, outperformed them across many tasks. Small 3.1 also emphasized efficient deployment, with claims of running inference at 150 tokens per second and support for on-device use with 32 GB RAM. That release came with both base and instruct checkpoints, offering flexibility for fine-tuning across domains such as legal, medical, and technical fields. In contrast, Small 3.2 focuses on surgical improvements to behavior and reliability. It does not aim to introduce new capabilities or architecture changes. Instead, it acts as a maintenance release: cleaning up edge cases in output generation, tightening instruction compliance, and refining system prompt interactions. Small 3.2 vs. Small 3.1: what changed? Instruction-following benchmarks show a small but measurable improvement. Mistral’s internal accuracy rose from 82.75% in Small 3.1 to 84.78% in Small 3.2. Similarly, performance on external datasets like Wildbench v2 and Arena Hard v2 improved significantly—Wildbench increased by nearly 10 percentage points, while Arena Hard more than doubled, jumping from 19.56% to 43.10%. Internal metrics also suggest reduced output repetition. The rate of infinite generations dropped from 2.11% in Small 3.1 to 1.29% in Small 3.2 — almost a 2× reduction. This makes the model more reliable for developers building applications that require consistent, bounded responses. Performance across text and coding benchmarks presents a more nuanced picture. Small 3.2 showed gains on HumanEval Plus (88.99% to 92.90%), MBPP Pass@5 (74.63% to 78.33%), and SimpleQA. It also modestly improved MMLU Pro and MATH results. Vision benchmarks remain mostly consistent, with slight fluctuations. ChartQA and DocVQA saw marginal gains, while AI2D and Mathvista dropped by less than two percentage points. Average vision performance decreased slightly from 81.39% in Small 3.1 to 81.00% in Small 3.2. This aligns with Mistral’s stated intent: Small 3.2 is not a model overhaul, but a refinement. As such, most benchmarks are within expected variance, and some regressions appear to be trade-offs for targeted improvements elsewhere. However, as AI power user and influencer @chatgpt21 posted on X: “It got worse on MMLU,” meaning the Massive Multitask Language Understanding benchmark, a multidisciplinary test with 57 questions designed to assess broad LLM performance across domains. Indeed, Small 3.2 scored 80.50%, slightly below Small 3.1’s 80.62%. Open source license will make it more appealing to cost-conscious and customized-focused users Both Small 3.1 and 3.2 are available under the Apache 2.0 license and can be accessed via the popular. AI code sharing repository Hugging Face (itself a startup based in France and NYC). Small 3.2 is supported by frameworks like vLLM and Transformers and requires roughly 55 GB of GPU RAM to run in bf16 or fp16 precision. For developers seeking to build or serve applications, system prompts and inference examples are provided in the model repository. While Mistral Small 3.1 is already integrated into platforms like Google Cloud Vertex AI and is scheduled for deployment on NVIDIA NIM and Microsoft Azure, Small 3.2 currently appears limited to self-serve access via Hugging Face and direct deployment. What enterprises should know when considering Mistral Small 3.2 for their use cases Mistral Small 3.2 may not shift competitive positioning in the open-weight model space, but it represents Mistral AI’s commitment to iterative model refinement. With noticeable improvements in reliability and task handling — particularly around instruction precision and tool usage — Small 3.2 offers a cleaner user experience for developers and enterprises building on the Mistral ecosystem. The fact that it is made by a French startup and compliant with EU rules and regulations such as GDPR and the EU AI Act also make it appealing for enterprises working in that part of the world. Still, for those seeking the biggest jumps in benchmark performance, Small 3.1 remains a reference point—especially given that in some cases, such as MMLU, Small 3.2 does not outperform its predecessor. That makes the update more of a stability-focused option than a pure upgrade, depending on the use case. source

Mistral just updated its open source Small model from 3.1 to 3.2: here’s why Read More »

From fear to fluency: Why empathy is the missing ingredient in AI rollouts

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more While many organizations are eager to explore how AI can transform their business, its success will hinge not on tools, but on how well people embrace them. This shift requires a different kind of leadership rooted in empathy, curiosity and intentionality. Technology leaders must guide their organizations with clarity and care. People use technology to solve human problems, and AI is no different, which means adoption is as emotional as it is technical, and must be inclusive to your organization from the start. Empathy and trust are not optional. They are essential for scaling change and encouraging innovation. Why this AI moment feels different Over the past year alone, we’ve seen AI adoption accelerate at breakneck speed.  First, it was generative AI, then Copilots; now we’re in the era of AI agents. With each new wave of AI innovation, businesses rush to adopt the latest tools, but the most important part of technological change that is often overlooked? People. In the past, teams had time to adapt to new technologies. Operating systems or enterprise resource planning (ERP) tools evolved over years, giving users more room to learn these platforms and acquire the skills to use them. Unlike previous tech shifts, this one with AI doesn’t come with a long runway. Change arrives overnight, and expectations follow just as fast. Many employees feel like they’re being asked to keep pace with systems they haven’t had time to learn, let alone trust. A recent example would be ChatGPT reaching 100 million monthly active users just two months after launch. This creates friction — uncertainty, fear and disengagement — especially when teams feel left behind. It’s no surprise that 81% of staff still don’t use AI tools in their daily work. This underlines the emotional and behavioral complexity of adoption. Some people are naturally curious and quick to experiment with new technology while others are skeptical, risk-averse or anxious about job security.  To unlock the full value of AI, leaders must meet people where they are and understand that adoption will look different across every team and individual. The 4 E’s of AI adoption Successful AI adoption requires a carefully thought-out framework, which is where the “four E’s” come in.  Evangelism – inspiring through trust and vision Before employees adopt AI, they need to understand why it matters to them. Evangelism isn’t about hype. It’s about helping people care by showing them how AI can make their work more meaningful, not just more efficient. Leaders must connect the dots between the organization’s goals and individual motivations. Remember, people prioritize stability and belonging before transformation. The priority is to show how AI supports, not disrupts, their sense of purpose and place. Use meaningful metrics like DORA or cycle time improvements to demonstrate value without pressure. When done with transparency, this builds trust and fosters a high-performance culture grounded in clarity, not fear. Enablement – empowering people with empathy Successful adoption depends as much on emotional readiness as it does on technical training. Many people process disruption in personal and often unpredictable ways. Empathetic leaders recognize this and build enablement strategies that give teams space to learn, experiment and ask questions without judgment. The AI talent gap is real; organizations must actively support people in bridging it with structured training, learning time or internal communities to share progress.  When tools don’t feel relevant, people disengage. If they can’t connect today’s skills to tomorrow’s systems, they tune out. That’s why enablement must feel tailored, timely and transferable. Enforcement – aligning people around shared goals Enforcement doesn’t mean command and control. It is about creating alignment through clarity, fairness and context.  People need to understand not just what is expected of them in an AI-driven environment, but why. Skipping straight to results without removing blockers only creates friction. As Chesterton’s Fence suggests, if you don’t understand why something exists, you shouldn’t rush to remove it. Instead, set realistic expectations, define measurable goals and make progress visible across the organization. Performance data can motivate, but only when it’s shared transparently, framed with context and used to lift people up, not call them out. Experimentation – creating safe spaces for innovation Innovation thrives when people feel safe to try, fail and learn. This is  especially true with AI, where the pace of change can be overwhelming. When perfection is the bar, creativity suffers. Leaders must model a mindset of progress over perfection. In my own teams, we’ve seen that progress, not polish, builds momentum. Small experiments lead to big breakthroughs. A culture of experimentation values curiosity as much as execution. Empathy and experimentation go hand in hand. One empowers the other. Leading the change, human first Adopting AI is not just a technical initiative, it’s a cultural reset, one that challenges leaders to show up with more empathy and not just expertise. Success depends on how well leaders can inspire trust and empathy across their organizations. The 4 E’s of adoption offer more than a framework. They reflect a leadership mindset rooted in inclusion, clarity and care.  By embedding empathy into structure and using metrics to illuminate progress rather than pressure outcomes, teams become more adaptable and resilient. When people feel supported and empowered, change becomes not only possible, but scalable. That’s where AI’s true potential begins to take shape. Rukmini Reddy is SVP of Engineering at PagerDuty. source

From fear to fluency: Why empathy is the missing ingredient in AI rollouts Read More »

MiniMax-M1 is a new open source model with 1 MILLION TOKEN context and new, hyper efficient reinforcement learning

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Chinese AI startup MiniMax, perhaps best known in the West for its hit realistic AI video model Hailuo, has released its latest large language model, MiniMax-M1 — and in great news for enterprises and developers, it’s completely open source under an Apache 2.0 license, meaning businesses can take it and use it for commercial applications and modify it to their liking without restriction or payment. M1 is an open-weight offering that sets new standards in long-context reasoning, agentic tool use, and efficient compute performance. It’s available today on the AI code sharing community Hugging Face and Microsoft’s rival code sharing community GitHub, the first release of what the company dubbed as “MiniMaxWeek” from its social account on X — with further product announcements expected. MiniMax-M1 distinguishes itself with a context window of 1 million input tokens and up to 80,000 tokens in output, positioning it as one of the most expansive models available for long-context reasoning tasks. The “context window” in large language models (LLMs) refers to the maximum number of tokens the model can process at one time — including both input and output. Tokens are the basic units of text, which may include entire words, parts of words, punctuation marks, or code symbols. These tokens are converted into numerical vectors that the model uses to represent and manipulate meaning through its parameters (weights and biases). They are, in essence, the LLM’s native language. For comparison, OpenAI’s GPT-4o has a context window of only 128,000 tokens — enough to exchange about a novel’s worth of information between the user and the model in a single back and forth interaction. At 1 million tokens, MiniMax-M1 could exchange a small collection or book series’ worth of information. Google Gemini 2.5 Pro offers a token context upper limit of 1 million, as well, with a reported 2 million window in the works. But M1 has another trick up its sleeve: it’s been trained using reinforcement learning in an innovative, resourceful, highly efficient technique. The model is trained using a hybrid Mixture-of-Experts (MoE) architecture with a lightning attention mechanism designed to reduce inference costs. According to the technical report, MiniMax-M1 consumes only 25% of the floating point operations (FLOPs) required by DeepSeek R1 at a generation length of 100,000 tokens. Architecture and variants The model comes in two variants—MiniMax-M1-40k and MiniMax-M1-80k—referring to their “thinking budgets” or output lengths. The architecture is built on the company’s earlier MiniMax-Text-01 foundation and includes 456 billion parameters, with 45.9 billion activated per token. A standout feature of the release is the model’s training cost. MiniMax reports that the M1 model was trained using large-scale reinforcement learning (RL) at an efficiency rarely seen in this domain, with a total cost of $534,700. This efficiency is credited to a custom RL algorithm called CISPO, which clips importance sampling weights rather than token updates, and to the hybrid attention design that helps streamline scaling. That’s an astonishingly “cheap” amount for a frontier LLM, as DeepSeek trained its hit R1 reasoning model at a reported cost of $5-$6 million, while the training cost of OpenAIs’ GPT-4 — a more than two-year-old model now — was said to exceed $100 million. This cost comes from both the price of graphics processing units (GPUs), the massively parallel computing hardware primarily manufactured by companies like Nvidia, which can cost $20,000–$30,000 or more per module, and from the energy required to run those chips continuously in large-scale data centers. Benchmark performance MiniMax-M1 has been evaluated across a series of established benchmarks that test advanced reasoning, software engineering, and tool-use capabilities. On AIME 2024, a mathematics competition benchmark, the M1-80k model scores 86.0% accuracy. It also delivers strong performance in coding and long-context tasks, achieving: 65.0% on LiveCodeBench 56.0% on SWE-bench Verified 62.8% on TAU-bench 73.4% on OpenAI MRCR (4-needle version) These results place MiniMax-M1 ahead of other open-weight competitors such as DeepSeek-R1 and Qwen3-235B-A22B on several complex tasks. While closed-weight models like OpenAI’s o3 and Gemini 2.5 Pro still top some benchmarks, MiniMax-M1 narrows the performance gap considerably while remaining freely accessible under an Apache-2.0 license. For deployment, MiniMax recommends vLLM as the serving backend, citing its optimization for large model workloads, memory efficiency, and batch request handling. The company also provides deployment options using the Transformers library. MiniMax-M1 includes structured function calling capabilities and is packaged with a chatbot API featuring online search, video and image generation, speech synthesis, and voice cloning tools. These features aim to support broader agentic behavior in real-world applications. Implications for technical decision-makers and enterprise buyers MiniMax-M1’s open access, long-context capabilities, and compute efficiency address several recurring challenges for technical professionals responsible for managing AI systems at scale. For engineering leads responsible for the full lifecycle of LLMs — such as optimizing model performance and deploying under tight timelines — MiniMax-M1 offers a lower operational cost profile while supporting advanced reasoning tasks. Its long context window could significantly reduce preprocessing efforts for enterprise documents or log data that span tens or hundreds of thousands of tokens. For those managing AI orchestration pipelines, the ability to fine-tune and deploy MiniMax-M1 using established tools like vLLM or Transformers supports easier integration into existing infrastructure. The hybrid-attention architecture may help simplify scaling strategies, and the model’s competitive performance on multi-step reasoning and software engineering benchmarks offers a high-capability base for internal copilots or agent-based systems. From a data platform perspective, teams responsible for maintaining efficient, scalable infrastructure can benefit from M1’s support for structured function calling and its compatibility with automated pipelines. Its open-source nature allows teams to tailor performance to their stack without vendor lock-in. Security leads may also find value in evaluating M1’s potential for secure, on-premises deployment of a high-capability model that doesn’t rely on transmitting sensitive data to third-party endpoints. Taken together, MiniMax-M1 presents a flexible option for organizations looking to experiment with or

MiniMax-M1 is a new open source model with 1 MILLION TOKEN context and new, hyper efficient reinforcement learning Read More »

The mirage of control: Privacy in the age of agentic AI

Presented by Zscaler We used to think of privacy as a perimeter problem: about walls and locks, permissions, and policies. But in a world where artificial agents are becoming autonomous actors — interacting with data, systems, and humans without constant oversight — privacy is no longer about control. It’s about trust. And trust, by definition, is about what happens when you’re not looking. Agentic AI — AI that perceives, decides, and acts on behalf of others — isn’t theoretical anymore. It’s routing our traffic, recommending our treatments, managing our portfolios, and negotiating our digital identity across platforms. These agents don’t just handle sensitive data — they interpret it. They make assumptions, act on partial signals, and evolve based on feedback loops. In essence, they build internal models not just of the world, but of us. And that should give us pause. Because once an agent becomes adaptive and semi-autonomous, privacy isn’t just about who has access to the data; it’s about what the agent infers, what it chooses to share, suppress, or synthesize, and whether its goals remain aligned with ours as contexts shift. Take a simple example: an AI health assistant designed to optimize wellness. It starts by nudging you to drink more water and get more sleep. But over time, it begins triaging your appointments, analyzing your tone of voice for signs of depression, and even withholding notifications it predicts will cause stress. You haven’t just shared your data — you’ve ceded narrative authority. That’s where privacy erodes, not through a breach, but through a subtle drift in power and purpose. This is no longer just about Confidentiality, Integrity, and Availability, the classic CIA triad. We must now factor in authenticity (can this agent be verified as itself?) and veracity (can we trust its interpretations and representations?). These aren’t merely technical qualities — they’re trust primitives. And trust is brittle when intermediated by intelligence. If I confide in a human therapist or lawyer, there are assumed boundaries — ethical, legal, psychological. We have expected norms of behavior on their part and limited access and control. But when I share with an AI assistant, those boundaries blur. Can it be subpoenaed? Audited? Reverse-engineered? What happens when a government or corporation queries my agent for its records? We have no settled concept yet of AI-client privilege. And if jurisprudence finds there isn’t one, then all the trust we place in our agents becomes retrospective regret. Imagine a world where every intimate moment shared with an AI is legally discoverable — where your agent’s memory becomes a weaponized archive, admissible in court. It won’t matter how secure the system is if the social contract around it is broken. Today’s privacy frameworks — GDPR, CCPA — assume linear, transactional systems. But agentic AI operates in context, not just computation. It remembers what you forgot. It intuits what you didn’t say. It fills in blanks that might be none of its business, and then shares that synthesis — potentially helpfully, potentially recklessly — with systems and people beyond your control. So we must move beyond access control and toward ethical boundaries. That means building agentic systems that understand the intent behind privacy, not just the mechanics of it. We must design for legibility; AI must be able to explain why it acted. And for intentionality. It must be able to act in a way that reflects the user’s evolving values, not just a frozen prompt history. But we also need to wrestle with a new kind of fragility: What if my agent betrays me? Not out of malice, but because someone else crafted better incentives — or passed a law that superseded its loyalties? In short: what if the agent is both mine and not mine? This is why we must start treating AI agency as a first-order moral and legal category. Not as a product feature. Not as a user interface. But as a participant in social and institutional life. Because privacy in a world of minds — biological and synthetic — is no longer a matter of secrecy. It’s a matter of reciprocity, alignment, and governance. If we get this wrong, privacy becomes performative — a checkbox in a shadow play of rights. If we get it right, we build a world where autonomy, both human and machine, is governed not by surveillance or suppression, but by ethical coherence. Agentic AI forces us to confront the limits of policy, the fallacy of control, and the need for a new social contract. One built for entities that think — and one that has the strength to survive when they speak back. Sam Curry is is Global VP, CISO in Residence at Zscaler. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact [email protected]. source

The mirage of control: Privacy in the age of agentic AI Read More »

Announcing our 2025 VB Transform Innovation Showcase finalists

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more This year’s VB Innovation Showcase finalists will be at VB Transform, June 24-25, in San Francisco. They will take to the stage as we delve into what’s actually working in enterprise AI, from copilots to agents.  Seven companies have been selected to showcase their generative AI products or features that are most likely to disrupt the enterprise. Those selected to present will do so in front of an invite-only audience of 600 industry decision-makers, and receive direct feedback from a panel of VC judges. The 2025 Innovation Showcase finalists are: CTGT Founded in 2024 by a team of Stanford and University of California, San Diego researchers, San Francisco-based CTGT is an AI risk management platform designed to change how enterprises deploy generative AI. Unlike conventional AI pipelines that require periodic offline updates, CTGT’s platform enables continuous real-time monitoring and automated refining of models in production, allowing AI systems to learn and adapt within live environments without ever going offline. This means models improve on the fly, closing the gap between development and deployment and ensuring maximum uptime and reliability for mission-critical AI applications. The company raised a seed $7.2 million round in February. Catio Palo Alto-based Catio is an AI-powered platform designed to assist companies in optimizing their tech stack architecture. The platform is a copilot for technical leaders and teams, providing data-driven insights and recommendations to evaluate, plan and evolve their technology infrastructure. Catio tackles these challenges by providing continuous architecture design, strategic tech stack planning, accurate analytics and evaluation of architectures, and personalized recommendations for each enterprise’s unique needs, all powered by advanced AI models and proprietary data. In March, Catio announced $3 million in additional funding. This is in addition to the $4 million raised between 2023-2024. Kumo Mountain View-based Kumo AI is focused on democratizing AI, particularly in the realm of predictive analytics, by leveraging cutting-edge Graph Neural Networks (GNNs) and Relational Deep Learning (RDL). They aim to make it easier for businesses to build and deploy highly accurate machine learning models directly from their relational data. Kumo has raised $37 million in funding in two rounds. Their latest round was a $18 million Series B in September 2022. Solo.io Cambridge, Mass.-based Solo.io is a cloud-native application networking company that provides solutions for connecting, securing and observing modern applications, particularly those built on Kubernetes and microservices. Founded in 2017, Solo.io aims to simplify the complexities of application networking in dynamic, multi-cloud environments. The company launched kagent – a first-of-its-kind cloud native framework that helps DevOps and platform engineers build and run AI agents in Kubernetes.  A $135 million series C funding round in 2021 brought the company’s total funding to $175 million. The company is valued at $1 billion. Superduper.io Berlin-based Superduper.io is an AI company focused on simplifying the integration of AI models and workflows directly within existing databases, eliminating the need for complex data pipelines and separate AI infrastructure. Their core offering is their Superduper Agents, which enable non-technical users to instantly answer even the trickiest questions about their data, documents and systems – and to set up AI workers for their tasks – just by chatting with ready-made AI agents. All without wrangling difficult tools, spreadsheets, dashboards and SQL or requiring help from engineering and analysts. Superduper.io has raised $1.75 million in one seed funding round. Notable investors include Hetz Ventures, session.vc and MongoDB. It was also part of the Intel Ignite accelerator program. Sutro Oakland-based Sutro is an AI-powered no-code platform that allows users to create full, production-ready mobile and web applications simply by describing their idea in plain text. It aims to democratize software development, making it accessible to individuals and businesses without coding expertise. Sutro was founded in 2021 and has raised around $6 million across two early funding rounds in 2023-2024. Qdrant Berlin-based Qdrant is a high-performance, massive-scale Vector Database and Vector Search Engine designed for the next generation of AI applications. It’s built in Rust, a language known for its safety and performance, making it a reliable choice for demanding AI workloads. Founded in 2020, Qdrant has raised $37.8 million over three rounds of funding, the latest of which was a $28 million series A in January 2024. Meet our Judging Panel Emily Zhao, principal at Salesforce Ventures Zhao is a principal at Salesforce Ventures focused on AI/ML. She also spends time on developer tooling, cybersecurity, vertical SaaS, and health tech. She joined Salesforce Ventures in the spring of 2022 and recently helped launch its $500 million generative AI fund. She has led several investments in the fund, including Hugging Face, RunwayML, Anthropic, Cohere and others. Before joining Salesforce Ventures, Zhao was an investor at Avenir Growth Capital, a venture-growth fund based in New York, where she spent most of her time on vertical SaaS, health tech and application software. Before Avenir, Emily was an associate in the Private Equity group at Blackstone and invested in corporate buyout transactions. Matt Kraning, partner at Menlo Ventures Kraning is one of Menlo‘s newest partners. He is focused on investing in AI, enterprise SaaS, national defense and cybersecurity. He’s a deeply technical former founder and a proven company builder with a Ph.D. in electrical engineering, specializing in AI and large-scale computing. Before Menlo, Kraning was the co-founder and CTO of Expanse, where he helped define the AI-driven attack surface management category. He’s also advised and invested in over 50 startups, including unicorns like Peregrine and Astranis, and recently led a $12 million round for Wispr Flow. Rebecca Li, investment director at Amex Ventures Li is an Investment Director at Amex Ventures. She joined in 2024 to focus on enterprise software investing, leading early-stage investments in infrastructure software, developer tools, data and AI. Before joining American Express, Li led fintech and software venture investments at Global Asset Capital. Her prior experience includes business development and partnerships at Credit Karma and technology

Announcing our 2025 VB Transform Innovation Showcase finalists Read More »

Meta’s new world model lets robots manipulate objects in environments they’ve never encountered before

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more While large language models (LLMs) have mastered text (and other modalities to some extent), they lack the physical “common sense” to operate in dynamic, real-world environments. This has limited the deployment of AI in areas like manufacturing and logistics, where understanding cause and effect is critical. Meta’s latest model, V-JEPA 2, takes a step toward bridging this gap by learning a world model from video and physical interactions. V-JEPA 2 can help create AI applications that require predicting outcomes and planning actions in unpredictable environments with many edge cases. This approach can provide a clear path toward more capable robots and advanced automation in physical environments. How a ‘world model’ learns to plan Humans develop physical intuition early in life by observing their surroundings. If you see a ball thrown, you instinctively know its trajectory and can predict where it will land. V-JEPA 2 learns a similar “world model,” which is an AI system’s internal simulation of how the physical world operates. The model is built on three core capabilities that are essential for enterprise applications: understanding what is happening in a scene, predicting how the scene will change based on an action, and planning a sequence of actions to achieve a specific goal. As Meta states in its blog, its “long-term vision is that world models will enable AI agents to plan and reason in the physical world.” The model’s architecture, called the Video Joint Embedding Predictive Architecture (V-JEPA), consists of two key parts. An “encoder” watches a video clip and condenses it into a compact numerical summary, known as an embedding. This embedding captures the essential information about the objects and their relationships in the scene. A second component, the “predictor,” then takes this summary and imagines how the scene will evolve, generating a prediction of what the next summary will look like.  V-JEPA is composed of an encoder and a predictor (source: Meta blog) This architecture is the latest evolution of the JEPA framework, which was first applied to images with I-JEPA and now advances to video, demonstrating a consistent approach to building world models. Unlike generative AI models that try to predict the exact color of every pixel in a future frame — a computationally intensive task — V-JEPA 2 operates in an abstract space. It focuses on predicting the high-level features of a scene, such as an object’s position and trajectory, rather than its texture or background details, making it far more efficient than other larger models at just 1.2 billion parameters That translates to lower compute costs and makes it more suitable for deployment in real-world settings. Learning from observation and action V-JEPA 2 is trained in two stages. First, it builds its foundational understanding of physics through self-supervised learning, watching over one million hours of unlabeled internet videos. By simply observing how objects move and interact, it develops a general-purpose world model without any human guidance. In the second stage, this pre-trained model is fine-tuned on a small, specialized dataset. By processing just 62 hours of video showing a robot performing tasks, along with the corresponding control commands, V-JEPA 2 learns to connect specific actions to their physical outcomes. This results in a model that can plan and control actions in the real world. V-JEPA two-stage training pipeline (source: Meta) This two-stage training enables a critical capability for real-world automation: zero-shot robot planning. A robot powered by V-JEPA 2 can be deployed in a new environment and successfully manipulate objects it has never encountered before, without needing to be retrained for that specific setting. This is a significant advance over previous models that required training data from the exact robot and environment where they would operate. The model was trained on an open-source dataset and then successfully deployed on different robots in Meta’s labs. For example, to complete a task like picking up an object, the robot is given a goal image of the desired outcome. It then uses the V-JEPA 2 predictor to internally simulate a range of possible next moves. It scores each imagined action based on how close it gets to the goal, executes the top-rated action, and repeats the process until the task is complete. Using this method, the model achieved success rates between 65% and 80% on pick-and-place tasks with unfamiliar objects in new settings. Real-world impact of physical reasoning This ability to plan and act in novel situations has direct implications for business operations. In logistics and manufacturing, it allows for more adaptable robots that can handle variations in products and warehouse layouts without extensive reprogramming. This can be especially useful as companies are exploring the deployment of humanoid robots in factories and assembly lines. The same world model can power highly realistic digital twins, allowing companies to simulate new processes or train other AIs in a physically accurate virtual environment. In industrial settings, a model could monitor video feeds of machinery and, based on its learned understanding of physics, predict safety issues and failures before they happen. This research is a key step toward what Meta calls “advanced machine intelligence (AMI),” where AI systems can “learn about the world as humans do, plan how to execute unfamiliar tasks, and efficiently adapt to the ever-changing world around us.”  Meta has released the model and its training code and hopes to “build a broad community around this research, driving progress toward our ultimate goal of developing world models that can transform the way AI interacts with the physical world.”  What it means for enterprise technical decision-makers V-JEPA 2 moves robotics closer to the software-defined model that cloud teams already recognize: pre-train once, deploy anywhere. Because the model learns general physics from public video and only needs a few dozen hours of task-specific footage, enterprises can slash the data-collection cycle that typically drags down pilot projects. In practical terms, you can prototype a pick-and-place robot

Meta’s new world model lets robots manipulate objects in environments they’ve never encountered before Read More »

OpenAI moves forward with GPT-4.5 deprecation in API, triggering developer anguish and confusion

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Word spread quickly across the machine learning and AI community on the social network X yesterday: OpenAI was sending developers an email notifying them that the company would be removing one of its largest and most powerful large language models (LLMs), GPT-4.5 Preview, from the official OpenAI application programming interface (API) on July 14, 2025. However, as an OpenAI spokesperson told VentureBeat via email, GPT-4.5 Preview will remain an option to individual ChatGPT users through the dropdown model selector menu at the top of the application. But it stil means that any third-party developers who had built applications or workflows atop GPT-4.5 preview (we’ll call it GPT-4.5 from here on out for simplicity’s sake, since a full GPT-4.5 was never made available on this platform), now need to switch over to another one of OpenAI’s nearly 40 (!!!) different model offerings still available through the API. The news quickly spread on X, where developers and AI enthusiasts posted reactions ranging from disappointment to confusion. Some described GPT-4.5 as a daily tool in their workflow, praising its tone and reliability. Others questioned the rationale behind launching the model in the first place if it was going to be short-lived. “This is sad — GPT-4.5 is one of my fav models,” wrote @BumrahBachi. Ben Hyak, the co-founder of AI observability and performance monitoring platform Raindrop.AI, called the move “tragic,” adding: “o3 + 4.5 are the models I use the most everyday.” Another user, @flowersslop asked bluntly, “what was the purpose of this model all along?” Deprecation had been planned since April Despite the strong reaction, OpenAI had in fact already announced the plan to deprecate GPT-4.5 Preview back in April 2025 during the launch of GPT-4.1. At that time, the company stated that developers would have three months to transition away from 4.5. OpenAI framed the model as an experimental offering that provided insights for future development, and said it would carry forward learnings from GPT-4.5 into future iterations — particularly in areas like creativity and writing nuance. In a follow-up response to VentureBeat, OpenAI communications confirmed that the June email was simply a scheduled reminder and that there are currently no plans to remove GPT-4.5 from ChatGPT subscriptions, where the model remains available. Community speculation on cost and model strategy Still, the developer-facing deprecation leaves a gap for some users, especially those who had built workflows or products around GPT-4.5’s specific characteristics. Some in the community speculated that high compute costs might have influenced the move, noting that similar changes had occurred with prior models. Others referenced recent API pricing updates, including a major reduction in cost for GPT-3.5 (internally referred to as o3), which is now priced 80% lower than before. User @chatgpt21 commented that GPT-4.5 is the “best non reasoning model for open ai on all benchmarks and it’s obvious,” and predicted that OpenAI will “they add test time compute it will blow o3 out of the water. In order to scale TTC you need to scale pre training” The end of the road for GPT-4.5 via API — developers encouraged to migrate to GPT-4.1 OpenAI has directed developers to its online forum for questions about migrating to GPT-4.1 or other models. With the API shutdown for GPT-4.5 Preview set for mid-July, teams relying on the model now have less than a month to complete that transition. source

OpenAI moves forward with GPT-4.5 deprecation in API, triggering developer anguish and confusion Read More »

Like humans, AI is forcing institutions to rethink their purpose

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Cognitive migration is not just an individual journey; it is also a collective and institutional one. As AI reshapes the terrain of thought, judgment and coordination, the very foundations of our schools, governments, corporations and civic systems are being called into question. Institutions, like people, now face the challenge of rapid change: “Rethinking” their purpose, adapting their structures and rediscovering what makes them essential in a world where machines can increasingly think, decide and produce. Like people who are undergoing cognitive migration, institutions — and the people who run them — must reassess what they were made for. Discontinuity Institutions are designed to promote continuity. Their purpose is to endure, to offer structure, legitimacy and coherence across time. It is those very attributes that contribute to trust. We rely on institutions not only to deliver services and enforce norms, but to provide a sense of order in a complex world. They are the long-arc vessels of civilization, meant to hold steady as individuals come and go. Without viable institutions, society risks upheaval and an increasingly uncertain future. But today, many of our core institutions are reeling. Having long served as the scaffolding of modern life, they are being tested in ways that feel not only sudden, but systemic. Some of this pressure comes from AI, which is rapidly reshaping the cognitive terrain on which these institutions were built. But AI is not the only force. The past two decades have brought rising public distrust, partisan fragmentation and challenges to institutional legitimacy that predate the generative AI technological wave. From increasing income inequality, to attacks on scientific process and consensus, to politicized courts, to declining university enrollments, the erosion of trust in our institutions has multiple causes, as well as compounding effects. In this context, the arrival of increasingly capable AI systems is not merely another challenge. It is an accelerant, fuel to the fire of institutional disruption. This disruption demands that institutions adapt their operations and revisit foundational assumptions. What are institutions for in a world where credentialing, reasoning and coordination are no longer exclusively human domains? All this institutional reinvention needs to take place at a pace that defies their very purpose and nature. This is the institutional dimension of cognitive migration: A shift not just in individuals find meaning and value, but in how our collective societal structures must evolve to support a new era. And as with all migrations, the journey will be uneven, contested and deeply consequential. The architecture of the old regime The institutions in place now were not designed for this moment. Most were forged in the Industrial Age and refined during the Digital Revolution. Their operating models reflect the logic of earlier cognitive regimes: stable processes, centralized expertise and the tacit assumption that human intelligence would remain preeminent. Schools, corporations, courts and government agencies are structured to manage people and information on a large scale. They rely on predictability, expert credentials and well-defined hierarchies of decision-making. These are traditional strengths that — even when considered bureaucratic — have historically offered a foundation for trust, consistency and broad participation within complex societies. But the assumptions beneath these structures are under strain. AI systems now perform tasks once reserved for knowledge workers, including summarizing documents, analyzing data, writing legal briefs, performing research, creating lesson plans and teaching, coding applications and building and executing marketing campaigns. Beyond automation, a deeper disruption is underway: The people running these institutions are expected to defend their continued relevance in a world where knowledge itself is no longer as highly valued or even a uniquely human asset. The relevance of some institutions is called into question from outside challengers including tech platforms, alternative credentialing models and decentralized networks. This essentially means that the traditional gatekeepers of trust, expertise and coordination are being challenged by faster, flatter and often more digitally native alternatives. In some cases, even long-standing institutional functions such as adjudicating disputes are being questioned, ignored, or bypassed altogether. This does not mean institutional collapse is inevitable. But it does suggest that the current paradigm of stable, slow-moving and authority-based structures may not endure. At a minimum, institutions are under intense pressure to change. If institutions are to remain relevant and play a vital role in the age of AI, they must become more adaptive, transparent and attuned to the values that cannot readily be encoded in algorithms: human dignity, ethical deliberation and long-term stewardship. The choice ahead is not whether institutions will change, but how. Will they resist, ossify and fall into irrelevance? Will they be forcibly restructured to meet transient agendas? Or will they deliberately reimagine themselves as co-evolving partners in a world of shared intelligence and shifting value? First steps of institutional migration A growing number of institutions are beginning to adapt. These responses are varied and often tentative, signs of motion more than full transformation. These are green shoots; taken together, they suggest that the cognitive migration of institutions may already be underway. Yet there is a deeper challenge beneath these experiments: Many institutions are still bound by outdated methods of operating. The environment, however, has changed. AI and other factors are redrawing the landscape, and institutions are only beginning to recalibrate. One example of change comes from an Arizona-based charter school where AI plays a leading role in daily instruction. Branded as Unbound Academy, the school uses AI platforms to deliver core academic content in condensed, focused sessions tailored for each child. This shows promise to improve academic achievement while also allowing students time later in the day to work on life skills, project-based learning and interpersonal development. In this model, teachers are reframed as guides and mentors, not content deliverers. It is an early glimpse of what institutional migration might look like in education: Not just digitizing the old classroom, but redesigning its structure, human roles and priorities around

Like humans, AI is forcing institutions to rethink their purpose Read More »

From prompt chaos to clarity: How to build a robust AI orchestration layer

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Editor’s note: Emilia will lead an editorial roundtable on this topic at VB Transform next week. Register today. AI agents seem like an inevitability these days. Most enterprises already use an AI application and may have deployed at least a single-agent system, with plans to pilot workflows with multiple agents.  Managing all that sprawl, especially when attempting to build interoperability in the long run, can become overwhelming. Reaching that agentic future means creating a workable orchestration framework that directs the different agents.  The demand for AI applications and orchestration has given rise to an emerging battleground, with companies focused on providing frameworks and tools gaining customers. Now, enterprises can choose between orchestration framework providers like LangChain, LlamaIndex, Crew AI, Microsoft’s AutoGen and OpenAI’s Swarm.  Enterprises also need to consider the type of orchestration framework they want to implement. They can choose between a prompt-based framework, agent-oriented workflow engines, retrieval and indexed frameworks, or even end-to-end orchestration.  As many organizations are just beginning to experiment with multiple AI agent systems or want to build out a larger AI ecosystem, specific criteria are at the top of their minds when choosing the orchestration framework that best fits their needs.  This larger pool of options in orchestration pushes the space even further, encouraging enterprises to explore all potential choices for orchestrating their AI systems instead of forcing them to fit into something else. While it can seem overwhelming, there’s a way for organizations to look at the best practices in choosing an orchestration framework and figure out what works well for them.  Orchestration platform Orq noted in a blog post that AI management systems include four key components: prompt management for consistent model interaction, integration tools, state management and monitoring tools to track performance.  Best practices to consider For enterprises planning to embark on their orchestration journey or improve their current one, some experts from companies like Teneo and Orq note at least five best practices to start with.  Define your business goals  Choose tools and large language models (LLMs) that align with your goals Lay out what you need out of an orchestration layer and prioritize these, i.e., integration, workflow design, monitoring and observability, scalability, security and compliance Know your existing systems and how to integrate them into the new layer Understand your data pipeline As with any AI project, organizations should take cues from their business needs. What do they need the AI application or agents to do, and how are these planned to support their work? Starting with this key step will help better inform their orchestration needs and the type of tools they require. Teneo said in a blog post that once that’s clear, teams must know what they need from their orchestration system and ensure these are the first features they look for. Some enterprises may want to focus more on monitoring and observability, rather than workflow design. Generally, most orchestration frameworks offer a range of features, and components such as integration, workflow, monitoring, scalability, and security are often the top priorities for businesses. Understanding what matters most to the organization will better guide how they want to build out their orchestration layer.  In a blog post, LangChain stated that businesses should be aware of what information or work is passed to models.  “When using a framework, you need to have full control over what gets passed into the LLM, and full control over what steps are run and in what order (in order to generate the context that gets passed into the LLM). We prioritize this with LangGraph, which is a low-level orchestration framework with no hidden prompts, no enforced “cognitive architectures”. This gives you full control to do the appropriate context engineering that you require,” the company said.  Since most enterprises plan to add AI agents into existing workflows, it’s best practice to know which systems need to be part of the orchestration stack and find the platform that integrates best.  As always, enterprises need to know their data pipeline so they can compare the performance of the agents they are monitoring.  source

From prompt chaos to clarity: How to build a robust AI orchestration layer Read More »

The Interpretable AI playbook: What Anthropic’s research means for your enterprise LLM strategy

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Anthropic CEO Dario Amodei made an urgent push in April for the need to understand how AI models think. This comes at a crucial time. As Anthropic battles in global AI rankings, it’s important to note what sets it apart from other top AI labs. Since its founding in 2021, when seven OpenAI employees broke off over concerns about AI safety, Anthropic has built AI models that adhere to a set of human-valued principles, a system they call Constitutional AI. These principles ensure that models are “helpful, honest and harmless” and generally act in the best interests of society. At the same time, Anthropic’s research arm is diving deep to understand how its models think about the world, and why they produce helpful (and sometimes harmful) answers. Anthropic’s flagship model, Claude 3.7 Sonnet, dominated coding benchmarks when it launched in February, proving that AI models can excel at both performance and safety. And the recent release of Claude 4.0 Opus and Sonnet again puts Claude at the top of coding benchmarks. However, in today’s rapid and hyper-competitive AI market, Anthropic’s rivals like Google’s Gemini 2.5 Pro and Open AI’s o3 have their own impressive showings for coding prowess, while they’re already dominating Claude at math, creative writing and overall reasoning across many languages. If Amodei’s thoughts are any indication, Anthropic is planning for the future of AI and its implications in critical fields like medicine, psychology and law, where model safety and human values are imperative. And it shows: Anthropic is the leading AI lab that focuses strictly on developing “interpretable” AI, which are models that let us understand, to some degree of certainty, what the model is thinking and how it arrives at a particular conclusion.  Amazon and Google have already invested billions of dollars in Anthropic even as they build their own AI models, so perhaps Anthropic’s competitive advantage is still budding. Interpretable models, as Anthropic suggests, could significantly reduce the long-term operational costs associated with debugging, auditing and mitigating risks in complex AI deployments. Sayash Kapoor, an AI safety researcher, suggests that while interpretability is valuable, it is just one of many tools for managing AI risk. In his view, “interpretability is neither necessary nor sufficient” to ensure models behave safely — it matters most when paired with filters, verifiers and human-centered design. This more expansive view sees interpretability as part of a larger ecosystem of control strategies, particularly in real-world AI deployments where models are components in broader decision-making systems. The need for interpretable AI Until recently, many thought AI was still years from advancements like those that are now helping Claude, Gemini and ChatGPT boast exceptional market adoption. While these models are already pushing the frontiers of human knowledge, their widespread use is attributable to just how good they are at solving a wide range of practical problems that require creative problem-solving or detailed analysis. As models are put to the task on increasingly critical problems, it is important that they produce accurate answers. Amodei fears that when an AI responds to a prompt, “we have no idea… why it chooses certain words over others, or why it occasionally makes a mistake despite usually being accurate.” Such errors — hallucinations of inaccurate information, or responses that do not align with human values — will hold AI models back from reaching their full potential. Indeed, we’ve seen many examples of AI continuing to struggle with hallucinations and unethical behavior. For Amodei, the best way to solve these problems is to understand how an AI thinks: “Our inability to understand models’ internal mechanisms means that we cannot meaningfully predict such [harmful] behaviors, and therefore struggle to rule them out … If instead it were possible to look inside models, we might be able to systematically block all jailbreaks, and also characterize what dangerous knowledge the models have.” Amodei also sees the opacity of current models as a barrier to deploying AI models in “high-stakes financial or safety-critical settings, because we can’t fully set the limits on their behavior, and a small number of mistakes could be very harmful.” In decision-making that affects humans directly, like medical diagnosis or mortgage assessments, legal regulations require AI to explain its decisions. Imagine a financial institution using a large language model (LLM) for fraud detection — interpretability could mean explaining a denied loan application to a customer as required by law. Or a manufacturing firm optimizing supply chains — understanding why an AI suggests a particular supplier could unlock efficiencies and prevent unforeseen bottlenecks. Because of this, Amodei explains, “Anthropic is doubling down on interpretability, and we have a goal of getting to ‘interpretability can reliably detect most model problems’ by 2027.” To that end, Anthropic recently participated in a $50 million investment in Goodfire, an AI research lab making breakthrough progress on AI “brain scans.” Their model inspection platform, Ember, is an agnostic tool that identifies learned concepts within models and lets users manipulate them. In a recent demo, the company showed how Ember can recognize individual visual concepts within an image generation AI and then let users paint these concepts on a canvas to generate new images that follow the user’s design. Anthropic’s investment in Ember hints at the fact that developing interpretable models is difficult enough that Anthropic does not have the manpower to achieve interpretability on their own. Creative interpretable models requires new toolchains and skilled developers to build them Broader context: An AI researcher’s perspective To break down Amodei’s perspective and add much-needed context, VentureBeat interviewed Kapoor an AI safety researcher at Princeton. Kapoor co-authored the book AI Snake Oil, a critical examination of exaggerated claims surrounding the capabilities of leading AI models. He is also a co-author of “AI as Normal Technology,” in which he advocates for treating AI as a standard, transformational tool like the internet or electricity, and promotes

The Interpretable AI playbook: What Anthropic’s research means for your enterprise LLM strategy Read More »