VentureBeat

Teaching the model: Designing LLM feedback loops that get smarter over time

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Large language models (LLMs) have dazzled with their ability to reason, generate and automate, but what separates a compelling demo from a lasting product isn’t just the model’s initial performance. It’s how well the system learns from real users. Feedback loops are the missing layer in most AI deployments. As LLMs are integrated into everything from chatbots to research assistants to ecommerce advisors, the real differentiator lies not in better prompts or faster APIs, but in how effectively systems collect, structure and act on user feedback. Whether it’s a thumbs down, a correction or an abandoned session, every interaction is data — and every product has the opportunity to improve with it. This article explores the practical, architectural and strategic considerations behind building LLM feedback loops. Drawing from real-world product deployments and internal tooling, we’ll dig into how to close the loop between user behavior and model performance, and why human-in-the-loop systems are still essential in the age of generative AI. 1. Why static LLMs plateau The prevailing myth in AI product development is that once you fine-tune your model or perfect your prompts, you’re done. But that’s rarely how things play out in production. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO LLMs are probabilistic… they don’t “know” anything in a strict sense, and their performance often degrades or drifts when applied to live data, edge cases or evolving content. Use cases shift, users introduce unexpected phrasing and even small changes to the context (like a brand voice or domain-specific jargon) can derail otherwise strong results. Without a feedback mechanism in place, teams end up chasing quality through prompt tweaking or endless manual intervention…  a treadmill that burns time and slows down iteration. Instead, systems need to be designed to learn from usage, not just during initial training, but continuously, through structured signals and productized feedback loops. 2. Types of feedback — beyond thumbs up/down The most common feedback mechanism in LLM-powered apps is the binary thumbs up/down — and while it’s simple to implement, it’s also deeply limited. Feedback, at its best, is multi-dimensional. A user might dislike a response for many reasons: factual inaccuracy, tone mismatch, incomplete information or even a misinterpretation of their intent. A binary indicator captures none of that nuance. Worse, it often creates a false sense of precision for teams analyzing the data. To improve system intelligence meaningfully, feedback should be categorized and contextualized. That might include: Structured correction prompts: “What was wrong with this answer?” with selectable options (“factually incorrect,” “too vague,” “wrong tone”). Something like Typeform or Chameleon can be used to create custom in-app feedback flows without breaking the experience, while platforms like Zendesk or Delighted can handle structured categorization on the backend. Freeform text input: Letting users add clarifying corrections, rewordings or better answers. Implicit behavior signals: Abandonment rates, copy/paste actions or follow-up queries that indicate dissatisfaction. Editor‑style feedback: Inline corrections, highlighting or tagging (for internal tools). In internal applications, we’ve used Google Docs-style inline commenting in custom dashboards to annotate model replies, a pattern inspired by tools like Notion AI or Grammarly, which rely heavily on embedded feedback interactions. Each of these creates a richer training surface that can inform prompt refinement, context injection or data augmentation strategies. 3. Storing and structuring feedback Collecting feedback is only useful if it can be structured, retrieved and used to drive improvement. And unlike traditional analytics, LLM feedback is messy by nature — it’s a blend of natural language, behavioral patterns and subjective interpretation. To tame that mess and turn it into something operational, try layering three key components into your architecture: 1. Vector databases for semantic recall When a user provides feedback on a specific interaction — say, flagging a response as unclear or correcting a piece of financial advice — embed that exchange and store it semantically. Tools like Pinecone, Weaviate or Chroma are popular for this. They allow embeddings to be queried semantically at scale. For cloud-native workflows, we’ve also experimented with using Google Firestore plus Vertex AI embeddings, which simplifies retrieval in Firebase-centric stacks. This allows future user inputs to be compared against known problem cases. If a similar input comes in later, we can surface improved response templates, avoid repeat mistakes or dynamically inject clarified context. 2. Structured metadata for filtering and analysis Each feedback entry is tagged with rich metadata: user role, feedback type, session time, model version, environment (dev/test/prod) and confidence level (if available). This structure allows product and engineering teams to query and analyze feedback trends over time. 3. Traceable session history for root cause analysis Feedback doesn’t live in a vacuum — it’s the result of a specific prompt, context stack and system behavior. l Log complete session trails that map: user query → system context → model output → user feedback This chain of evidence enables precise diagnosis of what went wrong and why. It also supports downstream processes like targeted prompt tuning, retraining data curation or human-in-the-loop review pipelines. Together, these three components turn user feedback from scattered opinion into structured fuel for product intelligence. They make feedback scalable — and continuous improvement part of the system design, not just an afterthought. 4. When (and how) to close the loop Once feedback is stored and structured, the next challenge is deciding when and how to act on it. Not all feedback deserves the same response — some can be instantly applied, while others require moderation, context or deeper analysis. Context injection: Rapid, controlled iterationThis is often the first line of defense — and one of the most flexible. Based on feedback patterns, you can inject additional instructions, examples or clarifications directly into the system prompt or context

Teaching the model: Designing LLM feedback loops that get smarter over time Read More »

Hugging Face: 5 ways enterprises can slash AI costs without sacrificing performance

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Enterprises seem to accept it as a basic fact: AI models require a significant amount of compute; they simply have to find ways to obtain more of it.  But it doesn’t have to be that way, according to Sasha Luccioni, AI and climate lead at Hugging Face. What if there’s a smarter way to use AI? What if, instead of striving for more (often unnecessary) compute and ways to power it, they can focus on improving model performance and accuracy?  Ultimately, model makers and enterprises are focusing on the wrong issue: They should be computing smarter, not harder or doing more, Luccioni says.  “There are smarter ways of doing things that we’re currently under-exploring, because we’re so blinded by: We need more FLOPS, we need more GPUs, we need more time,” she said.  AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Here are five key learnings from Hugging Face that can help enterprises of all sizes use AI more efficiently.  1: Right-size the model to the task  Avoid defaulting to giant, general-purpose models for every use case. Task-specific or distilled models can match, or even surpass, larger models in terms of accuracy for targeted workloads — at a lower cost and with reduced energy consumption.  Luccioni, in fact, has found in testing that a task-specific model uses 20 to 30 times less energy than a general-purpose one. “Because it’s a model that can do that one task, as opposed to any task that you throw at it, which is often the case with large language models,” she said.  Distillation is key here; a full model could initially be trained from scratch and then refined for a specific task. DeepSeek R1, for instance, is “so huge that most organizations can’t afford to use it” because you need at least 8 GPUs, Luccioni noted. By contrast, distilled versions can be 10, 20 or even 30X smaller and run on a single GPU.  In general, open-source models help with efficiency, she noted, as they don’t need to be trained from scratch. That’s compared to just a few years ago, when enterprises were wasting resources because they couldn’t find the model they needed; nowadays, they can start out with a base model and fine-tune and adapt it.  “It provides incremental shared innovation, as opposed to siloed, everyone’s training their models on their datasets and essentially wasting compute in the process,” said Luccioni.  It’s becoming clear that companies are quickly getting disillusioned with gen AI, as costs are not yet proportionate to the benefits. Generic use cases, such as writing emails or transcribing meeting notes, are genuinely helpful. However, task-specific models still require “a lot of work” because out-of-the-box models don’t cut it and are also more costly, said Luccioni. This is the next frontier of added value. “A lot of companies do want a specific task done,” Luccioni noted. “They don’t want AGI, they want specific intelligence. And that’s the gap that needs to be bridged.”  2. Make efficiency the default Adopt “nudge theory” in system design, set conservative reasoning budgets, limit always-on generative features and require opt-in for high-cost compute modes. In cognitive science, “nudge theory” is a behavioral change management approach designed to influence human behavior subtly. The “canonical example,” Luccioni noted, is adding cutlery to takeout: Having people decide whether they want plastic utensils, rather than automatically including them with every order, can significantly reduce waste. “Just getting people to opt into something versus opting out of something is actually a very powerful mechanism for changing people’s behavior,” said Luccioni.  Default mechanisms are also unnecessary, as they increase use and, therefore, costs because models are doing more work than they need to. For instance, with popular search engines such as Google, a gen AI summary automatically populates at the top by default. Luccioni also noted that, when she recently used OpenAI’s GPT-5, the model automatically worked in full reasoning mode on “very simple questions.” “For me, it should be the exception,” she said. “Like, ‘what’s the meaning of life, then sure, I want a gen AI summary.’ But with ‘What’s the weather like in Montreal,’ or ‘What are the opening hours of my local pharmacy?’ I do not need a generative AI summary, yet it’s the default. I think that the default mode should be no reasoning.” 3. Optimize hardware utilization Use batching; adjust precision and fine-tune batch sizes for specific hardware generation to minimize wasted memory and power draw.  For instance, enterprises should ask themselves: Does the model need to be on all the time? Will people be pinging it in real time, 100 requests at once? In that case, always-on optimization is necessary, Luccioni noted. However, in many others, it’s not; the model can be run periodically to optimize memory usage, and batching can ensure optimal memory utilization.  “It’s kind of like an engineering challenge, but a very specific one, so it’s hard to say, ‘Just distill all the models,’ or ‘change the precision on all the models,’” said Luccioni.  In one of her recent studies, she found that batch size depends on hardware, even down to the specific type or version. Going from one batch size to plus-one can increase energy use because models need more memory bars.  “This is something that people don’t really look at, they’re just like, ‘Oh, I’m gonna maximize the batch size,’ but it really comes down to tweaking all these different things, and all of a sudden it’s super efficient, but it only works in your specific context,” Luccioni explained.  4. Incentivize energy transparency It always helps when people are incentivized; to this end, Hugging Face earlier this year launched AI Energy Score. It’s a novel way to promote more energy efficiency,

Hugging Face: 5 ways enterprises can slash AI costs without sacrificing performance Read More »

CodeSignal’s new AI tutoring app Cosmo wants to be the ‘Duolingo for job skills’

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now CodeSignal Inc., the San Francisco-based skills assessment platform trusted by Netflix, Meta, and Capital One, launched Cosmo on Wednesday, a mobile learning application that transforms spare minutes into career-ready skills through artificial intelligence-powered micro-courses. The app represents a strategic pivot for CodeSignal, which built its reputation assessing technical talent for major corporations but always harbored ambitions to revolutionize workplace education. Cosmo delivers over 300 bite-sized courses across generative AI, coding, marketing, finance, and leadership through an interactive chat interface powered by an AI tutor. “Cosmo is like having an AI tutor in your pocket that can teach you anything from GenAI to coding to marketing to finance to leadership, and it does it through practice,” said Tigran Sloyan, CodeSignal’s co-founder and CEO in an exclusive interview with VentureBeat. “Instead of watching a video or reading about something, you immediately start practicing.” The launch comes as organizations grapple with massive skills gaps created by rapid AI adoption. According to the 2024 Stack Overflow Developer Survey, 76% of developers are now using or plan to use AI tools, yet most workers lack the practical knowledge to harness these tools effectively. Traditional corporate training programs, which can cost $20,000 to $40,000 per person for executive-level instruction, have proven inadequate for scaling AI literacy across entire workforces. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO How CodeSignal pivoted from tech hiring platform to mobile education powerhouse CodeSignal’s journey into mobile learning culminates a decade-long vision that took an unexpected detour through the hiring technology space. Sloyan originally founded the company in 2015 with educational ambitions but quickly realized that without skills-based hiring practices, alternative education would fail to gain traction. “I started the company with that dream and mission: I want to help more humans achieve their true potential, which naturally leads to better education,” Sloyan explained in an interview. “But roughly two years into the company’s history, I realized that without knowing companies would actually care about the skills you build through alternative education — rather than just asking ‘where did you go to college?’ or ‘what did you major in?’ — it wouldn’t work.” The company spent the next six years building what became the leading technical assessment platform, processing millions of coding evaluations for over 3,000 companies. This hiring-focused period provided CodeSignal with crucial intelligence about which skills employers actually value — data that now informs Cosmo’s curriculum development. “We know exactly what companies are looking for,” Sloyan said. “Without that, I feel like you’re shooting in the dark when you’re trying to prepare people for what is going to help them get that job, what is going to help them advance their career.” Why AI tutors could finally solve the personalized learning problem Cosmo differentiates itself through what CodeSignal calls “practice-first learning,” where users immediately engage with realistic workplace scenarios rather than consuming passive video content. The app’s AI tutor, also named Cosmo, guides learners through conversational exchanges that adapt to individual knowledge levels and learning pace. The platform addresses what educational psychologists call “Bloom’s two sigma problem” — a 1984 study showing that one-on-one tutoring produces learning outcomes two standard deviations above traditional classroom instruction. For four decades, this remained theoretically interesting but practically impossible to scale. “We know one-on-one personalization and tutoring really makes a difference in learning, but it can’t be done at scale. How do you get a tutor for every human?” Sloyan said. “In 2023, when I saw early versions of generative AI, I thought: this is the moment. This technology, especially if it keeps getting better, can be uniquely used to help humans learn the way learning was meant to happen.” The app combines predetermined course content with real-time personalization. Each lesson follows a structured curriculum, but learners can interrupt with questions that prompt immediate AI-generated explanations before returning to the main content thread. Generative AI skills training takes center stage as workforce scrambles to adapt Nearly one-third of Cosmo’s launch content focuses on generative AI applications, reflecting what CodeSignal identifies as the most critical skills gap in today’s market. The app offers role-specific AI training paths for sales professionals, marketers, engineers, healthcare workers, and other specialties. “The biggest emphasis is on generative AI skills, because that’s the biggest career skills gap right now for both students and working adults,” Sloyan explained. “Everything from how to understand and use GenAI, how to think about its limitations, how to be better at prompting, and how to understand the entire landscape.” This focus addresses a broader workforce transformation driven by AI adoption. While some fear job displacement, Sloyan predicts increased demand for skilled workers who can effectively collaborate with AI systems. “I don’t believe we’re going to reach a point where humans are no longer needed in the workforce. I think it’s going to be the opposite. We’re going to need more humans, because what an individual human can do in the age of AI is going to be so much bigger than what we could do before,” he said. Mobile-first learning strategy targets both individual workers and corporate clients CodeSignal positions Cosmo as fundamentally a consumer application that also serves enterprise customers — a reflection of how workplace learning actually occurs. The company already provides its GenAI Skills Academy to corporate clients, and Cosmo extends this training to mobile devices for on-the-go learning. “Even though some of the largest educational companies, like Coursera and Udemy, are making the majority of their income, or at least half, from companies, at the end of the day, education is a consumer business,” Sloyan noted. “Who are you educating? You’re not educating a company — you’re educating individuals.” The app launches free on iOS with premium

CodeSignal’s new AI tutoring app Cosmo wants to be the ‘Duolingo for job skills’ Read More »

OpenAI is editing its GPT-5 rollout on the fly — here’s what’s changing in ChatGPT

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now OpenAI’s launch of its most advanced AI model GPT-5 last week has been a stress test for the world’s most popular chatbot platform with 700 million weekly active users — and so far, OpenAI is openly struggling to keep users happy and its service running smoothly. The new flagship model GPT-5 — available in four variants of different speed and intelligence (regular, mini, nano, and pro), alongside longer-response and more powerful “thinking” modes for at least three of these variants — was said to offer faster responses, more reasoning power, and stronger coding ability. Instead, it was greeted with frustration: some users were vocally dismayed by OpenAI’s decision to abruptly remove the older underlying AI models from ChatGPT — ones users previously relied upon, and in some cases, forged deep emotional fixations with — and by the apparent worse performance by GPT-5 than said older models on tasks in math, science, writing and other domains. Indeed, the rollout has exposed infrastructure strain, user dissatisfaction, and a broader, more unsettling issue now drawing global attention: the growing emotional and psychological reliance some people form on AI and resulting break from reality some users experience, known as “ChatGPT psychosis.” AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO From bumpy debut to incremental fixes The long-anticipated GPT-5 model family debuted Thursday, August 7 in a livestreamed event beset with chart errors and some voice mode glitches during the presentation. But worse than these cosmetic issues for many users was the fact that OpenAI automatically deprecated its older AI models that used to power ChatGPT — GPT-4o, GPT-4.1, o3, o4-mini and o4-high — forcing all users over to the new GPT-5 model and directing their queries to different versions of its “thinking” process without revealing why or which specific model version was being used. Early adopters to GPT-5 reported basic math and logic mistakes, inconsistent code generation, and uneven real-world performance compared to GPT-4o. For context, the old models GPT-4o, o3, o4-mini and more still remain available and have remained available to users of OpenAI’s paid application programming interface (API) since the launch of GPT-5 on Thursday. By Friday, OpenAI co-fonder CEO Sam Altman conceded the launch was “a little more bumpy than we hoped for,” and blamed a failure in GPT-5’s new automatic “router” — the system that assigns prompts to the most appropriate variant. Altman and others at OpenAI claimed the “autoswitcher” went offline “for a chunk of the day,” making the model seem “way dumber” than intended. The launch of GPT-5 was preceded just days prior by the launch of OpenAI’s new open source large language models (LLMs) named gpt-oss, which also received mixed reviews. These models are not available on ChatGPT, rather, they are free to download and run locally or on third-party hardware. How to switch back from GPT-5 to GPT-4o in ChatGPT Within 24 hours, OpenAI restored GPT-4o access for Plus subscribers (those paying $20 per month or more subscription plans), pledged more transparent model labeling, and promised a UI update to let users manually trigger GPT-5’s “thinking” mode. Already, users can go and manually select the older models on the ChatGPT website by finding their account name and icon in the lower left corner of the screen, clicking it, then clicking “Settings” and “General” and toggling on “Show legacy models.” There’s no indication from OpenAI that other old models will be returning to ChatGPT anytime soon. Upgraded usage limits for GPT-5 Altman said that ChatGPT Plus subscribers will get twice as many messages using the GPT-5 “Thinking” mode that offers more reasoning and intelligence — up to 3,000 per week — and that engineers began fine-tuning decision boundaries in the message router. Sam Altman announced the following updates after the GPT-5 launch – OpenAI is testing a 3,000-per-week limit for GPT-5 Thinking messages for Plus users, significantly increasing reasoning rate limits today, and will soon raise all model-class rate limits above pre-GPT-5 levels… pic.twitter.com/ppvhKmj95u — Tibor Blaho (@btibor91) August 10, 2025 By the weekend, GPT-5 was available to 100% of Pro subscribers and “getting close to 100% of all users.” Altman said the company had “underestimated how much some of the things that people like in GPT-4o matter to them” and vowed to accelerate per-user customization — from personality warmth to tone controls like emoji use. Looming capacity crunch Altman warned that OpenAI faces a “severe capacity challenge” this week as usage of reasoning models climbs sharply — from less than 1% to 7% of free users, and from 7% to 24% of Plus subscribers. He teased giving Plus subscribers a small monthly allotment of GPT-5 Pro queries and said the company will soon explain how it plans to balance capacity between ChatGPT, the API, research, and new user onboarding. Altman: model attachment is real — and risky In a post on X last night, Altman acknowledged a dynamic the company has tracked “for the past year or so”: users’ deep attachment to specific models. “It feels different and stronger than the kinds of attachment people have had to previous kinds of technology,” he wrote, admitting that suddenly deprecating older models “was a mistake.” If you have been following the GPT-5 rollout, one thing you might be noticing is how much of an attachment some people have to specific AI models. It feels different and stronger than the kinds of attachment people have had to previous kinds of technology (and so suddenly… — Sam Altman (@sama) August 11, 2025 He tied this to a broader risk: some users treat ChatGPT as a therapist or life coach, which can be beneficial, but for a “small percentage” can reinforce delusion or undermine long-term well-being. While OpenAI’s

OpenAI is editing its GPT-5 rollout on the fly — here’s what’s changing in ChatGPT Read More »

TensorZero nabs $7.3M seed to solve the messy world of enterprise LLM development

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now TensorZero, a startup building open-source infrastructure for large language model applications, announced Monday it has raised $7.3 million in seed funding led by FirstMark, with participation from Bessemer Venture Partners, Bedrock, DRW, Coalition, and dozens of strategic angel investors. The funding comes as the 18-month-old company experiences explosive growth in the developer community. TensorZero’s open-source repository recently achieved the “#1 trending repository of the week” spot globally on GitHub, jumping from roughly 3,000 to over 9,700 stars in recent months as enterprises grapple with the complexity of building production-ready AI applications. “Despite all the noise in the industry, companies building LLM applications still lack the right tools to meet complex cognitive and infrastructure needs, and resort to stitching together whatever early solutions are available on the market,” said Matt Turck, General Partner at FirstMark, who led the investment. “TensorZero provides production-grade, enterprise-ready components for building LLM applications that natively work together in a self-reinforcing loop, out of the box.” The Brooklyn-based company addresses a growing pain point for enterprises deploying AI applications at scale. While large language models like GPT-5 and Claude have demonstrated remarkable capabilities, translating these into reliable business applications requires orchestrating multiple complex systems for model access, monitoring, optimization, and experimentation. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO How nuclear fusion research shaped a breakthrough AI optimization platform TensorZero’s approach stems from co-founder and CTO Viraj Mehta’s unconventional background in reinforcement learning for nuclear fusion reactors. During his PhD at Carnegie Mellon, Mehta worked on Department of Energy research projects where data collection cost “like a car per data point — $30,000 for 5 seconds of data,” he explained in a recent interview with VentureBeat. “That problem leads to a huge amount of concern about where to focus our limited resources,” Mehta said. “We were going to only get to run a handful of trials total, so the question became: what is the marginally most valuable place we can collect data from?” This experience shaped TensorZero’s core philosophy: maximizing the value of every data point to continuously improve AI systems. The insight led Mehta and co-founder Gabriel Bianconi, former chief product officer at Ondo Finance (a decentralized finance project with over $1 billion in assets under management), to reconceptualize LLM applications as reinforcement learning problems where systems learn from real-world feedback. “LLM applications in their broader context feel like reinforcement learning problems,” Mehta explained. “You make many calls to a machine learning model with structured inputs, get structured outputs, and eventually receive some form of reward or feedback. This looks to me like a partially observable Markov decision process.” Why enterprises are ditching complex vendor integrations for unified AI infrastructure Traditional approaches to building LLM applications require companies to integrate numerous specialized tools from different vendors — model gateways, observability platforms, evaluation frameworks, and fine-tuning services. TensorZero unifies these capabilities into a single open-source stack designed to work together seamlessly. “Most companies didn’t go through the hassle of integrating all these different tools, and even the ones that did ended up with fragmented solutions, because those tools weren’t designed to work well with each other,” Bianconi said. “So we realized there was an opportunity to build a product that enables this feedback loop in production.” The platform’s core innovation is creating what the founders call a “data and learning flywheel” — a feedback loop that turns production metrics and human feedback into smarter, faster, and cheaper models. Built in Rust for performance, TensorZero achieves sub-millisecond latency overhead while supporting all major LLM providers through a unified API. Major banks and AI startups are already building production systems on TensorZero The approach has already attracted significant enterprise adoption. One of Europe’s largest banks is using TensorZero to automate code changelog generation, while numerous AI-first startups from Series A to Series B stage have integrated the platform across diverse industries including healthcare, finance, and consumer applications. “The surge in adoption from both the open-source community and enterprises has been incredible,” Bianconi said. “We’re fortunate to have received contributions from dozens of developers worldwide, and it’s exciting to see TensorZero already powering cutting-edge LLM applications at frontier AI startups and large organizations.” The company’s customer base spans organizations from startups to major financial institutions, drawn by both the technical capabilities and the open-source nature of the platform. For enterprises with strict compliance requirements, the ability to run TensorZero within their own infrastructure provides crucial control over sensitive data. How TensorZero outperforms LangChain and other AI frameworks at enterprise scale TensorZero differentiates itself from existing solutions like LangChain and LiteLLM through its end-to-end approach and focus on production-grade deployments. While many frameworks excel at rapid prototyping, they often hit scalability ceilings that force companies to rebuild their infrastructure. “There are two dimensions to think about,” Bianconi explained. “First, there are a number of projects out there that are very good to get started quickly, and you can put a prototype out there very quickly. But often companies will hit a ceiling with many of those products and need to churn and go for something else.” The platform’s structured approach to data collection also enables more sophisticated optimization techniques. Unlike traditional observability tools that store raw text inputs and outputs, TensorZero maintains structured data about the variables that go into each inference, making it easier to retrain models and experiment with different approaches. Rust-powered performance delivers sub-millisecond latency at 10,000+ queries per second Performance has been a key design consideration. In benchmarks, TensorZero’s Rust-based gateway adds less than 1 millisecond of latency at 99th percentile while handling over 10,000 queries per second. This compares favorably to Python-based alternatives like LiteLLM, which can add 25-100x more latency at much

TensorZero nabs $7.3M seed to solve the messy world of enterprise LLM development Read More »

Keychain raises $30M and launches AI operating system for CPG manufacturers

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now The next time you shop at a grocery store, you might want to thank this AI startup for keeping the shelves stocked with your favorite food products — and keeping them fresh and safe to eat. That would be Keychain an AI-powered marketplace for retailers to buy the consumer packaged goods (CPG) on their shelves started by former Angi (formerly Angie’s List) CEO and Handy co-founder and CEO Oisin Hanrahan back in 2023. Today, Keychain is announcing a $30 million Series B funding round and unveiling KeychainOS, its new AI operating system designed to replace or integrate with legacy enterprise Enterprise Resource Planning (ERP) tools in manufacturing. Promotional screenshot of KeychainOS. Credit: Keychain An ERP is enterprise software that integrates essential business functions — such as finance, HR, manufacturing, procurement, and supply chain — into one unified platform, giving organizations a real-time, single source of information about what’s happening across the entire organization at any given moment. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO The global ERP market is expansive and growing fast, valued at an estimated $81.15 billion in 2024 and forecast to reach  $229.79 billion by 2032. It makes sense, then, that Keychain would try to take some of this market share, especially since they’re already assisting CPG manufacturers with their current product search and cataloging software. In contrast to the current ERP market, KeychainOS is being touted as a faster, CPG-specific alternative to systems like Oracle, QAD, and Plex, which often require months of setup and multiple add-ons before becoming fully usable. The funding round was led by Wellington Management with participation from BoxGroup and other existing investors, bringing Keychain’s total capital raised to $68 million just 18 months after launch. Building on a successful initial rollout Keychain’s story began with a narrower entry point into the CPG supply chain. “We started with a big vision: build the operating system for CPG. Our first product was a search-and-discovery tool so brands and retailers could find manufacturers,” said Hanrahan in a video call interview with VentureBeat recently. Keychain co-founding team from L to R: Jordan Weitz, Oisin Hanrahan, and Umang Dua That initial product has grown quickly with more than 20,000 brands and retailers as customers, thousands of manufacturers — and more than a billion dollars of search-and-discovery volume a month. In fact, Keychain states it’s currently being used by 8 of the top 10 U.S. retailers and 7 of the top 10 CPG brands, including 7-Eleven, Whole Foods, and General Mills. Expanding into ERP KeychainOS extends beyond sourcing into the core functions of manufacturing operations. Hanrahan emphasized that the expansion builds on Keychain’s existing customer base. Unlike traditional ERP systems, KeychainOS is designed to be implemented in days and integrates seamlessly with Keychain’s sourcing platform. The system responds to a need Hanrahan says he hears directly from the market. “Every four to eight weeks we host industry dinners with 80 to 200 people,” he told VentureBeat. “A constant theme is how hard it is to customize non-CPG software to run a plant, and the lack of connectivity between food safety, procurement, planning, and cost accounting.” KeychainOS was born to solve these difficulties far more quickly, efficiently, and smoothly. “We’re starting with customers who already use us for search and discovery—people paying us millions in aggregate. It’s a natural expansion,” Hanrahan offered. “Ultimately, it’ll be everywhere—like water. We’ve already had teams rip out existing food-safety software and replace it with Keychain OS.” Using AI to augment food safety and manufacturing process checking One of the ways KeychainOS differentiates itself from legacy ERP platforms is how it handles data entry. Traditional systems often require extensive manual input, which can slow operations and introduce errors. “Tools are fragmented and hard to use,” Hanrahan stated. “Today the expectation is natural-language interfaces and automated data ingestion—not smashing a keyboard to enter data.” Promotional image of KeychainOS on a processing facility floor. Credit: Keychain This reflects a design choice to minimize repetitive entry by enabling the system to capture and organize information in the background. The company is also expanding how workers interact with the software on the factory floor. At present, the primary interface is tapping on a screen through tablets placed in production environments. However, Keychain is building toward multimodal input. “On the floor, the primary interface is tapping on a screen—today it’s screen, tap, and type—while we add computer vision, connected scales, and voice,” Hanrahan explained. This means over time, facilities may be able to automatically record temperatures, weights, or other production data without manual input. Another feature of KeychainOS is the use of adaptive checklists powered by AI. Instead of static, paper-based forms where workers tick boxes regardless of context, the system can adjust based on responses. For example, if an operator records that two batches of a product or component were mixed together, the software automatically prompts additional required steps, rather than leaving that compliance process to chance. This allows food-safety audits and quality checks to be both standardized and responsive, ensuring that no steps are missed in daily operations. Competitive positioning The shift puts Keychain in competition with large, established, legacy ERP vendors. But Hanrahan believes the new KeychainOS arrives at a moment when manufacturers are crying out for alternatives. “We think the moment is here to build a better AI-native ERP without the 14 dropdowns and 17 checkboxes people hate in traditional SaaS,” he said. The company is betting that its vertical focus and AI capabilities will appeal to manufacturers weary of fragmented software. Instead of stitching together different tools, KeychainOS offers integrated modules for compliance, planning, and traceability, with the ability to share data across the supply chain. Customer and investor perspective Whole

Keychain raises $30M and launches AI operating system for CPG manufacturers Read More »

GEPA optimizes LLMs without costly reinforcement learning

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Researchers from the University of California, Berkeley, Stanford University and Databricks have introduced a new AI optimization method called GEPA that significantly outperforms traditional reinforcement learning (RL) techniques for adapting large language models (LLMs) to specialized tasks. GEPA removes the popular paradigm of learning through thousands of trial-and-error attempts guided by simple numerical scores. Instead, it uses an LLM’s own language understanding to reflect on its performance, diagnose errors, and iteratively evolve its instructions. In addition to being more accurate than established techniques, GEPA is significantly more efficient, achieving superior results with up to 35 times fewer trial runs. For businesses building complex AI agents and workflows, this translates directly into faster development cycles, substantially lower computational costs, and more performant, reliable applications. The high cost of optimizing modern AI systems Modern enterprise AI applications are rarely a single call to an LLM. They are often “compound AI systems,” complex workflows that chain multiple LLM modules, external tools such as databases or code interpreters, and custom logic to perform sophisticated tasks, including multi-step research and data analysis. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO A popular way to optimize these systems is through reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), a technique employed in popular reasoning models, including DeepSeek-R1. This method treats the system as a black box; it runs a task, gets a simple success metric (a “scalar reward,” like a score of 7/10), and uses this feedback to slowly nudge the model’s parameters in the right direction. The major drawback of RL is its sample inefficiency. To learn effectively from these sparse numerical scores, RL methods often require tens of thousands, or even hundreds of thousands, of trial runs, known as “rollouts.” For any real-world enterprise application that involves expensive tool calls (e.g., API queries, code compilation) or uses powerful proprietary models, this process is prohibitively slow and costly. As Lakshya A Agrawal, co-author of the paper and doctoral student at UC Berkeley, told VentureBeat, this complexity is a major barrier for many companies. “For many teams, RL is not practical due to its cost and complexity—and their go-to approach so far would often just be prompt engineering by hand,” Agrawal said. He noted that GEPA is designed for teams that need to optimize systems built on top-tier models that often can’t be fine-tuned, allowing them to improve performance without managing custom GPU clusters. The researchers frame this challenge as follows: “How can we extract maximal learning signal from every expensive rollout to enable effective adaptation of complex, modular AI systems in low-data or budget-constrained settings?” An optimizer that learns with language GEPA framework Source: arXiv GEPA (Genetic-Pareto) is a prompt optimizer that tackles this challenge by replacing sparse rewards with rich, natural language feedback. It leverages the fact that the entire execution of an AI system (including its reasoning steps, tool calls, and even error messages) can be serialized into text that an LLM can read and understand. GEPA’s methodology is built on three core pillars. First is “genetic prompt evolution,” where GEPA treats a population of prompts like a gene pool. It iteratively “mutates” prompts to create new, potentially better versions. This mutation is an intelligent process driven by the second pillar: “reflection with natural language feedback.” After a few rollouts, GEPA provides an LLM with the full execution trace (what the system tried to do) and the outcome (what went right or wrong). The LLM then “reflects” on this feedback in natural language to diagnose the problem and write an improved, more detailed prompt. For instance, instead of just seeing a low score on a code generation task, it might analyze a compiler error and conclude the prompt needs to specify a particular library version. The third pillar is “Pareto-based selection,” which ensures smart exploration. Instead of focusing only on the single best-performing prompt, which can lead to getting stuck in a suboptimal solution (a “local optimum”), GEPA maintains a diverse roster of “specialist” prompts. It tracks which prompts perform best on different individual examples, creating a list of top candidates. By sampling from this diverse set of winning strategies, GEPA ensures it explores more solutions and is more likely to discover a prompt that generalizes well across a wide range of inputs. Selecting a single best candidate (left) can result in models getting stuck in local minima while Pareto selection (right) can explore more options and find optimal solutions Source: arXiv The effectiveness of this entire process hinges on what the researchers call “feedback engineering.” Agrawal explains that the key is to surface the rich, textual details that systems already produce but often discard. “Traditional pipelines often reduce this detail to a single numerical reward, obscuring why particular outcomes occur,” he said. “GEPA’s core guidance is to structure feedback that surfaces not only outcomes but also intermediate trajectories and errors in plain text—the same evidence a human would use to diagnose system behavior.” For example, for a document retrieval system, this means listing which documents were retrieved correctly and which were missed, rather than just calculating a final score. GEPA in action The researchers evaluated GEPA across four diverse tasks, including multi-hop question answering (HotpotQA) and privacy-preserving queries (PUPA). They used both open-source (Qwen3 8B) and proprietary (GPT-4.1 mini) models, comparing GEPA against the RL-based GRPO and the state-of-the-art prompt optimizer MIPROv2. Across all tasks, GEPA substantially outperformed GRPO, achieving up to a 19% higher score while using up to 35 times fewer rollouts. Agrawal provided a concrete example of this efficiency gain: “We used GEPA to optimize a QA system in ~3 hours versus GRPO’s 24 hours—an 8x reduction in development time, while also achieving 20% higher performance,”

GEPA optimizes LLMs without costly reinforcement learning Read More »

Gartner: GPT-5 is here, but the infrastructure to support true agentic AI isn’t (yet)

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Here’s an analogy: Freeways didn’t exist in the U.S. until after 1956, when envisioned by President Dwight D. Eisenhower’s administration — yet super fast, powerful cars like Porsche, BMW, Jaguars, Ferrari and others had been around for decades.  You could say AI is at that same pivot point: While models are becoming increasingly more capable, performant and sophisticated, the critical infrastructure they need to bring about true, real-world innovation has yet to be fully built out.  “All we have done is create some very good engines for a car, and we are getting super excited, as if we have this fully functional highway system in place,” Arun Chandrasekaran, Gartner distinguished VP analyst, told VentureBeat.  This is leading to a plateauing, of sorts, in model capabilities such as OpenAI’s GPT-5: While an important step forward, it only features faint glimmers of truly agentic AI. AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO “It is a very capable model, it is a very versatile model, it has made some very good progress in specific domains,” said Chandrasekaran. “But my view is it’s more of an incremental progress, rather than a radical progress or a radical improvement, given all of the high expectations OpenAI has set in the past.”  GPT-5 improves in three key areas To be clear, OpenAI has made strides with GPT-5, according to Gartner, including in coding tasks and multi-modal capabilities.  Chandrasekaran pointed out that OpenAI has pivoted to make GPT-5 “very good” at coding, clearly sensing gen AI’s enormous opportunity in enterprise software engineering and taking aim at competitor Anthropic’s leadership in that area.  Meanwhile, GPT-5’s progress in modalities beyond text, particularly in speech and images, provides new integration opportunities for enterprises, Chandrasekaran noted.  GPT-5 also does, if subtly, advance AI agent and orchestration design, thanks to improved tool use; the model can call third-party APIs and tools and perform parallel tool calling (handle multiple tasks simultaneously). However, this means enterprise systems must have the capacity to handle concurrent API requests in a single session, Chandrasekaran points out. Multistep planning in GPT-5 allows more business logic to reside within the model itself, reducing the need for external workflow engines, and its larger context windows (8K for free users, 32K for Plus at $20 per month and 128K for Pro at $200 per month) can “reshape enterprise AI architecture patterns,” he said.  This means that applications that previously relied on complex retrieval-augmented generation (RAG) pipelines to work around context limits can now pass much larger datasets directly to the models and simplify some workflows. But this doesn’t mean RAG is irrelevant; “retrieving only the most relevant data is still faster and more cost-effective than always sending massive inputs,” Chandrasekaran pointed out.  Gartner sees a shift to a hybrid approach with less stringent retrieval, with devs using GPT-5 to handle “larger, messier contexts” while improving efficiency.  On the cost front, GPT-5 “significantly” reduces API usage fees; top-level costs are $1.25 per 1 million input tokens and $10 per 1 million output tokens, making it comparable to models like Gemini 2.5, but seriously undercutting Claude Opus. However, GTP-5’s input/output price ratio is higher than earlier models, which AI leaders should take into account when considering GTP-5 for high-token-usage scenarios, Chandrasekaran advised.  Bye-bye previous GPT versions (sorta) Ultimately, GPT-5 is designed to eventually replace GPT-4o and the o-series (they were initially sunset, then some reintroduced by OpenAI due to user dissent). Three model sizes (pro, mini, nano) will allow architects to tier services based on cost and latency needs; simple queries can be handled by smaller models and complex tasks by the full model, Gartner notes.  However, differences in output formats, memory and function-calling behaviors may require code review and adjustment, and because GPT-5 may render some previous workarounds obsolete, devs should audit their prompt templates and system instructions. By eventually sunsetting previous versions, “I think what OpenAI is trying to do is abstract that level of complexity away from the user,” said Chandrasekaran. “Often we’re not the best people to make those decisions, and sometimes we may even make erroneous decisions, I would argue.” Another fact behind the phase-outs: “We all know that OpenAI has a capacity problem,” he said, and thus has forged partnerships with Microsoft, Oracle (Project Stargate), Google and others to provision compute capacity. Running multiple generations of models would require multiple generations of infrastructure, creating new cost implications and physical constraints.  New risks, advice for adopting GPT-5 OpenAI claims it reduced hallucination rates by up to 65% in GPT-5 compared to previous models; this can help reduce compliance risks and make the model more suitable for enterprise use cases, and its chain-of-thought (CoT) explanations support auditability and regulatory alignment, Gartner notes.  At the same time, these lower hallucination rates as well as GPT-5’s advanced reasoning and multimodal processing could amplify misuse such as advanced scam and phishing generation. Analysts advise that critical workflows remain under human review, even if with less sampling.  The firm also advises that enterprise leaders:  Pilot and benchmark GPT-5 in mission-critical use cases, running side-by-side evaluations against other models to determine differences in accuracy, speed and user experience.  Monitor practices like vibe coding that risk data exposure (but without being offensive about it or risking defects or guardrail failures).  Revise governance policies and guidelines to address new model behaviors, expanded context windows and safe completions, and calibrate oversight mechanisms.  Experiment with tool integrations, reasoning parameters, caching and model sizing to optimize performance, and use inbuilt dynamic routing to determine the right model for the right task. Audit and upgrade plans for GPT-5’s expanded capabilities. This includes validating API quotas, audit trails and multimodal data pipelines to support new features and increased throughput.

Gartner: GPT-5 is here, but the infrastructure to support true agentic AI isn’t (yet) Read More »

Ai2’s MolmoAct model ‘thinks in 3D’ to challenge Nvidia and Google in robotics AI

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Physical AI, where robotics and foundation models come together, is fast becoming a growing space with companies like Nvidia, Google and Meta releasing research and experimenting in melding large language models (LLMs) with robots.  New research from the Allen Institute for AI (Ai2) aims to challenge Nvidia and Google in physical AI with the release of MolmoAct 7B, a new open-source model that allows robots to “reason in space. MolmoAct, based on Ai2’s open source Molmo, “thinks” in three dimensions. It is also releasing its training data. Ai2 has an Apache 2.0 license for the model, while the datasets are licensed under CC BY-4.0.  Ai2 classifies MolmoAct as an Action Reasoning Model, in which foundation models reason about actions within a physical, 3D space. What this means is that MolmoAct can use its reasoning capabilities to understand the physical world, plan how it occupies space and then take that action.  AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO “MolmoAct has reasoning in 3D space capabilities versus traditional vision-language-action (VLA) models,” Ai2 told VentureBeat in an email. “Most robotics models are VLAs that don’t think or reason in space, but MolmoAct has this capability, making it more performant and generalizable from an architectural standpoint.” Physical understanding  Since robots exist in the physical world, Ai2 claims MolmoAct helps robots take in their surroundings and make better decisions on how to interact with them.  “MolmoAct could be applied anywhere a machine would need to reason about its physical surroundings,” the company said. “We think about it mainly in a home setting because that’s where the greatest challenge lies for robotics, because there things are irregular and constantly changing, but MolmoAct can be applied anywhere.” MolmoAct can understand the physical world by outputting “spatially grounded perception tokens,” which are tokens pretrained and extracted using a vector-quantized variational autoencoder or a model that converts data inputs, such as video, into tokens. The company said these tokens differ from those used by VLAs in that they are not text inputs.  These enable MolmoAct to gain spatial understanding and encode geometric structures. With these, the model estimates the distance between objects.  Once it has an estimated distance, MolmoAct then predicts a sequence of “image-space” waypoints or points in the area where it can set a path to. After that, the model will begin outputting specific actions, such as dropping an arm by a few inches or stretching out.  Ai2’s researchers said they were able to get the model to adapt to different embodiments (i.e., either a mechanical arm or a humanoid robot) “with only minimal fine-tuning.” Benchmarking testing conducted by Ai2 showed MolmoAct 7B had a task success rate of 72.1%, beating models from Google, Microsoft and Nvidia.  A small step forward Ai2’s research is the latest to take advantage of the unique benefits of LLMs and VLMs, especially as the pace of innovation in generative AI continues to grow. Experts in the field see work from Ai2 and other tech companies as building blocks.  Alan Fern, professor at the Oregon State University College of Engineering, told VentureBeat that Ai2’s research “represents a natural progression in enhancing VLMs for robotics and physical reasoning.” “While I wouldn’t call it revolutionary, it’s an important step forward in the development of more capable 3D physical reasoning models,” Fern said. “Their focus on truly 3D scene understanding, as opposed to relying on 2D models, marks a notable shift in the right direction. They’ve made improvements over prior models, but these benchmarks still fall short of capturing real-world complexity and remain relatively controlled and toyish in nature.” He added that while there’s still room for improvement on the benchmarks, he is “eager to test this new model on some of our physical reasoning tasks.”  Daniel Maturana, co-founder of the start-up Gather AI, praised the openness of the data, noting that “this is great news because developing and training these models is expensive, so this is a strong foundation to build on and fine-tune for other academic labs and even for dedicated hobbyists.” Increasing interest in physical AI It has been a long-held dream for many developers and computer scientists to create more intelligent, or at least more spatially aware, robots.  However, building robots that process what they can “see” quickly and move and react smoothly gets difficult. Before the advent of LLMs, scientists had to code every single movement. This naturally meant a lot of work and less flexibility in the types of robotic actions that can occur. Now, LLM-based methods allow robots (or at least robotic arms) to determine the following possible actions to take based on objects it is interacting with. Google Research’s SayCan helps a robot reason about tasks using an LLM, enabling the robot to determine the sequence of movements required to achieve a goal. Meta and New York University’s OK-Robot uses visual language models for movement planning and object manipulation. Hugging Face released a $299 desktop robot in an effort to democratize robotics development. Nvidia, which proclaimed physical AI to be the next big trend, released several models to fast-track robotic training, including Cosmos-Transfer1.  OSU’s Fern said there’s more interest in physical AI even though demos remain limited. However, the quest to achieve general physical intelligence, which eliminates the need to individually program actions for robots, is becoming easier.  “The landscape is more challenging now, with less low-hanging fruit. On the other hand, large physical intelligence models are still in their early stages and are much more ripe for rapid advancements, which makes this space particularly exciting,” he said.  source

Ai2’s MolmoAct model ‘thinks in 3D’ to challenge Nvidia and Google in robotics AI Read More »

This researcher turned OpenAI’s open weights model gpt-oss-20b into a non-reasoning ‘base’ model with less alignment, more freedom

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now OpenAI’s new, powerful open weights AI large language model (LLM) family gpt-oss was released less than two weeks ago under a permissive Apache 2.0 license — the company’s first open weights model launch since GPT-2 in 2019 — but developers outside the company are already reshaping it. One of the most striking examples comes from Jack Morris, a Cornell Tech PhD student, former Google Brain Resident, and current researcher at Meta, who this week unveiled gpt-oss-20b-base, his own reworked version of OpenAI’s smaller gpt-oss-20B model, which removes the “reasoning” behavior of the model and returns it to a pre-trained “base” version that offers faster, freer, more uncensored and unconstrained responses. The model is available now on Hugging Face under a permissive MIT License, allowing it to be used for both additional research and commercial applications. How gpt-oss-20B-base is different than OpenAI’s gpt-oss models To understand what Morris did, it helps to know the difference between OpenAI’s release and what AI researchers call a “base model.” AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Secure your spot to stay ahead: https://bit.ly/4mwGngO Most LLMs offered by leading AI labs such as OpenAI, Anthropic, Google and even open source players like Meta, DeepSeek, and Alibaba’s Qwen team are “post-trained.” This means they have gone through an additional phase where it’s exposed to curated examples of desired behavior. For instruction tuned models, that means giving it many examples of instructions paired with ideal responses, so it learns to respond more helpfully, politely, or safely to natural language requests. The gpt-oss models OpenAI put out on August 5 were “reasoning-optimized”: trained and fine-tuned not just to predict the next word, but to follow instructions in a safe, consistent way, often stepping through problems with structured “chain of thought” reasoning before producing a final answer. This is a trend that goes back to OpenAI’s o1 model released almost a year ago in September 2024, but which numerous leading AI labs have now adopted — forcing the models to think longer over multiple steps and check their own work before outputting a well-reasoned response to the user. That makes them better suited for tasks like coding, solving math problems, or answering factual questions with explanations — but also means their responses are filtered and steered away from unsafe or undesirable content. A base model is different. It’s the raw, pretrained version of a large language model before that reasoning-specific alignment is applied. Base models simply try to predict the next chunk of text given what’s come before, with no built-in guardrails, stylistic preferences, or refusal behaviors. They’re prized by some researchers because they can produce more varied and less constrained output, and because studying their unaligned behavior can reveal how models store knowledge and patterns from their training data. Morris’s goal was to “reverse” OpenAI’s alignment process and restore the smaller gpt-oss-20B to something much closer to its original pretrained state. “We basically reversed the alignment part of LLM training, so we have something that produces natural-looking text again,” he wrote in an X thread announcing the project. “It doesn’t engage in CoT anymore. It is back to a model that just predicts the next token on generic text.” OpenAI hasn’t open-sourced a base model since GPT-2 in 2019. they recently released GPT-OSS, which is reasoning-only… or is it? turns out that underneath the surface, there is still a strong base model. so we extracted it. introducing gpt-oss-20b-base ? pic.twitter.com/3xryQgLF8Z — jack morris (@jxmnop) August 13, 2025 Rather than trying to jailbreak the model with clever prompts — which Morris said proved ineffective during his early experiments — he took a different tack after a conversation with former OpenAI co-founder, former Anthropic researcher and current Thinking Machines chief scientist John Schulman. The key was to think of alignment reversal as a small optimization problem: if most of the model’s pretrained knowledge is still present in its weights, then only a tiny, low-rank update might be needed to nudge it back toward base model behavior. Morris implemented that idea by applying a LoRA (low-rank adapter) update to just three layers of the model — the MLP layers at positions 7, 15, and 23 — with a rank of 16. That meant training about 60 million parameters, or 0.3% of the model’s 21 billion total. He used around 20,000 documents from the FineWeb dataset, keeping the format as close as possible to original pretraining (“ ….” style) so the model wouldn’t learn anything new, just re-enable broad free-text generation. Training took four days on eight NVIDIA H200 GPUs, Morris told VentureBeat via direct message on X, with a learning rate of 2e-6, a batch size of 16, and a maximum sequence length of 8,192 tokens. Afterward, he merged the LoRA weights back into the model so users could run it as a standalone, fully finetuned artifact. Morris also had to contend with the limitations of current open tools for fine-tuning mixture-of-experts (MoE) architectures like gpt-oss. Morris said he used Hugging Face’s framework, which he said crashes frequently and only supports certain training modes, and wrote his own harness to checkpoint often and skip over data batches that risked overloading GPU memory. Importantly, in response to questions and criticism from the AI community on X, Morris has also clarified he is not claiming to have recovered the base model “weights” — the internal settings of the artificial neurons that make up the neural network of the model and govern its behavior. The world of AI is crazy right now cause you can just claim to have extracted the base model from GPT-OSS while effectively you’ve just trained a lora on Fineweb lol https://t.co/oAnAWpMQ26 — Niels Rogge (@NielsRogge) August 15, 2025 Rather, Morris says that

This researcher turned OpenAI’s open weights model gpt-oss-20b into a non-reasoning ‘base’ model with less alignment, more freedom Read More »