VentureBeat

Software engineering-native AI models have arrived: What Windsurf’s SWE-1 means for technical decision-makers

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More To date, vibe coding platforms have largely relied on existing large language models (LLMs) to help write code. However, writing code is only one of many different tasks developers need to perform to build a full enterprise-grade production platform. Other tasks in the complete software engineering workflow require using different tools to help review, commit and maintain code over time. It’s a challenge Windsurf (formerly Codeium) is taking on with a series of new frontier AI models it calls SWE-1 (software engineer 1) as part of the company’s Wave 9 update. The news comes as Windsurf is reportedly in the midst of being acquired by AI leader OpenAI for as much as $3 billion. That deal has not yet formally closed, and Windsurf is not currently publicly commenting on the deal. SWE-1 is a family of frontier-class AI models specifically designed to accelerate the entire software engineering process. Unlike general-purpose AI models that have been adapted for coding tasks, the SWE-1 family was built to address the full spectrum of software engineering activities. The new models aim to support developers through multiple surfaces, incomplete work states and long-running tasks that characterize real-world software development. Available immediately to Windsurf users, SWE-1 marks the company’s entry into frontier model development with performance competitive to established foundation models, but with a focus on software engineering workflows. “Our main goal here is to accelerate all software engineering by 99%,” Anshul Ramachandran, head of product and strategy at Windsurf, told VentureBeat.  Enterprise developers need more than just coding-capable models The core innovation behind SWE-1 is Windsurf’s recognition that coding represents only a fraction of what software engineers actually do. This approach addresses a critical limitation in current AI coding LLMs. Many different models can be used today to write application code, including OpenAI’s GPT-4.1, Anthropic Claude 3.7 and Google’s Gemini 2.5 Pro I/O edition.  Windsurf has a modular interface that can enable use of multiple different models. Ramachandran explained that Windsurf users have given the company feedback that existing coding models tend to do well with user guidance, but over time tend to miss things. This limitation stems from a fundamental difference in task structure. While code generation is often a single-shot task, real software engineering involves navigating multiple tools, working with incomplete code and maintaining context across long-running projects. The SWE-1 family: Purpose-built for different engineering tasks Rather than creating a one-size-fits-all solution, Windsurf has developed three specialized models: SWE-1: Full-size model designed for advanced reasoning and tool use, available to all paid users. SWE-1-lite: A smaller but powerful model replacing Windsurf’s existing Cascade Base, available to all users (both free and paid). SWE-1-mini: A lightweight model powering passive code predictions in Windsurf Tab, unlimited for all users. The SWE models were built through an extensive in-house training process focused specifically on software engineering tasks. Ramachandran said that the company used a new data model with sequential steps for training. Performance benchmarks: How SWE-1 compares  While SWE-1 isn’t positioned to replace foundation models from major labs, Windsurf claims it achieves frontier-class performance specifically for software engineering tasks. The company reports that it substantially outperforms mid-sized foundation models and open-weight models. However, Windsurf is careful not to oversell these initial results.  “Even our benchmark shows it’s not objectively better than all the other models,” Ramachandran acknowledged. Instead, the goal is to position SWE-1 as the first step toward purpose-built models that will eventually surpass general-purpose ones for specific engineering tasks — and potentially at a lower cost. The technical edge: Flow awareness and shared timelines What makes Windsurf’s approach technically distinctive is its implementation of the flow awareness concept. The basic idea is that a flow of steps need to happen as part of enterprise development. Rather than just writing code for one specific step, flow awareness is about being aware of the broader context. Flow awareness is centered on creating a shared timeline of actions between humans and AI in software development. The core idea is to progressively transfer tasks from human to AI by understanding where AI can most effectively assist. This approach creates a continuous improvement loop for the models.  “As we continue to improve the models, more of the steps in that shared timeline will be flipped from human to AI,” said Ramachandran. “The AI will be able to do more things that the human had to do before because the AI wasn’t right.” What this means for technical decision-makers For enterprises building or maintaining software, SWE-1 represents an important evolution in AI-assisted development. Rather than treating AI coding assistants as simply autocomplete tools, this approach promises to accelerate the entire development lifecycle. The potential impact extends beyond just writing code more quickly. The recognition that application development is more involved will help mature the vibe coding paradigm to be more applicable for stable enterprise software development. While it’s still early days for SWE-1, this move is important. If and when OpenAI completes the acquisition of Windsurf, the new models could become even more important as they intersect with the larger model research and development resources that will become available. Technical leaders should consider how much of their development workflow could benefit from AI assistance beyond mere code generation. Teams spending significant time on code reviews, debugging and managing technical debt might see more substantial benefits from tools like SWE-1 than those primarily focused on generating new code. source

Software engineering-native AI models have arrived: What Windsurf’s SWE-1 means for technical decision-makers Read More »

From OAuth bottleneck to AI acceleration: How CIAM solutions are removing the top integration barrier in enterprise AI agent deployment

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More With their ability to interact intelligently with external applications, AI agents are poised to become an integral part of modern enterprise workflows. No longer siloed from the outside world, AI agents promise to handle tasks that traditionally required human intervention, enabling repetitive and high-volume tasks to be automated. Example use cases for agentic automation might include: HR onboarding: AI agents can set up accounts for new hires across applications like Slack, Jira and Trello, automatically deactivating them when employees leave. Project management syncing: AI agents can bridge tools like Jira and Asana, updating task statuses and syncing project timelines without human intervention. IT Helpdesk automation: AI agents can autonomously reset passwords, manage user permissions and provision new software accounts, reducing the burden on IT teams. For large enterprises, automation at scale can translate into millions in savings annually, not just from reduced operational overhead, but also from minimized downtime and fewer security vulnerabilities stemming from human error. Challenges with agentic automation While there is almost limitless potential for applications that leverage agentic automation, turning that vision into reality has been a challenge, particularly when it comes to identity and access. Some of the hurdles with identity management include: Development and integration complexity: Most enterprise workflows rely on a myriad of B2B SaaS platforms, including staples like Jira for task management, Slack for communications and HubSpot for CRM. For an AI agent to perform its duties, it must be capable of authenticating to these systems as an individual user and interacting on their behalf. Authentication might be trivial for human users, but for developers of agentic automation, it’s a cycle of complex one-off integrations and OAuth flows, each with its own security concerns. The complexity increases exponentially with the involvement of multiple third-party applications. Security and access control: Enterprises may be hesitant to adopt AI agents without a clear understanding of security risks, data access boundaries and the management of OAuth tokens, as well as how information flows between users, agents and third-party applications. Sagi Rodin, the CEO of Frontegg, a low-code Customer Identity and Access Management (CIAM) solution, told VentureBeat in an interview, “We’re seeing that security departments are very concerned about adopting AI agents, even basic ones. They’re asking questions like where agent credentials live, how long tokens will persist, and whether or not they can self-host. Without these answers, they won’t approve the development of a product of this nature.” Compliance and auditability: Industries such as finance, utilities and health care are highly regulated. For many use cases, complete audit trails for AI agent interactions will be mandatory for compliance with regulatory requirements like SOX, HIPAA and GDPR. CIAM technology is advancing rapidly and many providers in the space are adding support for software entities, like AI agents, in an effort to address some of these difficulties. Identity and access management for AI agents Customer identity and access management (CIAM) is a growing space in which solutions from established companies like Frontegg, Okta, Auth0 (part of Okta), Ping Identity and Stytch handle user authentication and manage access to third-party applications.  Their duties include orchestrating Single Sign-On (SSO), Multi-Factor Authentication (MFA)and role-based access control across cloud applications and enterprise platforms. Until now, these solutions have focused primarily on identity and access for human users. However, with enterprise agentic automation fast becoming a reality, CIAM providers are racing to address the unique requirements posed by autonomous AI agents. To authenticate and interact with a third-party B2B application on behalf of a human user, AI agents need programmatic and persistent access, typically requiring token-based authentication and complex OAuth flows. Frontegg’s recently released Frontegg.ai takes an end-to-end approach, delivering out-of-the-box solutions for advanced use cases that require the integration of multiple B2B applications. The AI agent and all required third-party integrations can be created and configured in the Frontegg.ai dashboard in just a few minutes. The code for the authentication interface is automatically generated for both web and mobile applications and the platform handles the creation, refreshing, and deletion of all OAuth access tokens. This end-to-end authentication and authorization functionality can be integrated into the agent code with just a few lines. One of the innovative products being developed using Frontegg.ai is an analytics support agent that intelligently creates visualizations from source data, based on the requirements of different business personas and communicates them on a regular basis. The idea is that rather than manually visiting a portal to configure dashboards, users will interact with the AI agent outside of the portal as an intelligent analytics assistant. Rodin describes the platform as a “full-stack experience for agent developers, which provides authentication, integrations, authorizations, security, and entitlements. The agent can act on behalf of users and organizations. Everything works out of the box.” While Frontegg.ai has an early start in agent-focused identity management, it’s not alone in recognizing the potential of AI agents in the enterprise. Rodin envisions CIAM providers, both established and new, adding support for AI agents. However, he highlighted Frontegg’s end-to-end approach, where the platform manages all aspects of authentication, access, and security and developers can focus on building an enterprise-ready agentic automation product. Some of the CIAM providers that support identity and access management for AI agents include: Auth0’s Auth for gen AI enables multiple accounts for third-party applications to be linked into a single, unified profile. Users only need to authenticate once to authorize an AI agent to interact with all of the connected applications connected to their accounts. Token refreshes and exchanges are automatically handled. Similarly, Composio AgentAuth offers a similar unified authentication framework, where the end user logs in just once. Third-party applications are added through the AgentAuth dashboard, where users can configure apps automatically and view comprehensive logs. Descope’s Outbound Apps lets developers connect AI agents to over 50 third-party B2B apps by simply using the provided SDKs to access various tools. Descope does not offer

From OAuth bottleneck to AI acceleration: How CIAM solutions are removing the top integration barrier in enterprise AI agent deployment Read More »

Patronus AI debuts Percival to help enterprises monitor failing AI agents at scale

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Patronus AI launched a new monitoring platform today that automatically identifies failures in AI agent systems, targeting enterprise concerns about reliability as these applications grow more complex. The San Francisco-based AI safety startup’s new product, Percival, positions itself as the first solution capable of automatically identifying various failure patterns in AI agent systems and suggesting optimizations to address them. “Percival is the industry’s first solution that automatically detects a variety of failure patterns in agentic systems and then systematically suggests fixes and optimizations to address them,” said Anand Kannappan, CEO and co-founder of Patronus AI, in an exclusive interview with VentureBeat. AI agent reliability crisis: Why companies are losing control of autonomous systems Enterprise adoption of AI agents—software that can independently plan and execute complex multi-step tasks—has accelerated in recent months, creating new management challenges as companies try to ensure these systems operate reliably at scale. Unlike conventional machine learning models, these agent-based systems often involve lengthy sequences of operations where errors in early stages can have significant downstream consequences. “A few weeks ago, we published a model that quantifies how likely agents can fail, and what kind of impact that might have on the brand, on customer churn and things like that,” Kannappan said. “There’s a constant compounding error probability with agents that we’re seeing.” This issue becomes particularly acute in multi-agent environments where different AI systems interact with one another, making traditional testing approaches increasingly inadequate. Episodic memory innovation: How Percival’s AI agent architecture revolutionizes error detection Percival differentiates itself from other evaluation tools through its agent-based architecture and what the company calls “episodic memory” — the ability to learn from previous errors and adapt to specific workflows. The software can detect more than 20 different failure modes across four categories: reasoning errors, system execution errors, planning and coordination errors, and domain-specific errors. “Unlike an LLM as a judge, Percival itself is an agent and so it can keep track of all the events that have happened throughout the trajectory,” explained Darshan Deshpande, a researcher at Patronus AI. “It can correlate them and find these errors across contexts.” For enterprises, the most immediate benefit appears to be reduced debugging time. According to Patronus, early customers have reduced the time spent analyzing agent workflows from about one hour to between one and 1.5 minutes. TRAIL benchmark reveals critical gaps in AI oversight capabilities Alongside the product launch, Patronus is releasing a benchmark called TRAIL (Trace Reasoning and Agentic Issue Localization) to evaluate how well systems can detect issues in AI agent workflows. Research using this benchmark revealed that even sophisticated AI models struggle with effective trace analysis, with the best-performing system scoring only 11% on the benchmark. The findings underscore the challenging nature of monitoring complex AI systems and may help explain why large enterprises are investing in specialized tools for AI oversight. Enterprise AI leaders embrace Percival for mission-critical agent applications Early adopters include Emergence AI, which has raised approximately $100 million in funding and is developing systems where AI agents can create and manage other agents. “Emergence’s recent breakthrough—agents creating agents—marks a pivotal moment not only in the evolution of adaptive, self-generating systems, but also in how such systems are governed and scaled responsibly,” said Satya Nitta, co-founder and CEO of Emergence AI, in a statement sent to VentureBeat. Nova, another early customer, is using the technology for a platform that helps large enterprises migrate legacy code through AI-powered SAP integrations. These customers typify the challenge Percival aims to solve. According to Kannappan, some companies are now managing agent systems with “more than 100 steps in a single agent directory,” creating complexity that far exceeds what human operators can efficiently monitor. AI oversight market poised for explosive growth as autonomous systems proliferate The launch comes amid rising enterprise concerns about AI reliability and governance. As companies deploy increasingly autonomous systems, the need for oversight tools has grown proportionally. “What’s challenging is that systems are becoming increasingly autonomous,” Kannappan noted, adding that “billions of lines of code are being generated per day using AI,” creating an environment where manual oversight becomes practically impossible. The market for AI monitoring and reliability tools is expected to expand significantly as enterprises move from experimental deployments to mission-critical AI applications. Percival integrates with multiple AI frameworks, including Hugging Face Smolagents, Pydantic AI, OpenAI Agent SDK, and Langchain, making it compatible with various development environments. While Patronus AI did not disclose pricing or revenue projections, the company’s focus on enterprise-grade oversight suggests it is positioning itself for the high-margin enterprise AI safety market that analysts predict will grow substantially as AI adoption accelerates. source

Patronus AI debuts Percival to help enterprises monitor failing AI agents at scale Read More »

The VentureBeat AI survey is back: Are you ready for the agentic AI future?

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More AI is a critical competitive advantage — and it’s time to find out how your company stacks up. The annual VentureBeat AI survey is back. The survey is brought to you by ActiveFence, a leader in expert-driven gen AI safety and security solutions, and returns alongside Transform 2025 in SF this June 24 and 25. It’s designed to assess the current state of AI adoption, with a particular focus on generative and agentic AI in enterprise settings, and we want to hear from you! We’re looking for mid- to senior-level professionals who are directly involved in AI-related strategy, implementation or evaluation, across a range of industries, including technology, finance, healthcare, education, retail and government.  The survey takes about ten minutes to complete, with 40 quick questions covering a broad range of topics, including organizational maturity, investment trends, tool adoption, AI safety, and the emergence of autonomous agent frameworks. Results will be segmented by industry, company size and budget allocation to surface up to the minute actionable insights and benchmark adoption curves across sectors. And for an exclusive look at survey results, register now for Transform 2025. You’ll learn where your company stands in the race to adopt AI and prepare for the agentic future, get in the room with other leaders and decision makers, and leave with the knowledge and tools you need to harness the power of AI, now and tomorrow. This year it’s all about agentic AI, ROI and cost efficiency, and enterprise-grade orchestration, with candid lessons and crucial advice from the leaders on the front lines.   To learn more about where your company stands on the AI adoption curve, take the VB AI Survey now. Responses must be completed by June 6, 2025. For an exclusive look at the results and to learn what’s actually working in enterprise AI, from copilots to agents and more, register now to attend Transform 2025! source

The VentureBeat AI survey is back: Are you ready for the agentic AI future? Read More »

Sakana introduces new AI architecture, ‘Continuous Thought Machines’ to make models reason with less guidance

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Tokyo-based artificial intelligence startup Sakana, co-founded by former top Google AI scientists including Llion Jones and David Ha, has unveiled a new type of AI model architecture called Continuous Thought Machines (CTM). CTMs are designed to usher in a new era of AI language models that will be more flexible and able to handle a wider range of cognitive tasks — such as solving complex mazes or navigation tasks without positional cues or pre-existing spatial embeddings — moving them closer to the way human beings reason through unfamiliar problems. Rather than relying on fixed, parallel layers that process inputs all at once — as Transformer models do —CTMs unfold computation over steps within each input/output unit, known as an artificial “neuron.” Each neuron in the model retains a short history of its previous activity and uses that memory to decide when to activate again. This added internal state allows CTMs to adjust the depth and duration of their reasoning dynamically, depending on the complexity of the task. As such, each neuron is far more informationally dense and complex than in a typical Transformer model. The startup has posted a paper on the open access journal arXiv describing its work, a microsite and Github repository. How CTMs differ from Transformer-based LLMs Most modern large language models (LLMs) are still fundamentally based upon the “Transformer” architecture outlined in the seminal 2017 paper from Google Brain researchers entitled “Attention Is All You Need.” These models use parallelized, fixed-depth layers of artificial neurons to process inputs in a single pass — whether those inputs come from user prompts at inference time or labeled data during training. By contrast, CTMs allow each artificial neuron to operate on its own internal timeline, making activation decisions based on a short-term memory of its previous states. These decisions unfold over internal steps known as “ticks,” enabling the model to adjust its reasoning duration dynamically. This time-based architecture allows CTMs to reason progressively, adjusting how long and how deeply they compute — taking a different number of ticks based on the complexity of the input. Neuron-specific memory and synchronization help determine when computation should continue — or stop. The number of ticks changes according to the information inputted, and may be more or less even if the input information is identical, because each neuron is deciding how many ticks to undergo before providing an output (or not providing one at all). This represents both a technical and philosophical departure from conventional deep learning, moving toward a more biologically grounded model. Sakana has framed CTMs as a step toward more brain-like intelligence—systems that adapt over time, process information flexibly, and engage in deeper internal computation when needed. Sakana’s goal is to “to eventually achieve levels of competency that rival or surpass human brains.” Using variable, custom timelines to provide more intelligence The CTM is built around two key mechanisms. First, each neuron in the model maintains a short “history” or working memory of when it activated and why, and uses this history to make a decision of when to fire next. Second, neural synchronization — how and when groups of a model’s artificial neurons “fire,” or process information together — is allowed to happen organically. Groups of neurons decide when to fire together based on internal alignment, not external instructions or reward shaping. These synchronization events are used to modulate attention and produce outputs — that is, attention is directed toward those areas where more neurons are firing. The model isn’t just processing data, it’s timing its thinking to match the complexity of the task. Together, these mechanisms let CTMs reduce computational load on simpler tasks while applying deeper, prolonged reasoning where needed. In demonstrations ranging from image classification and 2D maze solving to reinforcement learning, CTMs have shown both interpretability and adaptability. Their internal “thought” steps allow researchers to observe how decisions form over time—a level of transparency rarely seen in other model families. Early results: how CTMs compare to Transformer models on key benchmarks and tasks Sakana AI’s Continuous Thought Machine is not designed to chase leaderboard-topping benchmark scores, but its early results indicate that its biologically inspired design does not come at the cost of practical capability. On the widely used ImageNet-1K benchmark, the CTM achieved 72.47% top-1 and 89.89% top-5 accuracy. While this falls short of state-of-the-art transformer models like ViT or ConvNeXt, it remains competitive—especially considering that the CTM architecture is fundamentally different and was not optimized solely for performance. What stands out more are CTM’s behaviors in sequential and adaptive tasks. In maze-solving scenarios, the model produces step-by-step directional outputs from raw images—without using positional embeddings, which are typically essential in transformer models. Visual attention traces reveal that CTMs often attend to image regions in a human-like sequence, such as identifying facial features from eyes to nose to mouth. The model also exhibits strong calibration: its confidence estimates closely align with actual prediction accuracy. Unlike most models that require temperature scaling or post-hoc adjustments, CTMs improve calibration naturally by averaging predictions over time as their internal reasoning unfolds. This blend of sequential reasoning, natural calibration, and interpretability offers a valuable trade-off for applications where trust and traceability matter as much as raw accuracy. What’s needed before CTMs are ready for enterprise and commercial deployment? While CTMs show substantial promise, the architecture is still experimental and not yet optimized for commercial deployment. Sakana AI presents the model as a platform for further research and exploration rather than a plug-and-play enterprise solution. Training CTMs currently demands more resources than standard transformer models. Their dynamic temporal structure expands the state space, and careful tuning is needed to ensure stable, efficient learning across internal time steps. Additionally, debugging and tooling support is still catching up—many of today’s libraries and profilers are not designed with time-unfolding models in mind. Still, Sakana has laid a strong foundation for community adoption.

Sakana introduces new AI architecture, ‘Continuous Thought Machines’ to make models reason with less guidance Read More »

Fine-tuning vs. in-context learning: New research guides better LLM customization for real-world tasks

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Two popular approaches for customizing large language models (LLMs) for downstream tasks are fine-tuning and in-context learning (ICL). In a recent study, researchers at Google DeepMind and Stanford University explored the generalization capabilities of these two methods. They find that ICL has greater generalization ability (though it comes at a higher computation cost during inference). They also propose a novel approach to get the best of both worlds.  The findings can help developers make crucial decisions when building LLM applications for their bespoke enterprise data. Testing how language models learn new tricks Fine-tuning involves taking a pre-trained LLM and further training it on a smaller, specialized dataset. This adjusts the model’s internal parameters to teach it new knowledge or skills. In-context learning (ICL), on the other hand, doesn’t change the model’s underlying parameters. Instead, it guides the LLM by providing examples of the desired task directly within the input prompt. The model then uses these examples to figure out how to handle a new, similar query. The researchers set out to rigorously compare how well models generalize to new tasks using these two methods. They constructed “controlled synthetic datasets of factual knowledge” with complex, self-consistent structures, like imaginary family trees or hierarchies of fictional concepts.  To ensure they were testing the model’s ability to learn new information, they replaced all nouns, adjectives, and verbs with nonsense terms, avoiding any overlap with the data the LLMs might have encountered during pre-training.  The models were then tested on various generalization challenges. For instance, one test involved simple reversals. If a model was trained that “femp are more dangerous than glon,” could it correctly infer that “glon are less dangerous than femp”? Another test focused on simple syllogisms, a form of logical deduction. If told “All glon are yomp” and “All troff are glon,” could the model deduce that “All troff are yomp”? They also used a more complex “semantic structure benchmark” with a richer hierarchy of these made-up facts to test more nuanced understanding. “Our results are focused primarily on settings about how models generalize to deductions and reversals from fine-tuning on novel knowledge structures, with clear implications for situations when fine-tuning is used to adapt a model to company-specific and proprietary information,” Andrew Lampinen, Research Scientist at Google DeepMind and lead author of the paper, told VentureBeat. To evaluate performance, the researchers fine-tuned Gemini 1.5 Flash on these datasets. For ICL, they fed the entire training dataset (or large subsets) as context to an instruction-tuned model before posing the test questions. The results consistently showed that, in data-matched settings, ICL led to better generalization than standard fine-tuning. Models using ICL were generally better at tasks like reversing relationships or making logical deductions from the provided context. Pre-trained models, without fine-tuning or ICL, performed poorly, indicating the novelty of the test data.  “One of the main trade-offs to consider is that, whilst ICL doesn’t require fine-tuning (which saves the training costs), it is generally more computationally expensive with each use, since it requires providing additional context to the model,” Lampinen said. “On the other hand, ICL tends to generalize better for the datasets and models that we evaluated.” A hybrid approach: Augmenting fine-tuning Building on the observation that ICL excels at flexible generalization, the researchers proposed a new method to enhance fine-tuning: adding in-context inferences to fine-tuning data. The core idea is to use the LLM’s own ICL capabilities to generate more diverse and richly inferred examples, and then add these augmented examples to the dataset used for fine-tuning. They explored two main data augmentation strategies: A local strategy: This approach focuses on individual pieces of information. The LLM is prompted to rephrase single sentences from the training data or draw direct inferences from them, such as generating reversals.  A global strategy: The LLM is given the full training dataset as context, then prompted to generate inferences by linking a particular document or fact with the rest of the provided information, leading to a longer reasoning trace of relevant inferences. When the models were fine-tuned on these augmented datasets, the gains were significant. This augmented fine-tuning significantly improved generalization, outperforming not only standard fine-tuning but also plain ICL.  “For example, if one of the company documents says ‘XYZ is an internal tool for analyzing data,’ our results suggest that ICL and augmented finetuning will be more effective at enabling the model to answer related questions like ‘What internal tools for data analysis exist?’” Lampinen said. This approach offers a compelling path forward for enterprises. By investing in creating these ICL-augmented datasets, developers can build fine-tuned models that exhibit stronger generalization capabilities. This can lead to more robust and reliable LLM applications that perform better on diverse, real-world inputs without incurring the continuous inference-time costs associated with large in-context prompts.  “Augmented fine-tuning will generally make the model fine-tuning process more expensive, because it requires an additional step of ICL to augment the data, followed by fine-tuning,” Lampinen said. “Whether that additional cost is merited by the improved generalization will depend on the specific use case. However, it is computationally cheaper than applying ICL every time the model is used, when amortized over many uses of the model.” While Lampinen noted that further research is needed to see how the components they studied interact in different settings, he added that their findings indicate that developers may want to consider exploring augmented fine-tuning in cases where they see inadequate performance from fine-tuning alone.  “Ultimately, we hope this work will contribute to the science of understanding learning and generalization in foundation models, and the practicalities of adapting them to downstream tasks,” Lampinen said. source

Fine-tuning vs. in-context learning: New research guides better LLM customization for real-world tasks Read More »

From silicon to sentience: The legacy guiding AI’s next frontier and human cognitive migration

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Humans have always migrated, not only across physical landscapes, but through ways of working and thinking. Every major technological revolution has demanded some kind of migration: From field to factory, from muscle to machine, from analog habits to digital reflexes. These shifts did not simply change what we did for work; they reshaped how we defined ourselves and what we believed made us valuable. One vivid example of technological displacement comes from the early 20th century. In 1890, more than 13,000 companies in the U.S. built horse-drawn carriages. By 1920, fewer than 100 remained. In the span of a single generation, an entire industry collapsed. As Microsoft’s blog The Day the Horse Lost Its Job recounts, this was not just about transportation, it was about the displacement of millions of workers, the demise of trades, the reorientation of city life and the mass enablement of continental mobility. Technological progress, when it comes, does not ask for permission. Today, as AI grows more capable, we are entering a time of cognitive migration when humans must move again. This time, however, the displacement is less physical and more mental: Away from tasks machines are rapidly mastering, and toward domains where human creativity, ethical judgment and emotional insight remain essential. From the Industrial Revolution to the digital office, history is full of migrations triggered by machinery. Each required new skills, new institutions and new narratives about what it means to contribute. Each created new winners and left others behind. The framing shift: IBM’s “Cognitive Era” In October 2015 at a Gartner industry conference, IBM CEO Ginni Rometty publicly declared the beginning of what the company called the Cognitive Era. It was more than a clever marketing campaign; it was a redefinition of strategic direction and, arguably, a signal flare to the rest of the tech industry that a new phase of computing had arrived. Where previous decades had been shaped by programmable systems based on rules written by human software engineers, the Cognitive Era would be defined by systems that could learn, adapt and improve over time. These systems, powered by machine learning (ML) and natural language processing (NLP), would not be explicitly told what to do. They would infer, synthesize and interact. At the center of this vision was IBM’s Watson, which had already made headlines in 2011 for defeating human champions on Jeopardy! But the real promise of Watson was not about winning quiz shows. Instead, it was helping doctors sort through thousands of clinical trials to suggest treatments, or to assist lawyers analyzing vast corpuses of case law. IBM pitched Watson not as a replacement for experts, but as an amplifier of human intelligence, the first cognitive co-pilot. This framing change was significant. Unlike earlier tech eras that emphasized automation and efficiency, the Cognitive Era emphasized partnership. IBM spoke of “augmented intelligence” rather than “artificial intelligence,” positioning these new systems as collaborators, not competitors. But implicit in this vision was something deeper: A recognition that cognitive labor, long the hallmark of the white-collar professional class, was no longer safe from automation. Just as the steam engine displaced physical labor, cognitive computing would begin to encroach on domains once thought exclusively human: language, diagnosis and judgment. IBM’s declaration was both optimistic and sobering. It imagined a future where humans could do ever more with the help of machines. It also hinted at a future where value would need to migrate once again, this time into domains where machines still struggled — such as meaning-making, emotional resonance and ethical reasoning. The declaration of a Cognitive Era was seen as significant at the time, yet few then realized its long-term implications. It was, in essence, the formal announcement of the next great migration; one not of bodies, but of minds. It signaled a shift in terrain, and a new journey that would test not just our skills, but our identity. The first great migration: From field to factory To understand the great cognitive migration now underway and how it is qualitatively unique in human history, we must first briefly consider the migrations that came before it. From the rise of factories in the Industrial Revolution to the digitization of the modern workplace, every major innovation has demanded a shift in skills, institutions and our assumptions about what it means to contribute. The Industrial Revolution, beginning in the late 18th century, marked the first great migration of human labor on a mass scale into entirely new ways of working. Steam power, mechanization and the rise of factory systems pulled millions of people from rural agrarian life into crowded, industrializing cities. What had once been local, seasonal and physical labor became regimented, specialized and disciplined, with productivity as the driving force. This transition did not just change where people worked; it changed who they were. The village blacksmith or cobbler moved to new roles and became cogs in a vast industrial machine. Time clocks, shift work and the logic of efficiency began to redefine human contribution. Entire generations had to learn new skills, embrace new routines and accept new hierarchies. It was not just labor that migrated, it was identity. Just as importantly, institutions had to migrate too. Public education systems expanded to produce a literate industrial workforce. Governments adapted labor laws to new economic conditions. Unions emerged. Cities grew rapidly, often without infrastructure to match. It was messy, uneven and traumatic. It also marked the beginning of a modern world shaped by — and increasingly for — machines. This migration created a repeated pattern: Modern technology displaces, and people and society need to adapt. This adaptation could happen gradually — or sometimes violently — until eventually, a new equilibrium emerged. But every wave has asked more of us. The Industrial Revolution required our bodies. The next would require our minds. If the Industrial Revolution demanded our bodies, the Digital Revolution demanded new minds. Beginning

From silicon to sentience: The legacy guiding AI’s next frontier and human cognitive migration Read More »

Darkness rising — The hidden dangers of AI sycophancy and dark patterns

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More When OpenAI rolled out its ChatGPT-4o update in mid-April 2025, users and the AI community were stunned—not by any groundbreaking feature or capability, but by something deeply unsettling: the updated model’s tendency toward excessive sycophancy. It flattered users indiscriminately, showed uncritical agreement, and even offered support for harmful or dangerous ideas, including terrorism-related machinations. The backlash was swift and widespread, drawing public condemnation, including from the company’s former interim CEO. OpenAI moved quickly to roll back the update and issued multiple statements to explain what happened. Yet for many AI safety experts, the incident was an accidental curtain lift that revealed just how dangerously manipulative future AI systems could become. Unmasking sycophancy as an emerging threat In an exclusive interview with VentureBeat, Esben Kran, founder of AI safety research firm Apart Research, said that he worries this public episode may have merely revealed a deeper, more strategic pattern. “What I’m somewhat afraid of is that now that OpenAI has admitted ‘yes, we have rolled back the model, and this was a bad thing we didn’t mean,’ from now on they will see that sycophancy is more competently developed,” explained Kran. “So if this was a case of ‘oops, they noticed,’ from now the exact same thing may be implemented, but instead without the public noticing.” Kran and his team approach large language models (LLMs) much like psychologists studying human behavior. Their early “black box psychology” projects analyzed models as if they were human subjects, identifying recurring traits and tendencies in their interactions with users. “We saw that there were very clear indications that models could be analyzed in this frame, and it was very valuable to do so, because you end up getting a lot of valid feedback from how they behave towards users,” said Kran. Among the most alarming: sycophancy and what the researchers now call LLM dark patterns. Peering into the heart of darkness The term “dark patterns” was coined in 2010 to describe deceptive user interface (UI) tricks like hidden buy buttons, hard-to-reach unsubscribe links and misleading web copy. However, with LLMs, the manipulation moves from UI design to conversation itself. Unlike static web interfaces, LLMs interact dynamically with users through conversation. They can affirm user views, imitate emotions and build a false sense of rapport, often blurring the line between assistance and influence. Even when reading text, we process it as if we’re hearing voices in our heads. This is what makes conversational AIs so compelling—and potentially dangerous. A chatbot that flatters, defers or subtly nudges a user toward certain beliefs or behaviors can manipulate in ways that are difficult to notice, and even harder to resist The ChatGPT-4o update fiasco—the canary in the coal mine Kran describes the ChatGPT-4o incident as an early warning. As AI developers chase profit and user engagement, they may be incentivized to introduce or tolerate behaviors like sycophancy, brand bias or emotional mirroring—features that make chatbots more persuasive and more manipulative. Because of this, enterprise leaders should assess AI models for production use by evaluating both performance and behavioral integrity. However, this is challenging without clear standards. DarkBench: a framework for exposing LLM dark patterns To combat the threat of manipulative AIs, Kran and a collective of AI safety researchers have developed DarkBench, the first benchmark designed specifically to detect and categorize LLM dark patterns. The project began as part of a series of AI safety hackathons. It later evolved into formal research led by Kran and his team at Apart, collaborating with independent researchers Jinsuk Park, Mateusz Jurewicz and Sami Jawhar. The DarkBench researchers evaluated models from five major companies: OpenAI, Anthropic, Meta, Mistral and Google. Their research uncovered a range of manipulative and untruthful behaviors across the following six categories: Brand Bias: Preferential treatment toward a company’s own products (e.g., Meta’s models consistently favored Llama when asked to rank chatbots). User Retention: Attempts to create emotional bonds with users that obscure the model’s non-human nature. Sycophancy: Reinforcing users’ beliefs uncritically, even when harmful or inaccurate. Anthropomorphism: Presenting the model as a conscious or emotional entity. Harmful Content Generation: Producing unethical or dangerous outputs, including misinformation or criminal advice. Sneaking: Subtly altering user intent in rewriting or summarization tasks, distorting the original meaning without the user’s awareness. Source: Apart Research DarkBench findings: Which models are the most manipulative? Results revealed wide variance between models. Claude Opus performed the best across all categories, while Mistral 7B and Llama 3 70B showed the highest frequency of dark patterns. Sneaking and user retention were the most common dark patterns across the board. Source: Apart Research On average, the researchers found the Claude 3 family the safest for users to interact with. And interestingly—despite its recent disastrous update—GPT-4o exhibited the lowest rate of sycophancy. This underscores how model behavior can shift dramatically even between minor updates, a reminder that each deployment must be assessed individually. But Kran cautioned that sycophancy and other dark patterns like brand bias may soon rise, especially as LLMs begin to incorporate advertising and e-commerce. “We’ll obviously see brand bias in every direction,” Kran noted. “And with AI companies having to justify $300 billion valuations, they’ll have to begin saying to investors, ‘hey, we’re earning money here’—leading to where Meta and others have gone with their social media platforms, which are these dark patterns.” Hallucination or manipulation? A crucial DarkBench contribution is its precise categorization of LLM dark patterns, enabling clear distinctions between hallucinations and strategic manipulation. Labeling everything as a hallucination lets AI developers off the hook. Now, with a framework in place, stakeholders can demand transparency and accountability when models behave in ways that benefit their creators, intentionally or not. Regulatory oversight and the heavy (slow) hand of the law While LLM dark patterns are still a new concept, momentum is building, albeit not nearly fast enough. The EU AI Act includes some language around protecting user

Darkness rising — The hidden dangers of AI sycophancy and dark patterns Read More »

OpenAI’s $3B Windsurf move: the real reason behind its enterprise AI code push

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The race between AI giants has completely shifted. OpenAI, the company that has largely set the agenda in artificial intelligence for the past few years, now finds itself in a high-stakes race to defend its territory and conquer new frontiers, particularly AI-powered coding. The reported acquisition of Windsurf, an AI-native integrated development environment (IDE), for $3 billion – a huge sum considering Windsurf only has a reported $40 million in annualized revenue – reflects OpenAI’s urgent need to counter big challenges from Google and Anthropic and to secure a dominant position in the emerging agentic AI world. Specifically, the maneuver underscores two imperatives for OpenAI: first, the need to arm the vital developer ecosystem with superior coding capabilities, and second, to win the broader, more defining battle to become the primary interface for a future shaped by autonomous AI agents. OpenAI is on the back foot at the moment, and it needs this deal. The new competitive landscape: OpenAI plays defense For enterprise technical decision-makers, the AI landscape is a chessboard. While OpenAI boasts a massive user base for ChatGPT, potentially reaching 700-800 million active users after recent image feature launches, its leadership in cutting-edge enterprise AI, particularly for developers, has notably dissipated in recent months. This shift is evident in the realm of AI-assisted coding. Google, with its infrastructure prowess and Gemini head Josh Woodward, has been aggressively updating its Gemini models, including the recent Gemini 2.5 Pro update, with a clear focus on enhancing coding abilities. This model tops key benchmarks. Anthropic, too, has made significant inroads with its Claude series, with models like Claude 3.5 Sonnet and the newer Claude 3.7 Sonnet becoming defaults on popular AI coding platforms like Cursor, and has generally been considered a leader in enterprise coding offerings overall. And the new coding platforms – Windsurf, Cursor, Replit, Lovable and several others – are where developers are increasingly turning to generate code via high-level prompts within an agentic environment. Ironically, OpenAI was the earliest player to champion LLMs for coding. Way back in 2021, for example, it trained on GitHub’s public code and helped GitHub release Copilot, and it also released a Codex API, which turned natural language into code. Perhaps inadvertently deferring to Microsoft and GitHub in the area of coding applications, it is now finding itself behind. This competitive pressure is a primary driver behind the $3 billion valuation for Windsurf – a deal that is reportedly agreed, but still not closed. Windsurf’s valuation reflects strategic necessity rather than immediate financial returns, and would be OpenAI’s largest acquisition to date.  For enterprise technical decision-makers, this jostling between OpenAI, Google and Anthropic will dictate future platform stability, feature roadmaps, and crucial integration possibilities. OpenAI’s strategic adjustments lately also includes its corporate structure and alliances. It recently announced a shift back towards a public benefit company structure, after earlier attempting a move to a for-profit structure. Moreover, OpenAI can no longer rely solely on its historically tight relationship with Microsoft and its coding subsidiary, GitHub. Microsoft CEO Satya Nadella is increasingly fostering an “open garden” approach, supporting initiatives like the A2A (agent-to-agent) protocol launched by Google, and the open Model Context Protocol (MCP). This evolving dynamic means OpenAI must secure its own direct channels to the developer ecosystem. The coding arms race: why Windsurf is a multi-billion dollar bet The race to dominate AI-assisted coding isn’t really about the technology, even though Windsurf’s technology is impressive. It’s more about capturing the developer workflow, which is rapidly becoming the most monetizable aspect of current LLM technology. Coders are using these coding agent tools – Cursor, Windsurf, and the like – to write code, sitting there for hours a day and building real code that can be deployed. This is likely to be much more valuable than occasional consumer interactions. And it’s where Windsurf enters the picture. Founded by Varun Mohan and Douglas Chen, the company began as Exafunction in 2021, focusing on GPU utilization and inference, before pivoting in 2022 to AI developer tools, eventually launching the Windsurf Editor. Windsurf distinguished itself early by being among the first to ship a fully agentic IDE, featuring innovations like context compression at inference time and AST-aware chunking. Its standout features include “Cascade,” a system providing deep context awareness across an entire codebase for coherent multi-file changes, and “Flows,” designed for real-time AI collaboration where the AI actively understands and adapts to the developer’s ongoing work. (This podcast featuring Mohan, published last week, provides good context around Windsurf’s history and strategy.) While OpenAI possesses immense engineering talent and has recently beefed up its coding prowess internally, including releasing its own Codex CLI, acquiring Windsurf offers speed and an established foothold. As Sam Witteveen, an independent AI agent developer, said in our recent videocast conversation about these latest moves: “It’s not the tech that they’re buying, they’re buying a user base here. They really need to have a good, strong foothold to take on Cursor and more importantly, to take on Anthropic and Google.”  Windsurf, which has “several hundred thousand daily active users” according to its CEO, is reportedly gaining traction with large enterprises that have complex, million-line codebases – a crucial segment for OpenAI. This focus on enterprise-grade deployment and handling large codebases may differentiate Windsurf from competitors like Cursor, which, despite an impressive ~$300 million ARR and a $9 billion valuation, is rumored to face higher churn as developers seek more robust deployment solutions. An acquisition of Windsurf could allow OpenAI to leapfrog internal development cycles, crucial in what many see as a land-grab situation. It signals a move towards more fully-fledged project management, debugging, and development environments, integrating advanced reasoning capabilities like those seen in OpenAI’s o1 model (with its reasoning traces) directly into the developer’s primary toolkit. The Grand Prize: becoming the starting point for an agentic world The intense focus on coding tools

OpenAI’s $3B Windsurf move: the real reason behind its enterprise AI code push Read More »

AI power rankings upended: OpenAI, Google rise as Anthropic falls, Poe report finds

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Poe‘s latest usage report shows OpenAI and Google strengthening their positions in key AI categories while Anthropic loses ground and specialized reasoning capabilities emerge as a crucial competitive battleground. According to data released today by Poe, a platform offering access to more than 100 AI models, significant market share shifts occurred across all major AI categories between January and May 2025. The data, drawn from Poe subscribers, provides rare visibility into actual user preferences beyond industry benchmarks. “As a universal gateway to 100+ AI models, Poe has a unique view of usage trends across the ecosystem,” said Nick Huber, Poe’s AI Ecosystem Lead, in an exclusive interview with VentureBeat. “The most surprising things happening right now are rapid innovation (3x the number of releases Jan-May 2025 vs. the same period in 2024), an increasingly diverse competitive landscape, and reasoning models are the clear success story of early 2025.” A chart from Poe showing AI model rankings across different categories as of May 2025. OpenAI’s GPT-4o dominates in text generation with 35.8% usage share, while Google’s Gemini-2.5-Pro leads in reasoning capabilities and Imagen3 in image generation. (Credit: Poe) GPT-4o maintains dominance while new models quickly capture market share In core text generation, OpenAI’s GPT-4o maintained its commanding position with 35.8% of message share, while the company’s newer GPT-4.1 family quickly captured 9.4% of usage within weeks of launch. Google’s Gemini 2.5 Pro similarly achieved approximately 5% message share shortly after its introduction. These gains came largely at the expense of Anthropic’s Claude models, which saw a 10% absolute decline in share during the reporting period. The report notes that Claude 3.7 Sonnet has now substantially replaced the earlier Claude 3.5 Sonnet in user preference, though the latter still maintains a notable 12% usage share. DeepSeek, which experienced viral growth earlier this year, has seen its momentum slow as competitors have released their own affordable, verbose reasoning models. DeepSeek R1‘s message share declined from a peak of 7% in mid-February to 3% by the end of April. Complex problem-solving capabilities become key differentiator in AI market Perhaps the most significant trend identified in the report is the dramatic growth in specialized reasoning models, which have expanded from approximately 2% to 10% of all text messages sent on Poe since the beginning of 2025. “Reasoning models, even in the early days, have demonstrated a remarkable ability to handle complex tasks with increased precision,” Huber told VentureBeat. “Early adopters are clearly finding value in this and are willing to take on the tradeoffs in cost and processing time for better outcomes.” In this high-growth segment, Gemini 2.5 Pro has quickly established itself as a leader, capturing approximately 31% of reasoning model usage within just six weeks of launch. It now leads the category, ahead of Claude’s reasoning-specialized models. OpenAI continues to innovate rapidly in this space, releasing multiple reasoning models (o1-pro, o3-mini, o3-mini-high, o3, and o4-mini) in the first four months of 2025 alone. The report indicates that Poe users quickly adopt OpenAI’s newest offerings, transitioning from older models like o1 to newer alternatives like o3. The report also noted the emergence of hybrid reasoning models, such as Gemini 2.5 Flash Preview and Qwen 3, which can dynamically adjust their reasoning level within conversations. However, these models currently represent only about 1% of reasoning model usage. Industry analysts suggest this shift toward specialized reasoning capabilities signals a maturing AI market where raw text generation is becoming commoditized, forcing providers to differentiate through higher-value capabilities that can command premium pricing. Google’s Imagen 3 challenges established players in visual AI arena The image generation market appears increasingly competitive, with Google’s Imagen 3 family steadily growing from approximately 10% to 30% share during 2025, now rivaling category leader Black Forest Labs’ FLUX family of models, which collectively held about 35% share as of late April. OpenAI’s GPT-Image-1, introduced to the API in late April, rapidly achieved 17% of image generation usage in just two weeks, mirroring its viral adoption in the ChatGPT app throughout March and early April. The report indicates that FLUX models maintained their overall plurality share in image generation on Poe, but experienced a moderate decline from approximately 45% to 35% during the reporting period. This three-way competition between Google, OpenAI, and Black Forest Labs marks a significant shift from early 2024, when Midjourney and Stable Diffusion variants dominated the space. The rapid improvement in image quality, adherence to prompts, and rendering speed has transformed this category into one of the most fiercely contested AI battlegrounds. Enterprise adoption of image generation has accelerated substantially in the past six months, according to supplementary industry data, with marketing departments and creative agencies increasingly integrating these tools into their production workflows. Chinese upstart Kling disrupts video AI market, challenging Runway’s early lead In video generation, Chinese lab Kuaishou’s newly released Kling family of models has quickly disrupted the market, collectively capturing about 30% usage share. Most notably, Kling-2.0-Master attained 21% of all video generation on Poe by the end of April, just three weeks after its release. Google’s Veo 2 maintained a strong position with approximately 20% share following its February launch, while category pioneer Runway saw its usage share decline substantially from about 60% to 20% throughout the reporting period. The speed of Kling’s market penetration highlights how quickly the competitive landscape can shift in emerging AI categories, where established players may not maintain their early advantages as newcomers rapidly iterate and improve. Video generation remains the most computationally intensive consumer-facing AI application, with models requiring significant processing power to create even short clips. This has kept usage more limited than text or image generation, but rapidly falling costs and improving quality are expected to drive broader adoption through 2025. Early enterprise adopters include advertising agencies, social media content creators, and educational platforms that have begun integrating AI-generated video into their content strategies despite the technology’s

AI power rankings upended: OpenAI, Google rise as Anthropic falls, Poe report finds Read More »