VentureBeat

Microsoft infuses enterprise agents with deep reasoning, unveils data Analyst agent that outsmarts competitors

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Microsoft has built the largest enterprise AI agent ecosystem, and is now extending its lead with powerful new capabilities that position the company ahead in one of enterprise tech’s most exciting segments. The company announced Tuesday evening two significant additions to its Copilot Studio platform: deep reasoning capabilities that enable agents to tackle complex problems through careful, methodical thinking, and agent flows that combine AI flexibility with deterministic business process automation. Microsoft also unveiled two specialized deep reasoning agents for Microsoft 365 Copilot: Researcher and Analyst. “We have customers with thousands of agents already,” Microsoft’s Corporate Vice President for Business and Industry Copilot Charles Lamanna, told VentureBeat in an exclusive interview on Monday. “You start to have this kind of agentic workforce where no matter what the job is, you probably have an agent that can help you get it done faster.” Microsoft’s distinctive Analyst agent While the Researcher agent mirrors capabilities from competitors like OpenAI’s Deep Research and Google’s Deep Research, Microsoft’s Analyst agent represents a more differentiated offering. Designed to function like a personal data scientist, the Analyst agent can process diverse data sources, including Excel files, CSVs, and embedded tables in documents, generating insights through code execution and visualization. “This is not a base model off the shelf,” Lamanna emphasized. “This is quite a bit of extensions and tuning and training on top of the core models.” Microsoft has leveraged its deep understanding of Excel workflows and data analysis patterns to create an agent that aligns with how enterprise users actually work with data. The Analyst can automatically generate Python code to process uploaded data files, produce visualizations, and deliver business insights without requiring technical expertise from users. This makes it particularly valuable for financial analysis, budget forecasting and operational reporting use cases that typically require extensive data preparation. Deep reasoning: Bringing critical thinking to enterprise agents Microsoft’s deep reasoning capability extends agents’ abilities beyond simple task completion to complex judgment and analytical work. By integrating advanced reasoning models like OpenAI’s o1 and connecting them to enterprise data, these agents can tackle ambiguous business problems more methodically. The system dynamically determines when to invoke deeper reasoning, either implicitly based on task complexity or explicitly when users include prompts like “reason over this” or “think really hard about this.” Behind the scenes, the platform analyzes instructions, evaluates context, and selects appropriate tools based on the task requirements. This enables scenarios that were previously difficult to automate. For example, one large telecommunications company uses deep reasoning agents to generate complex RFP responses by assembling information from across multiple internal documents and knowledge sources, Lamanna told VentureBeat. Similarly, Thomson Reuters employs these capabilities for due diligence in mergers and acquisition reviews, processing unstructured documents to identify insights, he said. See an example of the agent reasoning at work in the video below: Agent flows: Reimagining process automation Microsoft has also introduced agent flows, which effectively evolve robotic process automation (RPA) by combining rule-based workflows with AI reasoning. This addresses customer demands for integrating deterministic business logic with flexible AI capabilities. “Sometimes they don’t want the model to freestyle. They don’t want the AI to make its own decisions. They want to have hard-coded business rules,” Lamanna explained. “Other times they do want the agent to freestyle and make judgment calls.” This hybrid approach enables scenarios like intelligent fraud prevention, where an agent flow might use conditional logic to route higher-value refund requests to an AI agent for deep analysis against policy documents. Pets at Home, a U.K.-based pet supplies retailer, has already deployed this technology for fraud prevention. Lamanna revealed the company has saved “over a million pounds” through the implementation. Similarly, Dow Chemical has realized “millions of dollars saved for transportation and freight management” through agent-based optimization. Below is a video showing the Agent Flows at work: The Microsoft Graph advantage Central to Microsoft’s agent strategy is its enterprise data integration through the Microsoft Graph, which is a comprehensive mapping of workplace relationships between people, documents, emails, calendar events, and business data. This provides agents with contextual awareness that generic models lack.  “The lesser known secret capability of the Microsoft graph is that we’re able to improve relevance on the graph based on engagement and how tightly connected some files are,” Lamanna revealed. The system identifies which documents are most referenced, shared, or commented on, ensuring agents reference authoritative sources rather than outdated copies. This approach gives Microsoft a significant competitive advantage over standalone AI providers. While competitors may offer advanced models, Microsoft combines these with workplace context and fine-tuning optimized explicitly for enterprise use cases and Microsoft tools. Microsoft can leverage the same web data and model technology that competitors can, Lamanna noted, “but we then also have all the content inside the enterprise.” This creates a flywheel effect where each new agent interaction further enriches the graph’s understanding of workplace patterns. Enterprise adoption and accessibility Microsoft has prioritized making these powerful capabilities accessible to organizations with varying technical resources, Lamanna said. The agents are exposed directly within Copilot, allowing users to interact through natural language without prompt engineering expertise. Meanwhile, Copilot Studio provides a low-code environment for custom agent development. “It’s in our DNA to have a tool for everybody, not just people who can boot up a Python SDK and make calls, but anybody can start to build these agents,” Lamanna emphasized. This accessibility approach has fueled rapid adoption. Microsoft previously revealed that over 100,000 organizations have used Copilot Studio and that more than 400,000 agents were created in the last quarter. The competitive landscape While Microsoft appears to lead enterprise agent deployment today, competition is intensifying. Google has expanded its Gemini capabilities for agents and agentic coding, while OpenAI’s o1 model and Agents SDK provide powerful reasoning and agentic tools for developers. Big enterprise application companies like Salesforce, Oracle, ServiceNow, SAP and others have all launched agentic platforms

Microsoft infuses enterprise agents with deep reasoning, unveils data Analyst agent that outsmarts competitors Read More »

AMD is powering AI success with smarter, right-sized compute

Presented by AMD As AI adoption accelerates, businesses are encountering compute bottlenecks that extend beyond just raw processing power. The challenge is not only about having more compute; it’s about having smarter, more efficient compute, customized to an organization’s needs, with the ability to scale alongside AI innovation. AI models are growing in size and complexity, requiring architectures that can process massive datasets, support continuous learning and provide the efficiency needed for real-time decision-making. From AI training and inference in hyperscale data centers to AI-driven automation in enterprises, the ability to deploy and scale compute infrastructure seamlessly is now a competitive differentiator. “It’s a tall order. Organizations are struggling to stay up-to-date with AI compute demands, scale AI workloads efficiently and optimize their infrastructure,” says Mahesh Balasubramanian, director, datacenter GPU product marketing at AMD. “Every company we talk to wants to be at the forefront of AI adoption and business transformation. The challenge is, they’ve never been faced before with such a massive, era-defining technology.” Launching a nimble AI strategy Where to start? Modernizing existing data centers is an essential first step to removing bottlenecks to AI innovation. This frees up space and power, improves efficiency and greens the data center, all of which helps the organization stay nimble enough to adapt to the changing AI environment. “You can upgrade your existing data center from a three-generation-old, Intel Xeon 8280 CPU, to the latest generation of AMD EPYC CPU and save up to 68% on energy while using 87% fewer servers3,” Balasubramanian says. “It’s not just a smart and efficient way of upgrading an existing data center, it opens up options for the next steps in upgrading a company’s compute power.” And as an organization evolves its AI strategy, it’s critical to have a plan for fast-growing hardware and computational requirements. It’s a complex undertaking, whether you’re working with a single model underlying organizational processes, customized models for each department or agentic AI. “If you understand your foundational situation – where AI will deployed, and what infrastructure is already available from a space, power, efficiency and cost perspective – you have a huge number of robust technology solutions to solve these problems,” Balasubramanian says. Beyond one-size-fits all compute A common perception in the enterprise is that AI solutions require a massive investment right out of the gate, across the board, on hardware, software and services. That has proven to be one of the most common barriers to adoption — and an easy one to overcome, Balasubramanian says. The AI journey kicks off with a look at existing tech and upgrades to the data center; from there, an organization can start scaling for the future by choosing technology that can be right-sized for today’s problems and tomorrow’s goals. “Rather than spending everything on one specific type of product or solution, you can now right-size the fit and solution for the organizations you have,” Balasubramanian says. “AMD is unique in that we have a broad set of solutions to meet bespoke requirements. We have solutions from cloud to data center, edge solutions, client and network solutions and more. This broad portfolio lets us provide the best performance across all solutions, and lets us offer in-depth guidance to enterprises looking for the solution that fits their needs.” That AI portfolio is designed to tackle the most demanding AI workloads — from foundation model training to edge inference. The latest AMD InstinctTM MI325X GPUs, powered by HBM3e memory and CDNA architecture, deliver superior performance for generative AI workloads, providing up to 1.3X better inference performance compared to competing solutions1,2​. AMD EPYC CPUs continue to set industry standards, delivering unmatched core density, energy efficiency and high-memory bandwidth critical for AI compute scalability​. Collaboration with a wide range of industry leaders — including OEMs like Dell, Supermicro, Lenovo, and HPE, network vendors like Broadcom and Marvell, and switching vendors like Arista and Cisco — maximizes the modularity of these data center solutions. It scales seamlessly from two or four servers to thousands, all built with next gen Ethernet-based AI networking and backed by industry-leading technology and expertise. Why open-source software is critical for AI advancement While both hardware and software are crucial for tackling today’s AI challenges, open-source software will drive true innovation. “We believe there’s no one company in this world that has the answers for every problem,” Balasubramanian says. “The best way to solve the world’s problems with AI is to have a united front, and to have a united front means having an open software stack that everyone can collaborate on. That’s a key part of our vision.” AMD’s open-source software stack, ROCmTM, is widely adopted by industry leaders like OpenAI, Microsoft, Meta, Oracle and more. Meta runs its largest and most complicated model on AMD Instinct GPUs. ROCm comes with standard support for PyTorch, the largest AI framework, and has more than a million models from Hugging Face’s premium model repository enabling customers begin their journey with seamless out of the box experience on ROCm software and Instinct GPUs. AMD works with vendors like PyTorch, Tensorflow, JAX, OpenAI’s Triton and others to ensure that no matter what the size of the model, small or large, applications and use cases can scale anywhere from a single GPU all the way to tens of thousands of GPUs — just as its AI hardware can scale to match any size workload. ROCm’s deep ecosystem engagement with continuous integration and continuous development ensures that new AI functions and features can be securely integrated into the stack. These features go through an automated testing and development process to ensure it fits in, it’s robust, it doesn’t break anything and it can provide support right away to the software developers and data scientists using it. And as AI evolves, ROCm is pivoting to offer new capabilities, rather than locking an organization into one particular vendor that might not offer the flexibility necessary to grow. “We want to give organizations an open-source software stack that is completely open

AMD is powering AI success with smarter, right-sized compute Read More »

‘Insane’: OpenAI introduces GPT-4o native image generation and it’s already wowing users

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More We’re coming up on the one year anniversary since OpenAI released its first “omni” or multimodal model, GPT-4o back in May 2024, but that old standby still has some tricks up its sleeve. Case-in-point, today OpenAI finally turned on the native multimodal image generation capabilities of GPT-4o for users of its hit chatbot ChatGPT on the Plus, Pro, Team, and Free usage tiers, though the company said it would also soon be made available for Enterprise, Edu, and through its application programming interface (API). Unlike the previous generative AI image model available in ChatGPT — OpenAI’s DALL-E 3, a classic diffusion transformer model that was trained to reconstruct images from text prompts by removing noise from pixels — this new image generator is part of the same model that spits out text and code, as OpenAI trained the entire model to understand all these forms of media at once. OpenAI president Greg Brockman had long ago previewed this native capability of GPT-4o back in May 2024, but for reasons that still remain unknown publicly, the company held onto it until now — following the public release of what many AI power users saw as a similar feature from Google AI Studio with its Gemini 2 Flash Experimental model. This has resulted in a much higher quality image generator that produces far more lifelike images and accurate text baked in, and it’s already impressing users — one of whom calls the quality “insane.” By the same token (pun intended), OpenAI still hasn’t said precisely what data GPT-4o’s image generation capabilities were trained on — and given the history of the company and other model providers, it likely includes many artworks scraped from the web, some of which are presumably copyrighted, which is likely to anger the artists behind them. Bringing Image Generation to ChatGPT and Sora OpenAI has long aimed to make image generation a core capability of its AI models. With GPT-4o, users can now generate images directly in ChatGPT, refining them through conversation and adjusting details on the fly. The model also integrates into Sora, OpenAI’s video-generation platform, further expanding multimodal capabilities. In an announcement on X, OpenAI confirmed that GPT-4o’s image generation is designed to: Accurately render text within images, allowing for the creation of signs, menus, invitations, and infographics. Follow complex prompts with precision, maintaining high fidelity even in detailed compositions. Build upon previous images and text, ensuring visual consistency across multiple interactions. Support various artistic styles, from photorealism to stylized illustrations. Users can describe an image in ChatGPT, specifying details such as aspect ratio, color schemes (hex codes), or transparency, and GPT-4o will generate it within a minute. As independent AI consultant Allie K. Miller wrote on X, it’s a “Huge leap in text generation,” and is “the best” AI image generation model she’s seen. Key capabilities and use cases GPT-4o is designed to make image generation not just visually stunning but also practical. Some of the key applications include: Design & Branding – Generate logos, posters, and advertisements with precise text placement. Education & Visualization – Create scientific diagrams, infographics, and historical imagery for learning. Game Development – Maintain character consistency across different design iterations. Marketing & Content Creation – Produce social media assets, event invitations, and digital illustrations tailored to brand needs. How GPT-4o improves generative images over DALL-E According to OpenAI’s official thread on X, GPT-4o introduces several improvements over previous models: Better text integration: Unlike past AI models that struggled with legible, well-placed text, GPT-4o can now accurately embed words within images. Enhanced contextual understanding: GPT-4o leverages chat history, allowing users to refine images interactively and maintain coherence across multiple generations. Improved multi-object binding: While previous models had difficulty correctly positioning many distinct objects in a scene, GPT-4o can now handle up to 10-20 objects at once. Versatile style adaptation: The model can generate or transform images into a variety of styles, from hand-drawn sketches to high-resolution photorealism. Limitations Despite its advancements, GPT-4o still has some known challenges: Cropping Issues: Large images, such as posters, may sometimes be cropped too tightly. Text Accuracy in Non-Latin Scripts: Some non-English characters may not render correctly. Detail Retention in Small Text: Highly detailed or small-font text may lose clarity. Editing Precision: Modifying specific parts of an image may inadvertently affect other elements. OpenAI is actively addressing these issues through ongoing model refinements. Safety and labeling measures As part of OpenAI’s commitment to responsible AI development, all GPT-4o-generated images include C2PA metadata, allowing users to verify their AI origin. Moreover, OpenAI has built an internal search tool to help detect AI-generated images. Strict safeguards are in place to block harmful content and prevent misuse, such as prohibiting explicit, deceptive, or harmful imagery. OpenAI also ensures that images featuring real people are subject to heightened restrictions. OpenAI CEO Sam Altman described the release as a “new high-water mark for creative freedom”, emphasizing that users will be able to create a wide range of visuals, with OpenAI observing and refining its approach based on real-world usage. As AI-generated images become more precise and accessible, GPT-4o represents a significant step forward in making text-to-image generation a mainstream tool for communication, creativity, and productivity. source

‘Insane’: OpenAI introduces GPT-4o native image generation and it’s already wowing users Read More »

Anthropic just gave Claude a superpower: real-time web search. Here’s why it changes everything

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Anthropic announced today that its AI assistant Claude can now search and process information from the internet in real-time, addressing one of users’ most requested features and closing a critical competitive gap with OpenAI’s ChatGPT. The new web search capability, available immediately for paid Claude users in the United States, transforms the AI assistant from a tool limited by its training data cutoff to one that can access and synthesize the latest information across the web. “With web search, Claude has access to the latest events and information, boosting its accuracy on tasks that benefit from the most recent data,” Anthropic said in its announcement. The company emphasized that Claude will provide direct citations to sources, allowing users to fact-check information— a direct response to growing concerns about AI hallucinations and misinformation. AI arms race intensifies as Anthropic secures billions in funding This launch comes at an important moment in the rapidly evolving AI sector. Just three weeks ago, Anthropic secured $3.5 billion in Series E funding at a post-money valuation of $61.5 billion, underscoring the high stakes in the AI race. Major backers include Lightspeed Venture Partners, Google (which holds a 14% stake) and Amazon, which has integrated Claude into its Alexa+ service. The web search rollout also follows Anthropic’s recent release of Claude 3.7 Sonnet, which the company claims has set “a new high-water mark in coding abilities.” This focus on programming proficiency appears strategic, especially in light of CEO Dario Amodei’s recent prediction at a Council on Foreign Relations event that “in three to six months, AI will be writing 90% of the code” that software developers currently produce. The timing of this feature launch reveals Anthropic’s determination to challenge OpenAI’s dominance in the consumer AI assistant market. While Claude has gained popularity among technical users for its nuanced reasoning and longer context window, the lack of real-time information access has been a significant handicap in head-to-head comparisons with ChatGPT. This update effectively neutralizes that disadvantage. How Claude’s web search transforms enterprise decision-making Unlike traditional search engines that return a list of links, Claude processes search results and delivers them in a conversational format. Users simply toggle on web search in their profile settings, and Claude will automatically search the internet when needed to inform its responses. Anthropic highlighted several business use cases for the web-enabled Claude: sales teams analyzing industry trends, financial analysts assessing current market data, researchers building grant proposals and shoppers comparing products across multiple sources. This feature fundamentally changes how enterprise users can interact with AI assistants. Previously, professionals needed to toggle between search engines and AI tools, manually feeding information from one to the other. Claude’s integrated approach streamlines this workflow dramatically, potentially saving hours of research time for knowledge workers. For financial services firms in particular, the ability to combine historical training data with breaking news creates a powerful analysis tool that could provide genuine competitive advantages. Investment decisions often hinge on connecting disparate pieces of information quickly — exactly the kind of task this integration aims to solve. Behind the scenes: The technical infrastructure powering Claude’s new capabilities Behind this seemingly straightforward feature lies considerable technical complexity. Anthropic has likely spent months fine-tuning Claude’s ability to search effectively, understand context and determine when web search would improve its responses. The update integrates with other recent technical improvements to the Anthropic API, including cache-aware rate limits, simpler prompt caching, and token-efficient tool use. These enhancements, announced earlier this month, aim to help developers process more requests while reducing costs. For certain applications, these enhancements can reduce token usage by up to 90%. Anthropic has also upgraded its developer console to enable collaboration among teams working on AI implementations. The revised console allows developers to share prompts, collaborate on refinements and control extended thinking budgets — features particularly valuable for enterprise customers integrating Claude into their workflows. The investment in these backend capabilities suggests Anthropic is building for scale, anticipating rapid adoption as more companies integrate AI into their operations. By focusing on developer experience alongside user-facing features, Anthropic is creating an ecosystem rather than just a product — a strategy that has served companies like Microsoft well in enterprise markets. Voice mode: Anthropic’s next frontier in natural AI interaction A web search may be just the beginning of Anthropic’s feature expansion. According to a recent report in the Financial Times, the company is developing voice capabilities for Claude, potentially transforming how users interact with the AI assistant. Mike Krieger, Anthropic’s chief product officer, told the Financial Times that the company is working on experiences that would allow users to speak directly to Claude. “We are doing some work around how Claude for desktop evolves… if it is going to be operating your computer, a more natural user interface might be to [speak to it],” Krieger said. The company has reportedly held discussions with Amazon and voice-focused AI startup ElevenLabs about potential partnerships, though no deals have been finalized. Voice interaction would represent a significant leap forward in making AI assistants more accessible and intuitive. The current text-based interaction model creates friction that voice could eliminate, potentially expanding Claude’s appeal beyond tech-savvy early adopters to a much broader user base. How Anthropic’s safety-first approach shapes regulatory conversations As Anthropic expands Claude’s capabilities, the company continues to emphasize its commitment to responsible AI development. In response to California Governor Gavin Newsom’s Working Group on AI Frontier Models draft report released earlier this week, Anthropic expressed support for “objective standards and evidence-based policy guidance,” particularly highlighting transparency as “a low-cost, high-impact means of growing the evidence base around a new technology.” “Many of the report’s recommendations already reflect industry best practices which Anthropic adheres to,” the company stated, noting its Responsible Scaling Policy that outlines how it assesses models for misuse and autonomy risks. This focus on responsible development represents a core differentiator in

Anthropic just gave Claude a superpower: real-time web search. Here’s why it changes everything Read More »

Small models as paralegals: LexisNexis distills models to build AI assistant

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More When legal research company LexisNexis created its AI assistant Protégé, it wanted to figure out the best way to leverage its expertise without deploying a large model.  Protégé aims to help lawyers, associates and paralegals write and proof legal documents and ensure that anything they cite in complaints and briefs is accurate. However, LexisNexis didn’t want a general legal AI assistant; they wanted to build one that learns a firm’s workflow and is more customizable.  LexisNexis saw the opportunity to bring the power of large language models (LLMs) from Anthropic and Mistral and find the best models that answer user questions the best, Jeff Reihl, CTO of LexisNexis Legal and Professional, told VentureBeat. “We use the best model for the specific use case as part of our multi-model approach. We use the model that provides the best result with the fastest response time,” Reihl said. “For some use cases, that will be a small language model like Mistral or we perform distillation to improve performance and reduce cost.” While LLMs still provide value in building AI applications, some organizations turn to using small language models (SLMs) or distilling LLMs to become small versions of the same model.  Distillation, where an LLM “teaches” a smaller model, has become a popular method for many organizations.  Small models often work best for apps like chatbots or simple code completion, which is what LexisNexis wanted to use for Protégé.  This is not the first time LexisNexis built AI applications, even before launching its legal research hub LexisNexis + AI in July 2024. “We have used a lot of AI in the past, which was more around natural language processing, some deep learning and machine learning,” Reihl said. “That really changed in November 2022 when ChatGPT was launched, because prior to that, a lot of the AI capabilities were kind of behind the scenes. But once ChatGPT came out, the generative capabilities, the conversational capabilities of it was very, very intriguing to us.” Small, fine-tuned models and model routing  Reihl said LexisNexis uses different models from most of the major model providers when building its AI platforms. LexisNexis + AI used Claude models from Anthropic, OpenAI’s GPT models and a model from Mistral.  This multimodal approach helped break down each task users wanted to perform on the platform. To do this, LexisNexis had to architect its platform to switch between models.  “We would break down whatever task was being performed into individual components, and then we would identify the best large language model to support that component. One example of that is we will use Mistral to assess the query that the user entered in,” Reihl said.  For Protégé, the company wanted faster response times and models more fine-tuned for legal use cases. So it turned to what Reihl calls “fine-tuned” versions of models, essentially smaller weight versions of LLMs or distilled models.  “You don’t need GPT-4o to do the assessment of a query, so we use it for more sophisticated work, and we switch models out,” he said.  When a user asks Protégé a question about a specific case, the first model it pings is a fine-tuned Mistral “for assessing the query, then determining what the purpose and intent of that query is” before switching to the model best suited to complete the task. Reihl said the next model could be an LLM that generates new queries for the search engine or another model that summarizes results.  Right now, LexisNexis mostly relies on a fine-tuned Mistral model though Reihl said it used a fine-tuned version of Claude “when it first came out; we are not using it in the product today but in other ways.” LexisNexis is also interested in using other OpenAI models especially since the company came out with new reinforcement fine-tuning capabilities last year. LexisNexis is in the process of evaluating OpenAI’s reasoning models including o3 for its platforms.  Reihl added that it may also look at using Gemini models from Google.  LexisNexis backs all of its AI platforms with its own knowledge graph to perform retrieval augmented generation (RAG) capabilities, especially as Protégé could help launch agentic processes later.  The AI legal suite Even before the advent of generative AI, LexisNexis tested the possibility of putting chatbots to work in the legal industry. In 2017, the company tested an AI assistant that would compete with IBM’s Watson-powered Ross and Protégé sits in the company’s LexisNexis + AI platform, which brings together the AI services of LexisNexis.  Protégé helps law firms with tasks that paralegals or associates tend to do. It helps write legal briefs and complaints that are grounded in firms’ documents and data, suggest legal workflow next steps, suggest new prompts to refine searches, draft questions for depositions and discovery, link quotes in filings for accuracy, generate timelines and, of course, summarize complex legal documents.  “We see Protégé as the initial step in personalization and agentic capabilities,” Reihl said. “Think about the different types of lawyers: M&A, litigators, real estate. It’s going to continue to get more and more personalized based on the specific task you do. Our vision is that every legal professional will have a personal assistant to help them do their job based on what they do, not what other lawyers do.” Protégé now competes against other legal research and technology platforms. Thomson Reuters customized OpenAI’s o1-mini-model for its CoCounsel legal assistant. Harvey, which raised $300 million from investors including LexisNexis, also has a legal AI assistant.  source

Small models as paralegals: LexisNexis distills models to build AI assistant Read More »

Anthropic’s stealth enterprise coup: How Claude 3.7 is becoming the coding agent of choice

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More While consumer attention has focused on the generative AI battles between OpenAI and Google, Anthropic has executed a disciplined enterprise strategy centered on coding — potentially the most valuable enterprise AI use case. The results are becoming increasingly clear: Claude is positioning itself as the LLM that matters most for businesses. The evidence? Anthropic’s Claude 3.7 Sonnet, released just two weeks ago, set new benchmark records for coding performance. Simultaneously, the company launched Claude Code, a command-line AI agent that helps developers build applications faster. Meanwhile, Cursor — an AI-powered code editor that defaults to Anthropic’s Claude model — has surged to a reported $100 million in annual recurring revenue in just 12 months.  Anthropic’s deliberate focus on coding comes as enterprises increasingly recognize the power of AI coding agents, which enable both seasoned developers and non-coders to build applications with unprecedented speed and efficiency. “Anthropic continues to come out on top,” said Guillermo Rauch, CEO of Vercel, another fast-growing company that lets developers, including non-coders, deploy front-end applications. Last year, Vercel switched its lead coding model from OpenAI’s GPT to Anthropic’s Claude after evaluating the models’ performance on key coding tasks. Claude 3.7: Setting new benchmarks for AI coding Released February 24, Claude 3.7 Sonnet leads on nearly all coding benchmarks. It scored an impressive 70.3% on the respected SWE-bench benchmark, which measures an agent’s software development skills, handily outperforming nearest competitors OpenAI’s o1 (48.9%) and DeepSeek-R1 (49.2%). It also outperforms competitors on agentic tasks. Source: Anthropic. SWE-bench measures a model’s ability to solve real-world software issues. Developer communities have quickly verified these results in real-world testing. Reddit threads comparing Claude 3.7 with Grok 3, the newly released model from Elon Musk’s xAI, consistently favor Anthropic’s model for coding tasks. “Based on what I’ve tested, Claude 3.7 seems to be the best for writing code (at least for me),” said a top commenter. (Update: Even Manus, the new Chinese multi-purpose agent that took the world by storm earlier this week, when it launched saying it was better than Open AI’s Deep Research and other autonomous tasks, was largely built on Claude.) Alongside the 3.7 Sonnet release, Anthropic launched Claude Code, an AI coding agent that works directly through the command line. This complements the company’s October release of Computer Use, which enables Claude to interact with a user’s computer, including using a browser to search the web, opening applications, and inputting text. Source: Anthropic: TAU-bench is a framework that tests AI agents on complex real-world tasks with user and tool interactions. Most notable is what Anthropic hasn’t done. Unlike competitors that rush to match each other feature-for-feature, the company hasn’t even bothered to integrate web search functionality into its app — a basic feature most users expect. This calculated omission signals that Anthropic isn’t competing for general consumers but is laser-focused on the enterprise market, where coding capabilities deliver much higher ROI than search. Hands-on with Claude’s coding capabilities To test the real-world capabilities of these coding agents, I experimented with building a database to store VentureBeat articles using three different approaches: Claude 3.7 Sonnet through Anthropic’s app; Cursor’s coding agent; and Claude Code. Using Claude 3.7 directly through Anthropic’s app, I found the solution provided remarkable guidance for a non-coder like myself. It recommended several options, from very robust solutions using things like PostgreSQL database, to easier, lightweight ones like using Airtable. I chose the lightweight solution, and Claude methodically walked me through how to pull articles from the VentureBeat API into Airtable using Make.com for connections. The process took about two hours, including some authentication challenges, but resulted in a functional system. You could say that instead of doing all of the code for me, it showed me a master plan on how to do it. Cursor, which defaults to Claude’s models, is a full-fledged code editor and was more eager to automate the process. However, it required permission at every step, creating a somewhat tedious workflow. Claude Code offered yet another approach, running directly in the terminal and using SQLite to create a local database that pulled articles from our RSS feed. This solution was simpler and more reliable in terms of getting me to my end goal, but definitely less robust and feature-rich than the Airtable implementation. I’m now understanding the nature of these tradeoffs, and know that the coding agent I pick really depends on the specific project. The key insight: Even as a non-developer, I was able to build functional database applications using all three approaches — something that would have been unthinkable just a year ago. And they all relied on Claude under the hood. For a more detailed review of how to do this so-called “vibe coding,” where you rely on agents to code things while not doing any coding yourself, read this great piece by developer Simon Willison published yesterday. The process can be very buggy, and frustrating at times, but with the right concessions to this, you can go a long way. The strategy: Why coding is Anthropic’s enterprise play Anthropic’s singular focus on coding capabilities isn’t accidental. According to projections reportedly leaked to The Information, Anthropic aims to reach $34.5 billion in revenue by 2027 — an 86-fold increase from current levels. Approximately 67% of this projected revenue would come from API business, with enterprise coding applications as the primary driver. While Anthropic hasn’t released exact numbers for its revenue so far, it said its coding revenue surged 1,000% over the last quarter of 2024. Last week, Anthropic announced it had raised $3.5 billion more in funding at a $61.5 billion valuation. This coding bet is supported by Anthropic’s own Economic Index, which found that 37.2% of queries sent to Claude were in the “computer and mathematical” category, primarily covering software engineering tasks like code modification, debugging and network troubleshooting. Anthropic appears to be marching to its own beat

Anthropic’s stealth enterprise coup: How Claude 3.7 is becoming the coding agent of choice Read More »

Less is more: UC Berkeley and Google unlock LLM potential through simple sampling

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A new paper by researchers from Google Research and the University of California, Berkeley, demonstrates that a surprisingly simple test-time scaling approach can boost the reasoning abilities of large language models (LLMs). The key? Scaling up sampling-based search, a technique that relies on generating multiple responses and using the model itself to verify them.  The core finding is that even a minimalist implementation of sampling-based search, using random sampling and self-verification, can elevate the reasoning performance of models like Gemini 1.5 Pro beyond that of o1-Preview on popular benchmarks. The findings can have important implications for enterprise applications and challenge the assumption that highly specialized training or complex architectures are always necessary for achieving top-tier performance. The limits of current test-time compute scaling The current popular method for test-time scaling in LLMs is to train the model through reinforcement learning to generate longer responses with chain-of-thought (CoT) traces. This approach is used in models such as OpenAI o1 and DeepSeek-R1. While beneficial, these methods usually require substantial investment in the training phase. Another test-time scaling method is “self-consistency,” where the model generates multiple responses to the query and chooses the answer that appears more often. Self-consistency reaches its limits when handling complex problems, as in these cases, the most repeated answer is not necessarily the correct one. Sampling-based search offers a simpler and highly scalable alternative to test-time scaling: Let the model generate multiple responses and select the best one through a verification mechanism. Sampling-based search can complement other test-time compute scaling strategies and, as the researchers write in their paper, “it also has the unique advantage of being embarrassingly parallel and allowing for arbitrarily scaling: simply sample more responses.” More importantly, sampling-based search can be applied to any LLM, including those that have not been explicitly trained for reasoning. How sampling-based search works The researchers focus on a minimalist implementation of sampling-based search, using a language model to both generate candidate responses and verify them. This is a “self-verification” process, where the model assesses its own outputs without relying on external ground-truth answers or symbolic verification systems. Search-based sampling Credit: VentureBeat The algorithm works in a few simple steps:  1—The algorithm begins by generating a set of candidate solutions to the given problem using a language model. This is done by giving the model the same prompt multiple times and using a non-zero temperature setting to create a diverse set of responses. 2—Each candidate’s response undergoes a verification process in which the LLM is prompted multiple times to determine whether the response is correct. The verification outcomes are then averaged to create a final verification score for the response. 3— The algorithm selects the highest-scored response as the final answer. If multiple candidates are within close range of each other, the LLM is prompted to compare them pairwise and choose the best one. The response that wins the most pairwise comparisons is chosen as the final answer. The researchers considered two key axes for test-time scaling: Sampling: The number of responses the model generates for each input problem. Verification: The number of verification scores computed for each generated solution How sampling-based search compares to other techniques The study revealed that reasoning performance continues to improve with sampling-based search, even when test-time compute is scaled far beyond the point where self-consistency saturates.  At a sufficient scale, this minimalist implementation significantly boosts reasoning accuracy on reasoning benchmarks like AIME and MATH. For example, Gemini 1.5 Pro’s performance surpassed that of o1-Preview, which has explicitly been trained on reasoning problems, and Gemini 1.5 Flash surpassed Gemini 1.5 Pro. “This not only highlights the importance of sampling-based search for scaling capability, but also suggests the utility of sampling-based search as a simple baseline on which to compare other test-time compute scaling strategies and measure genuine improvements in models’ search capabilities,” the researchers write. It is worth noting that while the results of search-based sampling are impressive, the costs can also become prohibitive. For example, with 200 samples and 50 verification steps per sample, a query from AIME will generate around 130 million tokens, which costs $650 with Gemini 1.5 Pro. However, this is a very minimalistic approach to sampling-based search, and it is compatible with optimization techniques proposed in other studies. With smarter sampling and verification methods, the inference costs can be reduced considerably by using smaller models and generating fewer tokens. For example, by using Gemini 1.5 Flash to perform the verification, the costs drop to $12 per question. Effective self-verification strategies There is an ongoing debate on whether LLMs can verify their own answers. The researchers identified two key strategies for improving self-verification using test-time compute: Directly comparing response candidates: Disagreements between candidate solutions strongly indicate potential errors. By providing the verifier with multiple responses to compare, the model can better identify mistakes and hallucinations, addressing a core weakness of LLMs. The researchers describe this as an instance of “implicit scaling.” Task-specific rewriting: The researchers propose that the optimal output style of an LLM depends on the task. Chain-of-thought is effective for solving reasoning tasks, but responses are easier to verify when written in a more formal, mathematically conventional style. Verifiers can rewrite candidate responses into a more structured format (e.g., theorem-lemma-proof) before evaluation. “We anticipate model self-verification capabilities to rapidly improve in the short term, as models learn to leverage the principles of implicit scaling and output style suitability, and drive improved scaling rates for sampling-based search,” the researchers write. Implications for real-world applications The study demonstrates that a relatively simple technique can achieve impressive results, potentially reducing the need for complex and costly model architectures or training regimes. This is also a scalable technique, enabling enterprises to increase performance by allocating more compute resources to sampling and verification. It also enables developers to push frontier language models beyond their limitations on complex tasks. “Given that it complements other test-time

Less is more: UC Berkeley and Google unlock LLM potential through simple sampling Read More »

The open-source AI debate: Why selective transparency poses a serious risk

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More As tech giants declare their AI releases open — and even put the word in their names — the once insider term “open source” has burst into the modern zeitgeist. During this precarious time in which one company’s misstep could set back the public’s comfort with AI by a decade or more, the concepts of openness and transparency are being wielded haphazardly, and sometimes dishonestly, to breed trust.  At the same time, with the new White House administration taking a more hands-off approach to tech regulation, the battle lines have been drawn — pitting innovation against regulation and predicting dire consequences if the “wrong” side prevails.  There is, however, a third way that has been tested and proven through other waves of technological change. Grounded in the principles of openness and transparency, true open source collaboration unlocks faster rates of innovation even as it empowers the industry to develop technology that is unbiased, ethical and beneficial to society.  Understanding the power of true open source collaboration Put simply, open-source software features freely available source code that can be viewed, modified, dissected, adopted and shared for commercial and noncommercial purposes — and historically, it has been monumental in breeding innovation. Open-source offerings Linux, Apache, MySQL and PHP, for example, unleashed the internet as we know it.  Now, by democratizing access to AI models, data, parameters and open-source AI tools, the community can once again unleash faster innovation instead of continually recreating the wheel — which is why a recent IBM study of 2,400 IT decision-makers revealed a growing interest in using open-source AI tools to drive ROI. While faster development and innovation were at the top of the list when it came to determining ROI in AI, the research also confirmed that embracing open solutions may correlate to greater financial viability. Instead of short-term gains that favor fewer companies, open-source AI invites the creation of more diverse and tailored applications across industries and domains that might not otherwise have the resources for proprietary models.  Perhaps as importantly, the transparency of open source allows for independent scrutiny and auditing of AI systems’ behaviors and ethics — and when we leverage the existing interest and drive of the masses, they will find the problems and mistakes as they did with the LAION 5B dataset fiasco.  In that case, the crowd rooted out more than 1,000 URLs containing verified child sexual abuse material hidden in the data that fuels generative AI models like Stable Diffusion and Midjourney — which produce images from text and image prompts and are foundational in many online video-generating tools and apps.  While this finding caused an uproar, if that dataset had been closed, as with OpenAI’s Sora or Google’s Gemini, the consequences could have been far worse. It’s hard to imagine the backlash that would ensue if AI’s most exciting video creation tools started churning out disturbing content. Thankfully, the open nature of the LAION 5B dataset empowered the community to motivate its creators to partner with industry watchdogs to find a fix and release ​​RE-LAION 5B — which exemplifies why the transparency of true open-source AI not only benefits users, but the industry and creators who are working to build trust with consumers and the general public.  The danger of open sourcery in AI While source code alone is relatively easy to share, AI systems are far more complicated than software. They rely on system source code, as well as the model parameters, dataset, hyperparameters, training source code, random number generation and software frameworks — and each of these components must work in concert for an AI system to work properly. Amid concerns around safety in AI, it has become commonplace to state that a release is open or open source. For this to be accurate, however, innovators must share all the pieces of the puzzle so that other players can fully understand, analyze and assess the AI system’s properties to ultimately reproduce, modify and extend its capabilities.  Meta, for example, touted Llama 3.1 405B as “the first frontier-level open-source AI model,” but only publicly shared the system’s pre-trained parameters, or weights, and a bit of software. While this allows users to download and use the model at will, key components like the source code and dataset remain closed — which becomes more troubling in the wake of the announcement that Meta will inject AI bot profiles into the ether even as it stops vetting content for accuracy.  To be fair, what is being shared certainly contributes to the community. Open weight models offer flexibility, accessibility, innovation and a level of transparency. DeepSeek’s decision to open source its weights, release its technical reports for R1 and make it free to use, for example, has enabled the AI community to study and verify its methodology and weave it into their work.  It is misleading, however, to call an AI system open source when no one can actually look at, experiment with and understand each piece of the puzzle that went into creating it. This misdirection does more than threaten public trust. Instead of empowering everyone in the community to collaborate, build and advance upon models like Llama X, it forces innovators using such AI systems to blindly trust the components that are not shared. Embracing the challenge before us As self-driving cars take to the streets in major cities and AI systems assist surgeons in the operating room, we are only at the beginning of letting this technology take the proverbial wheel. The promise is immense, as is the potential for error — which is why we need new measures of what it means to be trustworthy in the world of AI. Even as Anka Reuel and colleagues at Stanford University recently attempted to set up a new framework for the AI benchmarks used to assess how well models perform, for example, the review practice the industry and the

The open-source AI debate: Why selective transparency poses a serious risk Read More »

Nvidia debuts Llama Nemotron open reasoning models in a bid to advance agentic AI

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Nvidia is getting into the open source reasoning model market. At the Nvidia GTC event today, the AI giant made a series of hardware and software announcements. Buried amidst the big silicon announcements, the company announced a new set of open source Llama Nemotron reasoning models to help accelerate agentic AI workloads. The new models are an extension of the Nvidia Nemotron models that were first announced in January at the Consumer Electronics Show (CES). The new Llama Nemotron reasoning models are in part a response to the dramatic rise of reasoning models in 2025. Nvidia (and its stock price) were rocked to the core earlier this year when DeepSeek R1 came out, offering the promise of an open source reasoning model and superior performance. The Llama Nemotron family models are competitive with DeepSeek offering business-ready AI reasoning models for advanced agents.  “Agents are autonomous software systems designed to reason, plan, act and critique their work,” Kari Briski, vice president of Generative AI Software Product Managements at Nvidia said during a GTC pre-briefing with press. “Just like humans, agents need to understand context to breakdown complex requests, understand the user’s intent, and adapt in real time.” What’s inside Llama Nemotron for agentic AI As the name implies Llama Nemotron is based on Meta’s open source Llama models. With Llama as the foundation, Briski said that Nvidia algorithmically pruned the model to optimize compute requirements while maintaining accuracy. Nvidia also applied sophisticated post-training techniques using synthetic data. The training involved 360,000 H100 inference hours and 45,000 human annotation hours to enhance reasoning capabilities. All that training results in models that have exceptional reasoning capabilities across key benchmarks for math, tool calling, instruction following and conversational tasks, according to Nvidia. The Llama Nemotron family has three different models The family includes three models targeting different deployment scenarios: Nemotron Nano: Optimized for edge and smaller deployments while maintaining high reasoning accuracy. Nemotron Super: Balanced for optimal throughput and accuracy on single data center GPUs. Nemotron Ultra: Designed for maximum “agentic accuracy” in multi-GPU data center environments. For availability, Nano and Super are now available at NIM micro services and can be downloaded at AI.NVIDIA.com. Ultra is coming soon. Hybrid reasoning helps to advance agentic AI workloads One of the key features in Nvidia Llama Nemotron is the ability to toggle reasoning on or off. The ability to toggle reasoning is an emerging capability in the AI market. Anthropic Claude 3.7 has a somewhat similar functionality, though that model is a closed proprietary model. In the open source space IBM Granite 3.2 also has a reasoning toggle that IBM refers to as – conditional reasoning. The promise of hybrid or conditional reasoning is that it allows systems to bypass computationally expensive reasoning steps for simple queries. In a demonstration, Nvidia showed how the model could engage complex reasoning when solving a combinatorial problem but switch to direct response mode for simple factual queries. Nvidia Agent AI-Q blueprint provides an enterprise integration layer Recognizing that models alone aren’t sufficient for enterprise deployment, Nvidia also  announced the Agent AI-Q blueprint, an open-source framework for connecting AI agents to enterprise systems and data sources. “AI-Q is a new blueprint that enables agents to query multiple data types—text, images, video—and leverage external tools like web search and other agents,” Briski said. “For teams of connected agents, the blueprint provides observability and transparency into agent activity, allowing developers to improve the system over time.” The AI-Q blueprint is set to become available in April Why this matters for enterprise AI adoption For enterprises considering advanced AI agent deployments, Nvidia’s announcements address several key challenges. The open nature of Llama Nemotron models allows businesses to deploy reasoning-capable AI within their own infrastructure. That’s important as it can address data sovereignty and privacy concerns that can have limited adoption of cloud-only solutions. By building the new models as NIMs, Nvidia is also making it easier for organizations to deploy and manage deployments, whether on-premises or in the cloud. The hybrid, conditional reasoning approach is also important to note as it provides organizations with another option to choose from for this type of emerging capability. Hybrid reasoning allows enterprises to optimize for either thoroughness or speed, saving on latency and compute for simpler tasks while still enabling complex reasoning when needed. As enterprise AI moves beyond simple applications to more complex reasoning tasks, Nvidia’s combined offering of efficient reasoning models and integration frameworks positions companies to deploy more sophisticated AI agents that can handle multi-step logical problems while maintaining deployment flexibility and cost efficiency. source

Nvidia debuts Llama Nemotron open reasoning models in a bid to advance agentic AI Read More »

Adobe previews AI generated PowerPoints from raw customer data with ‘Project Slide Wow’

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Today at Adobe‘s annual digital innovation conference Summit 2024 in Las Vegas, the company is unveiling Project Slide Wow, a generative AI-driven tool designed to streamline the creation of PowerPoint presentations directly from raw customer data. Presented as part of the “Adobe Sneaks” program, the innovation aims to solve a common challenge for marketers and analysts—transforming complex data into compelling, easy-to-digest presentations. From data to presentation—automatically Project Slide Wow integrates with Adobe Customer Journey Analytics (CJA) to automatically generate PowerPoint slides populated with relevant data visualizations and speaker notes. This allows marketers and business analysts to quickly build data-backed presentations without manually structuring content or designing slides. “It’s analyzing all the charts in this project, generating captions for them, organizing them into a narrative, and creating the presentation slides,” said Jane Hoffswell, a research scientist at Adobe and the creator of Project Slide Wow, in a video call interview with VentureBeat a few days ago. “It figures out the best way to focus on the most important pieces of data.” A standout feature of the tool is its interactive AI agent within PowerPoint. Users can ask follow-up questions, request additional visualizations, or dynamically generate new slides on the fly, making it an adaptable solution for data-driven storytelling. One of the biggest advantages of Project Slide Wow is its ability to handle live data updates. Instead of static presentations that quickly become outdated, users can refresh their slides to reflect the latest analytics, which is particularly valuable for businesses that rely on real-time data insights. “We wanted this technology to be able to keep your data fresh and alive. If you give a presentation six months later, people will want to know how things have changed,” Hoffswell told VentureBeat. The tech under the hood Unlike many recent AI-powered tools, Project Slide Wow does not rely on large language models (LLMs) like OpenAI’s GPT or Adobe’s Firefly. Instead, Adobe’s research team developed a proprietary algorithmic ranking and scoring system to determine which insights are most important for a given dataset. The system prioritizes information based on: Data Structure in Adobe Customer Journey Analytics (CJA) – Insights that appear higher in the CJA workflow receive more emphasis. Relevance & Frequency—In slide generation, data points that appear multiple times across different analyses are given greater weight. Narrative Organization – The tool algorithmically arranges insights into a logical storytelling structure to ensure a smooth presentation flow. “Our ranking algorithm looks at the layout of the original Customer Journey Analytics project—content higher up is likely more important,” said Hoffswell. “We also prioritize values that frequently appear in the data.” Because the system is more rules-based and deterministic rather than relying on probabilistic AI models, it avoids common LLM issues such as hallucinated data or unpredictable outputs. It also allows enterprises to have greater transparency and control over how presentations are structured. What it means for the enterprise and decision-makers For CTOs, CIOs, team leaders and developer managers, Project Slide Wow represents a potential shift in how teams work with data visualization and presentations. Here’s what it means for enterprise-level decision-making: Greater Efficiency for Data Teams – Analysts and marketers can rapidly generate insights in presentation-ready formats, reducing the manual labor involved in building slides from scratch. Scalability for Large Organizations—By integrating directly into existing workflows, large enterprises can standardize the way customer insights are presented across departments. Data Integrity & Control – Unlike AI tools that create content based on unpredictable generative models, Project Slide Wow works within the enterprise’s existing datasets in CJA. This ensures data accuracy and minimizes compliance risks. Enhanced Collaboration Between Teams—By allowing presentations to be updated dynamically, multiple teams—such as marketing, analytics, and product development—can work with the latest insights in real time without duplicating efforts. Future Integration Potential – If Project Slide Wow becomes a full-fledged Adobe product, enterprise IT leaders may consider integrating it into their existing Microsoft 365 environments through the planned PowerPoint add-on. Will it become a full product? Adobe Sneaks is an annual showcase of experimental innovations. Around 40% of featured projects eventually become Adobe products. The fate of Project Slide Wow depends on user interest and engagement, as Adobe monitors social media conversations, customer feedback and direct inquiries to gauge demand. Eric Matisoff, Adobe’s Digital Experience Evangelist and host of Adobe Sneaks, highlighted that the program serves as a testing ground for cutting-edge ideas. “We start by scouring the company for hundreds of technologies and ideas…and whittle them down to the seven most exciting, entertaining, and forward-looking innovations,” Matisoff said. Looking ahead For businesses that rely on data-driven decision-making, Project Slide Wow could be a major step forward in simplifying the process of building presentations. If the tool gains traction, it may soon be available as an official Adobe product—potentially transforming how companies use customer data to inform strategy. Until then, CTOs, CIOs, team leads, and analysts should stay tuned for Adobe’s Sneaks announcements to see whether Project Slide Wow makes the leap from an experimental demo to a real-world enterprise solution. source

Adobe previews AI generated PowerPoints from raw customer data with ‘Project Slide Wow’ Read More »