VentureBeat

Anthropic’s Computer Use mode shows strengths and limitations in new study

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Since Anthropic released the “Computer Use” feature for Claude in October, there has been a lot of excitement about what AI agents can do when given the power to imitate human interactions. A new study by Show Lab at the National University of Singapore provides an overview of what we can expect from the current generation of graphical user interface (GUI) agents. Claude is the first frontier model that can interact as a GUI agent with a device through the same interfaces humans use. The model only accesses desktop screenshots and interacts by triggering keyboard and mouse actions. The feature promises to enable users to automate tasks through simple instructions and without the need to have API access to applications.  The researchers tested Claude on a variety of tasks including web search, workflow completion, office productivity and video games. Web search tasks involve navigating and interacting with websites, such as searching for and purchasing items or subscribing to news services. Workflow tasks involve multi-application interactions, such as extracting information from a website and inserting it into a spreadsheet. Office productivity tasks test the agent’s ability to perform common operations such as formatting documents, sending emails and creating presentations. The video game tasks evaluate the agent’s ability to perform multi-step tasks that require understanding the logic of the game and planning actions. Each task tests the model’s ability across three dimensions: planning, action and critic. First, the model must come up with a coherent plan to accomplish the task. It must then be able to carry out the plan by translating each step into an action, such as opening a browser, clicking on elements and typing text. Finally, the critic element determines whether the model can evaluate its progress and success in accomplishing the task. The model should be able to understand if it has made errors along the way and correct course. And if the task is not possible, it should give a logical explanation. The researchers created a framework based on these three components and reviewed and rated all tests by humans. In general, Claude did a great job of carrying out complex tasks. It was able to reason and plan multiple steps needed to carry out a task, perform the actions and evaluate its progress every step of the way. It can also coordinate between different applications such as copying information from web pages and pasting them in spreadsheets. Moreover, in some cases, it revisits the results at the end of the task to make sure everything is aligned with the goal. The model’s reasoning trace shows that it has a general understanding of how different tools and applications work and can coordinate them effectively. However, it also tends to make trivial mistakes that average human users would easily avoid. For example, in one task, the model failed to complete a subscription because it did not scroll down a webpage to find the corresponding button. In other cases, it failed at very simple and clear tasks, such as selecting and replacing text or changing bullet points to numbers. Moreover, the model either didn’t realize its error or made wrong assumptions about why it was not able to achieve the desired goal. According to the researchers, the model’s misjudgments of its progress highlight “a shortfall in the model’s self-assessment mechanisms” and suggest that “a complete solution to this still may require improvements to the GUI agent framework, such as an internalized strict critic module.” From the results, it is also clear that GUI agents can’t replicate all the basic nuances of how humans use computers. What does it mean for enterprises? The promise of using basic text descriptions to automate tasks is very appealing. But at least for now, the technology is not ready for mass deployment. The behavior of the models is unstable and can lead to unpredictable results, which can have damaging consequences in sensitive applications. Performing actions through interfaces designed for humans is also not the fastest way to accomplish tasks that can be done through APIs. And we have yet much to learn about the security risks of giving large language models (LLMs) control of the mouse and keyboard. For example, a study shows that web agents can easily fall victim to adversarial attacks that humans would easily ignore. Automating tasks at scale still requires robust infrastructure, including APIs and microservices that can be connected securely and served at scale. However, tools like Claude Computer Use can help product teams explore ideas and iterate over different solutions to a problem without investing time and money in developing new features or services to automate tasks. Once a viable solution is discovered, the team can focus on developing the code and components needed to deliver it efficiently and reliably. source

Anthropic’s Computer Use mode shows strengths and limitations in new study Read More »

Thomson Reuters’ CoCounsel redefines legal AI with OpenAI’s o1-mini model

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Thomson Reuters launched testing today of a custom version of OpenAI’s newest language model in its CoCounsel legal assistant. The implementation marks the first enterprise customization of the o1-mini model and reveals how large companies are now transforming their artificial intelligence strategies. The media and technology giant has implemented a strategic approach by deploying specialized AI models from OpenAI, Google, and Anthropic, with each optimized for specific legal tasks. Industry analysts believe this strategy, combined with the novel capabilities of o1-mini, could become a blueprint for enterprise AI deployment across industries. “Each model—OpenAI, Google Gemini, and Anthropic—brings unique capabilities that are matched to the demands of specific workflows,” explained Joel Hron, Chief Technology Officer at Thomson Reuters, in an exclusive interview with VentureBeat. The company routes different legal tasks based on these capabilities. “OpenAI focuses on generative tasks like summarization and conversational AI within CoCounsel. Google’s Gemini is optimal for long-context tasks, enabling deep integration with large legal documents. Anthropic’s Claude is targeted at workflows requiring high sensitivity and customization, such as tax and compliance use cases.” The new o1-mini model advances AI reasoning capabilities significantly, according to James Dyett, Head of Platform Sales at OpenAI, who also spoke to VentureBeat in an exclusive interview. “OpenAI o1-mini was designed for workflows that require professionals to spot very minor but potentially consequential terms and errors in legal briefs,” Dyett said. “Compared to GPT-4, OpenAI o1-mini was trained to spend more time thinking through legal complexities.” AI shows major gains in legal document analysis Early testing has demonstrated meaningful performance improvements in real-world applications. Hron pointed to specific examples from their evaluation process. “In our testing of o1-mini for the detection of privileged emails, the model has shown a notable ability to identify situationally nuanced instances of privilege that were previously missed by even highly capable models like GPT-4,” he said. “This advancement is a direct reflection of o1-mini’s enhanced reasoning and contextual understanding.” The strategy has produced significant results. Thomson Reuters reports a 1,400% increase in CoCounsel users over the past year. The system has transformed several key legal workflows, particularly in document management and analysis. “Document review, legal research, and drafting and revision have all seen significant improvements,” Hron noted. “These improvements have increased productivity and allow legal professionals to focus on higher-value tasks.” From AI customer to AI developer: Thomson Reuters’ strategic expansion The company’s AI strategy extends beyond using existing technology. Thomson Reuters recently acquired UK-based Safe Sign Technologies, a specialist in legal-focused language models, marking a significant move into AI development. “Our strategy for developing proprietary LLMs through Safe Sign Technologies complements our partnerships by giving us greater control over data security, customization, and cost efficiency,” Hron explained. “It allows us to leverage our greatest assets — our proprietary content and world-class domain experts — in a more direct way to create unique solutions that only we can deliver.” The management of multiple AI models has required sophisticated infrastructure support. Thomson Reuters partnered with Amazon Web Services to handle the computational demands, becoming an early customer for AWS Sagemaker HyperPod. “We have deep and long-standing relationships with all of these providers and have the computational infrastructure needed to support demand for each of these models,” Hron said. “This actually allows us to optimize costs by allocating tasks strategically to the appropriate model.” The development has drawn attention from both technology leaders and investors. Dyett emphasized the broader implications for enterprise AI deployment. “OpenAI works with large enterprises to understand the opportunities where frontier models like o1-mini or customized versions of o1-mini can power specific use cases,” he said. “These insights enable us to improve our model capabilities and identify additional legal tasks suited for OpenAI o1-mini reasoning customization.” While enterprise AI has traditionally focused on broad capabilities, Thomson Reuters’ implementation of o1-mini signals a pivotal shift toward precision-engineered models that excel at highly specialized tasks. The model’s ability to catch nuanced legal distinctions that even GPT-4 missed suggests that the future of AI lies not in jack-of-all-trades systems, but in sophisticated networks of specialized models working in concert. For the legal industry, where a single missed detail can have million-dollar consequences, this precision-first approach could redefine the standards for AI deployment. source

Thomson Reuters’ CoCounsel redefines legal AI with OpenAI’s o1-mini model Read More »

Chinese researchers unveil LLaVA-o1 to challenge OpenAI’s o1 model

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI‘s o1 model has shown that inference-time scaling—using more compute during inference—can significantly boost a language model’s reasoning abilities. LLaVA-o1, a new model developed by researchers from multiple universities in China, brings this paradigm to open-source vision language models (VLMs). Early open-source VLMs typically use a direct prediction approach, generating answers without reasoning about the prompt and the steps required to solve the prompt. Without a structured reasoning process, they are less effective at tasks that require logical reasoning. Advanced prompting techniques such as chain-of-thought (CoT) prompting, where the model is encouraged to generate intermediate reasoning steps, produce some marginal improvements. But VLMs often produce errors or hallucinate. The researchers observed that a key issue is that the reasoning process in existing VLMs is not sufficiently systematic and structured. The models do not generate reasoning chains and often get stuck in reasoning processes where they don’t know at what stage they are and what specific problem they must solve. “We observe that VLMs often initiate responses without adequately organizing the problem and the available information,” the researchers write. “Moreover, they frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path.” Multistage reasoning OpenAI o1 uses inference-time scaling to solve the systematic and structured reasoning problem and allows the model to pause and review its results as it gradually solves the problem. While OpenAI has not released much detail about the underlying mechanism of o1, its results show promising directions for improving the reasoning abilities of foundational models. Inspired by o1, the researchers designed LLaVA-o1 to perform stage-by-stage reasoning. Instead of generating a direct reasoning chain, LLaVA-o1 breaks down the reasoning process into four distinct stages: Summary: The model first provides a high-level summary of the question, outlining the core problem it needs to address. Caption:  If an image is present, the model describes the relevant parts, focusing on elements related to the question. Reasoning:  Building on the summary, the model performs structured, logical reasoning to derive a preliminary answer. Conclusion: Finally, the model presents a concise summary of the answer based on the preceding reasoning. Only the conclusion stage is visible to the user; the other three stages represent the model’s internal reasoning process, similar to the hidden reasoning trace of o1. This structured approach allows LLaVA-o1 to manage its reasoning process independently, leading to improved performance on complex tasks. “This structured approach enables the model to independently manage its reasoning process, improving its adaptability and performance on complex reasoning tasks,” the researchers write. Stage-level beam search (right) vs other inference-time scaling techniques Source: arXiv LLaVA-o1 also introduces a novel inference-time scaling technique called “stage-level beam search.” Stage-level beam search generates multiple candidate outputs at each reasoning stage. It then selects the best candidate at each stage to continue the generation process. This is in contrast to the classic best-of-N approach, in which the model is prompted to generate multiple complete responses before selecting one. “Notably, it is the structured output design of LLaVA-o1 that makes this approach feasible, enabling efficient and accurate verification at each stage,” the researchers write. “This validates the effectiveness of structured output in improving inference time scaling.” Training LLaVA-o1 LLaVA-o1 training data is annotated with GPT-4o Source: arXiv To train LLaVA-o1, the researchers compiled a new dataset of around 100,000 image-question-answer pairs obtained from several widely used VQA datasets. The dataset covers a variety of tasks, from multi-turn question answering to chart interpretation and geometric reasoning. The researchers used GPT-4o to generate the detailed four-stage reasoning processes for each example, including the summary, caption, reasoning and conclusion stages.  The researchers then fine-tuned Llama-3.2-11B-Vision-Instruct on this dataset to obtain the final LLaVA-o1 model. The researchers have not released the model but plan to release the dataset, called the LLaVA-o1-100k. LLaVA-o1 in action The researchers evaluated LLaVA-o1 on several multimodal reasoning benchmarks.  Despite being trained on only 100,000 examples, LLaVA-o1 showed significant performance improvements over the base Llama model, with an average benchmark score increase of 6.9%.   LLaVA-o1 vs other open and closed models Source: arXiv Furthermore, stage-level beam search led to additional performance gains, demonstrating the effectiveness of inference-time scaling. Due to computational resource constraints, the researchers were only able to test the technique with a beam size of 2. They expect even greater improvements with larger beam sizes. Impressively, LLaVA-o1 outperformed not only other open-source models of the same size or larger but also some closed-source models like GPT-4-o-mini and Gemini 1.5 Pro. “LLaVA-o1 establishes a new standard for multimodal reasoning in VLMs, offering robust performance and scalability, especially in inference time,” the researchers write. “Our work paves the way for future research on structured reasoning in VLMs, including potential expansions with external verifiers and the use of reinforcement learning to further enhance complex multimodal reasoning capabilities.” source

Chinese researchers unveil LLaVA-o1 to challenge OpenAI’s o1 model Read More »

Microsoft’s AI agents: 4 insights that could reshape the enterprise landscape

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The era of AI agents has officially arrived, and Microsoft is leading the charge. At Ignite, the company made bold claims about its advancements in enterprise AI, including 100,000 organizations already deploying or editing AI agents. These announcements suggest Microsoft is about to disrupt how enterprises approach automation, as well as startups competing in this space. In my conversation with generative AI developer and expert Sam Witteveen, CEO of Red Dragon AI and developer expert in machine learning, who has become a leading AI educator and influencer through his technical content on YouTube, we dive into the most significant takeaways from Ignite. Why are enterprise AI leaders saying value is shifting from LLMs to the layers on top of it — specifically, to the the enterprise governance layers where Microsoft can shine? How does Microsoft’s vision of a multi-agent “mesh” raise the stakes for technical decision-makers? And what does it mean when startups claim Microsoft is “steamrolling” entire verticals with its enterprise AI offerings? Major takeaways from our conversation: AI Agents Are Ready: Microsoft says the agent era is here, but what’s fueling this bold declaration? Beyond LLMs: The value is shifting, but where—and why does it matter to your enterprise? The Multi-Agent Mesh Vision: Could millions of agents working in tandem redefine enterprise AI architecture? Microsoft’s Lead: With 100,000 organizations on board, how real is this advantage—and how much is marketing spin? UPDATE: The takeaways video is actually the first of a three-part series. We’ve since released the second two videos of the series. The second one covers the 10 autonomous agents Microsoft launched, and how they cover such obvious ground in the enterprise that they are likely startup killers. This third covers how Copilot Studio agent builder is differentiated from competitors. See below: source

Microsoft’s AI agents: 4 insights that could reshape the enterprise landscape Read More »

Salesforce launches Agentforce Testing Center to put agents through paces

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The next phase of agentic AI may just be evaluation and monitoring, as enterprises want to make the agents they’re beginning to deploy more observable. While AI agent benchmarks can be misleading, there’s a lot of value in seeing if the agent is working the way they want to. To this end, companies are beginning to offer platforms where customers can sandbox AI agents or evaluate their performance. Salesforce released its agent evaluation platform, Agentforce Testing Center, in a limited pilot Wednesday. General availability is expected in December. Testing Center lets enterprises observe and prototype AI agents to ensure they access the workflows and data they need.  Testing Center’s new capabilities include AI-generated tests for Agentforce, Sandboxes for Agentforce and Data Cloud and monitoring and observability for Agentforce.  AI-generated tests allow companies to use AI models to generate “hundreds of synthetic interactions” to test if agents end up in how often they answer the way companies want. As the name suggests, sandboxes offer an isolated environment to test agents while mirroring a company’s data to reflect better how the agent will work for them. Monitoring and observability let enterprises bring an audit trail to the sandbox when the agents go into production.  Patrick Stokes, executive vice president of product and industries marketing at Salesforce, told VentureBeat that the Testing Center is part of a new class of agents the company calls Agent Lifecycle Management.  “We are positioning what we think will be a big new subcategory of agents,” Stokes said. “When we say lifecycle, we mean the whole thing from genesis to development all the way through deployment, and then iterations of your deployment as you go forward.” Stokes said that right now, the Testing Center doesn’t have workflow-specific insights where developers can see the specific choices in API, data or model the agents used. However, Salesforce collects that kind of data on its Einstein Trust Layer. “What we’re doing is building developer tools to expose that metadata to our customers so that they can actually use it to better build their agents,” Stokes said. Salesforce is hanging its hat on AI agents, focusing a lot of its energy on its agentic offering Agentforce. Salesforce customers can use preset agents or build customized agents on Agentforce to connect to their instances.  Evaluating agents AI agents touch many points in an organization, and since good agentic ecosystems aim to automate a big chunk of workflows, making sure they work well becomes essential.  If an agent decides to tap the wrong API, it could spell disaster for a business. AI agents are stochastic in nature, like the models that power them, and consider each potential probability before coming up with an outcome. Stokes said Salesforce tests agents by barraging the agent with versions of the same utterances or questions. Its responses are scored as pass or fail, allowing the agent to learn and evolve within a safe environment that human developers can control.  Platforms that help enterprises evaluate AI agents are fast becoming a new type of product offering. In June, customer experience AI company Sierra launched an AI agent benchmark called TAU-bench to look at the performance of conversational agents. Automation company UiPath released its Agent Builder platform in October which also offered a means to evaluate agent performance before full deployment.  Testing AI applications is nothing new. Other than benchmarking model performances, many AI model repositories like AWS Bedrock and Microsoft Azure already let customers test out foundation models in a controlled environment to see which one works best for their use cases.  source

Salesforce launches Agentforce Testing Center to put agents through paces Read More »

Lightricks bets on open-source AI video to challenge Big Tech

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Lightricks, the Israeli company behind the viral photo-editing app Facetune, is launching an ambitious effort to shake up the generative AI landscape. Today, the company announced the release of LTX Video (LTXV), an open-source AI model capable of generating five seconds of high-quality video in just four seconds. By making its video model freely available, Lightricks is directly aiming at the growing dominance of proprietary AI systems from tech giants like OpenAI, Adobe and Google. “We believe foundational models are going to be a commodity, and you can’t build an actual business around foundational models,” said Zeev Farbman, co-founder and CEO of Lightricks, in an exclusive interview with VentureBeat. “If startups want to have a serious chance to compete, the technology needs to be open, and you want to make sure that people in the top universities across the world have access to your model and add capabilities on top of it.” With real-time processing, scalability for long-form video, and a compact architecture that runs efficiently even on consumer-grade hardware, LTXV is poised to make professional-grade generative video technology accessible to a broader audience—an approach that could disrupt the industry’s status quo. We prompted LTXV to create a high-fashion scene. The model generated this cinematic sequence featuring a businesswoman in an urban setting—complete with consistent lighting, reflective surfaces, and professional-grade cinematography—all in four seconds. (Credit: Lightricks/VentureBeat) How Lightricks weaponizes open source to challenge AI giants Lightricks’ decision to release LTXV as open source is a calculated gamble designed to differentiate the company in an increasingly crowded generative AI market. With its two billion parameters, the model is designed to run efficiently on widely available GPUs, such as the Nvidia RTX 4090, while maintaining high visual fidelity and motion consistency. This move comes at a time when many leading AI models—from OpenAI’s DALL-E to Google’s Imagen—are locked behind APIs, requiring developers to pay for access. Lightricks, by contrast, is betting that openness will foster innovation and adoption. Farbman compared LTXV’s launch to Meta’s release of its open-source Llama language models, which quickly gained traction in the AI community and helped Meta establish itself in a space dominated by OpenAI’s ChatGPT. “The business rationale is that if the community adopts it, if people in academia adopt it, we as a company are going to benefit a ton from it,” Farbman said. Unlike Meta, which controls the infrastructure its models run on, Lightricks is focusing solely on the model itself, working with platforms like Hugging Face to make it accessible. “We’re not going to make any money out of this model at the moment,” Farbman emphasized. “Some people are going to deploy it locally on their hardware, like a gaming PC. It’s all about adoption.” We tested Lightricks’ new LTXV video model with a simple prompt about a vintage IBM PC. Here’s what the AI-generated in just four seconds. (Credit: Lightricks / VentureBeat) Lightning-fast AI video: Breaking speed records on consumer hardware LTXV’s standout feature is its speed. The model can generate five seconds of video—121 frames at 768×512 resolution—in just four seconds on Nvidia’s H100 GPUs. Even on consumer-grade hardware, such as the RTX 4090, LTXV delivers near-real-time performance, making it one of the fastest models of its kind. This speed is achieved without compromising quality. The model’s Diffusion Transformer architecture ensures smooth motion and structural consistency between frames, addressing a key limitation of earlier video-generation models. For smaller studios, independent creators, and researchers, the ability to iterate quickly and generate high-quality results on affordable hardware is a game-changer. “When you’re waiting a couple of minutes to get a result, it’s a terrible user experience,” Farbman said. “But once you’re getting feedback quickly, you can experiment and iterate faster. You develop a mental model of what the system can do, and that unlocks creativity.” Lightricks has also designed LTXV to support longer-form video production, offering creators greater flexibility and control. This scalability, combined with its rapid processing times, opens up new possibilities for industries ranging from gaming to e-commerce. In gaming, for example, LTXV could be used to upscale graphics in older games, transforming them into visually stunning experiences. In e-commerce, the model’s speed and efficiency could enable businesses to create thousands of ad variations for targeted A/B testing. “Imagine casting an actor—real or virtual—and tweaking the visuals in real time to find the best creative for a specific audience,” Farbman said. From photo app to AI powerhouse: Lightricks’ bold market play With LTXV, Lightricks is positioning itself as a disruptor in an industry increasingly dominated by a handful of tech giants. This is a bold move for a company that started as a mobile app maker and is best known for Facetune, a consumer photo-editing app that became a global hit. Lightricks has since expanded its offerings, acquiring the Chicago-based influencer marketing platform Popular Pays and launching LTX Studio, an AI-driven storytelling platform aimed at professional creators. The integration of LTXV into LTX Studio is expected to enhance the platform’s capabilities, allowing users to generate longer, more dynamic videos with greater speed and precision. But Lightricks faces significant challenges. Competing against industry heavyweights like Adobe and Autodesk, which have deeper pockets and established user bases, won’t be easy. Adobe, for example, has already integrated generative AI into its Creative Cloud suite, giving it a natural advantage among professional users. Farbman acknowledges the risks but believes that open-source innovation is the only viable path forward for smaller players. “If you want to have a fighting chance as a startup versus the giants, you need to ensure the technology is open and adopted by academia and the broader community,” he said. Why open source could win the AI video generation race The release of LTXV also highlights a growing tension in the AI industry between open-source and proprietary approaches. While closed models offer companies tighter control and monetization opportunities, they risk alienating developers and

Lightricks bets on open-source AI video to challenge Big Tech Read More »

Microsoft’s 10 new AI agents strengthen its enterprise automation lead

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Microsoft made waves at Ignite 2024 with its announcement that 10 autonomous AI agents are now available for enterprise use. Microsoft effectively declared that AI agents are ready for prime time — achieving what others have yet to accomplish. Microsoft’s pre-built agents target core enterprise operations – from CRM and supply chain management to financial reconciliation. While competitors like Salesforce and ServiceNow offer AI agent solutions in some limited areas, Microsoft has created an extensive agent ecosystem that reaches beyond its own platform. The system includes 1,400 third-party connectors and supports customization across 1,800+ large language models. The scale of adoption is equally significant: 100,000 organizations are already creating or modifying agents, Microsoft says, with deployment rates doubling last quarter – adoption numbers that dwarf those of competitors In my three-part video series with generative AI developer and expert Sam Witteveen, we explore what this move means for enterprises, why Microsoft is pulling ahead as a leader in agentic AI, and how these tools may transform the way companies handle workflows. Below, we break down the highlights and invite you to explore insights from the full series. The big takeaways Microsoft’s release of these 10 AI agents shows enterprise AI is moving from theoretical to practical, but Microsoft’s other statements about agents have other ramifications: Pre-built enterprise value: Microsoft’s agents are pre-configured to tackle specific workflows, unlike traditional toolkits that require heavy customization. Whether it’s qualifying sales leads or optimizing supply chains, these agents are ready to deploy. A decisive lead: By leveraging its ecosystem of productivity apps and customer reach, Microsoft is ahead of competitors like Salesforce, Google and AWS, offering enterprise-grade solutions at scale. Redefining competition: The agents’ targeted capabilities, like CRM lead scoring and time management, challenge startups that previously dominated these niches. The agentic AI vision: From pre-built agents to fully customized solutions, Microsoft’s ecosystem empowers enterprises to create, modify, and deploy agents seamlessly—lowering the barriers to adoption. LLM models may no longer be the most valuable: Microsoft’s shift from “per token” to “per message” pricing — and toward “per outcome” value — signals a move beyond the raw output of language models. But with competitors like Google, AWS, and open-source frameworks hot on its heels, Microsoft’s lead may not last forever. In the video series, we talk about these alternatives players too, and how Microsoft is differentiated from them. Watch the series In this three-part series, we dive deep into what Microsoft’s AI agents mean for enterprise leaders. Watch now to learn: Part 1: The four biggest takeaways from Microsoft Ignite 2024. Part 2: How Microsoft’s 10 autonomous agents cover key enterprise workflows (and incidentally could kill a lot of startups in the process, which had launched to cover similar workflows). Part 3: How Microsoft stacks up against competitors like Google, OpenAI and AWS, in the race for agentic AI leadership. Explore the full series here: source

Microsoft’s 10 new AI agents strengthen its enterprise automation lead Read More »

OpenScholar: The open-source A.I. that’s outperforming GPT-4o in scientific research

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Scientists are drowning in data. With millions of research papers published every year, even the most dedicated experts struggle to stay updated on the latest findings in their fields. A new artificial intelligence system, called OpenScholar, is promising to rewrite the rules for how researchers access, evaluate, and synthesize scientific literature. Built by the Allen Institute for AI (Ai2) and the University of Washington, OpenScholar combines cutting-edge retrieval systems with a fine-tuned language model to deliver citation-backed, comprehensive answers to complex research questions. “Scientific progress depends on researchers’ ability to synthesize the growing body of literature,” the OpenScholar researchers wrote in their paper. But that ability is increasingly constrained by the sheer volume of information. OpenScholar, they argue, offers a path forward—one that not only helps researchers navigate the deluge of papers but also challenges the dominance of proprietary AI systems like OpenAI’s GPT-4o. How OpenScholar’s AI brain processes 45 million research papers in seconds At OpenScholar’s core is a retrieval-augmented language model that taps into a datastore of more than 45 million open-access academic papers. When a researcher asks a question, OpenScholar doesn’t merely generate a response from pre-trained knowledge, as models like GPT-4o often do. Instead, it actively retrieves relevant papers, synthesizes their findings, and generates an answer grounded in those sources. This ability to stay “grounded” in real literature is a major differentiator. In tests using a new benchmark called ScholarQABench, designed specifically to evaluate AI systems on open-ended scientific questions, OpenScholar excelled. The system demonstrated superior performance on factuality and citation accuracy, even outperforming much larger proprietary models like GPT-4o. One particularly damning finding involved GPT-4o’s tendency to generate fabricated citations—hallucinations, in AI parlance. When tasked with answering biomedical research questions, GPT-4o cited nonexistent papers in more than 90% of cases. OpenScholar, by contrast, remained firmly anchored in verifiable sources. The grounding in real, retrieved papers is fundamental. The system uses what the researchers describe as their “self-feedback inference loop” and “iteratively refines its outputs through natural language feedback, which improves quality and adaptively incorporates supplementary information.” The implications for researchers, policy-makers, and business leaders are significant. OpenScholar could become an essential tool for accelerating scientific discovery, enabling experts to synthesize knowledge faster and with greater confidence. How OpenScholar works: The system begins by searching 45 million research papers (left), uses AI to retrieve and rank relevant passages, generates an initial response, and then refines it through an iterative feedback loop before verifying citations. This process allows OpenScholar to provide accurate, citation-backed answers to complex scientific questions. | Source: Allen Institute for AI and University of Washington Inside the David vs. Goliath battle: Can open source AI compete with Big Tech? OpenScholar’s debut comes at a time when the AI ecosystem faces a growing tension between closed, proprietary systems and the rise of open-source alternatives like Meta’s Llama. Models like OpenAI’s GPT-4o and Anthropic’s Claude offer impressive capabilities, but they are expensive, opaque, and inaccessible to many researchers. OpenScholar flips this model on its head by being fully open-source. The OpenScholar team has released not only the code for the language model but also the entire retrieval pipeline, a specialized 8-billion-parameter model fine-tuned for scientific tasks, and a datastore of scientific papers. “To our knowledge, this is the first open release of a complete pipeline for a scientific assistant LM—from data to training recipes to model checkpoints,” the researchers wrote in their blog post announcing the system. This openness is not just a philosophical stance; it’s also a practical advantage. OpenScholar’s smaller size and streamlined architecture make it far more cost-efficient than proprietary systems. For example, the researchers estimate that OpenScholar-8B is 100 times cheaper to operate than PaperQA2, a concurrent system built on GPT-4o. This cost-efficiency could democratize access to powerful AI tools for smaller institutions, underfunded labs, and researchers in developing countries. Still, OpenScholar is not without limitations. Its datastore is restricted to open-access papers, leaving out paywalled research that dominates some fields. This constraint, while legally necessary, means the system might miss critical findings in areas like medicine or engineering. The researchers acknowledge this gap and hope future iterations can responsibly incorporate closed-access content. How OpenScholar performs: Expert evaluations show OpenScholar (OS-GPT4o and OS-8B) competing favorably with both human experts and GPT-4o across four key metrics: organization, coverage, relevance and usefulness. Notably, both OpenScholar versions were rated as more “useful” than human-written responses. | Source: Allen Institute for AI and University of Washington The new scientific method: When AI becomes your research partner The OpenScholar project raises important questions about the role of AI in science. While the system’s ability to synthesize literature is impressive, it is not infallible. In expert evaluations, OpenScholar’s answers were preferred over human-written responses 70% of the time, but the remaining 30% highlighted areas where the model fell short—such as failing to cite foundational papers or selecting less representative studies. These limitations underscore a broader truth: AI tools like OpenScholar are meant to augment, not replace, human expertise. The system is designed to assist researchers by handling the time-consuming task of literature synthesis, allowing them to focus on interpretation and advancing knowledge. Critics may point out that OpenScholar’s reliance on open-access papers limits its immediate utility in high-stakes fields like pharmaceuticals, where much of the research is locked behind paywalls. Others argue that the system’s performance, while strong, still depends heavily on the quality of the retrieved data. If the retrieval step fails, the entire pipeline risks producing suboptimal results. But even with its limitations, OpenScholar represents a watershed moment in scientific computing. While earlier AI models impressed with their ability to engage in conversation, OpenScholar demonstrates something more fundamental: the capacity to process, understand, and synthesize scientific literature with near-human accuracy. The numbers tell a compelling story. OpenScholar’s 8-billion-parameter model outperforms GPT-4o while being orders of magnitude smaller. It matches human experts in citation accuracy where other

OpenScholar: The open-source A.I. that’s outperforming GPT-4o in scientific research Read More »

H2O.ai improves AI agent accuracy with predictive models

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Open-source AI platform provider H2O.ai believes a blend of generative and predictive AI models makes for more consistent responses which enterprises want from an AI agent.  H2O.ai launched its new multi-agent platform that blends generative and predictive AI and is now generally available. The platform, h2oGPTe, uses the company’s AI models Mississippi and Danube, but can also access other large and small language models available. The company said h2oGPTe works in air-gapped, on-premise and cloud systems. Sri Ambati, founder and CEO of H2O, told VentureBeat that having both generative and predictive AI gives enterprises more confidence that the agents will work exactly as they need without compromising security.  “The number one problem with agents is consistency. Can I get a consistent response from an [large language model] LLM for the same prompt? I think you get two different, like multiple responses right now,” Ambati said. “But you can bring multiple models that negotiate, plan and deliver an outcome. Think of it as humans can have a bit of variability with each other, but you still expect a consistent response, and that’s the domain of predictive AI combined with generative AI.” Ambati explained that generative AI models are “decent at content generation and very good at code generation,” but predictive models bring more scenario simulation to the table. He said the predictive models bring consistency to agentic responses because these do not just generate responses but learn from patterns in data.  The platform is built for finance, telecommunications, healthcare and government enterprises that need to manage multi-step tasks. H2O.ai’s agent works best for organizations that want to get insights into their business and not just a guide that runs through their workflows. This is because agents within the h2oGPTe platform can read multimodal data like charts and craft answers to questions like “Should my company sell more dolls this year?” that consider the enterprise’s historical financial data or market trend information they store.  Multimodal agents Like other AI agents, h2oGPTe automates workflow tasks so human employees don’t have to do those activities themselves. Ambati said the multimodal capabilities of H2O.ai’s agents open up more information that it can learn from to offer the best, most consistent answers to users.  The company said the agents can also create PDF documents with charts and tables grounded in enterprise data to visualize information for the human user. H2O.ai ensured that the agents cite their sources for data traceability and offer customizable guardrails.  H2O.ai’s agentic platform builds in model testing, including automated question generation, where an AI model will create variations of a prompt and barrage the agent with questions to see if it consistently responds. It also has a dashboard where people can identify which type of database, model, or part of the workflow the agents tapped.  Consistency and accuracy in agents With the hype around AI agents predicted to continue to the following year, there is a need to ensure agents provide value to enterprises, including performing consistently, reliably and accurately.  Reliability is critical because AI agents are meant to automate a large portion of an enterprise’s workflow without human intervention.  H2O.ai’s approach of blending generative and predictive models is one way, but other companies are also looking at ways to ensure AI agents don’t cause trouble for enterprises. The startup xpander.ai introduced its Agent Graph System for multi-step agents. Salesforce also released to a limited preview its Agentforce Testing Center to test agent response consistency.  source

H2O.ai improves AI agent accuracy with predictive models Read More »

Will Republicans continue to support subsidies for the chip industry?

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The Chips & Science Act was a bipartisan law passed to provide $52 billion for the U.S. semiconductor industry. It was created in the name of ensuring national security and a secure supply chain for critical electronics goods at a time when relations with China were frosty. The act became law in part because it promised to bring high-value jobs back to the United States, decades after those jobs left for low-cost areas in Asia. But Donald Trump is president-elect now and the Republicans are firmly in control of the federal government. We’ll soon find out if the love for electronics, chips and the jobs they bring is still there. Under Trump, new leaders have been tapped such as Vivek and Elon Musk to cut government spending via the Department of Government Efficiency (DOGE). Will they continue to support the Chips & Science Act? And do they see the value of investing in semiconductor factories further with a second act to finish the job of completing the chip factories that have been started? To answer these questions, I did an interview with Scott Almassy, a partner with consulting and accounting firm PwC. He has been running the company’s semiconductor practice for a long time during his 20-year stint at PwC. For that job, he has had to stay on top of the intricacies of the chip business, not only from the view of Silicon Valley but also in places like South Korea. Here’s an edited transcript of our interview. Scott Almassy is a partner in charge of chip coverage at PwC. VentureBeat: Could you start with some of your background? Scott Almassy: I’m a partner with PwC. Obviously we’re one of the large accounting and advisory firms. I’ve been here 20 years. Currently I’m our U.S. semiconductor leader. Our business is split between audit and advisory, audit being assurance, public companies, capital markets, audit opinions, and then advisory is consulting. I sit over both of those, but I’m an audit partner by background. In my 20 years I’ve been in the U.S., mostly in Silicon Valley, and also South Korea for three years. Virtually all my clients have been semiconductor companies, from foundries to the fabless guys at the end, putting the final products out there. I’ve seen the end to end throughout my career. As far as perspectives go, our industry has–especially starting with COVID, it’s been quite in the spotlight. Now everyone is curious about shifts, about the industry. You have the CHIPS Act. You have China. You have the rest of the world trying to onshore, reshore, whatever you want to call it. At the same time you still have the 30-plus years of muscle memory for Asia, moving everything there. Now people are figuring out how to bring it back and/or diversify. VentureBeat: There was bipartisan support for the CHIPS Act. That’s why it passed. Where does it stand after the election in terms of what might be modified about it, or whether the money that’s there is going to get spent or allocated or not? Almassy: A number of different perspectives. You’re right that it was bipartisan. In theory it would be harder to unwind, not only from an administrative perspective, but a political and emotional perspective. You have a number of states that were super excited that that funding was rolled out and large players would build in their states. That makes it difficult to unwind. Initially, and obviously we’re only seven days past it–initially there was a bit of consternation. Are the funds going to get doled out? Some folks, including potentially Commerce, who’s in charge of giving the money out, want to make sure they dot all the Ts and cross the Is. Whether they needed to expedite that, whether the companies that were granted the money needed to work together to get that across the finish line and locked in before the change in administration. At least what I’ve heard and what I’ve read recently is that the initial CHIPS Act – the $51-52 billion, whatever number in pure cash, and then the tax incentives would take it higher – probably isn’t at risk. That money will continue to be doled out. An interesting thing to watch might be–I don’t know how familiar you are with the CHIPS Act, but effectively the money was earmarked, the $50 billion plus. Commerce then set out to figure out what it would look like and what they wanted people to do before they gave them the money. That whole thing was almost a clean sheet. Trying to figure out, is it limited on how you can expand in China? Or not necessarily China, but countries on the list. One thing to watch out for is if these contracts are signed prior to the new administration coming, the money might still get doled out, but do they try to put additional restrictions on it, put a spin on it? Image by DALL-E 3 for VentureBeat I’m not sure there can be wholesale changes. It’s not restrictive. But the terms are written with preventing China’s growth in mind, making sure jobs are made, making sure you’re not doing buybacks. All that stuff is already in there. VentureBeat: The other piece of the picture that seems new is the likelihood of tariffs happening. If there’s still a supply chain that exists outside the U.S. and they supply parts into the semiconductor factories, are the costs going to go up for that reason? People were pointing out things like the cost of game consoles. A PS5 Pro costs $700 now, and it might go to $1,000 if it’s affected by tariffs. That’s something that is manufactured in China. AMD is the key supplier on that. But I don’t know which pieces of that are going to be affected by tariffs, if any. Almassy: It’s an interesting point on

Will Republicans continue to support subsidies for the chip industry? Read More »