VentureBeat

UiPath’s new orchestrator guides AI agents to follow your enterprise’s rules

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More By now, many enterprises have begun exploring AI agents and determining whether deploying them is a viable option for their business. But many still equate agents with something most companies have had for years: automation.  Automation pioneer UiPath sees agents and orchestrating the entire ecosystem — a little differently.  The company announced its new UiPath Platform for Agentic Automation. However, it made clear that agents are not a new version of robotic process automation (RPA); rather, they are another tool that enterprises can integrate with RPA to complete workflows.  Daniel Dines, UiPath founder and CEO, told VentureBeat in an interview that agents cannot be fully automated as they are built today. “The big problem with LLMs today is that they are nondeterministic, so you cannot run them directly in an autonomous fashion,” Dines said. “If you look at most implementations of agents, these are actually chatbots. So we’re moving from chat in, chat out to an agent that is data in, action out, where we orchestrate between agents, humans and robots.”  Key to UiPath’s offering is its AI orchestration layer, Maestro. It oversees the flow of information from agents to the human employee to the automation layer. UiPath described Maestro as a centralized supervisor who “automates, models and optimizes complex business processes” and monitors performance.  Breaking down agents and automation Maestro takes user prompts and breaks down the process into manageable steps to complete it. Instead of allowing agents to access information indiscriminately, Dines said Maestro has three steps. First, the agent takes the prompt, analyzes it, and recommends how to complete the query. Next, a human user approves of the recommendation. Then, an RPA tool will execute on that recommendation, completing the request.  Dines said Maestro makes the workflow more transparent and accountable because a human remains in the loop and a rules-based RPA finishes the task. For UiPath, separating agents that take in data to make a recommendation from the automation that acts upon that recommendation ensures enterprises don’t let agents have unfettered access to their entire system.  “I think it’s a very powerful way for enterprises to adopt agents. And look, in many discussions with clients, I think they resonate very well because they are really concerned about the unlimited agency of agents,” Dines said. UiPath also integrates with the orchestration framework provider LangChain to offer open, multi-agent frameworks. The Platform for Agentic Automation also works with Anthropic and Microsoft frameworks, with UiPath being part of Google’s Agent-to-Agent protocol.  Not every agent is automation Dines insists that thinking about agents as a complete stack solution, where the agents read the data and then take action,  “Agents being nondeterministic in nature are transactional; they create effects on the underlying systems. No client I know will take risks on this,” Dines said. “Transactions should be 100% reliable, and only automations can offer this type of reliability. So our solution is the best of those worlds.” He added that “maybe in some future” agentic AI “will become more reliable, and some actions you can you can delegate to agents, but it should be a progression.” Others in the industry believe that agents are the next evolution of automation. In fact, the entire premise of agentic AI is to have a system that does things on the user’s behalf. A secondary goal for many is to have “ambient” agents, where AI agents run in the background, proactively act for the user and alert people to any changes that need their attention.  UiPath, however, still needs to make a case that its approach to agents is more effective than all-in-one agent offerings and cuts through the hype surrounding agents that do everything for users.  Companies like ServiceNow, Salesforce, Writer and Microsoft have all released agentic platforms aimed at enterprise users. Writer’s new platform relies on self-evolving models for autonomous agents.  Enterprises also showed excitement around the idea that AI agents could streamline much of their work and automate many manual tasks in companies.  source

UiPath’s new orchestrator guides AI agents to follow your enterprise’s rules Read More »

The ‘era of experience’ will unleash self-learning AI agents across the web—here’s how to prepare

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More David Silver and Richard Sutton, two renowned AI scientists, argue in a new paper that artificial intelligence is about to enter a new phase, the “Era of Experience.” This is where AI systems rely increasingly less on human-provided data and improve themselves by gathering data from and interacting with the world. While the paper is conceptual and forward-looking, it has direct implications for enterprises that aim to build with and for future AI agents and systems.  Both Silver and Sutton are seasoned scientists with a track record of making accurate predictions about the future of AI. The validity predictions can be directly seen in today’s most advanced AI systems. In 2019, Sutton, a pioneer in reinforcement learning, wrote the famous essay “The Bitter Lesson,” in which he argues that the greatest long-term progress in AI consistently arises from leveraging large-scale computation with general-purpose search and learning methods, rather than relying primarily on incorporating complex, human-derived domain knowledge.  David Silver, a senior scientist at DeepMind, was a key contributor to AlphaGo, AlphaZero and AlphaStar, all important achievements in deep reinforcement learning. He was also the co-author of a paper in 2021 that claimed that reinforcement learning and a well-designed reward signal would be enough to create very advanced AI systems. The most advanced large language models (LLMs) leverage those two concepts. The wave of new LLMs that have conquered the AI scene since GPT-3 have primarily relied on scaling compute and data to internalize vast amounts of knowledge. The most recent wave of reasoning models, such as DeepSeek-R1, has demonstrated that reinforcement learning and a simple reward signal are sufficient for learning complex reasoning skills. What is the era of experience? The “Era of Experience” builds on the same concepts that Sutton and Silver have been discussing in recent years, and adapts them to recent advances in AI. The authors argue that the “pace of progress driven solely by supervised learning from human data is demonstrably slowing, signalling the need for a new approach.” And that approach requires a new source of data, which must be generated in a way that continually improves as the agent becomes stronger. “This can be achieved by allowing agents to learn continually from their own experience, i.e., data that is generated by the agent interacting with its environment,” Sutton and Silver write. They argue that eventually, “experience will become the dominant medium of improvement and ultimately dwarf the scale of human data used in today’s systems.” According to the authors, in addition to learning from their own experiential data, future AI systems will “break through the limitations of human-centric AI systems” across four dimensions: Streams: Instead of working across disconnected episodes, AI agents will “have their own stream of experience that progresses, like humans, over a long time-scale.” This will allow agents to plan for long-term goals and adapt to new behavioral patterns over time. We can see glimmers of this in AI systems that have very long context windows and memory architectures that continuously update based on user interactions. Actions and observations: Instead of focusing on human-privileged actions and observations, agents in the era of experience will act autonomously in the real world. Examples of this are agentic systems that can interact with external applications and resources through tools such as computer use and Model Context Protocol (MCP). Rewards: Current reinforcement learning systems mostly rely on human-designed reward functions. In the future, AI agents should be able to design their own dynamic reward functions that adapt over time and match user preferences with real-world signals gathered from the agent’s actions and observations in the world. We’re seeing early versions of self-designing rewards with systems such as Nvidia’s DrEureka.  Planning and reasoning: Current reasoning models have been designed to imitate the human thought process. The authors argue that “More efficient mechanisms of thought surely exist, using non-human languages that may, for example, utilise symbolic, distributed, continuous, or differentiable computations.” AI agents should engage with the world, observe and use data to validate and update their reasoning process and develop a world model. The idea of AI agents that adapt themselves to their environment through reinforcement learning is not new. But previously, these agents were limited to very constrained environments such as board games. Today, agents that can interact with complex environments (e.g., AI computer use) and advances in reinforcement learning will overcome these limitations, bringing about the transition to the era of experience. What does it mean for the enterprise? Buried in Sutton and Silver’s paper is an observation that will have important implications for real-world applications: “The agent may use ‘human-friendly’ actions and observations such as user interfaces, that naturally facilitate communication and collaboration with the user. The agent may also take ‘machine-friendly’ actions that execute code and call APIs, allowing the agent to act autonomously in service of its goals.” The era of experience means that developers will have to build their applications not only for humans but also with AI agents in mind. Machine-friendly actions require building secure and accessible APIs that can easily be accessed directly or through interfaces such as MCP. It also means creating agents that can be made discoverable through protocols such as Google’s Agent2Agent. You will also need to design your APIs and agentic interfaces to provide access to both actions and observations. This will enable agents to gradually reason about and learn from their interactions with your applications. If the vision that Sutton and Silver present becomes reality, there will soon be billions of agents roaming around the web (and soon in the physical world) to accomplish tasks. Their behaviors and needs will be very different from human users and developers, and having an agent-friendly way to interact with your application will improve your ability to leverage future AI systems (and also prevent the harms they can cause). “By building upon the foundations of RL and adapting its

The ‘era of experience’ will unleash self-learning AI agents across the web—here’s how to prepare Read More »

Meta unleashes Llama API running 18x faster than OpenAI: Cerebras partnership delivers 2,600 tokens per second

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Meta announced today a partnership with Cerebras Systems to power its new Llama API, offering developers access to inference speeds up to 18 times faster than traditional GPU-based solutions. The announcement, made at Meta’s inaugural LlamaCon developer conference in Menlo Park, positions the company to compete directly with OpenAI, Anthropic, and Google in the rapidly growing AI inference service market, where developers purchase tokens by the billions to power their applications. “Meta has selected Cerebras to collaborate to deliver the ultra-fast inference that they need to serve developers through their new Llama API,” said Julie Shin Choi, chief marketing officer at Cerebras, during a press briefing. “We at Cerebras are really, really excited to announce our first CSP hyperscaler partnership to deliver ultra-fast inference to all developers.” The partnership marks Meta’s formal entry into the business of selling AI computation, transforming its popular open-source Llama models into a commercial service. While Meta’s Llama models have accumulated over one billion downloads, until now the company had not offered a first-party cloud infrastructure for developers to build applications with them. “This is very exciting, even without talking about Cerebras specifically,” said James Wang, a senior executive at Cerebras. “OpenAI, Anthropic, Google — they’ve built an entire new AI business from scratch, which is the AI inference business. Developers who are building AI apps will buy tokens by the millions, by the billions sometimes. And these are just like the new compute instructions that people need to build AI applications.” A benchmark chart shows Cerebras processing Llama 4 at 2,648 tokens per second, dramatically outpacing competitors SambaNova (747), Groq (600) and GPU-based services from Google and others — explaining Meta’s hardware choice for its new API. (Credit: Cerebras) Breaking the speed barrier: How Cerebras supercharges Llama models What sets Meta’s offering apart is the dramatic speed increase provided by Cerebras’ specialized AI chips. The Cerebras system delivers over 2,600 tokens per second for Llama 4 Scout, compared to approximately 130 tokens per second for ChatGPT and around 25 tokens per second for DeepSeek, according to benchmarks from Artificial Analysis. “If you just compare on API-to-API basis, Gemini and GPT, they’re all great models, but they all run at GPU speeds, which is roughly about 100 tokens per second,” Wang explained. “And 100 tokens per second is okay for chat, but it’s very slow for reasoning. It’s very slow for agents. And people are struggling with that today.” This speed advantage enables entirely new categories of applications that were previously impractical, including real-time agents, conversational low-latency voice systems, interactive code generation, and instant multi-step reasoning — all of which require chaining multiple large language model calls that can now be completed in seconds rather than minutes. The Llama API represents a significant shift in Meta’s AI strategy, transitioning from primarily being a model provider to becoming a full-service AI infrastructure company. By offering an API service, Meta is creating a revenue stream from its AI investments while maintaining its commitment to open models. “Meta is now in the business of selling tokens, and it’s great for the American kind of AI ecosystem,” Wang noted during the press conference. “They bring a lot to the table.” The API will offer tools for fine-tuning and evaluation, starting with Llama 3.3 8B model, allowing developers to generate data, train on it, and test the quality of their custom models. Meta emphasizes that it won’t use customer data to train its own models, and models built using the Llama API can be transferred to other hosts—a clear differentiation from some competitors’ more closed approaches. Cerebras will power Meta’s new service through its network of data centers located throughout North America, including facilities in Dallas, Oklahoma, Minnesota, Montreal, and California. “All of our data centers that serve inference are in North America at this time,” Choi explained. “We will be serving Meta with the full capacity of Cerebras. The workload will be balanced across all of these different data centers.” The business arrangement follows what Choi described as “the classic compute provider to a hyperscaler” model, similar to how Nvidia provides hardware to major cloud providers. “They are reserving blocks of our compute that they can serve their developer population,” she said. Beyond Cerebras, Meta has also announced a partnership with Groq to provide fast inference options, giving developers multiple high-performance alternatives beyond traditional GPU-based inference. Meta’s entry into the inference API market with superior performance metrics could potentially disrupt the established order dominated by OpenAI, Google, and Anthropic. By combining the popularity of its open-source models with dramatically faster inference capabilities, Meta is positioning itself as a formidable competitor in the commercial AI space. “Meta is in a unique position with 3 billion users, hyper-scale datacenters, and a huge developer ecosystem,” according to Cerebras’ presentation materials. The integration of Cerebras technology “helps Meta leapfrog OpenAI and Google in performance by approximately 20x.” For Cerebras, this partnership represents a major milestone and validation of its specialized AI hardware approach. “We have been building this wafer-scale engine for years, and we always knew that the technology’s first rate, but ultimately it has to end up as part of someone else’s hyperscale cloud. That was the final target from a commercial strategy perspective, and we have finally reached that milestone,” Wang said. The Llama API is currently available as a limited preview, with Meta planning a broader rollout in the coming weeks and months. Developers interested in accessing the ultra-fast Llama 4 inference can request early access by selecting Cerebras from the model options within the Llama API. “If you imagine a developer who doesn’t know anything about Cerebras because we’re a relatively small company, they can just click two buttons on Meta’s standard software SDK, generate an API key, select the Cerebras flag, and then all of a sudden, their tokens are being processed on a giant wafer-scale engine,” Wang

Meta unleashes Llama API running 18x faster than OpenAI: Cerebras partnership delivers 2,600 tokens per second Read More »

Salesforce takes aim at ‘jagged intelligence’ in push for more reliable AI

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Salesforce is tackling one of artificial intelligence’s most persistent challenges for business applications: the gap between an AI system’s raw intelligence and its ability to consistently perform in unpredictable enterprise environments — what the company calls “jagged intelligence.” In a comprehensive research announcement today, Salesforce AI Research revealed several new benchmarks, models, and frameworks designed to make future AI agents more intelligent, trusted, and versatile for enterprise use. The innovations aim to improve both the capabilities and consistency of AI systems, particularly when deployed as autonomous agents in complex business settings. “While LLMs may excel at standardized tests, plan intricate trips, and generate sophisticated poetry, their brilliance often stumbles when faced with the need for reliable and consistent task execution in dynamic, unpredictable enterprise environments,” said Silvio Savarese, Salesforce’s Chief Scientist and Head of AI Research, during a press conference preceding the announcement. The initiative represents Salesforce’s push toward what Savarese calls “Enterprise General Intelligence” (EGI) — AI designed specifically for business complexity rather than the more theoretical pursuit of Artificial General Intelligence (AGI). “We define EGI as purpose-built AI agents for business optimized not just for capability, but for consistency, too,” Savarese explained. “While AGI may conjure images of superintelligent machines surpassing human intelligence, businesses aren’t waiting for that distant, illusory future. They’re applying these foundational concepts now to solve real-world challenges at scale.” How Salesforce is measuring and fixing AI’s inconsistency problem in enterprise settings A central focus of the research is quantifying and addressing AI’s inconsistency in performance. Salesforce introduced the SIMPLE dataset, a public benchmark featuring 225 straightforward reasoning questions designed to measure how jagged an AI system’s capabilities really are. “Today’s AI is jagged, so we need to work on that. But how can we work on something without measuring it first? That’s exactly what this SIMPLE benchmark is,” explained Shelby Heinecke, Senior Manager of Research at Salesforce, during the press conference. For enterprise applications, this inconsistency isn’t merely an academic concern. A single misstep from an AI agent could disrupt operations, erode customer trust, or inflict substantial financial damage. “For businesses, AI isn’t a casual pastime; it’s a mission-critical tool that requires unwavering predictability,” Savarese noted in his commentary. Inside CRMArena: Salesforce’s virtual testing ground for enterprise AI agents Perhaps the most significant innovation is CRMArena, a novel benchmarking framework designed to simulate realistic customer relationship management scenarios. It enables comprehensive testing of AI agents in professional contexts, addressing the gap between academic benchmarks and real-world business requirements. “Recognizing that current AI models often fall short in reflecting the intricate demands of enterprise environments, we’ve introduced CRMArena: a novel benchmarking framework meticulously designed to simulate realistic, professionally grounded CRM scenarios,” Savarese said. The framework evaluates agent performance across three key personas: service agents, analysts, and managers. Early testing revealed that even with guided prompting, leading agents succeed less than 65% of the time at function-calling for these personas’ use cases. “The CRM arena essentially is a tool that’s been introduced internally for improving agents,” Savarese explained. “It allows us to stress test these agents, understand when they’re failing, and then use these lessons we learn from those failure cases to improve our agents.” New embedding models that understand enterprise context better than ever before Among the technical innovations announced, Salesforce highlighted SFR-Embedding, a new model for deeper contextual understanding that leads the Massive Text Embedding Benchmark (MTEB) across 56 datasets. “SFR embedding is not just research. It’s coming to Data Cloud very, very soon,” Heinecke noted. A specialized version, SFR-Embedding-Code, was also introduced for developers, enabling high-quality code search and streamlining development. According to Salesforce, the 7B parameter version leads the Code Information Retrieval (CoIR) benchmark, while smaller models (400M, 2B) offer efficient, cost-effective alternatives. Why smaller, action-focused AI models may outperform larger language models for business tasks Salesforce also announced xLAM V2 (Large Action Model), a family of models specifically designed to predict actions rather than just generate text. These models start at just 1 billion parameters—a fraction of the size of many leading language models. “What’s special about our xLAM models is that if you look at our model sizes, we’ve got a 1B model, we all the way up to a 70B model. That 1B model, for example, is a fraction of the size of many of today’s large language models,” Heinecke explained. “This small model packs just so much power in taking the ability to take the next action.” Unlike standard language models, these action models are specifically trained to predict and execute the next steps in a task sequence, making them particularly valuable for autonomous agents that need to interact with enterprise systems. “Large action models are LLMs under the hood, and the way we build them is we take an LLM and we fine-tune it on what we call action trajectories,” Heinecke added. Enterprise AI safety: How Salesforce’s trust layer establishes guardrails for business use To address enterprise concerns about AI safety and reliability, Salesforce introduced SFR-Guard, a family of models trained on both publicly available data and CRM-specialized internal data. These models strengthen the company’s Trust Layer, which provides guardrails for AI agent behavior. “Agentforce’s guardrails establish clear boundaries for agent behavior based on business needs, policies, and standards, ensuring agents act within predefined limits,” the company stated in its announcement. The company also launched ContextualJudgeBench, a novel benchmark for evaluating LLM-based judge models in context—testing over 2,000 challenging response pairs for accuracy, conciseness, faithfulness, and appropriate refusal to answer. Looking beyond text, Salesforce unveiled TACO, a multimodal action model family designed to tackle complex, multi-step problems through chains of thought-and-action (CoTA). This approach enables AI to interpret and respond to intricate queries involving multiple media types, with Salesforce claiming up to 20% improvement on the challenging MMVet benchmark. Co-innovation in action: How customer feedback shapes Salesforce’s enterprise AI roadmap Itai Asseo, Senior Director of Incubation and

Salesforce takes aim at ‘jagged intelligence’ in push for more reliable AI Read More »

Writer releases Palmyra X5, delivers near GPT-4.1 performance at 75% lower cost

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Writer, the enterprise generative AI company valued at $1.9 billion, today released Palmyra X5, a new large language model (LLM) featuring an expansive 1-million-token context window that promises to accelerate the adoption of autonomous AI agents in corporate environments. The San Francisco-based company, which counts Accenture, Marriott, Uber, and Vanguard among its hundreds of enterprise customers, has positioned the model as a cost-efficient alternative to offerings from industry giants like OpenAI and Anthropic, with pricing set at $0.60 per million input tokens and $6 per million output tokens. “This model really unlocks the agentic world,” said Matan-Paul Shetrit, Director of Product at Writer, in an interview with VentureBeat. “It’s faster and cheaper than equivalent large context window models out there like GPT-4.1, and when you combine it with the large context window and the model’s ability to do tool or function calling, it allows you to start really doing things like multi-step agentic flows.” A comparison of AI model efficiency showing Writer’s Palmyra X5 achieving nearly 20% accuracy on OpenAI’s MRCR benchmark at approximately $0.60 per million tokens, positioning it favorably against more expensive models like GPT-4.1 and GPT-4o (right) that cost over $2.00 per million tokens. (Credit: Writer) AI economics breakthrough: How Writer trained a powerhouse model for just $1 million Unlike many competitors, Writer trained Palmyra X5 with synthetic data for approximately $1 million in GPU costs — a fraction of what other leading models require. This cost efficiency represents a significant departure from the prevailing industry approach of spending tens or hundreds of millions on model development. “Our belief is that tokens in general are becoming cheaper and cheaper, and the compute is becoming cheaper and cheaper,” Shetrit explained. “We’re here to solve real problems, rather than nickel and diming our customers on the pricing.” The company’s cost advantage stems from proprietary techniques developed over several years. In 2023, Writer published research on “becoming self-instruct,” which introduced early stopping criteria for minimal instruct tuning. According to Shetrit, this allows Writer to “cut costs significantly” during the training process. “Unlike other foundational shops, our view is that we need to be effective. We need to be efficient here,” Shetrit said. “We need to provide the fastest, cheapest models to our customers, because ROI really matters in these cases.” Million-token marvel: The technical architecture powering Palmyra X5’s speed and accuracy Palmyra X5 can process a full million-token prompt in approximately 22 seconds and execute multi-turn function calls in around 300 milliseconds — performance metrics that Writer claims enable “agent behaviors that were previously cost- or time-prohibitive.” The model’s architecture incorporates two key technical innovations: a hybrid attention mechanism and a mixture of experts approach. “The hybrid attention mechanism…introduces attention mechanism that inside the model allows it to focus on the relevant parts of the inputs when generating each output,” Shetrit said. This approach accelerates response generation while maintaining accuracy across the extensive context window. Palmyra X5’s hybrid attention architecture processes massive inputs through specialized decoder blocks, enabling efficient handling of million-token contexts. (Credit: Writer) On benchmark tests, Palmyra X5 achieved notable results relative to its cost. On OpenAI’s MRCR 8-needle test — which challenges models to find eight identical requests hidden in a massive conversation — Palmyra X5 scored 19.1%, compared to 20.25% for GPT-4.1 and 17.63% for GPT-4o. It also places eighth in coding on the BigCodeBench benchmark with a score of 48.7. These benchmarks demonstrate that while Palmyra X5 may not lead every performance category, it delivers near-flagship capabilities at significantly lower costs — a trade-off that Writer believes will resonate with enterprise customers focused on ROI. From chatbots to business automation: How AI agents are transforming enterprise workflows The release of Palmyra X5 comes shortly after Writer unveiled AI HQ earlier this month — a centralized platform for enterprises to build, deploy, and supervise AI agents. This dual product strategy positions Writer to capitalize on growing enterprise demand for AI that can execute complex business processes autonomously. “In the age of agents, models offering less than 1 million tokens of context will quickly become irrelevant for business-critical use cases,” said Writer CTO and co-founder Waseem AlShikh in a statement. Shetrit elaborated on this point: “For a long time, there’s been a large gap between the promise of AI agents and what they could actually deliver. But at Writer, we’re now seeing real-world agent implementations with major enterprise customers. And when I say real customers, it’s not like a travel agent use case. I’m talking about Global 2000 companies, solving the gnarliest problems in their business.” Early adopters are deploying Palmyra X5 for various enterprise workflows, including financial reporting, RFP responses, support documentation, and customer feedback analysis. One particularly compelling use case involves multi-step agentic workflows, where an AI agent can flag outdated content, generate suggested revisions, share them for human approval, and automatically push approved updates to a content management system. This shift from simple text generation to process automation represents a fundamental evolution in how enterprises deploy AI — moving from augmenting human work to automating entire business functions. Writer’s Palmyra X5 offers an 8x increase in context window size over its predecessor, allowing it to process the equivalent of 1,500 pages at once. (Credit: Writer) Cloud expansion strategy: AWS partnership brings Writer’s AI to millions of enterprise developers Alongside the model release, Writer announced that both Palmyra X5 and its predecessor, Palmyra X4, are now available in Amazon Bedrock, Amazon Web Services’ fully managed service for accessing foundation models. AWS becomes the first cloud provider to deliver fully managed models from Writer, significantly expanding the company’s potential reach. “Seamless access to Writer’s Palmyra X5 will enable developers and enterprises to build and scale AI agents and transform how they reason over vast amounts of enterprise data—leveraging the security, scalability, and performance of AWS,” said Atul Deo, Director of Amazon Bedrock at AWS, in the

Writer releases Palmyra X5, delivers near GPT-4.1 performance at 75% lower cost Read More »

More accurate coding: Researchers adapt Sequential Monte Carlo for AI-generated code

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Coding with the help of AI models continues to gain popularity, but many have highlighted issues that arise when developers rely on coding assistants.  However, researchers from MIT, McGill University, ETH Zurich, Johns Hopkins University, Yale and the Mila-Quebec Artificial Intelligence Institute have developed a new method for ensuring that AI-generated codes are more accurate and useful. This method spans various programming languages and instructs the large language model (LLM) to adhere to the rules of each language. The group found that by adapting new sampling methods, AI models can be guided to follow programming language rules and even enhance the performance of small language models (SLMs), which are typically used for code generation, surpassing that of large language models. In the paper, the researchers used Sequential Monte Carlo (SMC) to “tackle a number of challenging semantic parsing problems, guiding generation with incremental static and dynamic analysis.” Sequential Monte Carlo refers to a family of algorithms that help figure out solutions to filtering problems.  João Loula, co-lead writer of the paper, said in an interview with MIT’s campus paper that the method “could improve programming assistants, AI-powered data analysis and scientific discovery tools.” It can also cut compute costs and be more efficient than reranking methods.  The researchers noted that AI-generated code can be powerful, but it can also often lead to code that disregards the semantic rules of programming languages. Other methods to prevent this can distort models or are too time-consuming.  Their method makes the LLM adhere to programming language rules by discarding code outputs that may not work early in the process and “allocate efforts towards outputs that more most likely to be valid and accurate.” Adapting SMC to code generation The researchers developed an architecture that brings SMC to code generation “under diverse syntactic and semantic constraints.”  “Unlike many previous frameworks for constrained decoding, our algorithm can integrate constraints that cannot be incrementally evaluated over the entire token vocabulary, as well as constraints that can only be evaluated at irregular intervals during generation,” the researchers said in the paper.  Key features of adapting SMC sampling to model generation include proposal distribution where the token-by-token sampling is guided by cheap constraints, important weights that correct for biases and resampling which reallocates compute effort towards partial generations. The researchers noted that while SMC can guide models towards more correct and useful code, they acknowledged that the method may have some problems. “While importance sampling addresses several shortcomings of local decoding, it too suffers from a major weakness: weight corrections and expensive potentials are not integrated until after a complete sequence has been generated from the proposal. This is even though critical information about whether a sequence can satisfy a constraint is often available much earlier and can be used to avoid large amounts of unnecessary computation,” they said.  Model testing To prove their theory, Loula and his team ran experiments to see if using SMC to engineer more accurate code works.  These experiments were:  Python Code Generation on Data Science tasks, which used Llama 3 70B to code line-by-line and test early versions  Text-to-SQL Generation with Llama 3 8B- Instruct Goal Inference in Planning Tasks to predict an agent’s goal condition, and also used Llama 3 8B Molecular Synthesis for drug discovery They found that using SMC improved small language models, improved accuracy and robustness, and outperformed larger models.  Why is it important AI models have made engineers and other coders work faster and more efficiently. It’s also given rise to a whole new kind of software engineer: the vibe coder. But there have been concerns over code quality, lack of support for more complex coding and compute costs for simple code generation. New methods, such as adapting SMC, may make AI-powered coding more useful and enable engineers to trust the code generated by models more.  Other companies have explored ways to improve AI-generated code. Together AI and Agentica released DeepCoder-14B, which harnesses fewer parameters. Google also improved its Code Assist feature to help enhance code quality.  source

More accurate coding: Researchers adapt Sequential Monte Carlo for AI-generated code Read More »

DeepSeek’s success shows why motivation is key to AI innovation

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More January 2025 shook the AI landscape. The seemingly unstoppable OpenAI and the powerful American tech giants were shocked by what we can certainly call an underdog in the area of large language models (LLMs). DeepSeek, a Chinese firm not on anyone’s radar, suddenly challenged OpenAI. It is not that DeepSeek-R1 was better than the top models from American giants; it was slightly behind in terms of the benchmarks, but it suddenly made everyone think about the efficiency in terms of hardware and energy usage. Given the unavailability of the best high-end hardware, it seems that DeepSeek was motivated to innovate in the area of efficiency, which was a lesser concern for larger players. OpenAI has claimed they have evidence suggesting DeepSeek may have used their model for training, but we have no concrete proof to support this. So, whether it is true or it’s OpenAI simply trying to appease their investors is a topic of debate. However, DeepSeek has published their work, and people have verified that the results are reproducible at least on a much smaller scale. But how could DeepSeek attain such cost-savings while American companies could not? The short answer is simple: They had more motivation. The long answer requires a little bit more of a technical explanation. DeepSeek used KV-cache optimization One important cost-saving for GPU memory was optimization of the Key-Value cache used in every attention layer in an LLM. LLMs are made up of transformer blocks, each of which comprises an attention layer followed by a regular vanilla feed-forward network. The feed-forward network conceptually models arbitrary relationships, but in practice, it is difficult for it to always determine patterns in the data. The attention layer solves this problem for language modeling. The model processes texts using tokens, but for simplicity, we will refer to them as words. In an LLM, each word gets assigned a vector in a high dimension (say, a thousand dimensions). Conceptually, each dimension represents a concept, like being hot or cold, being green, being soft, being a noun. A word’s vector representation is its meaning and values according to each dimension. However, our language allows other words to modify the meaning of each word. For example, an apple has a meaning. But we can have a green apple as a modified version. A more extreme example of modification would be that an apple in an iPhone context differs from an apple in a meadow context. How do we let our system modify the vector meaning of a word based on another word? This is where attention comes in. The attention model assigns two other vectors to each word: a key and a query. The query represents the qualities of a word’s meaning that can be modified, and the key represents the type of modifications it can provide to other words. For example, the word ‘green’ can provide information about color and green-ness. So, the key of the word ‘green’ will have a high value on the ‘green-ness’ dimension. On the other hand, the word ‘apple’ can be green or not, so the query vector of ‘apple’ would also have a high value for the green-ness dimension. If we take the dot product of the key of ‘green’ with the query of ‘apple,’ the product should be relatively large compared to the product of the key of ‘table’ and the query of ‘apple.’ The attention layer then adds a small fraction of the value of the word ‘green’ to the value of the word ‘apple’. This way, the value of the word ‘apple’ is modified to be a little greener. When the LLM generates text, it does so one word after another. When it generates a word, all the previously generated words become part of its context. However, the keys and values of those words are already computed. When another word is added to the context, its value needs to be updated based on its query and the keys and values of all the previous words. That’s why all those values are stored in the GPU memory. This is the KV cache. DeepSeek determined that the key and the value of a word are related. So, the meaning of the word green and its ability to affect greenness are obviously very closely related. So, it is possible to compress both as a single (and maybe smaller) vector and decompress while processing very easily. DeepSeek has found that it does affect their performance on benchmarks, but it saves a lot of GPU memory. DeepSeek applied MoE The nature of a neural network is that the entire network needs to be evaluated (or computed) for every query. However, not all of this is useful computation. Knowledge of the world sits in the weights or parameters of a network. Knowledge about the Eiffel Tower is not used to answer questions about the history of South American tribes. Knowing that an apple is a fruit is not useful while answering questions about the general theory of relativity. However, when the network is computed, all parts of the network are processed regardless. This incurs huge computation costs during text generation that should ideally be avoided. This is where the idea of the mixture-of-experts (MoE) comes in. In an MoE model, the neural network is divided into multiple smaller networks called experts. Note that the ‘expert’ in the subject matter is not explicitly defined; the network figures it out during training. However, the networks assign some relevance score to each query and only activate the parts with higher matching scores. This provides huge cost savings in computation. Note that some questions need expertise in multiple areas to be answered properly, and the performance of such queries will be degraded. However, because the areas are figured out from the data, the number of such questions is minimised. The importance of reinforcement learning An LLM is taught to think

DeepSeek’s success shows why motivation is key to AI innovation Read More »

Structify raises $4.1M seed to turn unstructured web data into enterprise-ready datasets

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A Brooklyn-based startup is taking aim at one of the most notorious pain points in the world of artificial intelligence and data analytics: the painstaking process of data preparation. Structify emerged from stealth mode today, announcing its public launch alongside $4.1 million in seed funding led by Bain Capital Ventures, with participation from 8VC, Integral Ventures and strategic angel investors. The company’s platform uses a proprietary visual language model called DoRa to automate the gathering, cleaning, and structuring of data — a process that typically consumes up to 80% of data scientists’ time, according to industry surveys. “The volume of information available today has absolutely exploded,” said Ronak Gandhi, co-founder of Structify, in an exclusive interview with VentureBeat. “We’ve hit a major inflection point in data availability, which is both a blessing and a curse. While we have unprecedented access to information, it remains largely inaccessible because it’s so difficult to convert into the right format for making meaningful business decisions.” Structify’s approach reflects a growing industry-wide focus on solving what data experts call “the data preparation bottleneck.” Gartner research indicates that inadequate data preparation remains one of the primary obstacles to successful AI implementation, with four of five businesses lacking the data foundations necessary to fully capitalize on generative AI. How AI-powered data transformation is unlocking hidden business intelligence at scale At its core, Structify allows users to create custom datasets by specifying the data schema, selecting sources, and deploying AI agents to extract that data. The platform can handle everything from SEC filings and LinkedIn profiles to news articles and specialized industry documents. What sets Structify apart, according to Gandhi, is their in-house model DoRa, which navigates the web like a human would. “It’s super high-quality. It navigates and interacts with stuff just like a person would,” Gandhi explained. “So we’re talking about human quality — that’s the first and foremost center of the principles behind DoRa. It reads the internet the way a human would.” This approach allows Structify to support a free tier, which Gandhi believes will help democratize access to structured data. “The way in which you think about data now is, it’s this really precious object,” Gandhi said. “This really precious thing that you spend so much time finagling and getting and wrestling around, and when you have it, you’re like, ‘Oh, if someone was to delete it, I would cry.’” Structify’s vision is to “commoditize data” — making it something that can be easily recreated if lost. From finance to construction: How businesses are deploying custom datasets to solve industry-specific challenges The company has already seen adoption across multiple sectors. Finance teams use it to extract information from pitch decks, construction companies turn complex geotechnical documents into readable tables, and sales teams gather real-time organizational charts for their accounts. Slater Stich, partner at Bain Capital Ventures, highlighted this versatility in the funding announcement: “Every company I’ve ever worked with has a handful of data sources that are both extremely important and a huge pain to work with, whether that’s figures buried in PDFs, scattered across hundreds of web pages, hidden behind an enterprise SOAP API, etc.” The diversity of Structify’s early customer base reflects the universal nature of data preparation challenges. According to TechTarget research, data preparation typically involves a series of labor-intensive steps: collection, discovery, profiling, cleansing, structuring, transformation, and validation — all before any actual analysis can begin. Why human expertise remains crucial for AI accuracy: Inside Structify’s ‘quadruple verification’ system A key differentiator for Structify is its “quadruple verification” process, which combines AI with human oversight. This approach addresses a critical concern in AI development: ensuring accuracy. “Whenever a user sees something that’s suspicious, or we identify some data as potentially suspicious, we can send it to an expert in that specific use case,” Gandhi explained. “That expert can act in the same way as [DoRa], navigate to the right piece of information, extract it, save it, and then verify if it’s right.” This process not only corrects the data but also creates training examples that improve the model’s performance over time, especially in specialized domains like construction or pharmaceutical research. “Those things are so messy,” Gandhi noted. “I never thought in my life I would have a strong understanding of geology. But there we are, and that, I think, is a huge strength – being able to learn from these experts and put it directly into DoRa.” As data extraction tools become more powerful, privacy concerns inevitably arise. Structify has implemented safeguards to address these issues. “We don’t do any authentication, anything that required a login, anything that requires you to go behind some sense of information – our agent doesn’t do that because that’s a privacy concern,” Gandhi said. The company also prioritizes transparency by providing direct sourcing information. “If you’re interested in learning more about a particular piece of information, you go directly to that content and see it, as opposed to kind of legacy providers where it’s this black box.” Structify enters a competitive landscape that includes both established players and other startups addressing various aspects of the data preparation challenge. Companies like Alteryx, Informatica, Microsoft, and Tableau all offer data preparation capabilities, while several specialists have been acquired in recent years. What differentiates Structify, according to CEO Alex Reichenbach, is its combination of speed and accuracy. A recent LinkedIn post by Reichenbach claimed they had sped up their agent “10x while cutting cost ~16x” through model optimization and infrastructure improvements. The company’s launch comes amid growing interest in AI-powered data automation. According to a TechTarget report, automating data preparation “is frequently cited as one of the major investment areas for data and analytics teams,” with augmented data preparation capabilities becoming increasingly important. How frustrating data preparation experiences inspired two friends to revolutionize the industry For Gandhi, Structify addresses problems he faced firsthand in previous roles.

Structify raises $4.1M seed to turn unstructured web data into enterprise-ready datasets Read More »

Breaking the ‘intellectual bottleneck’: How AI is computing the previously uncomputible in healthcare

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Whenever a patient gets a CT scan at the University of Texas Medical Branch (UTMB), the resulting images are automatically sent off to the cardiology department, analyzed by AI and assigned a cardiac risk score.  In just a few months, thanks to a simple algorithm, AI has flagged several patients at high cardiovascular risk. The CT scan doesn’t have to be related to the heart; the patient doesn’t have to have heart problems. Every scan automatically triggers an evaluation.  It is straightforward preventative care enabled by AI, allowing the medical facility to finally start utilizing their vast amounts of data.  “The data is just sitting out there,” Peter McCaffrey, UTMB’s chief AI officer, told VentureBeat. “What I love about this is that AI doesn’t have to do anything superhuman. It’s performing a low intellect task, but at very high volume, and that still provides a lot of value, because we’re constantly finding things that we miss.” He acknowledged, “We know we miss stuff. Before, we just didn’t have the tools to go back and find it.”  How AI helps UTMB determine cardiovascular risk Like many healthcare facilities, UTMB is applying AI across a number of areas. One of its first use cases is cardiac risk screening. Models have been trained to scan for incidental coronary artery calcification (iCAC), a strong predictor of cardiovascular risk. The goal is to identify patients susceptible to heart disease who may have otherwise been overlooked because they exhibit no obvious symptoms, McCaffrey explained.  Through the screening program, every CT scan completed at the facility is automatically analyzed using AI to detect coronary calcification. The scan doesn’t have to have anything to do with cardiology; it could be ordered due to a spinal fracture or an abnormal lung nodule.  The scans are fed into an image-based convolutional neural network (CNN) that calculates an Agatston score, which represents the accumulation of plaque in the patient’s arteries. Typically, this would be calculated by a human radiologist, McCaffrey explained.  From there, the AI allocates patients with an iCAC score at or above 100 into three ‘risk tiers’ based on additional information (such as whether they are on a statin or have ever had a visit with a cardiologist). McCaffrey explained that this assignment is rules-based and can draw from discrete values within the electronic health record (EHR), or the AI can determine values by processing free text such as clinical visit notes using GPT-4o.  Patients flagged with a score of 100 or more, with no known history of cardiology visitation or therapy, are automatically sent digital messages. The system also sends a note to their primary physician. Patients identified as having more severe iCAC scores of 300 or higher also receive a phone call.  McCaffrey explained that almost everything is automated, except for the phone call; however, the facility is actively piloting tools in the hopes of also automating voice calls. The only area where humans are in the loop is in confirming the AI-derived calcium score and the risk tier before proceeding with automated notification. Since launching the program in late 2024, the medical facility has evaluated approximately 450 scans per month, with five to ten of these cases being identified as high-risk each month, requiring intervention, McCaffrey reported.  “The gist here is no one has to suspect you have this disease, no one has to order the study for this disease,” he noted.  Another critical use case for AI is in the detection of stroke and pulmonary embolism. UTMB uses specialized algorithms that have been trained to spot specific symptoms and flag care teams within seconds of imaging to accelerate treatment.  Like with the iCAC scoring tool, CNNs, respectively trained for stroke and pulmonary embolisms, automatically receive CT scans and look for indicators such as obstructed blood flows or abrupt blood vessel cutoff.  “Human radiologists can detect these visual characteristics, but here the detection is automated and happens in mere seconds,” said McCaffrey.  Any CT ordered “under suspicion” of stroke or pulmonary embolism is automatically sent to the AI — for instance, a clinician in the ER may identify facial droop or slurring and issue a “CT stroke” order, triggering the algorithm.  Both algorithms include a messaging application that notifies the entire care team as soon as a finding is made. This will include a screenshot of the image with a crosshair over the location of the lesion. “These are particular emergency use cases where how quickly you initiate treatment matters,” said McCaffrey. “We’ve seen cases where we’re able to gain several minutes of intervention because we had a quicker heads up from AI.” Reducing hallucinations, anchoring bias To ensure models perform as optimally as possible, UTMB profiles them for sensitivity, specificity, F-1 score, bias and other factors both pre-deployment and recurrently post-deployment.  So, for example, the iCAC algorithm is validated pre-deployment by running the model on a balanced set of CT scans while radiologists manually score — then the two are compared. In post-deployment review, meanwhile, radiologists are given a random subset of AI-scored CT scans and perform a full iCAC measurement that is blinded to the AI score. McCaffrey explained that this allows his team to calculate model error recurrently and also detect potential bias (which would be seen as a shift in the magnitude and/or directionality of error).  To help prevent anchoring bias — where AI and humans rely too heavily on the first piece of information they encounter, thereby missing important details when making a decision — UTMB employs a “peer learning” technique. A random subset of radiology exams are chosen, shuffled, anonymized and distributed to different radiologists, and their answers are compared.  This not only helps to rate individual radiologist performance, but also detects whether the rate of missed findings was higher in studies in which AI was used to specifically highlight particular anomalies (thus leading to anchoring bias).  For instance, if

Breaking the ‘intellectual bottleneck’: How AI is computing the previously uncomputible in healthcare Read More »

No more window switching: Mastercard’s Agent Pay transforms how enterprises use AI search

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More One criticism of computer use platforms and other search agents is that you cannot complete any transactions within the same window if you use them to find a product or a hotel.  Mastercard aims to change that by integrating AI companies and platforms into its payments network, enabling users and enterprises to transact seamlessly within their respective ecosystems. Today the company announced Agent Pay, a new payments program that brings the Mastercard payments system to AI chat platforms.  Greg Ulrich, Mastercard’s chief data and AI officer, told VentureBeat in an interview that Agent Pay “closes the loop” on agentic search.  “You want to close the loop within the experience to enable the customer experience in the most effective way possible, which is what we’re trying to enable today,” Ulrich said. “You have to make sure that everybody in the ecosystem can identify the agents and authenticate the agent to handle the transaction safely and securely.” OpenAI, Anthropic and Perplexity can join Mastercard’s payments network, allowing other network participants — merchants, card users and financial institutions — to trust the platform’s transactions and check for any potential fraud. It also enables Mastercard to bring its fraud and transaction dispute systems to those companies.  Mastercard partnered with Microsoft, IBM, Braintree, and Checkout.com to scale Agent Pay, orchestrate the system and enhance other features for merchants.  It will also integrate Agent Pay with banks and other financial institutions.  Making agent searches more useful Over the past year, we’ve seen AI-powered search evolve from simply listing information to actively using your computer on your behalf. These platforms enable users to ask AI models to search the internet for any information. It can offer recommendations for products or places to visit. AI search has proved popular.  This week, OpenAI even announced it’s adding shopping features to ChatGPT search, running on GPT-4o, in a bid to compete with Google’s long-standing dominance in product search.  However, users have always found that they need to open a separate window if they see a deal.  “We’re gonna work with the AI platform and agents so that they can get onboarded and can access the technology. But on the merchant side, they can do now things to recognize these transactions and more effectively manage the risk around this,” said Mastercard Chief Digital Officer Pablo Fourez. Bringing these platforms into a payments system like Mastercard makes them more useful, as they can serve not only as a place where people find information, but also as a platform where users can find and transact.  When AI companies are part of the payment network, it could also improve any agentic workflow built by enterprises. Imagine an agentic workflow that includes searching for new suppliers, finding a suitable supplier, helping with negotiations, drafting a contract and setting up transactions through the platform.  The company is discussing the integration of Agent Pay into Microsoft’s Copilot and Azure/OpenAI services.  Tokens rather than AI  Agent Pay, however, is not based on generative AI, even though Mastercard does leverage the technology for other products.  Agent Pay relies on the company’s tokenization technology, which utilizes cryptography to help mask personally identifiable information (PII) during digital transactions. “It’s a separate number that is useless if it’s not used within the context of the transaction that you authorized,” Fourez said. “That’s achieved through cryptography that makes the transactions each transaction unique, and if someone else gets this information, they can’t do anything with it.”  Mastercard utilizes generative AI and large language models for fraud detection, which Ulrich noted works in tandem with tokenization for Agent Pay, as its AI models verify transactions for fraud once they are initiated.  Ulrich added Agent Pay lets every company and person involved in the transaction trust that “rules work in this ecosystem.” “We’re making sure that we safely and securely identify these players in the ecosystem, that we have a way to capture and hold the credentials securely,” he said.  source

No more window switching: Mastercard’s Agent Pay transforms how enterprises use AI search Read More »