VentureBeat

Y Combinator’s hottest startup, Origami Agents, secures $2M seed round to supercharge sales teams with AI

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Y Combinator-backed startup Origami Agents has raised $2 million in seed funding to build AI research agents that augment, rather than replace, human sales teams, breaking from the industry trend of AI avatars automating sales roles. The San Francisco-based company, founded just four months ago, has already reached $50,000 in monthly recurring revenue during its eight-week beta period, making it the fastest-growing startup in Y Combinator’s current batch, according to its founders. “Only humans can close big deals, but AI can make them much smarter and faster,” Kenson Chung told VentureBeat in an exclusive interview. The 22-year-old dropped out of University College London‘s computer science program to co-found Origami Agents. The company’s AI agents perform the tedious research work that typically consumes up to three hours of a sales representative’s day. The startup’s early traction offers a stark contrast to heavily funded competitors like 11x and Artisan, which have raised tens of millions to develop AI avatars that attempt to fully automate sales outreach. Origami’s founders argue that this approach often results in spam that damages customer relationships. A dashboard view of Origami Agents’ AI system analyzing companies’ digital presence to identify potential sales opportunities through job postings, social media and employee activity. (Credit: Origami Agents) AI sales research: How Origami Agents is transforming B2B lead generation One early customer seeing dramatic results is Stellar, a property maintenance marketplace that has grown its client base eight-fold in nine months, partly attributed to Origami’s technology. “The quality of the leads that we get is extraordinarily high,” said Matt Wetrich, Stellar’s CEO and former Uber executive. “My outbound email closes are four times industry average.” Wetrich, who became an angel investor in Origami after experiencing their product’s impact, explained that the technology helps identify property management companies at the ideal size for Stellar’s services while filtering out poor fits — work that previously required significant manual effort from his team. “Instead of it being like ground beef, and you gotta form it into something, it’s now like arriving as a steak,” Wetrich said, describing the quality of Origami’s leads compared to traditional methods. The founding team brings relevant experience to the challenge. Prior to starting Origami, Finn Mallery built custom outbound solutions for more than 20 startups after working on go-to-market strategy at Fizz, while Chung served as CTO at an enterprise sales platform. Industry experts suggest that the timing could be right for Origami’s approach. While AI has begun transforming various aspects of business operations, its application in B2B sales remains nascent. “You’re just not going to recognize the world in three years time with this,” predicted Wetrich. “The gravy train is just departing.” The seed funding will help Origami expand beyond its initial customer base in property management and real estate into other B2B verticals. The company’s agents can already analyze everything from product reviews to social media engagement to identify potential customers at their moment of highest buying intent. “There’s enough information on the internet to know exactly who your perfect customers are,” said Mallery. “We’re realizing the power of the entire internet’s unstructured data by building a generalized solution any company can make use of.” The future of sales: AI that works with humans, not against them As debates continue about AI’s role in sales, Origami’s rapid growth suggests there may be more value in augmenting human capabilities than trying to replace them entirely. The company’s approach could offer a blueprint for how AI can enhance rather than eliminate human roles across other business functions. “It’s not going to be a lot like a computer; it’s going to be a lot like electricity,” said Wetrich, comparing AI’s impact on sales to other transformative technologies. “Everybody lives and breathes and dies off of revenue. And if you can go find ways to go get more revenue… that’s what they empower you to do.” source

Y Combinator’s hottest startup, Origami Agents, secures $2M seed round to supercharge sales teams with AI Read More »

ByteDance’s UI-TARS can take over your computer, outperforms GPT-4o and Claude

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A new AI agent has emerged from the parent company of TikTok to take control of your computer and perform complex workflows. Much like Anthropic’s Computer Use, ByteDance’s new UI-TARS understands graphical user interfaces (GUIs), applies reasoning and takes autonomous, step-by-step action.  Trained on roughly 50B tokens and offered in 7B- and 72B-parameter versions, the PC/MacOS agents achieves state-of-the-art (SOTA) performance on 10-plus GUI benchmarks across performance, perception, grounding and overall agent capabilities, consistently beating out OpenAI’s GPT-4o, Claude and Google’s Gemini. “Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention,” researchers from ByteDance and Tsinghua University write in a new research paper.  Source: Arxiv How UI-TARS explains its thinking UI-TARS works across desktop, mobile and web applications, using multimodal inputs (text, images, interactions) to understand visual environments. Its UI features two tabs — one to the left showing its step-by-step “thinking,” and a larger one to the right where it pulls up files, websites and apps and automatically takes action.  For example, in a demo video released today, the model is prompted to “Find round trip flights from SEA to NYC on the 5th and return on the 10th next month and filter by price in ascending order.” In response, UI-TARS navigates to the website for Delta Airlines, fills in the “from” and “to” fields, clicks in the relevant dates and sorts and filters by price, explaining each step in its thinking box before taking action.  In another scenario, it is instructed to install an autoDocstring extension in VS Code. Here is its step-by-step thinking as it completes that task:  It first reports that “to proceed with installing the autoDocstring extension in VS Code, I first need to open the VS Code application.”  Recognizing that the VS Code window has not fully loaded, the model notes that “it is necessary to wait for the application to finish initializing before proceeding with the installation of the autoDocstring extension. This ensures that all required functionalities are available for the next steps.”  It then explains that it must access the Extensions view in VS Code, which can be done by clicking on a tab in the left sidebar.  Running into a small glitch, it reasons: “The previous attempt to click on the ‘Extensions’ tab may not have been precise enough, so I will try clicking on it again to ensure the Extensions view opens successfully.” Once in the Extensions view, it types in ‘autoDocstring’ before concluding: “Since the installation process has already been initiated, there is no further action required at this moment. I will wait for the installation to complete and confirm its success once the process is finished.” Outperforming its rivals Across a variety of benchmarks, researchers report that UI-TARS consistently outranked OpenAI’s GPT-4o; Anthropic’s Claude-3.5-Sonnet; Gemini-1.5-Pro and Gemini-2.0; four Qwen models; and numerous academic models. For instance, in VisualWebBench — which measures a model’s ability to ground web elements including webpage quality assurance and optical character recognition — UI-TARS 72B scored 82.8%, outperforming GPT-4o (78.5%) and Claude 3.5 (78.2%).  It also did significantly better on WebSRC benchmarks (understanding of semantic content and layout in web contexts) and ScreenQA-short (comprehension of complex mobile screen layouts and web structure). UI-TARS-7B achieved leading scores of 93.6% on WebSRC, while UI-TARS-72B achieved 88.6% on ScreenQA-short, outperforming Qwen, Gemini, Claude 3.5 and GPT-4o.  “These results demonstrate the superior perception and comprehension capabilities of UI-TARS in web and mobile environments,” the researchers write. “Such perceptual ability lays the foundation for agent tasks, where accurate environmental understanding is crucial for task execution and decision-making.” UI-TARS also showed impressive results in ScreenSpot Pro and ScreenSpot v2 , which assess a model’s ability to understand and localize elements in GUIs. Further, researchers tested its capabilities in planning multi-step actions and low-level tasks in mobile environments, and benchmarked it on OSWorld (which assesses open-ended computer tasks) and AndroidWorld (which scores autonomous agents on 116 programmatic tasks across 20 mobile apps).  Source: Arxiv Source: Arxiv Under the hood To help it take step-by-step actions and recognize what it’s seeing, UI-TARS was trained on a large-scale dataset of screenshots that parsed metadata including element description and type, visual description, bounding boxes (position information), element function and text from various websites, applications and operating systems. This allows the model to provide a comprehensive, detailed description of a screenshot, capturing not only elements but spatial relationships and overall layout.  The model also uses state transition captioning to identify and describe the differences between two consecutive screenshots and determine whether an action — such as a mouse click or keyboard input — has occurred. Meanwhile, set-of-mark (SoM) prompting allows it to overlay distinct marks (letters, numbers) on specific regions of an image.  The model is equipped with both short-term and long-term memory to handle tasks at hand while also retaining historical interactions to improve later decision-making. Researchers trained the model to perform both System 1 (fast, automatic and intuitive) and System 2 (slow and deliberate) reasoning. This allows for multi-step decision-making, “reflection” thinking, milestone recognition and error correction.  Researchers emphasized that it is critical that the model be able to maintain consistent goals and engage in trial and error to hypothesize, test and evaluate potential actions before completing a task. They introduced two types of data to support this: error correction and post-reflection data. For error correction, they identified mistakes and labeled corrective actions; for post-reflection, they simulated recovery steps.  “This strategy ensures that the agent not only learns to avoid errors but also adapts dynamically when they occur,” the researchers write. Clearly, UI-TARS exhibits impressive capabilities, and it’ll be interesting to see its evolving use cases in the increasingly competitive AI agents space. As the researchers note: “Looking ahead, while native agents represent a significant leap forward, the future lies in the integration of active and lifelong learning, where agents autonomously

ByteDance’s UI-TARS can take over your computer, outperforms GPT-4o and Claude Read More »

Trump’s $500 billion AI moonshot: Ambition meets controversy in ‘Project Stargate’

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More President Donald Trump unveiled an ambitious plan to reshape America’s artificial intelligence landscape this week, coupling a massive $500 billion private-sector initiative with sweeping executive actions that strip away regulatory barriers — while simultaneously sparking controversy over both funding claims and environmental concerns. The centerpiece of Trump’s AI strategy, dubbed “Project Stargate,” brings together an unlikely alliance of tech giants: Sam Altman’s OpenAI, Larry Ellison’s Oracle, and SoftBank under Masayoshi Son. The initiative aims to construct up to 20 massive AI data centers across the United States, with the first facility already under construction in Abilene, Texas. “This is a resounding declaration of confidence in America’s potential,” Trump declared at the White House announcement. However, the bold initiative immediately faced skepticism from an unexpected quarter: Trump’s own adviser and tech billionaire Elon Musk. Elon Musk questions Stargate’s $500 billion funding as OpenAI rivalry intensifies “They don’t actually have the money,” Musk wrote on X.com (formerly Twitter), claiming SoftBank had secured “well under $10B.” This public clash between Musk and Altman, former collaborators turned rivals, highlights the complex dynamics within Trump’s tech coalition. Altman swiftly countered Musk’s claim, inviting him to visit the Abilene site while pointedly noting that “what is great for the country isn’t always what’s optimal for your companies” — a reference to Musk’s competing AI ventures. They don’t actually have the money — Elon Musk (@elonmusk) January 22, 2025 Industry analysts note that the funding structure remains opaque. While the initial commitment is $100 billion, the path to $500 billion appears to rely heavily on future fundraising and market conditions. Microsoft CEO Satya Nadella, whose company is notably absent from the main announcement despite its OpenAI partnership, offered measured support: “All I know is, I’m good for my $80 billion,” he told CNBC at Davos. Emergency powers and deregulation: Trump’s strategy to fast-track AI infrastructure The initiative arrives alongside an executive order that fundamentally reshapes the federal government’s approach to AI development. The order explicitly prioritizes speed over regulation, with Trump declaring he will use emergency powers to fast-track power plant construction for the energy-hungry data centers. “I’m going to get the approval under emergency declaration. I can get the approvals done myself without having to go through years of waiting,” Trump told the World Economic Forum. This approach marks a sharp departure from the Biden Administration’s emphasis on AI safety guidelines. Environmental concerns loom large. While the Abilene facility plans to use renewable energy, Trump’s order allows the data centers to “use whatever fuel they want,” including coal for backup power. This has alarmed climate activists, who warn about the massive energy requirements of AI infrastructure. Corporate DEI programs clash with White House policy as tech giants navigate Trump Era The initiative also faces potential contradictions with Trump’s other policy priorities. Many of the participating companies maintain diversity, equity and inclusion (DEI) programs that clash with Trump’s day-one executive order ending such initiatives in federal agencies. The initiative represents a striking paradigm shift in how the U.S. approaches technological development. While previous administrations carefully balanced innovation with oversight, Trump’s approach essentially throws out the regulatory playbook in favor of a move-fast-and-fix-later strategy. This creates an unprecedented experiment in AI development: Can Silicon Valley’s biggest players, freed from regulatory constraints but bound by new social restrictions, deliver on the promise of U.S. AI dominance? The contradictions are difficult to ignore. Trump is simultaneously declaring AI development a national emergency while constraining the very companies building it through restrictions on their internal practices. Tech giants like OpenAI and Oracle must now thread an increasingly narrow needle — racing to build massive AI infrastructure while potentially dismantling their DEI initiatives that have become deeply embedded in their corporate cultures and hiring practices. More concerning for AI researchers is the absence of safety guidelines in this new framework. By prioritizing speed and scale over careful development, the administration risks repeating the mistakes of previous technological revolutions, where unforeseen consequences emerged only after systems became too entrenched to easily modify. The stakes with AI are arguably much higher. America’s AI gamble: A race against China with uncertain odds For now, the tech industry appears willing to navigate these contradictions in exchange for unprecedented support for AI infrastructure development. Whether this gamble pays off may determine not just the future of American AI, but also the shape of the global tech landscape for decades to come. The stakes couldn’t be higher. As China continues its own aggressive AI development, Project Stargate represents America’s biggest bet yet on maintaining its technological edge. The question remains: Will this moonshot approach create the “golden age” Trump promises, or will regulatory rollbacks and internal conflicts undermine its ambitious goals? source

Trump’s $500 billion AI moonshot: Ambition meets controversy in ‘Project Stargate’ Read More »

Why everyone in AI is freaking out about DeepSeek

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More As of a few days ago, only the nerdiest of nerds (I say this as one) had ever heard of DeepSeek, a Chinese AI subsidiary of the equally evocatively named High-Flyer Capital Management, a quantitative analysis (or quant) firm that initially launched in 2015. Yet within the last few days, it’s been arguably the most discussed company in Silicon Valley. That’s largely thanks to the release of DeepSeek-R1, a new large language model (LLM) that performs “reasoning” similar to OpenAI’s current best-available model o1 — taking multiple seconds or minutes to answer hard questions and solve complex problems as it reflects on its own analysis in a step-by-step, or “chain of thought” fashion. Not only that, but DeepSeek-R1 scored as high as or higher than OpenAI’s o1 on a variety of third-party benchmarks (tests to measure AI performance at answering questions on various subjects), and was reportedly trained at a fraction of the cost (reportedly around $5 million), with far fewer graphics processing units (GPU) that are under a strict embargo imposed by the U.S., OpenAI’s home turf. But unlike o1, which is available only to paying ChatGPT subscribers of the Plus tier ($20 per month) and more expensive tiers (such as Pro at $200 per month), DeepSeek-R1 was released as a fully open-source model, which also explains why it has quickly rocketed up the charts of AI code sharing community Hugging Face’s most downloaded and active models. Also, thanks to the fact that it is fully open-source, people have already fine-tuned and trained many variations of the model for different task-specific purposes, such as making it small enough to run on a mobile device, or combining it with other open-source models. Even if you want to use it for development purposes, DeepSeek’s API costs are more than 90% lower than the equivalent o1 model from OpenAI. Most impressively of all, you don’t even need to be a software engineer to use it: DeepSeek has a free website and mobile app even for U.S. users with an R1-powered chatbot interface very similar to OpenAI’s ChatGPT. Except, once again, DeepSeek undercut or “mogged” OpenAI by connecting this powerful reasoning model to web search — something OpenAI hasn’t yet done (web search is only available on the less powerful GPT family of models at present). An open-and-shut irony There’s a pretty delicious, or maybe disconcerting irony to this, given OpenAI’s founding goals to democratize AI for the masses. As Nvidia senior research manager Jim Fan put it on X: “We are living in a timeline where a non-US company is keeping the original mission of OpenAI alive — truly open, frontier research that empowers all. It makes no sense. The most entertaining outcome is the most likely.” Or as X user @SuspendedRobot put it, referencing reports that DeepSeek appears to have been trained on question-answer outputs and other data generated by ChatGPT: “OpenAI stole from the whole internet to make itself richer, DeepSeek stole from them and give it back to the masses for free I think there is a certain british folktale about this” But Fan isn’t the only one to sit up and take note of DeepSeek’s success. The open-source availability of DeepSeek-R1, its high performance, and the fact that it seemingly “came out of nowhere” to challenge the former leader of generative AI, has sent shockwaves throughout Silicon Valley and far beyond, based on my conversations with and readings of various engineers, thinkers and leaders. If not “everyone” is freaking out about it as my hyperbolic headline suggests, it’s certainly the talk of the town in tech and business circles. A message posted to Blind, the app for sharing anonymous gossip in Silicon Valley, has been making the rounds suggesting Meta is in crisis over the success of DeepSeek because of how quickly it surpassed Meta’s own efforts to be the king of open-source AI with its Llama models. ‘This changes the whole game’ X user @tphuang wrote compellingly: “DeepSeek has commoditized AI outside of very top-end. Lightbulb moment for me in 1st photo. R1 is so much cheaper than US labor cost that many jobs will get automated away over next 5 yrs,” later noting why DeepSeek’s R1 is more enticing to users than even OpenAI’s o1: “3 huge issues w/ o1:1) too slow2) too expensive3) lack of control for end user/reliance on OpenAIR1 solves all of them. A company can buy their own Nvidia GPUs, run these models. Don’t have to worry about additional costs or slow/unresponsive OpenAI servers” @tphaung also posed a compelling analogy as a question: “Will DeepSeek be to LLM what Android became to OS world?” Web entrepreneur Arnaud Bertrand didn’t mince words about the startling implications of DeepSeek’s success either, writing on X: “There’s no overstating how profoundly this changes the whole game. And not only with regards to AI, it’s also a massive indictment of the US’s misguided attempt to stop China’s technological development, without which Deepseek may not have been possible (as the saying goes, necessity is the mother of inventions).” The censorship issue However, others have sounded cautionary notes on DeepSeek’s rapid rise, arguing that as a startup operated out of China, it is necessarily subject to that country’s laws and content censorship requirements. Indeed, in my own usage of DeepSeek on the iOS app here in the U.S. I found it would not answer questions about Tiananmen Square, the site of the 1989 pro-democracy student protests and uprising, and subsequent violent crackdown by the Chinese military, which resulted in at least 200, possibly thousands of deaths, earning it the nickname “Tiananmen Square Massacre” in Western media outlets. Ben Hylak, a former Apple human interface designer and cofounder of AI product analytics platform Dawn, posted on X how asking about this subject caused DeepSeek-R1 to enter a circuitous loop. As a member of the press itself, I of course take

Why everyone in AI is freaking out about DeepSeek Read More »

Pipeshift cuts GPU usage for AI inferences 75% with modular interface engine

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More DeepSeek’s release of R1 this week was a watershed moment in the field of AI. Nobody thought a Chinese startup would be the first to drop a reasoning model matching OpenAI’s o1 and open-source it (in line with OpenAI’s original mission) at the same time. Enterprises can easily download R1’s weights via Hugging Face, but access has never been the problem — over 80% of teams are using or planning to use open models. Deployment is the real culprit. If you go with hyperscaler services, like Vertex AI, you’re locked into a specific cloud. On the other hand, if you go solo and build in-house, there’s the challenge of resource constraints as you have to set up a dozen different components just to get started, let alone optimizing or scaling downstream. To address this challenge, Y Combinator and SenseAI-backed Pipeshift is launching an end-to-end platform that allows enterprises to train, deploy and scale open-source generative AI models — LLMs, vision models, audio models and image models — across any cloud or on-prem GPUs. The company is competing with a rapidly growing domain that includes Baseten, Domino Data Lab, Together AI and Simplismart. The key value proposition? Pipeshift uses a modular inference engine that can quickly be optimized for speed and efficiency, helping teams not only deploy 30 times faster but achieve more with the same infrastructure, leading to as much as 60% cost savings.  Imagine running inferences worth four GPUs with just one. The orchestration bottleneck When you have to run different models, stitching together a functional MLOps stack in-house — from accessing compute, training and fine-tuning to production-grade deployment and monitoring — becomes the problem. You have to set up 10 different inference components and instances to get things up and running and then put in thousands of engineering hours for even the smallest of optimizations.  “There are multiple components of an inference engine,” Arko Chattopadhyay, cofounder and CEO of Pipeshift, told VentureBeat. “Every combination of these components creates a distinct engine with varying performance for the same workload. Identifying the optimal combination to maximize ROI requires weeks of repetitive experimentation and fine-tuning of settings. In most cases, the in-house teams can take years to develop pipelines that can allow for the flexibility and modularization of infrastructure, pushing enterprises behind in the market alongside accumulating massive tech debts.” While there are startups that offer platforms to deploy open models across cloud or on-premise environments, Chattopadhyay says most of them are GPU brokers, offering one-size-fits-all inference solutions. As a result, they maintain separate GPU instances for different LLMs, which doesn’t help when teams want to save costs and optimize for performance. To fix this, Chattopadhyay started Pipeshift and developed a framework called modular architecture for GPU-based inference clusters (MAGIC), aimed at distributing the inference stack into different plug-and-play pieces. The work created a Lego-like system that allows teams to configure the right inference stack for their workloads, without the hassle of infrastructure engineering. This way, a team can quickly add or interchange different inference components to piece together a customized inference engine that can extract more out of existing infrastructure to meet expectations for costs, throughput or even scalability.  For instance, a team could set up a unified inference system, where multiple domain-specific LLMs could run with hot-swapping on a single GPU, utilizing it to full benefit. Running four GPU workloads on one Since claiming to offer a modular inference solution is one thing and delivering on it is entirely another, Pipeshift’s founder was quick to point out the benefits of the company’s offering.  “In terms of operational expenses…MAGIC allows you to run LLMs like Llama 3.1 8B at >500 tokens/sec on a given set of Nvidia GPUs without any model quantization or compression,” he said. “This unlocks a massive reduction of scaling costs as the GPUs can now handle workloads that are an order of magnitude 20-30 times what they originally were able to achieve using the native platforms offered by the cloud providers.” The CEO noted that the company is already working with 30 companies on an annual license-based model.  One of these is a Fortune 500 retailer that initially used four independent GPU instances to run four open fine-tuned models for their automated support and document processing workflows. Each of these GPU clusters was scaling independently, adding to massive cost overheads. “Large-scale fine-tuning was not possible as datasets became larger and all the pipelines were supporting single-GPU workloads while requiring you to upload all the data at once. Plus, there was no auto-scaling support with tools like AWS Sagemaker, which made it hard to ensure optimal use of infra, pushing the company to pre-approve quotas and reserve capacity beforehand for theoretical scale that only hit 5% of the time,” Chattopadhyay noted. Interestingly, after shifting to Pipeshift’s modular architecture, all the fine-tunes were brought down to a single GPU instance that served them in parallel, without any memory partitioning or model degradation. This brought down the requirement to run these workloads from four GPUs to just a single GPU. “Without additional optimizations, we were able to scale the capabilities of the GPU to a point where it was serving five-times-faster tokens for inference and could handle a four-times-higher scale,” the CEO added. In all, he said that the company saw a 30-times faster deployment timeline and a 60% reduction in infrastructure costs. With modular architecture, Pipeshift wants to position itself as the go-to platform for deploying all cutting-edge open-source AI models, including DeepSeek R-1. However, it won’t be an easy ride as competitors continue to evolve their offerings. For instance, Simplismart, which raised $7 million a few months ago, is taking a similar software-optimized approach to inference. Cloud service providers like Google Cloud and Microsoft Azure are also bolstering their respective offerings, although Chattopadhyay thinks these CSPs will be more like partners than competitors in the long run.

Pipeshift cuts GPU usage for AI inferences 75% with modular interface engine Read More »

Galileo launches ‘Agentic Evaluations’ to fix AI agent errors before they cost you

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Galileo, a San Francisco-based startup, is betting that the future of artificial intelligence depends on trust. Today, the company launched a new product, Agentic Evaluations, to address a growing challenge in the world of AI: making sure the increasingly complex systems known as AI agents actually work as intended. AI agents — autonomous systems that perform multi-step tasks like generating reports or analyzing customer data — are gaining traction across industries. But their rapid adoption raises a crucial question: How can companies verify these systems remain reliable after deployment? Galileo’s CEO, Vikram Chatterji, believes his company has found the answer. “Over the last six to eight months, we started to see some of our customers trying to adopt agentic systems,” said Chatterji in an interview. “Now LLMs can be used as a smart router to pick and choose the right API calls towards actually completing a task. Going from just generating text to actually completing a task was a very big chasm that was unlocked.” A diagram showing how Galileo evaluates AI agents at three key stages: tool selection, error detection and task completion. (Credit: Galileo) AI agents show promise, but enterprises demand accountability Major enterprises like Cisco and Ema (the latter founded by Coinbase’s former chief product officer) have already adopted Galileo’s platform. These companies use AI agents to automate tasks from customer support to financial analysis, and report significant productivity gains. “A sales representative who’s trying to do outreach and outbounds would otherwise use maybe a week of their time to do that, versus with some of these AI-enabled agents, they’re doing that within two days or less,” Chatterji explained, highlighting the return on investment for enterprises. Galileo’s new framework evaluates tool selection quality, detects errors in tool calls, and tracks overall session success. It also monitors essential metrics for large-scale AI deployment, including costs and latency. A dashboard showing how Galileo evaluates AI agents at three key stages: tool selection, error detection and task completion. (Credit: Galileo) $68 million in funding fuels Galileo’s push into enterprise AI The launch builds on Galileo’s recent momentum. The company raised $45 million in series B funding led by Scale Venture Partners last October, bringing its total funding to $68 million. Industry analysts project the market for AI operations tools could reach $4 billion by 2025. The stakes are high as AI deployment accelerates. Studies show even advanced models like GPT-4 can hallucinate about 23% of the time during basic question-and-answer tasks. Galileo’s tools help enterprises identify these issues before they impact operations. “Before we launch this thing, we really, really need to know that this thing works,” Chatterji said, describing customer concerns. “The bar is really high. So that’s where we gave them this tool chain, such that they could just use our metrics as the basis for these tests.” Addressing AI hallucinations and enterprise-scale challenges The company’s focus on reliable, production-ready solutions positions it well in a market increasingly concerned with AI safety. For technical leaders deploying enterprise AI, Galileo’s platform provides essential guardrails for ensuring AI agents perform as intended while controlling costs. As enterprises expand their use of AI agents, performance monitoring tools become crucial infrastructure. Galileo’s latest offering aims to help businesses deploy AI responsibly and effectively at scale. “2025 will be the year of agents. It is going to be very prolific,” Chatterji noted. “However, what we’ve also seen is a lot of companies that are just launching these agents without good testing is leading to negative implications…The need for proper testing and evaluations is more than ever before.” source

Galileo launches ‘Agentic Evaluations’ to fix AI agent errors before they cost you Read More »

Microsoft just built an AI that designs materials for the future: Here’s how it works

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Microsoft Research has introduced a powerful new AI system today that generates novel materials with specific desired properties, potentially accelerating the development of better batteries, more efficient solar cells and other critical technologies. The system, called MatterGen, represents a fundamental shift in how scientists discover new materials. Rather than screening millions of existing compounds — the traditional approach that can take years — MatterGen directly generates novel materials based on desired characteristics, similar to how AI image generators create pictures from text descriptions. “Generative models provide a new paradigm for materials design by directly generating entirely novel materials given desired property constraints,” said Tian Xie, principal research manager at Microsoft Research and lead author of the study published today in Nature. “This represents a major advancement towards creating a universal generative model for materials design.” How Microsoft’s AI engine works differently than traditional methods MatterGen uses a specialized type of AI called a diffusion model — similar to those behind image generators like DALL-E — but adapted to work with three-dimensional crystal structures. It gradually refines random arrangements of atoms into stable, useful materials that meet specified criteria. The results surpass previous approaches. According to the research paper, materials produced by MatterGen are “more than twice as likely to be novel and stable, and more than 15 times closer to the local energy minimum” compared to previous AI approaches. This means the generated materials are both more likely to be useful and physically possible to create. In one striking demonstration, the team collaborated with scientists at China’s Shenzhen Institutes of Advanced Technology to synthesize a new material, TaCr2O6, that MatterGen had designed. The real-world material closely matched the AI’s predictions, validating the system’s practical utility. Real-world applications could transform energy storage and computing The system is particularly notable for its flexibility. It can be “fine-tuned” to generate materials with specific properties — from particular crystal structures to desired electronic or magnetic characteristics. This could be invaluable for designing materials for specific industrial applications. The implications could be far-reaching. New materials are crucial for advancing technologies in energy storage, semiconductor design and carbon capture. For instance, better battery materials could accelerate the transition to electric vehicles, while more efficient solar cell materials could make renewable energy more cost-effective. “From an industrial perspective, the potential here is enormous,” Xie explained. “Human civilization has always depended on material innovations. If we can use generative AI to make materials design more efficient, it could accelerate progress in industries like energy, healthcare and beyond.” Microsoft’s open source strategy aims to accelerate scientific discovery Microsoft has released MatterGen’s source code under an open-source license, allowing researchers worldwide to build upon the technology. This move could accelerate the system’s impact across various scientific fields. The development of MatterGen is part of Microsoft’s broader AI for Science initiative, which aims to accelerate scientific discovery using AI. The project integrates with Microsoft’s Azure Quantum Elements platform, potentially making the technology accessible to businesses and researchers through cloud computing services. However, experts caution that while MatterGen represents a significant advance, the path from computationally designed materials to practical applications still requires extensive testing and refinement. The system’s predictions, while promising, need experimental validation before industrial deployment. Nevertheless, the technology represents a significant step forward in using AI to accelerate scientific discovery. As Daniel Zügner, a senior researcher on the project, noted, “We’re deeply committed to research that can have a positive, real-world impact, and this is just the beginning.” source

Microsoft just built an AI that designs materials for the future: Here’s how it works Read More »

Hugging Face shrinks AI vision models to phone-friendly size, slashing computing costs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Hugging Face has achieved a remarkable breakthrough in AI, introducing vision-language models that run on devices as small as smartphones while outperforming their predecessors that require massive data centers. The company’s new SmolVLM-256M model, requiring less than one gigabyte of GPU memory, surpasses the performance of their Idefics 80B model from just 17 months ago — a system 300 times larger. This dramatic reduction in size and improvement in capability marks a watershed moment for practical AI deployment. “When we released Idefics 80B in August 2023, we were the first company to open-source a video language model,” Andrés Marafioti, machine learning research engineer at Hugging Face, said in an exclusive interview with VentureBeat. “By achieving a 300X size reduction while improving performance, SmolVLM marks a breakthrough in vision-language models.” Performance comparison of Hugging Face’s new SmolVLM models shows the smaller versions (256M and 500M) consistently outperforming their 80-billion-parameter predecessor across key visual reasoning tasks. (Credit: Hugging Face) Smaller AI models that run on everyday devices The advancement arrives at a crucial moment for enterprises struggling with the astronomical computing costs of implementing AI systems. The new SmolVLM models — available in 256M and 500M parameter sizes — process images and understand visual content at speeds previously unattainable at their size class. The smallest version processes 16 examples per second while using only 15GB of RAM with a batch size of 64, making it particularly attractive for businesses looking to process large volumes of visual data. “For a mid-sized company processing 1 million images monthly, this translates to substantial annual savings in compute costs,” Marafioti told VentureBeat. “The reduced memory footprint means businesses can deploy on cheaper cloud instances, cutting infrastructure costs.” The development has already caught the attention of major technology players. IBM has partnered with Hugging Face to integrate the 256M model into Docling, their document processing software. “While IBM certainly has access to substantial compute resources, using smaller models like these allows them to efficiently process millions of documents at a fraction of the cost,” said Marafioti. Processing speeds of SmolVLM models across different batch sizes, showing how the smaller 256M and 500M variants significantly outperform the 2.2B version on both A100 and L4 graphics cards. (Credit: Hugging Face) How Hugging Face reduced model size without compromising power The efficiency gains come from technical innovations in both vision processing and language components. The team switched from a 400M parameter vision encoder to a 93M parameter version and implemented more aggressive token compression techniques. These changes maintain high performance while dramatically reducing computational requirements. For startups and smaller enterprises, these developments could be transformative. “Startups can now launch sophisticated computer vision products in weeks instead of months, with infrastructure costs that were prohibitive mere months ago,” said Marafioti. The impact extends beyond cost savings to enabling entirely new applications. The models are powering advanced document search capabilities through ColiPali, an algorithm that creates searchable databases from document archives. “They obtain very close performances to those of models 10X the size while significantly increasing the speed at which the database is created and searched, making enterprise-wide visual search accessible to businesses of all types for the first time,” Marafioti explained. A breakdown of SmolVLM’s 1.7 billion training examples shows document processing and image captioning comprising nearly half of the dataset. (Credit: Hugging Face) Why smaller AI models are the future of AI development The breakthrough challenges conventional wisdom about the relationship between model size and capability. While many researchers have assumed that larger models were necessary for advanced vision-language tasks, SmolVLM demonstrates that smaller, more efficient architectures can achieve similar results. The 500M parameter version achieves 90% of the performance of its 2.2B parameter sibling on key benchmarks. Rather than suggesting an efficiency plateau, Marafioti sees these results as evidence of untapped potential: “Until today, the standard was to release VLMs starting at 2B parameters; we thought that smaller models were not useful. We are proving that, in fact, models at 1/10 of the size can be extremely useful for businesses.” This development arrives amid growing concerns about AI’s environmental impact and computing costs. By dramatically reducing the resources required for vision-language AI, Hugging Face’s innovation could help address both issues while making advanced AI capabilities accessible to a broader range of organizations. The models are available open-source, continuing Hugging Face’s tradition of increasing access to AI technology. This accessibility, combined with the models’ efficiency, could accelerate the adoption of vision-language AI across industries from healthcare to retail, where processing costs have previously been prohibitive. In a field where bigger has long meant better, Hugging Face’s achievement suggests a new paradigm: The future of AI might not be found in ever-larger models running in distant data centers, but in nimble, efficient systems running right on our devices. As the industry grapples with questions of scale and sustainability, these smaller models might just represent the biggest breakthrough yet. source

Hugging Face shrinks AI vision models to phone-friendly size, slashing computing costs Read More »

MiniMax unveils its own open-source LLM with industry-leading 4M token context

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More MiniMax is perhaps today best known here in the U.S. as the Singaporean company behind Hailuo, a realistic, high-resolution generative AI video model that competes with Runway, OpenAI’s Sora and Luma AI’s Dream Machine. But the company has far more tricks up its sleeve: Today, for instance, it announced the release and open-sourcing of the MiniMax-01 series, a new family of models built to handle ultra-long contexts and enhance AI agent development. The series includes MiniMax-Text-01, a foundation large language model (LLM), and MiniMax-VL-01, a visual multimodal model. A massive context window MiniMax-Text-o1, is of particular note for enabling up to 4 million tokens in its context window — equivalent to a small library’s worth of books. The context window is how much information the LLM can handle in one input/output exchange, with words and concepts represented as numerical “tokens,” the LLM’s own internal mathematical abstraction of the data it was trained on. And, while Google previously led the pack with its Gemini 1.5 Pro model and 2-million-token context window, MiniMax remarkably doubled that. As MiniMax posted on its official X account today: “MiniMax-01 efficiently processes up to 4M tokens — 20 to 32 times the capacity of other leading models. We believe MiniMax-01 is poised to support the anticipated surge in agent-related applications in the coming year, as agents increasingly require extended context handling capabilities and sustained memory.” The models are available now for download on Hugging Face and Github under a custom MiniMax license, for users to try directly on Hailuo AI Chat (a ChatGPT/Gemini/Claude competitor), and through MiniMax’s application programming interface (API), where third-party developers can link their own unique apps to them. MiniMax is offering APIs for text and multi-modal processing at competitive rates: $0.2 per 1 million input tokens $1.1 per 1 million output tokens For comparison, OpenAI’s GPT-4o costs $2.50 per 1 million input tokens through its API, a staggering 12.5X more expensive. MiniMax has also integrated a mixture of experts (MoE) framework with 32 experts to optimize scalability. This design balances computational and memory efficiency while maintaining competitive performance on key benchmarks. Striking new ground with Lightning Attention Architecture At the heart of MiniMax-01 is a Lightning Attention mechanism, an innovative alternative to transformer architecture. This design significantly reduces computational complexity. The models consist of 456 billion parameters, with 45.9 billion activated per inference. Unlike earlier architectures, Lightning Attention employs a mix of linear and traditional SoftMax layers, achieving near-linear complexity for long inputs. SoftMax, for those like myself who are new to the concept, are the transformation of input numerals into probabilities adding up to 1, so that the LLM can approximate which meaning of the input is likeliest. MiniMax has rebuilt its training and inference frameworks to support the Lightning Attention architecture. Key improvements include: MoE all-to-all communication optimization: Reduces inter-GPU communication overhead. Varlen ring attention: Minimizes computational waste for long-sequence processing. Efficient kernel implementations: Tailored CUDA kernels improve Lightning Attention performance. These advancements make MiniMax-01 models accessible for real-world applications, while maintaining affordability. Performance and benchmarks On mainstream text and multimodal benchmarks, MiniMax-01 rivals top-tier models like GPT-4 and Claude-3.5, with especially strong results on long-context evaluations. Notably, MiniMax-Text-01 achieved 100% accuracy on the Needle-In-A-Haystack task with a 4-million-token context. The models also demonstrate minimal performance degradation as input length increases. MiniMax plans regular updates to expand the models’ capabilities, including code and multi-modal enhancements. The company views open-sourcing as a step toward building foundational AI capabilities for the evolving AI agent landscape. With 2025 predicted to be a transformative year for AI agents, the need for sustained memory and efficient inter-agent communication is increasing. MiniMax’s innovations are designed to meet these challenges. Open to collaboration MiniMax invites developers and researchers to explore the capabilities of MiniMax-01. Beyond open-sourcing, its team welcomes technical suggestions and collaboration inquiries at [email protected]. With its commitment to cost-effective and scalable AI, MiniMax positions itself as a key player in shaping the AI agent era. The MiniMax-01 series offers an exciting opportunity for developers to push the boundaries of what long-context AI can achieve. source

MiniMax unveils its own open-source LLM with industry-leading 4M token context Read More »

DeepMind’s new inference-time scaling technique improves planning accuracy in LLMs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Inference-time scaling is one of the big themes of artificial intelligence in 2025, and AI labs are attacking it from different angles. In its latest research paper, Google DeepMind introduced the concept of “Mind Evolution,” a technique that optimizes responses of large language models (LLMs) for planning and reasoning tasks.  Inference-time scaling techniques try to improve LLMs’ performance by allowing them to “think” more when generating their answers. Practically, this means that instead of generating its answer in one go, a model is allowed to generate several answers, review and correct its answers, and explore different ways to solve the problem.  Evolving LLM responses Mind Evolution relies on two key components: search and genetic algorithms. Search algorithms are a common component in many inference-time scaling techniques. They allow LLMs to find the best reasoning path for the optimal solution. Genetic algorithms are inspired by natural selection. They create and evolve a population of candidate solutions to optimize a goal, often referred to as the “fitness function.”  Mind Evolution algorithm (source: arXiv) Mind Evolution starts by creating a population of candidate solutions expressed in natural language. The solutions are generated by an LLM that has been given a description of the problem along with useful information and instructions. The LLM then evaluates each candidate and improves it if it does not meet the criteria for the solution. The algorithm then selects the parents for the next generation of solutions by sampling from the existing population, with higher-quality solutions having a greater chance of being selected. It next creates new solutions through crossover (choosing parent pairs and combining their elements to create a new solution) and mutation (making random changes to newly created solutions). It reuses the evaluation method to refine the new solutions. The cycle of evaluation, selection and recombination continues until the algorithm reaches the optimal solution or exhausts a preset number of iterations. Refinement process for proposed solutions in the Mind Evolution algorithm (source: arXiv) One of the important parts of Mind Evolution is the evaluation function. Evaluators of inference-time scaling techniques often require the problem to be formalized from natural language into a structured, symbolic representation that can be processed by a solver program. Formalizing a problem can require significant domain expertise and a deep understanding of the problem to identify all the key elements that need to be represented symbolically and how they relate to one another, which limits its applicability.  In Mind Evolution, the fitness function is designed to work with natural language planning tasks where solutions are expressed in natural language. This allows the system to avoid formalizing problems, as long as a programmatic solution evaluator is available. It also provides textual feedback in addition to a numerical score, which allows the LLM to understand specific issues and make targeted improvements. “We focus on evolving solutions in natural language spaces instead of formal spaces. This removes the requirement of task formalization, which requires significant effort and expert knowledge for each task instance,” the researchers write. Mind Evolution also uses an “island” approach to make sure it explores a diverse set of solutions. At each stage, the algorithm creates separate groups of solutions that evolve within themselves. It then “migrates” optimal solutions from one group to another to combine and create new ones. Mind Evolution in planning tasks The researchers tested Mind Evolution against baselines such as 1-pass, where the model generates only one answer; Best-of-N, where the model generates multiple answers and chooses the best one; and Sequential Revisions+, a revision technique where 10 candidate solutions are proposed independently, then revised separately for 80 turns. Sequential Revisions+ is the closest to Mind Evolution, though it does not have the genetic algorithm component to combine the best parts of the discovered solution. For reference, they also include an additional 1-pass baseline that uses OpenAI o1-preview. Performance on the Trip Planning benchmark. As the complexity of the task increases, the gap between Mind Evolution and other methods grows (source: arXiv). The researchers carried out most tests on the fast and affordable Gemini 1.5 Flash. They also explored a two-stage approach, where the Gemini 1.5 Pro model is used when the Flash model can’t address the problem. This two-stage approach provides better cost-efficiency than using the Pro model on every problem instance. The researchers tested Mind Evolution on several natural-language planning benchmarks for tasks such as trip and meeting planning. Previous research shows that LLMs can’t achieve good performance on these tasks without the aid of formal solvers. For example, Gemini 1.5 Flash and o1-preview achieve a success rate of only 5.6% and 11.7% on TravelPlanner, a benchmark that simulates organizing a trip plan based on user preferences and constraints expressed in natural language. Even exploiting Best-of-N over 800 independently generated responses, Gemini 1.5 Flash only achieves 55.6% success on TravelPlanner. Performance on the TravelPlanner benchmark. As the complexity of the task increases, Mind Evolution remains consistently high-performing while other methods falter (source: arXiv). In all their tests, Mind Evolution outperformed the baselines by a wide margin, especially as the tasks got more difficult.  For example, Mind Evolution achieves a 95% success rate on TravelPlanner. On the Trip Planning benchmark, which involves creating an itinerary of cities to visit with a number of days in each, Mind Evolution achieved 94.1% on the test instances while other methods reached a maximum of 77% success rate. Interestingly, the gap between Mind Evolution and other techniques increases as the number of cities grows, indicating its ability to handle more complex planning tasks. With the two-stage process, Mind Evolution reached near-perfect success rates on all benchmarks. Mind Evolution also proved a cost-effective approach for solving natural-language planning problems, using a fraction of the number of tokens used by Sequential-Revision+, the only other technique that comes close to its performance.  “Overall, these results demonstrate a clear advantage of an evolutionary strategy that combines

DeepMind’s new inference-time scaling technique improves planning accuracy in LLMs Read More »