VentureBeat

In the future, we will all manage our own AI agents

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Jensen Huang, CEO of Nvidia, gave an eye-opening keynote talk at CES 2025 last week. It was highly appropriate, as Huang’s favorite subject of artificial intelligence has exploded across the world and Nvidia has, by extension, become one of the most valuable companies in the world. Apple recently passed Nvidia with a market capitalization of $3.58 trillion, compared to Nvidia’s $3.33 trillion. The company is celebrating the 25th year of its GeForce graphics chip business and it has been a long time since I did the first interview with Huang back in 1996, when we talked about graphics chips for a “Windows accelerator.” Back then, Nvidia was one of 80 3D graphics chip makers. Now it’s one of around three or so survivors. And it has made a huge pivot from graphics to AI. Huang hasn’t changed much. For the keynote, Huang announced a video game graphics card, the Nvidia GeForce RTX 50 Series, but there were a dozen AI-focused announcements about how Nvidia is creating the blueprints and platforms to make it easy to train robots for the physical world. In fact, in a feature dubbed DLSS 4, Nvidia is now using AI to make its graphics chip frame rates better. And there are technologies like Cosmos, which helps robot developers use synthetic data to train their robots. A few of these Nvidia announcements were among my 13 favorite things at CES. After the keynote, Huang held a free-wheeling Q&A with the press at the Fountainbleau hotel in Las Vegas. At first, he engaged with a hilarious discussion with the audio-visual team in the room about the sound quality, as he couldn’t hear questions up on stage. So he came down among the press and, after teasing the AV team guy named Sebastian, he answered all of our questions, and he even took a selfie with me. Then he took a bunch of questions from financial analysts. I was struck at how technical Huang’s command of AI was during the keynote, but it reminded me more of a Siggraph technology conference than a keynote speech for consumers at CES. I asked him about that and you can see his answer below. I’ve included the whole Q&A from all of the press in the room. Here’s an edited transcript of the press Q&A. Jensen Huang, CEO of Nvidia, at CES 2025 press Q&A. Question: Last year you defined a new unit of compute, the data center. Starting with the building and working down. You’ve done everything all the way up to the system now. Is it time for Nvidia to start thinking about infrastructure, power, and the rest of the pieces that go into that system? Jensen Huang: As a rule, Nvidia–we only work on things that other people do not, or that we can do singularly better. That’s why we’re not in that many businesses. The reason why we do what we do, if we didn’t build NVLink72, who would have? Who could have? If we didn’t build the type of switches like Spectrum-X, this ethernet switch that has the benefits of InfiniBand, who could have? Who would have? We want our company to be relatively small. We’re only 30-some-odd thousand people. We’re still a small company. We want to make sure our resources are highly focused on areas where we can make a unique contribution. We work up and down the supply chain now. We work with power delivery and power conditioning, the people who are doing that, cooling and so on. We try to work up and down the supply chain to get people ready for these AI solutions that are coming. Hyperscale was about 10 kilowatts per rack. Hopper is 40 to 50 to 60 kilowatts per rack. Now Blackwell is about 120 kilowatts per rack. My sense is that that will continue to go up. We want it to go up because power density is a good thing. We’d rather have computers that are dense and close by than computers that are disaggregated and spread out all over the place. Density is good. We’re going to see that power density go up. We’ll do a lot better cooling inside and outside the data center, much more sustainable. There’s a whole bunch of work to be done. We try not to do things that we don’t have to. HP EliteBook Ultra G1i 14-inch notebook next-gen AI PC. Question: You made a lot of announcements about AI PCs last night. Adoption of those hasn’t taken off yet. What’s holding that back? Do you think Nvidia can help change that? Huang: AI started the cloud and was created for the cloud. If you look at all of Nvidia’s growth in the last several years, it’s been the cloud, because it takes AI supercomputers to train the models. These models are fairly large. It’s easy to deploy them in the cloud. They’re called endpoints, as you know. We think that there are still designers, software engineers, creatives, and enthusiasts who’d like to use their PCs for all these things. One challenge is that because AI is in the cloud, and there’s so much energy and movement in the cloud, there are still very few people developing AI for Windows. It turns out that the Windows PC is perfectly adapted to AI. There’s this thing called WSL2. WSL2 is a virtual machine, a second operating system, Linux-based, that sits inside Windows. WSL2 was created to be essentially cloud-native. It supports Docker containers. It has perfect support for CUDA. We’re going to take the AI technology we’re creating for the cloud and now, by making sure that WSL2 can support it, we can bring the cloud down to the PC. I think that’s the right answer. I’m excited about it. All the PC OEMs are excited about it. We’ll get all these PCs ready with Windows and WSL2. All the energy

In the future, we will all manage our own AI agents Read More »

Microsoft’s new rStar-Math technique upgrades small models to outperform OpenAI’s o1-preview at math problems

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Microsoft is doubling down on the potential of small language models (SLMs) with the unveiling of rStar-Math, a new reasoning technique that can be applied to small models to boost their performance on math problems using reasoning techniques — performance similar to, and in some cases exceeding, that of OpenAI’s o1-preview model. While still in a research phase — as outlined in a paper published on pre-review site arXiv.org and credited to eight authors at Microsoft, Peking University and Tsinghua University in China — the technique was applied to several different smaller open-source models including Microsoft’s own Phi-3 mini, Alibaba’s Qwen-1.5B (a 1.5-billion-parameter model), and Qwen-7B (a 7-billion-parameter model). It showed improved performance on all of them, even exceeding OpenAI’s previously most advanced model at the MATH (word problem solving) third-party benchmark of 12,500 questions covering various branches such as geometry and algebra, and all levels of difficulty. Ultimately, according to a post on Hugging Face, the researchers plan to make their code and data available on Github at https://github.com/microsoft/rStar, though one of the paper’s authors, Li Lyna Zhang, wrote in the comments on the Hugging Face post that the team is “still undergoing the internal review process for open-source release.” As such, “the repository remains private for now. Please stay tuned!” Community members expressed enthusiasm, calling the innovations “impressive” and praising the blend of Monte Carlo Tree Search (MCTS) with step-by-step reasoning. One commenter highlighted the simplicity and utility of using Q-values for step scoring, while others speculated on future applications in geometric proofs and symbolic reasoning. This news follows closely on the heels of the open-sourcing of Microsoft’s Phi-4 model, a smaller 14-billion-parameter AI system now available on Hugging Face under the permissive MIT license. While the Phi-4 release has expanded access to high-performance small models, rStar-Math showcases a specialized approach: using smaller AI systems to achieve state-of-the-art results in mathematical reasoning. rStar-Math works by using several different models and components to help a target small model ‘self-evolve’ The key to rStar-Math is that it leverages Monte Carlo Tree Search (MCTS), a method that mimics human “deep thinking” by iteratively refining step-by-step solutions to mathematical problems. The researchers used MCTS because it “breaks down complex math problems into simpler single-step generation tasks, reducing the difficulty” for smaller models. However, they didn’t just apply MCTS as other researchers have done. Instead, in a stroke of brilliance, they also ask the model they trained to always output its “chain-of-thought” reasoning steps as both natural language descriptions and Python code. They mandated the model would include the natural language responses as Python code comments, and only those outputs using Python would be used to train the model. The researchers also trained a “policy model” to generate math reasoning steps and a process preference model (PPM) to select the most promising steps to solving the problems, and improved them both over four rounds of “self-evolution,” with each model improving the other. For their starting data, the researchers said they used “747,000 math word problems from publicly available sources,” along with their solutions, but generated new steps for solving them with the two models described above. Record-breaking results After four rounds of self-evolution, rStar-Math achieved significant milestones: • On the MATH benchmark, the accuracy of the Qwen2.5-Math-7B model jumped from 58.8% to 90.0%, outperforming OpenAI o1-preview. • On the American Invitational Mathematics Examination (AIME), it solved 53.3% of problems, placing among the top 20% of high school competitors. These results highlight the power of SLMs in handling complex mathematical reasoning, traditionally dominated by larger systems. Smaller is better? In recent years, AI innovation has largely been driven by scaling up language models, with increasing parameters seen as a way to improve performance. Yet, the high costs associated with these massive models, from computational resources to energy consumption, have raised questions about scalability. Microsoft is offering an alternative path, focusing on efficiency. The release of rStar-Math further underscores this commitment by demonstrating how SLMs can rival — and in some cases exceed — the capabilities of their larger counterparts. Microsoft’s dual releases of Phi-4 and the rStar-Math paper suggest that compact, specialized models can provide powerful alternatives to the industry’s largest systems. Moreover, by outperforming larger competitors in key benchmarks, these models challenge the notion that bigger is always better. They open doors for mid-sized organizations and academic researchers to access cutting-edge capabilities without the financial or environmental burden of massive models. source

Microsoft’s new rStar-Math technique upgrades small models to outperform OpenAI’s o1-preview at math problems Read More »

Smoke, reflections and portals: Adobe’s TransPixar takes AI VFX to the next level

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A team from Adobe Research and Hong Kong University of Science and Technology (HKUST) has developed an artificial intelligence system that could change how visual effects are made for films, games and interactive media. The technology, called TransPixar, adds a crucial feature to AI-generated videos: the ability to create see-through elements like smoke, reflections, and ethereal effects that blend naturally into scenes. Current AI video tools typically can only generate solid images, making TransPixar a significant technical achievement. “Alpha channels are crucial for visual effects, allowing transparent elements like smoke and reflections to blend seamlessly into scenes,” said Yijun Li, project leader at Adobe Research and one of the paper’s authors. “However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models.” The breakthrough comes at a critical time as demand for visual effects continues to surge across the entertainment, advertising and gaming industries. Traditional VFX work often requires painstaking manual effort by artists to create convincing transparent effects. A demonstration of TransPixar’s transparency effects shows a photorealistic robot rendered with complex reflective surfaces and seamless alpha-channel blending, allowing the image to be integrated into any background. (Credit: Adobe Research) TransPixar: Bringing transparency to AI visual effects What makes TransPixar particularly notable is its ability to maintain high quality while working with very limited training data. The researchers accomplished this by developing a novel approach that extends existing video AI models rather than building one from scratch. “We introduce new tokens for alpha channel generation, reinitializing their positional embeddings, and adding a zero-initialized domain embedding to distinguish them from RGB tokens,” explained Luozhou Wang, lead author and researcher at HKUST. “Using a LoRA-based fine-tuning scheme, we project alpha tokens into the qkv space while preserving RGB quality.” In demonstrations, the system showed impressive results generating diverse effects from simple text prompts — from swirling storm clouds and magical portals to shattering glass and billowing smoke. The technology can also animate still images with transparency effects, opening up new creative possibilities for artists and designers. The research team has made their code publicly available on GitHub and deployed a demo on Hugging Face, allowing developers and researchers to experiment with the technology. A red aircraft generated by TransPixar demonstrates the AI system’s ability to create objects with precise transparency effects, shown here against a checkered background that reveals the seamless alpha channel integration — a key technical advancement in AI-generated visual content. (Credit: Adobe) Transforming VFX workflows for creators big and small Early testing shows TransPixar could make visual effects production faster and simpler, especially for smaller studios that can’t afford expensive effects work. While the system still needs significant computing power to process longer videos, its potential impact on the creative industry is clear. The technology matters far beyond technical improvements. As streaming services need more content and virtual production grows, AI-generated transparent effects could change how studios operate. Small teams could create effects that once required major studios, while bigger productions could finish projects much faster. TransPixar could be especially valuable for real-time uses. Video games, AR applications and live production could create transparent effects instantly — something that today requires hours or days of work. This advance comes at a key moment for Adobe as companies like Stability AI and Runway compete to develop professional effects tools. Major studios are already looking to AI to reduce costs, making TransPixar’s timing ideal. The entertainment industry faces three growing challenges: Viewers want more content, budgets are tight, and there aren’t enough effects artists. TransPixar offers a solution by making effects faster to create, less expensive, and more consistent in quality. The real question isn’t whether AI will transform visual effects — it’s whether traditional VFX workflows will even exist in five years. source

Smoke, reflections and portals: Adobe’s TransPixar takes AI VFX to the next level Read More »

Nvidia’s AI agent play is here with new models, orchestration blueprints

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The industry’s push into agentic AI continues, with Nvidia announcing several new services and models to facilitate the creation and deployment of AI agents.  Today, Nvidia launched Nemotron, a family of models based on Meta’s Llama and trained on the company’s techniques and datasets. The company also announced new AI orchestration blueprints to guide AI agents. These latest releases bring Nvidia, a company more known for the hardware that powers the generative AI revolution, to the forefront of agentic AI development. Nemotron comes in three sizes: Nano, Super and Ultra. It also comes in two flavors: the Llama Nemotron for language tasks and the Cosmos Nemotron vision model for physical AI projects. The Llama Nemotron Nano has 4B parameters, the Super 49B parameters and the Ultra 253B parameters.  All three work best for agentic tasks including “instruction following, chat, function calling, coding and math,” according to the company. Rev Lebaredian, VP of Omniverse and simulation technology at Nvidia, said in a briefing with reporters that the three sizes are optimized for different Nvidia computing resources. Nano is for cost-efficient low latency applications on PC and edge devices, Super is for high accuracy and throughput on a single GPU and Ultra is for highest accuracy at data center scale.  “AI agents are the digital workforce that will work for us and work with us, and so the Nemotron model family is for agentic AI,” said Lebaredian.  The Nemotron models are available as hosted APIs on Hugging Face and Nvidia’s website. Nvidia said enterprises can access the models through its AI Enterprise software platform.  Nvidia is no stranger to foundation models. Last year, it quietly released a version of Nemotron, Llama-3.1-Nemotron-70B-Instruct, that outperformed similar models from OpenAI and Anthropic. It also unveiled NVLM 1.0, a family of multimodal language models.  More support for agents AI agents became a big trend in 2024 as enterprises began exploring how to deploy agentic systems in their workflow. Many believe that momentum will continue this year.  Companies like Salesforce, ServiceNow, AWS and Microsoft have all called agents the next wave of gen AI in enterprises. AWS has added multi-agent orchestration to Bedrock, while Salesforce released its Agentforce 2.0, bringing more agents to its customers.  However, agentic workflows still need other infrastructure to work efficiently. One such infrastructure revolves around orchestration, or managing multiple agents crossing different systems.  Orchestration blueprints  Nvidia has also entered the emerging field of AI orchestration with its blueprints that guide agents through specific tasks.  The company has partnered with several orchestration companies, including LangChain, LlamaIndex, CrewAI, Daily and Weights and Biases, to build blueprints on Nvidia AI Enterprise. Each orchestration framework has developed its own blueprint with Nvidia. For example, CrewAI created a blueprint for code documentation to ensure code repositories are easy to navigate. LangChain added Nvidia NIM microservices to its structured report generation blueprint to help agents return internet searches in different formats.  “Making multiple agents work together smoothly or orchestration is key to deploying agentic AI,” said Lebaredian. “These leading AI orchestration companies are integrating every Nvidia agentic building block, NIM, Nemo and Blueprints with their open-source agentic orchestration platforms.” Nvidia’s new PDF-to-podcast blueprint aims to compete with Google’s NotebookLM by converting information from PDFs to audio. Another new blueprint will help build agents to search for and summarize videos.  Lebaredian said Blueprints aims to help developers quickly deploy AI agents. To that end, Nvidia unveiled Nvidia Launchables, a platform that lets developers test, prototype and run blueprints in one click.  Orchestration could be one of the bigger stories of 2025 as enterprises grapple with multi-agent production.  source

Nvidia’s AI agent play is here with new models, orchestration blueprints Read More »

Nvidia launches agentic AI blueprints to automate work for enterprises

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Nvidia and its partners have launched agentic AI blueprints to automate work for enterprises. Developers can now build and deploy custom AI agents that can reason, plan and take action with Nvidia AI blueprints that include Nvidia NIM microservices, Nvidia NeMo, and agentic AI frameworks from leading providers. The new Nvidia AI blueprints for building agentic AI applications are poised to help enterprises everywhere automate work. Jensen Huang, CEO of Nvidia, made the announcement as part of his CES 2025 opening keynote. With the blueprints, developers can build and deploy custom AI agents that can reason, plan and take action to quickly analyze large quantities of data, including summarizing and distilling real-time insights from video, PDFs and images. CrewAI, Daily, LangChain, LlamaIndex and Weights & Biases are among leading providers of agentic AI orchestration and management tools that have worked with Nvidia to build blueprints that integrate the Nvidia AI Enterprise software platform, including Nvidia NIM microservices and Nvidia NeMo, with their platforms. These five blueprints — comprising a new category of partner blueprints for agentic AI — provide the building blocks for developers to create the next wave of AI applications that will transform every industry. In addition to the partner blueprints, Nvidia is introducing its own new PDF-to-podcast AI blueprint, and another to build AI agents for video search and summarization. These are joined by four additional Nvidia Omniverse blueprints that make it easier for developers to build simulation-ready digital twins for physical AI. Huang noted in his keynote that every programmer will need agents to create code to keep up. To help enterprises rapidly take AI agents into production, Accenture is announcing AI Refinery for Industry built with Nvidia AI Enterprise, including Nvidia NeMo, Nvidia NIM microservices and AI Blueprints. The AI Refinery for Industry solutions — powered by Accenture AI Refinery with Nvidia — can help enterprises rapidly launch agentic AI across fields like automotive, technology, manufacturing and consumer goods. AI agents are a thing. Agentic AI represents the next wave in the evolution of generative AI. It enables applications to move beyond simple chatbot interactions and tackle complex, multi-step problems through sophisticated reasoning and planning. As explained in Huang’s CES keynote, enterprise AI agents will become a centerpiece of AI factories that generate tokens to create unprecedented intelligence and productivity across industries. Agentic AI orchestration is a sophisticated system designed to manage, monitor and coordinate multiple AI agents working together — key to developing reliable enterprise agentic AI systems. The agentic AI orchestration layer from Nvidia partners provides the glue needed for AI agents to effectively work together. The new partner blueprints, now available from agentic AI orchestration leaders, offer integrations with Nvidia AI Enterprise software, including NIM microservices and Nvidia NeMo Retriever, to boost retrieval accuracy and reduce latency. For example:● CrewAI is using new Llama 3.3 70B Nvidia NIM microservices and the Nvidia NeMo Retriever embedding NIM microservice for its blueprint for code documentation for software development. The blueprint helps ensure code repositories remain comprehensive and easy to navigate. ● Daily’s voice agent blueprint, powered by the company’s open-source Pipecat framework, uses the Nvidia Riva automatic speech recognition and text-to-speech NIM microservice, along with the Llama 3.3 70B NIM microservice, to achieve real-time conversational AI. ● LangChain is adding Llama 3.3 70B Nvidia NIM microservices to its structured report generation blueprint. Built on LangGraph, the blueprint allows users to define a topic and specify an outline to guide an agent in searching the web for relevant information, so it can return a report in the requested format. ● LlamaIndex’s document research assistant for blog creation blueprint harnesses Nvidia NIM microservices and NeMo Retriever to help content creators produce high-quality blogs. It can tap into agentic-driven retrieval-augumented generation with NeMo Retriever to automatically research, outline and generate compelling content with source attribution. ● Weights & Biases is adding its W&B Weave capability to the AI blueprint for AI virtual assistants, which features the Llama 3.1 70B NIM microservice. The blueprint can streamline the process of debugging, evaluating, iterating and tracking production performance and collecting human feedback to support seamless integration and faster iterations for building and deploying agentic AI applications. Summarize many, complex PDFs while keeping proprietary data secure The age of Agentic AI is here. With trillions of PDF files — from financial reports to technical research papers — generated everyyear, it’s a constant challenge to stay up to date with information. Nvidia’s PDF-to-podcast AI blueprint provides a recipe developers can use to turn multiple long and complex PDFs into AI-generated readouts that can help professionals, students and researchers efficiently learn about virtually any topic and quickly understand key takeaways. The blueprint — built on NIM microservices and text-to-speech models — allows developers to build applications that extract images, tables and text from PDFs, and convert the data into easily digestible audio content, all while keeping data secure. For example, developers can build AI agents that can understand context, identify key points and generate a concise summary as a monologue or a conversation-style podcast, narrated in a natural voice. This offers users an engaging, time-efficient way to absorb information at their desired speed. Test, prototype and run agentic AI blueprints in one click Nvidia blueprints are aimed at empowering the world’s more than 25 million software developers to easily integrate AI into their applications. These blueprints simplify the process of building and deploying agentic AI applications, making advanced AI integration more accessible than ever. With just a single click, developers can now build and run the new agentic AI blueprints as Nvidia Launchables. These Launchables provide on-demand access to developer environments with predefined configurations, enabling quick workflow setup. Since they contain all necessary components for development, Launchables support consistent and reproducible setups without the need for manual configuration or overhead — streamlining the entire development process, from prototyping to deployment. Enterprises can also

Nvidia launches agentic AI blueprints to automate work for enterprises Read More »

Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More As large language models (LLMs) continue to improve at coding, the benchmarks used to evaluate their performance are steadily becoming less useful. That’s because though many LLMs have similar high scores on these benchmarks, understanding which ones to use on specific software development projects and enterprises can be difficult. A new paper by Yale University and Tsinghua University presents a novel method to test the ability of models to tackle “self-invoking code generation” problems that require reasoning, generating code, and reusing existing code in problem-solving. Self-invoking code generation is much more similar to realistic programming scenarios than benchmark tests are, and it provides a better understanding of current LLMs’ ability to solve real-world coding problems. Self-invoking code generation Two popular benchmarks used to evaluate the coding abilities of LLMs are HumanEval and MBPP (Mostly Basic Python Problems). These are datasets of handcrafted problems that require the model to write code for simple tasks.  However, these benchmarks only cover a subset of the challenges software developers face in the real world. In practical scenarios, software developers don’t just write new code — they must also understand and reuse existing code and create reusable components to solve complex problems. “The ability to understand and subsequently leverage one’s own generated code, [in other words] self-invoking code generation, plays an important role for LLMs to leverage their reasoning capabilities to code generation that current benchmarks fail to capture,” the researchers write. To test the ability of LLMs in self-invoking code generation, the researchers created two new benchmarks, HumanEval Pro and MBPP Pro, which extend the existing datasets. Each problem in HumanEval Pro and MBPP Pro builds on top of an existing example in the original dataset and introduces additional elements that require the model to solve the base problem and invoke that solution to solve a more complex problem.  Self-invoking code generation (source: arXiv) For example, the original problem can be something simple, like writing a function that replaces all occurrences of a given character in a string with a new character. The extended problem would be to write a function that changes occurrences of multiple characters in a string with their given replacements. This would require the model to write a new function that invokes the previous function it generated in the simple problem.  “This evaluation of self-invoking code generation offers deeper insights into the programming capabilities of LLMs, extending beyond the scope of single-problem code generation,” the researchers write. LLMs perform poorly at self-invoking code generation The researchers tested HumanEval Pro and MBPP Pro on more than 20 open and private models, including GPT-4o, OpenAI o1-mini and Claude 3.5 Sonnet, as well as Qwen, DeepSeek and Codestral series. Their findings show a significant disparity between traditional coding benchmarks and self-invoking code generation tasks. “While frontier LLMs excel at generating individual code snippets, they often struggle to effectively [utilize] their own generated code for solving more complex problems,” the researchers write. For example, with a single generation (pass@1), o1-mini achieves 96.2% on HumanEval but only 76.2% on HumanEval Pro. Another interesting finding is that while instruction fine-tuning provides significant improvements on simple coding tasks, it shows diminishing returns on self-invoking code generation. The researchers note that “current instruction-based fine-tuning approaches are insufficiently effective for more complex self-invoking code generation tasks,” suggesting that we need to rethink how we train base models for coding and reasoning tasks. To help advance research on self-invoking code generation, the researchers propose a technique to automatically repurpose existing coding benchmarks for self-invoking code generation. The approach uses frontier LLMs to generate self-invoking problems based on the original problems. They then generate candidate solutions and verify their correctness by executing the code and running test cases on them. The pipeline minimizes the need for manual code review to help generate more examples with less effort. Automatically generating self-invoking code generation problems (source: arXiv) A complex landscape This new family of benchmarks comes at a time when old coding benchmarks are quickly being conquered by frontier models. Current frontier models such as GPT-4o, o1, and Claude 3.5 Sonnet already have very high scores on HumanEval and MBPP as well as their more advanced versions, HumanEval+ and MBPP+.  At the same time, there are more complex benchmarks such as SWE-Bench, which evaluate models’ capabilities in end-to-end software engineering tasks that require a wide range of skills such as using external libraries and files, and managing DevOps tools. SWE-Bench is a very difficult benchmark and even the most advanced models are showing only modest performance. For example, OpenAI o1 is inconsistent on SWE-Bench Verified. Surprising find: OpenAI’s O1 – reasoning-high only hit 30% on SWE-Bench Verified – far below their 48.9% claim. Even more interesting: Claude achieves 53% in the same framework. Something’s off with O1’s “enhanced reasoning”… ?1/8 pic.twitter.com/ADLXNuKpPP — Alejandro Cuadron (@Alex_Cuadron) January 5, 2025 Self-invoking code generation sits somewhere between the simple benchmarks and SWE-Bench. It helps evaluate a very specific type of reasoning ability: using existing code within a module to tackle complex problems. Self-invoking code benchmarks can prove to be a very practical proxy for the usefulness of LLMs in real-world settings, where human programmers are in control and AI copilots help them accomplish specific coding tasks in the software development process. “HumanEval Pro and MBPP Pro are positioned to serve as valuable benchmarks for code-related evaluations and to inspire future LLM development by shedding light on current model shortcomings and encouraging innovation in training methodologies,” the researchers write. source

Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks Read More »

Listen to your technology users — they have led to the most disruptive innovations in history

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More In 1971, Advanced Research Projects Agency Network (ARPANET), the precursor to the modern internet, had about 1,000 users. The @ sign was an obscure symbol. Then, engineer Ray Tomlinson changed everything by creating a system to send messages to other computers on the ARPANET network, using the @ sign to indicate who each message was for. Email was born. One of the biggest inventions of the digital era wasn’t created by a company looking for a product to sell. It was cooked up by a user with a problem to solve. Tomlinson said he didn’t even fully realize what a big deal his invention was until almost 25 years later, in 1993. Users were also behind the invention of the dishwasher (a socialite looking to make dinner party cleanup easier), the telephone (an engineer who wanted to talk to his wife upstairs from his basement lab), the plastic contact lens (an optometrist tired of wearing thick heavy glasses) and even modern tech companies like Airbnb (the founders rented an air mattress in their living room to help make rent on their San Francisco apartment). Users are a major source of disruptive innovation, yet they are often overlooked. We recently published an analysis of 60 cases of disruptive innovation in the Journal of Product Innovation Management, from LASIK surgery to electric power tools. Our goal was to understand where disruptive innovation originates. We were surprised to find that nearly half the innovations we identified came from users, rather than producers. Combining ‘need knowledge’ and ‘solution knowledge’ Users have a unique, close-up view of a problem — and know where current solutions fall short. Technical experts and existing producers have a clearer sense of what potential solutions could look like, but they aren’t as close to the need. By combining users’ “need knowledge,” with their own “solution knowledge,” companies can unlock a wealth of opportunities for growth and competitive advantage. Disruptive ideas for B2C products and services often arise from individual consumers looking to meet their own needs. Disruptive innovation in the B2B space can come from professionals looking for new tools or systems to do their jobs more effectively. For instance, physician John H. Gibbon and his wife Mary developed the heart-lung machine and used it to perform one of the first successful open-heart surgeries. Our study found that products offering dramatically new functionalities are more likely to be developed by users and often arise in times when customer needs are changing rapidly. On the other hand, innovations with high technological novelty are more likely to be generated by producers, who have the necessary technical expertise. These tend to originate in moments of rapid technological change. Our research calls into question existing thinking about disruptive innovation. The narrative going back to businessman Clayton Christensen has been that disruption comes from startups and other new players in a market, while large incumbents generally lag behind. Users are seen as part of the problem. When your customers keep asking for the same thing over and over, there isn’t much room to innovate. But our research shows that there isn’t just one template for disruptive innovation, and users can be a source of ingenious ideas rather than a barrier. While companies often look to users for input on how to tweak existing projects and innovate around the margins, we found that they can also generate disruptive, game-changing innovation. Tips to support disruptive innovation So, how can your company surface truly disruptive innovation from users? First, create a culture of open innovation that values insights from outside the organization. While the technical geniuses in your R&D department are experts in how to build something new, they aren’t the only authorities on what it is you should build. Our research suggests that it’s especially important to seek out user-generated disruption at times when customer needs are changing rapidly. Talk to your customers and create channels for dialogue and engagement. Most companies regularly survey users and conduct focus groups. But to identify truly disruptive ideas, you need to go beyond reactions to existing products and plumb unmet needs and pain points. Customer complaints also offer insight into how existing solutions fall short. AI tools make it easier to monitor user communities online and analyze customer feedback, reviews, and complaints. Keep your pulse on social media and online user communities where people share innovative ways to adapt existing products and wish lists for new functionalities. Users also congregate offline. At sporting events you may find athletes DIYing custom solutions to unmet needs. Mountain bikes were invented in the 1970s by riders who cobbled together custom bikes, called clunkers, to explore beautiful off-road landscapes in California. Focus on lead users who are ahead of the trends. Lead users are often the first to see rising consumer needs that will be dominant in the future, and they stand to benefit from new solutions. Research shows that lead user ideas are much more valuable commercially than those from the average customer. However, take their input with a grain of salt, as lead users sometimes value niche functionalities that mainstream customers won’t care about. You can also look for lead users embedded within your organization — for instance, employees who work for a car company because they are auto aficionados. Lastly, explore co-creation initiatives that foster direct collaboration with user innovators. For instance, run a contest where customers submit ideas for new products or features, some of which could turn out to be truly disruptive. Or sponsor hackathons that bring together users with needs and technical experts to design solutions. Companies are always looking for an innovation edge, but they often miss one of the most powerful sources of groundbreaking ideas — their own users. By tapping into the vast pool of existing users and customers, you can harness their creativity and expertise to fuel truly disruptive innovation. Christina Raasch is Professor

Listen to your technology users — they have led to the most disruptive innovations in history Read More »

Nvidia using GenAI to integrate Omniverse virtual creations into physical AI apps

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Nvidia unveiled generative AI models and blueprints that expand Nvidia Omniverse integration further into physical AI applications such as robotics, autonomous vehicles and vision AI. As part of the CES 2025 opening keynote by Nvidia CEO Jensen Huang, the company said global leaders in software development and professional services are using Omniverse to develop new products and services that will accelerate the next era of industrial AI. Accenture, Altair, Ansys, Cadence, Foretellix, Microsoft and Neural Concept are among the first to integrate Omniverse into their next-generation software products and professional services. Siemens, a leader in industrial automation, announced today at the CES trade show the availability of Teamcenter Digital Reality Viewer — the first Siemens Xcelerator application powered by Nvidia Omniverse libraries. “Physical AI will revolutionize the $50 trillion manufacturing and logistics industries. Everything that moves — from cars and trucks to factories and warehouses — will be robotic and embodied by AI,” said Huang, in a statement. “Nvidia’s Omniverse digital twin operating system and Cosmos physical AI serve as the foundational libraries for digitalizing the world’s physical industries.” New models and frameworks accelerate world-building for physical AI Creating 3D worlds for physical AI simulation requires three steps: world-building, labelingthe world with physical attributes and making it photoreal. Nvidia offers generative AI models that accelerate each step. The USD Code and USD Search Nvidia NIM microservices are now generally available. They let developers use text prompts to generate or search for OpenUSD assets. A new Nvidia Edify SimReady generative AI model unveiled today can automatically label existing 3D assets with attributes like physics or materials, enabling developers to process 1,000 3D objects in minutes instead of over 40 hours manually. Nvidia Omniverse, paired with new Nvidia Cosmos world foundation models, creates a synthetic data multiplication engine that lets developers easily generate massive amounts of controllable, photoreal synthetic data. Developers can compose 3D scenarios in Omniverse and render images or videos as outputs. These can then be used with text prompts to condition Cosmos models to generate countless synthetic virtual environments for physical AI training. Nvidia Omniverse blueprints speed up industrial, robotic workflows Cosmos generates synthetic driving data. During the CES keynote, Nvidia also announced four new blueprints that make it easier for developers to build universal scene description (OpenUSD)-based Omniverse digital twins for physical AI. The blueprints are: Mega, powered by Omniverse Sensor RTX APIs, for developing and testing robot fleets at scale in an industrial factory or warehouse digital twin before deployment in real-world facilities Autonomous Vehicle (AV) Simulation, also powered by Omniverse Sensor RTX APIs, that lets AV developers replay driving data, generate new ground-truth data and perform closed-loop testing to accelerate their development pipelines Omniverse spatial streaming to Apple Vision Pro that helps developers create applications for immersive streaming of large-scale industrial digital twins to Apple Vision Pro Real-time digital twins for computer aided engineering (CAE), a reference workflow built on Nvidia CUDA-X acceleration, physics AI and Omniverse libraries that enables real-time physics visualization New, free “Learn OpenUSD” courses are also now available to help developers build OpenUSD-based worlds faster than ever. Market leaders supercharge industrial AI using Nvidia Omniverse Global leaders in software development and professional services are using Omniverse to develop new products and services that are poised to accelerate the next era of industrial AI. Building on its adoption of Omniverse libraries in its Reality Digital Twin data center digital twin platform, Cadence, a leader in electronic systems design, announced further integration of Omniverse into Allegro, its leading electronic computer-aided design application used by the world’s largest semiconductor companies. Altair, a leader in computational intelligence, is adopting the Omniverse blueprint for real-time CAE digital twins for interactive computational fluid dynamics (CFD). Ansys is adopting Omniverse into Ansys Fluent, a leading CAE application. And Neural Concept is integrating Omniverse libraries into its next-generation software products, enabling real-time CFD and enhancing engineering workflows. Accenture, a leading global professional services company, is using Mega to help German supply chain solutions leader Kion by building next-generation autonomous warehouses and robotic fleets for their network of global warehousing and distribution customers. AV toolchain provider Foretellix, a leader in data-driven autonomy development, is using the AV simulation blueprint to enable full 3D sensor simulation for optimized AV testing and validation. Research organization MITRE is also deploying the blueprint, in collaboration with the University of Michigan’s Mcity testing facility, to create an industry-wide AV validation platform. Katana Studio is using the Omniverse spatial streaming workflow to create custom car configurators for Nissan and Volkswagen, allowing them to design and review car models in an immersive experience while improving the customer decision-making process. Innoactive, an XR streaming platform for enterprises, leveraged the workflow to add platform support for spatial streaming to Apple Vision Pro. The solution enables Volkswagen Group to conduct design and engineering project reviews at human-eye resolution. Innoactive also collaborated with Syntegon, a provider of processing and packaging technology solutions for pharmaceutical production, to enable Syntegon’s customers to walk through and review digital twins of custom installations before they arebuilt. source

Nvidia using GenAI to integrate Omniverse virtual creations into physical AI apps Read More »

Researchers improved AI agent performance on unfamiliar tasks using ‘Dungeons and Dragons’

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Organizations interested in deploying AI agents must first fine-tune them, especially in workflows that often feel rote. While some organizations want agents that only perform one kind of task in one workflow, sometimes agents need to be brought into new environments with the hope that they adapt.  Researchers from the Beijing University of Posts and Telecommunications have unveiled a new method, AgentRefine. It teaches agents to self-correct, leading to more generalized and adaptive AI agents.  The researchers said that current tuning methods limit agents to the same tasks as their training dataset, or “held-in” tasks, and do not perform as well for “held-out,” or new environments. By following only the rules laid out through the training data, agents trained with these frameworks would have trouble “learning” from their mistakes and cannot be made into general agents and brought into to new workflows.  To combat that limitation, AgentRefine aims to create more generalized agent training datasets that enable the model to learn from mistakes and fit into new workflows. In a new paper, the researchers said that AgentRefine’s goal is “to develop generalized agent-tuning data and establish the correlation between agent generalization and self-refinement.” If agents self-correct, they will not perpetuate any errors they learned and bring these same mistakes to other environments they’re deployed in.  “We find that agent-tuning on the self-refinement data enhances the agent to explore more viable actions while meeting bad situations, thereby resulting in better generalization to new agent environments,” the researchers write.  AI agent training inspired by D&D Taking their cue from the tabletop roleplaying game Dungeons & Dragons, the researchers created personas, scripts for the agent to follow and challenges. And yes, there is a Dungeon Master (DM).  They divided data construction for AgentRefine into three areas: script generation, trajectory generation and verification.  In script generation, the model creates a script, or guide, with information on the environment, tasks and actions personas can take. (The researchers tested AgentRefine using Llama-3-8B-Instruct, Llama-3-70B-Instruct, Mistral-7B-Instruct-v0.3, GPT-4o-mini and GPT-4o) The model then generates agent data that has errors and acts both as a DM and a player during the trajectory stage. It asses the actions it can take and then see if these contain errors. The last stage, verification, checks the script and trajectory, allowing for the potential of agents it trains to do self-correction. Better and more diverse task abilities The researchers found that agents trained using the AgentRefine method and dataset performed better on diverse tasks and adapted to new scenarios. These agents self-correct more to redirect their actions and decision-making to avoid errors, and become more robust in the process.  In particular, AgentRefine improved the performance of all the models to work on held-out tasks.  Enterprises must make agents more task-adaptable so that they don’t repeat only what they’ve learned so they can become better decision-makers. Orchestrating agents not only “direct traffic” for multiple agents but also determine whether agents have completed tasks based on user requests.  OpenAI’s o3 offers “program synthesis” which could improve task adaptability. Other orchestration and training frameworks, like Magentic-One from Microsoft, sets actions for supervisor agents to learn when to move tasks to different agents.  source

Researchers improved AI agent performance on unfamiliar tasks using ‘Dungeons and Dragons’ Read More »

Global VC investments rose 5.4% to $368.5B in 2024, but deals fell 17%

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Global venture capital investments rose to $368.5 billion in 2024, up 5.4% from $349.4 billion a year earlier, according to the first look at the Q4 2024 Pitchbook-NVCA Venture Monitor report. But the number of global deals in 2024 fell 17% to 35,686 from 43,320 a year earlier in 2023. AI deals as a percentage of all deals rose for the year, as you can see in the chart below. The 2024 global deals are down 50.9% from $751.5 billion in the peak year of 2021 and down 37% from 57,068 in deal count in 2021. AI deals are big part of the picture now. There were 8,343 global AI deals in 2024, down 3.6% from 8,661 in 2023 and down 16.6% from 10,007 in 2021. AI’s share of all global VC deals is at a new high. The value of those global AI deals in 2024 was $131.5 billion, up from 52% from $86.3 billion in 2023 and down 6% from $140.2 billion in 2021. AI and machine learning were 35.7% of global deal value in 2024, up from 24.7% in 2023. And AI and machine learning were 23.4% of the global deal count in 2024, up from 20% in 2023. In 2021, AI was 18.7% of global deal value and 17.5% of global deal count. Q4 global numbers On the global level in Q4, Asia Pacific’s venture market has struggled through the last few years, something that didn’t change in 2024, Pitchbook lead VC analyst Kyle Stanford said. Compared with Europe and the U.S., the amount of dry powder built up within the various markets across APAC was much smaller, further pressuring dealmaking over the past year. China, which has driven around half of the annual deal activity for APAC, has seen a material decline in activity, due to both economic challenges within the country, as well as the tensions with the U.S. government, which has curtailed activity by U.S.-headquartered firms. Just 20.4% of deal count occurred in Asia, the lowest proportion in the past decade. Globally, AI has continued to dominate the headlines and investment focus of investors despite some noting that the investment activity is not sustainable long-term. Whether or not that true is trivial in the current moment. Just over half of all VC invested globally during Q4 went to an AI-focused company. Its true that amount was heavily influenced by the likes of OpenAI, Databricks, xAI, and other well-known companies raising for share buybacks and investment into chips and computing energy needs, but the most important factors is the level of capital availability for AI compared with other sectors, Stanford said. The proportion of total deals going to AI companies has consistently increased over the past couple years as large corporates and investors alike move to harness the expected efficiencies of the next tech wave, he said. Global VC investments and deal counts by year. “VC-backed exits have not been strong historically for APAC, though many markets are still too young to develop a healthy exit environment,” he said. “The lack of exits across many of the regions has kept many foreign investors weary of increased activity during the market slowdown. Japan has been an outlier in terms of count, as many IPOs within the country have helped drive returns to investors. In 2024, 19% of the global VC-backed exits originated in Asia-based companies.” Fundraising has been slow globally, as new commitments dropped just over 20% YoY. The lack of exits has had a large impact on fundraising for Asia as LPs have been less inclined to reup commitments at this time. 2024 marked the lowest year for new commitments since 2018, and was the lowest year for closed funds in the market in the past decade. North America and Europe similarly struggled to secure new commitments to venture funds. Q4 U.S. deals U.S. Dealmaking remained relatively robust in the fourth quarter of 2024 from a count perspective, and increased slightly by 3.7% compared to a year earlier, Pitchbook and the NVCA said. In the quarter, AI deals accounted for nearly half (46.4%) of total US deal value. Stanford said it seems counterintuitive to the narrative in the market over the past few years, but is indicative of holdover of certain mechanics of venture from a few years ago. “What has happened is that the excess of dry powder from the high fundraising years of 2021 and 2022 have kept many investors active in the market despite the lack of returns,” Stanford said. “With the slow fundraising years of 2023 and 2024, we should likely see this relative robustness start to deteriorate as fund run through their available capital and aren’t able to raise a subsequent fund.” AI deals by year has been rising sharply. Artificial intelligence continues to be the story of the market, and drove a near majority of dollars for VC in 2024, he said. OpenAI, xAI, Anthropic, and others have become synonymous with outsized deals in venture, and seemingly operate in a different funding environment than most VC-backed companies who continue to struggle with lower capital availability, Stanford said. But the lack of exits remains the story of the venture market, even as the outlook is more hopeful, he said. Just $149.2 billion in exit value was created during 2024, largely coming from a handful of IPOs. Unicorns, which hold around two-thirds of the U.S. VC market value, have held tight as private companies, creating pressure on investors and limited partners with the lack of distributions. Merges and acquisitions were was also “silent in 2024,” with few large deals to note, Stanford said. A more acquisition-friendly environment in 2025 could set the stage for a renewed M&A market, especially if a soft-landing for the economy can be fully engineered, he said. In the U.S., fundraising was dominated by large, established firms. Thirty firms accounted for more than 68% of total

Global VC investments rose 5.4% to $368.5B in 2024, but deals fell 17% Read More »