VentureBeat

Nvidia advances robot learning and humanoid development with AI and simulation tools

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Nvidia revealed new AI and simulation tools that will advance robot learning and humanoid development. The world’s biggest tech company by valuation (worth $3.432 trillion) said that the tools will enable robotics developers to greatly accelerate their work on AI-enabled robots, with tools revealed this week at the Conference for Robot Learning (CoRL) in Munich, Germany. The lineup includes the general availability of the Nvidia Isaac Lab robot learning framework; six new humanoid robot learning workflows for Project GR00T, an initiative to accelerate humanoid robot development; and new world-model development tools for video data curation and processing, including the Nvidia Cosmos tokenizer and Nvidia NeMo Curator for video processing. The open-source Cosmos tokenizer provides robotics developers superior visual tokenization by breaking down images and videos into high-quality tokens with exceptionally high compression rates. It runs up to 12 times faster than current tokenizers, while NeMo Curator provides video processing curation up to seven times faster than unoptimized pipelines. Also timed with CoRL, Nvidia released 23 papers and presented nine workshops related to robot learning, and also released training and workflow guides for developers. Further, Hugging Face and Nvidia announced they’re collaborating to accelerate open-source robotics research with LeRobot, Nvidia Isaac Lab and Nvidia Jetson for the developer community. Accelerating robot development with Isaac Lab Nvidia Isaac Lab Project GR00T Models Nvidia Isaac Lab is an open-source, robot learning framework built on Nvidia Omniverse, a platform for developing OpenUSD applications for industrial digitalization and physical AI simulation. Developers can use Isaac Lab to train robot policies at scale. This open-source unified robot learning framework applies to any embodiment — from humanoids to quadrupeds and collaborative robots — to handle increasingly complex movements and interactions. Leading commercial robot makers, robotics application developers, and robotics research entities around the world are adopting Isaac Lab, including 1X, Agility Robotics, The AI Institute, Berkeley Humanoid, Boston Dynamics, Field AI, Fourier, Galbot, Mentee Robotics, Skild AI, Swiss-Mile, Unitree Robotics, and Xpeng Robotics. Project GR00T: Foundations for general-purpose humanoid robots The humanoids are coming. Building advanced humanoids is extremely difficult, demanding multilayer technological and interdisciplinary approaches to make the robots perceive, move and learn skills effectively for human-robot and robot-environment interactions. Project GR00T is an initiative to develop accelerated libraries, foundation models and data pipelines to accelerate the global humanoid robot developer ecosystem. Six new Project GR00T workflows provide humanoid developers with blueprints to realize the most challenging humanoid robot capabilities. They include things such as GR00T-Gen for building generative AI-powered, OpenUSD-based 3D environments and more. “Humanoid robots are the next wave of embodied AI,” said Jim Fan, senior research manager of embodied AI at Nvidia, in a statement. “Nvidia research and engineering teams are collaborating across the company and our developer ecosystem to build Project GR00T to help advance the progress and development of global humanoid robot developers.” Today, robot developers are building world models — AI representations of the world that can predict how objects and environments respond to a robot’s actions. Building these world models is incredibly compute- and data-intensive with models requiring thousands of hours of real-world, curated image or video data. Nvidia Cosmos tokenizers provide efficient, high-quality encoding and decoding to simplify the development of these world models. They set a new standard of minimal distortion and temporal instability, enabling high-quality video and image reconstructions. Providing high-quality compression and up to 12 times faster visual reconstruction, the Cosmos tokenizer paves the path for scalable, robust and efficient development of generative applications across a broad spectrum of visual domains. 1X, a humanoid robot company, has updated the 1X World Model Challenge dataset to use the Cosmos tokenizer. “Nvidia Cosmos tokenizer achieves really high temporal and spatial compression of our data while still retaining visual fidelity,” said Eric Jang, vice president of AI at 1X Technologies, in a statement. “This allows us to train world models with long horizon video generation in an even more compute-efficient manner.” Other humanoid and general purpose robot developers including Xpeng Robotics and Hillbot are developing with the Nvidia Cosmos tokenizer to manage high-resolution images and videos. NeMo Curator NeMo Curator now includes a video processing pipeline. This enables robot developers to improve their world-model accuracy processing large-scale text, image and video data. Curating video data poses challenges due to its massive size, requiring scalable pipelines and efficient orchestration for load balancing across GPUs. Additionally, models for filtering, captioning and embedding need optimization to maximize throughput. NeMo Curator overcomes these challenges by streamlining data curation with automatic pipeline orchestration, reducing processing time significantly. It supports linear scaling across multi-node multi-GPU systems, efficiently handling over 100 petabytes of data. This simplifies AI development, reduces costs and accelerates time to market. Availability Nvidia Isaac Lab 1.2 is available now and is open source on GitHub. Nvidia Cosmos tokenizer is available now on GitHub and Hugging Face. NeMo Curator for video processing will be available at the end of the month. The new Nvidia Project GR00T workflows are coming soon to help robot companies build humanoid robot capabilities with greater ease. For researchers and developers learning to use Isaac Lab, new getting started developer guides and tutorials are now available, including an Isaac Gym to Isaac Lab migration guide. source

Nvidia advances robot learning and humanoid development with AI and simulation tools Read More »

Meet the startup that just won the Pentagon’s first AI defense contract

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The Department of Defense has awarded its first generative AI defense contract to Jericho Security, marking a strategic shift in military cybersecurity. The $1.8 million Small Business Technology Transfer (STTR) Phase II contract, announced through AFWERX, tasks the New York-based startup with developing advanced cybersecurity solutions for the Department of the Air Force. “This is one of the first generative AI contracts awarded in defense, marking a major milestone in how seriously our military is addressing AI-based threats,” Sage Wohns, CEO of Jericho Security, told VentureBeat in an exclusive interview How AI-powered phishing attacks target military personnel The company’s approach centers on simulating complex, multi-channel phishing attacks that mirror real-world scenarios. “In today’s landscape, phishing campaigns aren’t limited to just emails—they involve coordinated attempts across multiple platforms like text messages, phone calls, and even video calls,” Wohns explained, describing attacks that chain together multiple forms of communication to deceive targets. What sets Jericho’s technology apart is its focus on human vulnerability — widely considered the weakest link in cybersecurity. The company claims that up to 95% of data breaches stem from human error. Their platform creates personalized security training programs based on individual risk profiles, using generative AI to simulate sophisticated attacks including deepfake impersonations and AI-generated malware. Deepfake attacks and drone pilot targeting: The new frontier of military cybersecurity The contract comes at a critical time, as military personnel face increasingly targeted attacks. “There was a highly publicized spear-phishing attack targeting Air Force drone pilots using fake user manuals,” Wohns revealed, highlighting how the company helped evaluate vulnerabilities through attack simulation and specialized training. For a young company competing in the crowded cybersecurity market, landing a Defense Department contract represents a major validation. The deal positions Jericho Security to expand beyond its commercial roots into the lucrative government sector, where cybersecurity spending continues to grow amid escalating threats. Military contracts often require stringent security measures. Wohns emphasized that Jericho maintains “military-grade cybersecurity standards” including end-to-end encryption and isolated secure environments for handling sensitive military data. The next generation of AI defense: Predator and prey model Unlike traditional cybersecurity approaches that react to known threats, Jericho Security employs what Wohns calls a “predator and prey” model. “We started with attack simulation, giving us a continuous stream of real-time data to enhance both offensive and defensive capabilities,” he said. This dual approach allows their AI systems to evolve alongside emerging threats rather than merely responding to them. The Air Force contract, executed through AFWERX—the innovation arm of the Department of the Air Force—is part of a broader initiative to accelerate military adoption of private sector technology. AFWERX has awarded over 6,200 contracts worth more than $4.7 billion since 2019, working to strengthen the U.S. defense industrial base and speed up technology deployment. source

Meet the startup that just won the Pentagon’s first AI defense contract Read More »

Meta unveils AI tools to give robots a human touch in physical world

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Meta made several major announcements for robotics and embodied AI systems this week. This includes releasing benchmarks and artifacts for better understanding and interacting with the physical world. Sparsh, Digit 360 and Digit Plexus, the three research artifacts released by Meta, focus on touch perception, robot dexterity and human-robot interaction. Meta is also releasing PARTNR a new benchmark for evaluating planning and reasoning in human-robot collaboration. The release comes as advances in foundational models have renewed interest in robotics, and AI companies are gradually expanding their race from the digital realm to the physical world. There is renewed hope in the industry that with the help of foundation models such as large language models (LLMs) and vision-language models (VLMs), robots can accomplish more complex tasks that require reasoning and planning. Tactile perception Sparsh, which was created in collaboration with the University of Washington and Carnegie Mellon University, is a family of encoder models for vision-based tactile sensing. It is meant to provide robots with touch perception capabilities. Touch perception is crucial for robotics tasks, such as determining how much pressure can be applied to a certain object to avoid damaging it.  The classic approach to incorporating vision-based tactile sensors in robot tasks is to use labeled data to train custom models that can predict useful states. This approach does not generalize across different sensors and tasks. Meta Sparsh architecture Credit: Meta Meta describes Sparsh as a general-purpose model that can be applied to different types of vision-based tactile sensors and various tasks. To overcome the challenges faced by previous generations of touch perception models, the researchers trained Sparsh models through self-supervised learning (SSL), which obviates the need for labeled data. The model has been trained on more than 460,000 tactile images, consolidated from different datasets. According to the researchers’ experiments, Sparsh gains an average 95.1% improvement over task- and sensor-specific end-to-end models under a limited labeled data budget. The researchers have created different versions of Sparsh based on various architectures, including Meta’s I-JEPA and DINO models. Touch sensors In addition to leveraging existing data, Meta is also releasing hardware to collect rich tactile information from the physical. Digit 360 is an artificial finger-shaped tactile sensor with more than 18 sensing features. The sensor has over 8 million taxels for capturing omnidirectional and granular deformations on the fingertip surface. Digit 360 captures various sensing modalities to provide a richer understanding of the environment and object interactions.  Digit 360 also has on-device AI models to reduce reliance on cloud-based servers. This enables it to process information locally and respond to touch with minimal latency, similar to the reflex arc in humans and animals. Meta Digit 360 Credit: Meta “Beyond advancing robot dexterity, this breakthrough sensor has significant potential applications from medicine and prosthetics to virtual reality and telepresence,” Meta researchers write. Meta is publicly releasing the code and designs for Digit 360 to stimulate community-driven research and innovation in touch perception. But as in the release of open-source models, it has much to gain from the potential adoption of its hardware and models. The researchers believe that the information captured by Digit 360 can help in the development of more realistic virtual environments, which can be big for Meta’s metaverse projects in the future. Meta is also releasing Digit Plexus, a hardware-software platform that aims to facilitate the development of robotic applications. Digit Plexus can integrate various fingertip and skin tactile sensors onto a single robot hand, encode the tactile data collected from the sensors, and transmit them to a host computer through a single cable. Meta is releasing the code and design of Digit Plexus to enable researchers to build on the platform and advance robot dexterity research. Meta will be manufacturing Digit 360 in partnership with tactile sensor manufacturer GelSight Inc. They will also partner with South Korean robotics company Wonik Robotics to develop a fully integrated robotic hand with tactile sensors on the Digit Plexus platform. Evaluating human-robot collaboration Meta is also releasing Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR), a benchmark for evaluating the effectiveness of AI models when collaborating with humans on household tasks.  PARTNR is built on top of Habitat, Meta’s simulated environment. It includes 100,000 natural language tasks in 60 houses and involves more than 5,800 unique objects. The benchmark is designed to evaluate the performance of LLMs and VLMs in following instructions from humans.  Meta’s new benchmark joins a growing number of projects that are exploring the use of LLMs and VLMs in robotics and embodied AI settings. In the past year, these models have shown great promise to serve as planning and reasoning modules for robots in complex tasks. Startups such as Figure and Covariant have developed prototypes that use foundation models for planning. At the same time, AI labs are working on creating better foundation models for robotics. An example is Google DeepMind’s RT-X project, which brings together datasets from various robots to train a vision-language-action (VLA) model that generalizes to various robotics morphologies and tasks. source

Meta unveils AI tools to give robots a human touch in physical world Read More »

Pika 1.5 updates with three new Halloween-themed video AI Pikaffects

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Of all the AI video models out there, arguably the one to see most success among mainstream creators and viewers — those outside the pro and amateur filmmaking community — is Pika. The Palo Alto, California-based startup co-founded by two Stanford AI PhD dropouts Demi Guo and Chenlin Meng and funded to the tune of $135 million so far, unveiled its new Pika 1.5 text-to-video and image-to-video AI generation model at the start of this month (October 2024) with a collection of six physics-defying special effects (Explode, squish, melt, crush, inflate and “cake-ify”) for its web app that users could add easily to their own photos, turning them into surreal and bizarrely captivating videos. Following this, brands with accounts on the social networks Instagram and TikTok — especially brands in cosmetics, skincare, and wellness — began using the effects, especially the “squish,” to advertise their services. It even sparked a whole trend of creators trying the “Squish It” effect — or Pikaffect, as the company calls its AI presents — on their own videos. Pika added four more Pikaffects two weeks later. Now, the company is hoping to continue building upon its success cracking through to the mainstream by releasing three new Pikaffects in time for Halloween: levitate, eye pop, and decapitate — all of which do what they sound like. They’re ALIIIIIIIVE! Our freaky new Pikaffects are ready just in time for spooky season. ??? Try levitating, eye-popping, and decapitating the whole family. pic.twitter.com/wx0fddcP91 — Pika (@pika_labs) October 29, 2024 “We’re trying to put fun at the forefront of AI—making it accessible not just for creators, but for anyone, from kids to grandparents,” said Matan Cohen-Grumi, Pika’s Founding Creative Director, in a video call interview with VentureBeat earlier this week. To use the new and prior Pikaffects, users of Pika follow the same simple steps: visit Pika.art, sign in with a Google Account, Discord Account, Facebook/Meta account or email address, and then navigate to the bottom menu bar to add a new image. After tapping the Image button marked with a paperclip icon (highlighted above in a screenshot) the user can take a new image or add a previously uploaded one from their device or cloud photo library. Then, tapping the Pikaffects button marked by a magic wand (encircled above in the annotated screenshot), the user can pull up all 13 preset Pikaeffects. Finally, the user can generate a video based on the screenshot by tapping the star button (encircled above in the annotated screenshot). “What I would suggest, is for everyone to go to our website and try it out,” advocated Cohen-Grumi. “It’s so, so accessible.” The creative director asserted that Pika’s effects only take a few seconds to generate a new video from a still image. However, in VentureBeat’s limited tests, the site appeared overloaded with traffic and stalled for a while with some images failing to generate videos so far on the company’s free tier, which offers 150 credits to the user each month — enough for 10 videos (1 video costs 15 credits on Pika’s scale). There are also Standard, Pro, and Unlimited tiers for $10, $35, and $95 per month (20% discount when paid annually) with gradually increasing numbers of credits. Asked about the time outs we experienced, Cohen-Grumi noted that Pika’s newfound success with Pikaffects had come with load bearing challenges. “We had a lot, a lot of traffic, more than created on the launch, but everything was resolved very quickly,” he told VentureBeat. And seeking to dispel notions Pika was competing on novelty over realism, he also asserted that Pika 1.5 “can deliver extremely realistic results with natural movement.” As for what’s next for Pika — more Pikaffects for every major holiday or season of the year? — Cohen-Grumi played coy. “We’re always working on the next thing, ensuring everything we release is fun and accessible for everyone,” he said. source

Pika 1.5 updates with three new Halloween-themed video AI Pikaffects Read More »

OpenAI expands Realtime API with new voices and cuts prices for developers

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI updated its Realtime API today, which is currently in beta. This update adds new voices for speech-to-speech applications to its platform and cuts costs associated with caching prompts.  Beta users of the Realtime API will now have five new voices they can use to build their applications. OpenAI showcased three of the new voices, Ash, Verse and the British-sounding Ballad, in a post on X.  Two Realtime API updates: – You can now build speech-to-speech experiences with five new voices—which are much more expressive and steerable. ??? – We’re lowering the price by using prompt caching. Cached text inputs are discounted 50% and cached audio inputs are discounted… pic.twitter.com/jLzZDBrR7l — OpenAI Developers (@OpenAIDevs) October 30, 2024 The company said in its API documentation that the native speech-to-speech feature “skip[s] an intermediate text format means low latency and nuanced output,” while the voices are easier to steer and more expressive than its previous voices.  However, OpenAI warns it cannot offer client-side authentication for the API now as it’s still in beta. It also said that there may be issues with processing real-time audio.  “Network conditions heavily affect real-time audio, and delivering audio reliably from a client to a server at scale is challenging when network conditions are unpredictable,” the company shared. OpenAI’s history with AI-powered speech and voices has been controversial. In March, it released Voice Engine, a voice cloning platform to rival ElevenLabs, but it limited access to only a few researchers. In May, after the company demoed its GPT-4o and Voice Mode, it paused using one of the voices, Sky, after the actress Scarlett Johansson spoke out about its similarity to her voice.  The company rolled out ChatGPT Advanced Voice Mode for paying subscribers (those using ChatGPT Plus, Enterprise, Teams and Edu) in the U.S. in September.  Speech-to-speech AI would ideally let enterprises build more real-time responses using a voice. Suppose a customer calls a company’s customer service platform. In that case, the speech-to-speech capability can take the person’s voice, understand what they are asking, and respond using an AI-generated voice with lower latency. Speech-to-speech also lets users generate voice-overs, with a user speaking their lines, but the voice output is not theirs. One platform that offers this is Replica and, of course, ElevenLabs.   OpenAI released the Realtime API this month during its Dev Day. The API aims to speed up the building of voice assistants. Lowering costs Using speech-to-speech features, though, could get expensive.  When Realtime API launched, the pricing structure was at $0.06 per minute of audio input and $0.24 per audio output, which is not cheap. However, the company plans to lower real-time API prices with prompt caching.  Cached text inputs will drop by 50%, and cached audio inputs will be discounted by 80%. OpenAI also announced Prompt Caching during Dev Day and would keep frequently requested contexts and prompts in the model’s memory. This will drop the number of tokens it needs to create to generate responses. Lowering input prices, could encourage more interested developers to connect to the API.  OpenAI is not the only company to roll out Prompt Caching. Anthropic launched prompt caching for Claude 3.5 Sonnet in August.  source

OpenAI expands Realtime API with new voices and cuts prices for developers Read More »

DevRev raises $100.8 million in Series A funding and becomes an AI unicorn at a $1.15 billion valuation

Palo Alto, CA — October 2024 — Following its successful Series A funding round in August 2024, where DevRev secured $100.8 million and reached a $1.15 billion valuation, the company continues to drive forward its mission to revolutionize customer support and product development. Led by Khosla Ventures with participation from Mayfield Fund, Param Hansa Values, U First Capital, and several accelerators, family offices, and angel investors, this investment highlights the growing potential of AI-native enterprise software. Fueling this mission is DevRev’s AgentOS platform, which is rapidly advancing GenAI adoption in enterprises. By offering seamless 1-click data migration from legacy systems and deploying lightweight AI agents, DevRev is setting a new standard for how businesses integrate and benefit from AI.  A visionary approach to developer-customer interaction DevRev, founded in October 2020 by Dheeraj Pandey, former co-founder and CEO of Nutanix, and Manoj Agarwal, former SVP of Engineering at Nutanix, aims to remodel how businesses connect developers directly with customers and revenue. The company was born out of a simple yet powerful realization: “Today, every company is a software company, yet we isolate developers from customers and revenue…Our mission is to break down these barriers and empower developers to create customer-conscious products and businesses.” — Dheeraj Pandey, CEO of DevRev DevRev’s knowledge graph powers its AgentOS, delivering AI-native solutions that streamline customer service, product management, and software engineering. The platform is already trusted by customers across all major geographies, various industries, and numerous company sizes, including many of the global leading players across AI, SaaS, and financial services. By analyzing structured and unstructured data — from customer conversations to session analytics – the platform’s unique approach allows developers to connect their code directly to production issues and customer interactions. From there, DevRev’s AI-driven agents are able to automate enterprise workflows to reduce manual effort, enhance operational efficiency, and accelerate response times. “We have invested heavily in the generative AI sector. We’ve noticed that to fully harness the potential of AI, the underlying data and knowledge infrastructure must be reimagined and rebuilt. DevRev is at the forefront of enabling AI adoption in enterprises, thanks to its innovative product architecture. Furthermore, DevRev is pioneering a new vision for organizational structure by breaking down internal silos, fostering greater collaboration and efficiency across the company.” — Dr. Ekta Dang, CEO of U First Capital AI agents on knowledge graphs Organizations today suffer from technology complexity that siloes around departments and their respective apps, data, and workflows, which results in poor customer experiences, delays in product development, and often building the wrong software. DevRev believes that this complexity can be meaningfully resolved by AI-on-Knowledge Graphs, which combines the emerging power of GenAI and an organization’s own systems mapped into Knowledge Graphs. While AI is proving to be powerful, organizations are realizing that without Knowledge Graphs, they either end up with AI copilots on single apps or AI copilots on vast data lakes with little-to-no context or definition. The solution begins by creating an organization’s Knowledge Graph by ingesting data from 2-way real time integrations with an organization’s CRM, support, and engineering applications, along with the underlying code repositories. By doing so, the Knowledge Graph understands the product (software), the customers (users), the people (employees), and the workflows involved, along with unique elements to the organization, such as security and customizations. Once mapped, customers and employees can run queries through AI Agents to not only return more accurate search results, but also power systems of action quickly across the organization. This is the productivity promise GenAI holds, which is only enabled by the contextual mapping that Knowledge Graphs provides. With DevRev’s Knowledge Graph platform and data from major system of record applications that are ingested real-time into DevRev, DevRev creates an interdependent network of customer, user, product, employee, work and usage records. Put simply, DevRev comprises both the front-end applications and the back-end Knowledge Graphs to analyze, contextualize, and act on enterprise data, enabling organizations to: Gain Deep Organizational Insights: spot emerging trends and linkages across customers, products, and employees to better inform strategic planning Increase Focus: connect the dots between product / engineering roadmaps and customer impact to better prioritize and allocate resources Boost Operational Efficiency: streamline operations by identifying bottlenecks, eliminating redundancies, and automating workflows across the organization Enhance Customer Experience: gain a comprehensive understanding of customer interactions and feedback, leading to more personalized and effective service About DevRev DevRev’s mission is to help build the world’s most customer-centric companies, embracing the principle that “less is better.” Founded in October 2020 by Dheeraj Pandey and Manoj Agarwal, DevRev is headquartered in Palo Alto, California, with offices in seven global locations. For more information, visit DevRev’s website. About U First Capital Led by two technical PhDs based in Silicon Valley for over two decades, U First Capital’s focus is to invest in stellar founders. The firm has invested in over twenty five category-leading companies like Anthropic, Cohere AI, Rubrik, Worldcoin, Pensando, Palantir, Uniphore, and Nile. For more information, visit U First Capital’s website.  source

DevRev raises $100.8 million in Series A funding and becomes an AI unicorn at a $1.15 billion valuation Read More »

The power of business semantics: Turning data into actionable AI insights

Presented by SAP There’s no question the future of business data and decision-making is driven by AI.  And with steady advancments in AI, organizations across industries feel the pressure to drive innovation across the end-to-end process. But the foundational challenge to achieve success with AI is data fragmentation. “The reality for most of our customers is that every part of their business is deeply connected, but when they need to make decisions based on insights, that’s when data feels like a fragmented experience,” says Tony Truong, director of product marketing, data and analytics at SAP. The misalignment between IT and business is due to inconsistencies in how they view the business – each has quite differing approaches to the balance between data agility and data governance. Bringing these together is a time-consuming task for IT because when data is extracted from its source,the business context — an understanding of that data in relationship to the processes it was originally associated with — is completely wiped out. For the data to be usable, all of the metadata and logic must be rebuilt from scratch. And by the time that lengthy, redundant process is completed, the data is already getting stale. There’s also a mismatch in data definitions across different departments or systems — each department may look at the same data point in a different way. For instance, what the sales team considers a “customer” might differ from the marketing team’s definition. The discrepancy in semantics can affect how business leaders view the impact of a marketing campaign for the business development team. The inconsistency can lead to significant inefficiencies and delays decision-making, Truong says. “This fragmented experience leads to missed opportunities and disconnects between integrated solutions,” Truong says. “Managing data and applications across different platforms is complex, requiring specialized skills and tooling that could increase the operational and complexity costs if not done right. And when data is shipped to users without the context necessary for it to be useful, collaboration becomes significantly limited, and the organization loses the power of shared decision-making insights and collective expertise.” Without context, large language models (LLMs), other applications downstream and business users aren’t working with enough domain-specific knowledge to deliver the business insights organizations are chasing. And to ensure data consumers can leverage this data thoroughly, organizations need to prioritize business semantics, data literacy and self-service capabilities. The importance of business semantics As organizations integrate data across multiple business processes, they need a new way to maintain the accuracy of that data. That’s where business semantics comes in. AI models and applications require semantically rich data in order to produce reliable business outputs. A semantic layer is an abstraction layer between underlying data storage and analytics tools. It translates metadata (or business context) into natural language so that users can interact using terms they understand, and hides complex underlying data infrastructure, which dramatically simplifies data exploration and analysis. This provides business users with a way to discover and understand relationships between data, enabling them to answer complex questions and uncover hidden insights that traditional databases might miss. It also offers secure, truly self-service access to data and analytics, which is a major step forward for business decision-making. When teams have streamlined access to the same contextual data, it takes far less time and effort to generate insights, dramatically speeding up data-powered decision-making for users at every level and in every department. “When data products are enriched with domain-specific knowledge and are accessible, this gives ownership of the data back to the business and makes them infinitely usable across the organization, since the value of a data asset is proportional to its usage,” Truong says. How business data fabric unlocks business semantics and self-service A business data fabric is key to delivering an integrated, semantically-rich data layer over underlying data landscapes. It’s a data management architecture that provides seamless and scalable access to data without duplication, differing from a standard data fabric in that it keeps the business context and logic intact. It creates a single source of truth, offering agile self-service access to trusted data, and accelerated, accurate decisions, real-time data for in-the-moment insights and flexibility and a simplified data landscape. That maximizes the potential of your data and current infrastructure investments, while comprehensive data governance ensures every stakeholder that private data stays private. IT can federate access and security so that teams have self-service access without needing to rebuild systems and processes or make offline copies, and the data is secured from unauthorized access. The data modeling and semantic layer creates a common language for data across systems by creating a model that describes the data, and a semantic layer that offers a business-friendly interface to data consumers. “When business processes are integrated, you can take advantage of your existing investments and future investments,” Truong says. “Data is harmonized and ready to use. All your lines of business can have a single system to power their cross-organizational decision-making systems.” Dig deeper: Learn more about how a business data fabric can transform your AI capabilities. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact [email protected]. source

The power of business semantics: Turning data into actionable AI insights Read More »

AI on your smartphone? Hugging Face’s SmolLM2 brings powerful models to the palm of your hand

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Hugging Face today has released SmolLM2, a new family of compact language models that achieve impressive performance while requiring far fewer computational resources than their larger counterparts. The new models, released under the Apache 2.0 license, come in three sizes — 135M, 360M and 1.7B parameters — making them suitable for deployment on smartphones and other edge devices where processing power and memory are limited. Most notably, the 1.7B parameter version outperforms Meta’s Llama 1B model on several key benchmarks. Performance comparison shows SmolLM2-1B outperforming larger rival models on most cognitive benchmarks, with particularly strong results in science reasoning and commonsense tasks. Credit: Hugging Face Small models pack a powerful punch in AI performance tests “SmolLM2 demonstrates significant advances over its predecessor, particularly in instruction following, knowledge, reasoning and mathematics,” according to Hugging Face’s model documentation. The largest variant was trained on 11 trillion tokens using a diverse dataset combination including FineWeb-Edu and specialized mathematics and coding datasets. This development comes at a crucial time when the AI industry is grappling with the computational demands of running large language models (LLMs). While companies like OpenAI and Anthropic push the boundaries with increasingly massive models, there’s growing recognition of the need for efficient, lightweight AI that can run locally on devices. The push for bigger AI models has left many potential users behind. Running these models requires expensive cloud computing services, which come with their own problems: slow response times, data privacy risks and high costs that small companies and independent developers simply can’t afford. SmolLM2 offers a different approach by bringing powerful AI capabilities directly to personal devices, pointing toward a future where advanced AI tools are within reach of more users and companies, not just tech giants with massive data centers. A comparison of AI language models shows SmolLM2’s superior efficiency, achieving higher performance scores with fewer parameters than larger rivals like Llama3.2 and Gemma, where the horizontal axis represents the model size and the vertical axis shows accuracy on benchmark tests. Credit: Hugging Face Edge computing gets a boost as AI moves to mobile devices SmolLM2’s performance is particularly noteworthy given its size. On the MT-Bench evaluation, which measures chat capabilities, the 1.7B model achieves a score of 6.13, competitive with much larger models. It also shows strong performance on mathematical reasoning tasks, scoring 48.2 on the GSM8K benchmark. These results challenge the conventional wisdom that bigger models are always better, suggesting that careful architecture design and training data curation may be more important than raw parameter count. The models support a range of applications including text rewriting, summarization and function calling. Their compact size enables deployment in scenarios where privacy, latency or connectivity constraints make cloud-based AI solutions impractical. This could prove particularly valuable in healthcare, financial services and other industries where data privacy is non-negotiable. Industry experts see this as part of a broader trend toward more efficient AI models. The ability to run sophisticated language models locally on devices could enable new applications in areas like mobile app development, IoT devices, and enterprise solutions where data privacy is paramount. The race for efficient AI: Smaller models challenge industry giants However, these smaller models still have limitations. According to Hugging Face’s documentation, they “primarily understand and generate content in English” and may not always produce factually accurate or logically consistent output. The release of SmolLM2 suggests that the future of AI may not solely belong to increasingly large models, but rather to more efficient architectures that can deliver strong performance with fewer resources. This could have significant implications for democratizing AI access and reducing the environmental impact of AI deployment. The models are available immediately through Hugging Face’s model hub, with both base and instruction-tuned versions offered for each size variant. source

AI on your smartphone? Hugging Face’s SmolLM2 brings powerful models to the palm of your hand Read More »

Why multi-agent AI tackles complexities LLMs can’t

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The introduction of ChatGPT has brought large language models (LLMs) into widespread use across both tech and non-tech industries. This popularity is primarily due to two factors: LLMs as a knowledge storehouse: LLMs are trained on a vast amount of internet data and are updated at regular intervals (that is, GPT-3, GPT-3.5, GPT-4, GPT-4o, and others);  Emergent abilities: As LLMs grow, they display abilities not found in smaller models. Does this mean we have already reached human-level intelligence, which we call artificial general intelligence (AGI)? Gartner defines AGI as a form of AI that possesses the ability to understand, learn and apply knowledge across a wide range of tasks and domains. The road to AGI is long, with one key hurdle being the auto-regressive nature of LLM training that predicts words based on past sequences. As one of the pioneers in AI research, Yann LeCun points out that LLMs can drift away from accurate responses due to their auto-regressive nature. Consequently, LLMs have several limitations: Limited knowledge: While trained on vast data, LLMs lack up-to-date world knowledge. Limited reasoning: LLMs have limited reasoning capability. As Subbarao Kambhampati points out LLMs are good knowledge retrievers but not good reasoners. No Dynamicity: LLMs are static and unable to access real-time information. To overcome LLM’s challenges, a more advanced approach is required. This is where agents become crucial. Agents to the rescue The concept of intelligent agent in AI has evolved over two decades, with implementations changing over time. Today, agents are discussed in the context of LLMs. Simply put, an agent is like a Swiss Army knife for LLM challenges: It can help us in reasoning, provide means to get up-to-date information from the Internet (solving dynamicity issues with LLM) and can achieve a task autonomously. With LLM as its backbone, an agent formally comprises tools, memory, reasoning (or planning) and action components. Components of an agent (Image Credit: Lilian Weng) Components of AI agents Tools enable agents to access external information — whether from the internet, databases, or APIs — allowing them to gather necessary data. Memory can be short or long-term. Agents use scratchpad memory to temporarily hold results from various sources, while chat history is an example of long-term memory. The Reasoner allows agents to think methodically, breaking complex tasks into manageable subtasks for effective processing. Actions: Agents perform actions based on their environment and reasoning, adapting and solving tasks iteratively through feedback. ReAct is one of the common methods for iteratively performing reasoning and action. What are agents good at? Agents excel at complex tasks, especially when in a role-playing mode, leveraging the enhanced performance of LLMs. For instance, when writing a blog, one agent may focus on research while another handles writing — each tackling a specific sub-goal. This multi-agent approach applies to numerous real-life problems. Role-playing helps agents stay focused on specific tasks to achieve larger objectives, reducing hallucinations by clearly defining parts of a prompt — such as role, instruction and context. Since LLM performance depends on well-structured prompts, various frameworks formalize this process. One such framework, CrewAI, provides a structured approach to defining role-playing, as we’ll discuss next. Multi agents vs single agent Take the example of retrieval augmented generation (RAG) using a single agent. It’s an effective way to empower LLMs to handle domain-specific queries by leveraging information from indexed documents. However, single-agent RAG comes with its own limitations, such as retrieval performance or document ranking. Multi-agent RAG overcomes these limitations by employing specialized agents for document understanding, retrieval and ranking. In a multi-agent scenario, agents collaborate in different ways, similar to distributed computing patterns: sequential, centralized, decentralized or shared message pools. Frameworks like CrewAI, Autogen, and langGraph+langChain enable complex problem-solving with multi-agent approaches. In this article, I have used CrewAI as the reference framework to explore autonomous workflow management. Workflow management: A use case for multi-agent systems Most industrial processes are about managing workflows, be it loan processing, marketing campaign management or even DevOps. Steps, either sequential or cyclic, are required to achieve a particular goal. In a traditional approach, each step (say, loan application verification) requires a human to perform the tedious and mundane task of manually processing each application and verifying them before moving to the next step. Each step requires input from an expert in that area. In a multi-agent setup using CrewAI, each step is handled by a crew consisting of multiple agents. For instance, in loan application verification, one agent may verify the user’s identity through background checks on documents like a driving license, while another agent verifies the user’s financial details. This raises the question: Can a single crew (with multiple agents in sequence or hierarchy) handle all loan processing steps? While possible, it complicates the crew, requiring extensive temporary memory and increasing the risk of goal deviation and hallucination. A more effective approach is to treat each loan processing step as a separate crew, viewing the entire workflow as a graph of crew nodes (using tools like langGraph) operating sequentially or cyclically. Since LLMs are still in their early stages of intelligence, full workflow management cannot be entirely autonomous. Human-in-the-loop is needed at key stages for end-user verification. For instance, after the crew completes the loan application verification step, human oversight is necessary to validate the results. Over time, as confidence in AI grows, some steps may become fully autonomous. Currently, AI-based workflow management functions in an assistive role, streamlining tedious tasks and reducing overall processing time. Production challenges Bringing multi-agent solutions into production can present several challenges. Scale: As the number of agents grows, collaboration and management become challenging. Various frameworks offer scalable solutions — for example, Llamaindex takes event-driven workflow to manage multi-agents at scale. Latency: Agent performance often incurs latency as tasks are executed iteratively, requiring multiple LLM calls. Managed LLMs (like GPT-4o) are slow because of implicit guardrails and network delays.

Why multi-agent AI tackles complexities LLMs can’t Read More »

Google’s AI system could change the way we write: InkSight turns handwritten notes digital

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A centuries-old technology — pen and paper — is getting a dramatic digital upgrade. Google Research has developed an artificial intelligence system that can accurately convert photographs of handwritten notes into editable digital text, potentially transforming how millions of people capture and preserve their thoughts. The new system, called InkSight, represents a significant breakthrough in the long-running effort to bridge the divide between traditional handwriting and digital text. While digital note-taking has offered clear advantages for decades — searchability, cloud storage, easy editing, and integration with other digital tools — traditional pen-and-paper note-taking remains widely preferred, according to the researchers. A page from “Alice in Wonderland” shown in its original form (left) and after digital conversion by Google’s InkSight AI (right), demonstrating the system’s ability to preserve the natural character of handwritten text while making it digital. (Credit: Google) How Google’s new AI system understands human handwriting better than ever before “Digital note-taking is gaining popularity, offering a durable, editable, and easily indexable way of storing notes in the vectorized form,” Andrii Maksai, the project lead at Google Research, explained in the paper. “However, a substantial gap remains between this way of note-taking and traditional pen-and-paper note-taking, a practice still favored by a vast majority.” What makes InkSight revolutionary is its approach to understanding handwriting. Previous attempts to convert handwritten text to digital format relied heavily on analyzing the geometric properties of written strokes — essentially trying to trace the lines on the page. InkSight instead combines two sophisticated AI capabilities: the ability to read and understand text, and the ability to reproduce it naturally. The results are remarkable. In human evaluations, 87% of the samples produced by InkSight were considered valid tracings of the input text, and 67% were indistinguishable from human-generated digital handwriting. The system can handle real-world scenarios that would confound earlier systems: poor lighting, messy backgrounds, even partially obscured text. “To our knowledge, this is the first work that effectively de-renders handwritten text in arbitrary photos with diverse visual characteristics and backgrounds,” the researchers explain in their paper published on arXiv. The system can even handle simple sketches and drawings, though with some limitations. The same multilingual birthday note shown in three stages: the original handwriting (left), InkSight’s word-level analysis with color-coded processing (center), and the final digitized version with preserved character strokes (right). The system maintains the personal style of handwriting across Chinese, English and French text. (Credit: Google) Why handwriting still matters in our digital age, and how AI could help preserve it The technology arrives at a crucial moment in the evolution of human-computer interaction. Despite decades of digital advancement, handwriting remains deeply ingrained in human cognition and learning. Studies have consistently shown that writing by hand improves memory retention and understanding compared to typing. This has created a persistent challenge for technology adoption in education and professional settings. “Our work aims to make physical notes, particularly handwritten text, available in the form of digital ink, capturing the stroke-level trajectory details of handwriting,” Maksai says. “This allows paper note-takers to enjoy the benefits of digital medium without the need to use a stylus.” The implications extend far beyond simple convenience. In academic settings, students could maintain their preferred handwritten note-taking style while gaining the ability to search, share, and organize their notes digitally. Professionals who sketch ideas or take meeting notes by hand could seamlessly integrate them into digital workflows. Researchers and historians could more easily digitize and analyze handwritten documents. Perhaps most significantly, InkSight could help preserve and digitize handwritten content in languages that historically have limited digital representation. “Our work could allow access to the digital ink underlying the physical notes, potentially enabling the training of better online handwriting recognizers for languages that are historically low-resource in the digital ink domain,” notes Dr. Claudiu Musat, one of the project’s researchers. From breakthrough to real-world application: The technical architecture and future of digital note-taking The technology’s architecture is notably elegant. Built using widely available components, including Google’s Vision Transformer (ViT) and mT5 language model, InkSight demonstrates how sophisticated AI capabilities can be achieved through clever combination of existing tools rather than building everything from scratch. Google has released a public version of the model, though with important ethical safeguards. The system cannot generate handwriting from scratch — a crucial limitation that prevents potential misuse for forgery or impersonation. Current limitations do exist. The system processes text word by word rather than handling entire pages at once, and occasionally struggles with very wide stroke widths or significant variations in stroke width. However, these limitations seem minor compared to the system’s achievements. The technology is available for public testing through a Hugging Face demo, allowing users to experience firsthand how their handwritten notes might translate to digital form. Early feedback has been overwhelmingly positive, with users particularly noting the system’s ability to maintain the personal character of handwriting while providing digital benefits. While most AI systems seek to automate human tasks, InkSight takes a different path. It preserves the cognitive benefits and personal intimacy of handwriting while adding the power of digital tools. This subtle but crucial distinction points to a future where technology amplifies rather than replaces human capabilities. In the end, InkSight’s greatest innovation might be its restraint — showing how AI can advance human practices without erasing what makes them human in the first place. source

Google’s AI system could change the way we write: InkSight turns handwritten notes digital Read More »