VentureBeat

AI that clicks for you: Microsoft’s research points to the future of GUI automation

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A comprehensive new survey from Microsoft researchers and academic partners reveals that artificial intelligence agents powered by large language models (LLMs) are becoming increasingly capable of controlling graphical user interfaces (GUIs), potentially changing how humans interact with software. The technology essentially gives AI systems the ability to see and manipulate computer interfaces just like humans do — clicking buttons, filling out forms, and navigating between applications. Rather than requiring users to learn complex software commands, these “GUI agents” can interpret natural language requests and automatically execute the necessary actions. “These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands,” the researchers write. “Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software.” Think of it as having a highly skilled executive assistant who can operate any software program on your behalf. You simply tell the assistant what you want to accomplish, and they handle all the technical details of making it happen. This timeline charts the rapid growth of AI agents capable of controlling software, with a surge of new models from researchers and tech companies emerging since 2023, categorized by their application across web, mobile, and computer platforms. (Credit: arxiv.org) The rise of enterprise AI assistants changes everything Major tech companies are already racing to incorporate these capabilities into their products. Microsoft’s Power Automate uses LLMs to help users create automated workflows across applications. The company’s Copilot AI assistant can directly control software based on text commands. Anthropic’s Computer Use functionality for Claude enables the AI to interact with web interfaces and perform complex tasks. Google is reportedly developing Project Jarvis, an AI system that would use Chrome browser to carry out web-based tasks like research, shopping, and travel booking, though this capability is still in development and hasn’t been publicly released. “The advent of Large Language Models, particularly multimodal models, has ushered in a new era of GUI automation,” the paper notes. “They have demonstrated exceptional capabilities in natural language understanding, code generation, task generalization, and visual processing.” This represents a potential $68.9 billion market opportunity by 2028, according to analysts at BCC Research, as enterprises look to automate repetitive tasks and make their software more accessible to non-technical users. The market is projected to grow from $8.3 billion in 2022 to this figure, at a compound annual growth rate (CAGR) of 43.9% during the forecast period. The enterprise impact: Challenges and opportunities in AI automation However, significant hurdles remain before the technology sees widespread enterprise adoption. The researchers identify several key limitations, including privacy concerns when agents handle sensitive data, computational performance constraints, and the need for better safety and reliability guarantees. “While they are effective for predefined workflows, these methods lacked the flexibility and adaptability required for dynamic, real-world applications,” the paper states regarding earlier automation approaches. The research team provides a detailed roadmap for addressing these challenges, emphasizing the importance of developing more efficient models that can run locally on devices, implementing robust security measures, and creating standardized evaluation frameworks. “By incorporating safeguards and customizable actions, these agents ensure efficiency and security when handling intricate commands,” the researchers note, highlighting recent progress in making the technology enterprise-ready. For enterprise technology leaders, the emergence of LLM-powered GUI agents represents both an opportunity and a strategic consideration. While the technology promises significant productivity gains through automation, organizations will need to carefully evaluate the security implications and infrastructure requirements of deploying these AI systems. “The field of GUI agents is moving towards multi-agent architectures, multimodal capabilities, diverse action sets, and novel decision-making strategies,” the paper explains. “These innovations mark significant steps toward creating intelligent, adaptable agents capable of high performance across varied and dynamic environments.” Industry experts predict that by 2025, at least 60% of large enterprises will be piloting some form of GUI automation agents, potentially leading to massive efficiency gains but also raising important questions about data privacy and job displacement. The comprehensive survey suggests we’re at an inflection point where conversational AI interfaces could fundamentally change how humans interact with software — though realizing this potential will require continued advances in both the underlying technology and enterprise deployment practices. “These developments are laying the groundwork for more versatile and powerful agents capable of handling complex, dynamic environments,” the researchers conclude, pointing to a future where AI assistants become an integral part of how we work with computers. source

AI that clicks for you: Microsoft’s research points to the future of GUI automation Read More »

Nvidia accelerates Google quantum AI design with quantum physics simulation

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Nvidia it is working with Google Quantum AI to accelerate the design of its next-generation quantum computing devices using simulations powered by Nvidia. Google Quantum AI is using the hybrid quantum-classical computing platform and the Nvidia Eos supercomputer to simulate the physics of its quantum processors. This will help overcome the current limitations of quantum computing hardware, which can only run a certain number of quantum operations before computations must cease, due to what researchers call “noise.” “The development of commercially useful quantum computers is only possible if we can scaleup quantum hardware while keeping noise in check,” said Guifre Vidal, research scientist fromGoogle Quantum AI, in a statement. “Using Nvidia accelerated computing, we’re exploring the noiseimplications of increasingly larger quantum chip designs.” Understanding noise in quantum hardware designs requires complex dynamical simulations capable of fully capturing how qubits within a quantum processor interact with their environment. These simulations have traditionally been prohibitively computationally expensive to pursue. Using the CUDA-Q platform, however, Google can employ 1,024 Nvidia H100 Tensor Core GPUs at the Nvidia Eos supercomputer to perform one of the world’s largest and fastest dynamical simulation of quantum devices — at a fraction of the cost. “AI supercomputing power will be helpful to quantum computing’s success,” said Tim Costa, director of quantum and HPC at Nvidia, in a statement. “Google’s use of the CUDA-Q platform demonstrates the central role GPU-accelerated simulations have in advancing quantum computing to help solve real-world problems.” With CUDA-Q and H100 GPUs, Google can perform fully comprehensive, realistic simulations of devices containing 40 qubits – the largest performed simulations of this kind. The simulation techniques provided by CUDA-Q mean noisy simulations that would’ve taken a week can now run in minutes. The software powering these accelerated dynamic simulations will be publicly available in the CUDA-Q platform, allowing quantum hardware engineers to rapidly scale their system designs. source

Nvidia accelerates Google quantum AI design with quantum physics simulation Read More »

OpenAI faces critical test as Chinese models close the gap in AI leadership

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More In the fast-moving world of AI, competition is heating up—and nowhere is this more evident than in the battle over advanced reasoning models. In just the past few days, three new AI models from Chinese developers—Deepseek R1 (HighFlyer Capital Management), Marco-1 (Alibaba), and OpenMMLab’s hybrid model —have entered the fray, challenging OpenAI’s o1 Preview in performance and accessibility. These releases highlight how quickly open-source innovation is catching up to proprietary giants like OpenAI, whose o1-preview model set a new benchmark for complex reasoning tasks when it was released in mid-September. With OpenAI expected to unveil its next release as early as next week, the pressure is mounting to prove its dominance isn’t slipping. This race has broader implications beyond model performance. OpenAI’s skyrocketing $157 billion valuation and ambitious timeline for artificial general intelligence (AGI) have put intense pressure on its leadership to maintain momentum, especially as competitors close the gap faster than ever. Last year, OpenAI’s GPT-4 had a five-month lead before Anthropic’s Claude 2 debuted. This year, OpenAI’s lead with o1-preview has shrunk to just two and a half months, underscoring the rapid pace of innovation across the industry. Meanwhile, Anthropic has upped the stakes by releasing its Model Context Protocol (MCP), which simplifies AI-data integration and paves the way for next-gen applications. This open-source initiative also signals how other players in the field, including open-source-focused labs like AI2 with its OLMo 2 model, and Nous Research’s Nous Forge are broadening access to advanced AI capabilities with rival approaches to OpenAI. For a detailed breakdown of this – of these Chinese models, what they offer, how OpenAI and Google are likely to respond in the coming weeks, MCP, and OLMo 2 – check out our full discussion in the video below. You won’t want to miss the analysis from AI developer Sam Witteveen, who shares exclusive insights with me on why all of these developments matter. To my surprise, he was particularly bullish about MCP and its benefits – suggesting this could be significant for helping create our own personal agents. source

OpenAI faces critical test as Chinese models close the gap in AI leadership Read More »

Alibaba releases Qwen with Questions, an open reasoning model that beats o1-preview

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Chinese e-commerce giant Alibaba has released the latest model in its ever-expanding Qwen family. This one is known as Qwen with Questions (QwQ), and serves as the latest open source competitor to OpenAI’s o1 reasoning model. Like other large reasoning models (LRMs), QwQ uses extra compute cycles during inference to review its answers and correct its mistakes, making it more suitable for tasks that require logical reasoning and planning like math and coding. What is Qwen with Questions (OwQ?) and can it be used for commercial purposes? Alibaba has released a 32-billion-parameter version of QwQ with a 32,000-token context. The model is currently in preview, which means a higher-performing version is likely to follow. According to Alibaba’s tests, QwQ beats o1-preview on the AIME and MATH benchmarks, which evaluate mathematical problem-solving abilities. It also outperforms o1-mini on GPQA, a benchmark for scientific reasoning. QwQ is inferior to o1 on the LiveCodeBench coding benchmarks but still outperforms other frontier models such as GPT-4o and Claude 3.5 Sonnet. Example output of Qwen with Questions QwQ does not come with an accompanying paper that describes the data or the process used to train the model, which makes it difficult to reproduce the model’s results. However, since the model is open, unlike OpenAI o1, its “thinking process” is not hidden and can be used to make sense of how the model reasons when solving problems. Alibaba has also released the model under an Apache 2.0 license, which means it can be used for commercial purposes. ‘We discovered something profound’ According to a blog post that was published along with the model’s release, “Through deep exploration and countless trials, we discovered something profound: when given time to ponder, to question, and to reflect, the model’s understanding of mathematics and programming blossoms like a flower opening to the sun… This process of careful reflection and self-questioning leads to remarkable breakthroughs in solving complex problems.” This is very similar to what we know about how reasoning models work. By generating more tokens and reviewing their previous responses, the models are more likely to correct potential mistakes. Marco-o1, another reasoning model recently released by Alibaba might also contain hints of how QwQ might be working. Marco-o1 uses Monte Carlo Tree Search (MCTS) and self-reflection at inference time to create different branches of reasoning and choose the best answers. The model was trained on a mixture of chain-of-thought (CoT) examples and synthetic data generated with MCTS algorithms. Alibaba points out that QwQ still has limitations such as mixing languages or getting stuck in circular reasoning loops. The model is available for download on Hugging Face and an online demo can be found on Hugging Face Spaces. The LLM age gives way to LRMs: Large Reasoning Models The release of o1 has triggered growing interest in creating LRMs, even though not much is known about how the model works under the hood aside from using inference-time scale to improve the model’s responses.  There are now several Chinese competitors to o1. Chinese AI lab DeepSeek recently released R1-Lite-Preview, its o1 competitor, which is currently only available through the company’s online chat interface. R1-Lite-Preview reportedly beats o1 on several key benchmarks. Another recently released model is LLaVA-o1, developed by researchers from multiple universities in China, which brings the inference-time reasoning paradigm to open-source vision language models (VLMs).  The focus on LRMs comes at a time of uncertainty about the future of model scaling laws. Reports indicate that AI labs such as OpenAI, Google DeepMind, and Anthropic are getting diminishing returns on training larger models. And creating larger volumes of quality training data is becoming increasingly difficult as models are already being trained on trillions of tokens gathered from the internet.  Meanwhile, inference-time scale offers an alternative that might provide the next breakthrough in improving the abilities of the next generation of AI models. There are reports that OpenAI is using o1 to generate synthetic reasoning data to train the next generation of its LLMs. The release of open reasoning models is likely to stimulate progress and make the space more competitive. source

Alibaba releases Qwen with Questions, an open reasoning model that beats o1-preview Read More »

Trump’s AI Czar and the Wild West of AI regulation: Strategies for enterprises to navigate the chaos

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More AI is advancing at breakneck speed, but the regulatory landscape is in chaos. With the coming Trump administration vowing to take a hands-off approach to regulation, a lack of AI regulation at the federal level means that the U.S. is facing a fragmented patchwork of state-led rules – or in some cases no rules at all.  Recent reports suggest that President-elect Trump is considering appointing an “AI czar” in the White House to coordinate federal policy and governmental use of artificial intelligence. While this move may indicate an evolving approach to AI oversight, it remains unclear how much regulation will actually be implemented. Though apparently not taking on the AI czar role, Tesla chief Elon Musk is expected to play a significant role in shaping future use cases and debates surrounding AI. But Musk is hard to read. While he espouses minimal regulation, he also has expressed fear around unrestrained AI – so if anything, his role injects even more uncertainly.  Trump’s “efficiency” appointees Musk and Vivek Ramaswamy have vowed to take a chainsaw approach to the federal bureaucracy that could reduce it “25%” or more. So there doesn’t seem to be any reason to expect forceful regulation anytime soon. For executives like Wells Fargo Mehta Chintan, who at our AI Impact event in January was calling out for regulation to create more certainty, this lack of regulation doesn’t make things easier. In fact, regulation around AI was already way behind, and delaying it further meant more headaches. The bank, which is already heavily regulated, faces an ongoing guessing game of what might be regulated in the future. This uncertainty forces it to spend significant engineering resources “building scaffolding around things,” Chintan said at the time, because it doesn’t know what to expect once applications go to market. That caution is well deserved. Steve Jones, executive VP for gen AI at Capgemini, says that no federal AI regulation means that frontier model companies like OpenAI, Microsoft, Google and Anthropic face no accountability for any harmful or dubious content generated by their models. As a result, enterprise users are left to shoulder the risks: “You’re on your own,” Jones emphasized. Companies cannot easily hold model providers accountable if something goes wrong, increasing their exposure to potential liabilities. Moreover, Jones pointed out that if these model providers use data scraped without proper indemnification or leak sensitive information, enterprise users could become vulnerable to lawsuits. For example, he mentioned a large financial services company that has resorted to “poisoning” its data—injecting fictional data into its systems to identify any unauthorized use if it leaks. This uncertain environment poses significant risks and hidden opportunities for executive decision-makers. Join us at an exclusive event about AI regulation in Washington D.C. on Dec. 5, with speakers from Capgemini, Verizon, Fidelity and more, as we cut through the noise, providing clear strategies to help enterprise leaders stay ahead of compliance challenges, navigate the evolving patchwork of regulations and leverage the flexibility of the current landscape to innovate without fear. Hear from top experts in AI and industry as they share actionable insights to guide your enterprise through this regulatory Wild West. (Links to RSVP and full agenda here. Space is limited, so move quickly. Navigating the Wild West of AI Regulation: The Challenge Ahead In the rapidly evolving landscape of AI, enterprise leaders face a dual challenge: harnessing AI’s transformative potential while encountering regulatory hurdles that are often just unclear. is increasingly on companies to be proactive, otherwise, they could end up in hot water, like SafeRent, DoNotPay and Clearview.  Capgemini’s Steve Jones notes that relying on model providers without clear indemnification agreements is risky—it’s not just the models’ outputs that can pose problems, but the data practices and potential liabilities as well. The lack of a cohesive federal framework, coupled with varying state regulations, creates a complex compliance landscape. For instance, the FTC’s actions against companies like DoNotPay signal a more aggressive stance on AI-related misrepresentations, while state-level initiatives, such as New York’s Bias Audit Law, impose additional compliance requirements. The potential appointment of an AI czar could centralize AI policy, but the impact on practical regulation remains uncertain, leaving companies with more questions than answers. Join the conversation: The future of AI regulation Enterprise leaders must adopt proactive strategies to navigate this environment: Implement robust compliance programs: Develop comprehensive AI governance frameworks that address potential biases, ensure transparency, and comply with existing and emerging regulations. Stay informed on regulatory developments: Regularly monitor both federal and state regulatory changes to anticipate and adapt to new compliance obligations, including potential federal efforts like the AI czar initiative. Engage with policymakers: Participate in industry groups and engage with regulators to influence the development of balanced AI policies that consider both innovation and ethical considerations. Invest in ethical AI practices: Prioritize the development and deployment of AI systems that adhere to ethical standards, thereby mitigating risks associated with bias and discrimination. Enterprise decision-makers must remain vigilant, adaptable and proactive to navigate the complexities of AI regulation successfully. By learning from the experiences of others and staying informed through studies and reports, companies can position themselves to leverage AI’s benefits while minimizing regulatory risks. We invite you to join us at the upcoming salon event in Washington D.C. on Dec. 5 to be part of this crucial conversation and gain the knowledge needed to stay ahead of the regulatory curve, and understand the implications of potential federal actions like the AI czar. source

Trump’s AI Czar and the Wild West of AI regulation: Strategies for enterprises to navigate the chaos Read More »

Luma expands Dream Machine AI video model into full creative platform, mobile app

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The competition between startups and big companies such as Google and Meta to offer compelling AI video creation tools has entered a new phase. Luma AI, a startup founded by former Googlers and others, is dramatically expanding its Dream Machine AI video model with a new interface, mobile app and new image generation foundation model. The model, Luma Photon, combines personalization, efficiency and creative power to push the boundaries of image and video creation. Now available on web and iOS, the new Dream Machine aims to blend simplicity and sophistication across input devices through a unified intuitive conversational interface. Promotional image of new Luma desktop interface. Credit: Luma AI With over 25 million registered users since its launch in June 2024, Dream Machine is evolving into a subscription-based service for both casual creators and professionals in industries such as fashion, marketing and filmmaking. “We built Dream Machine as a visual thought partner, powered by a whole new image model called Luma Photon,” said Amit Jain, CEO and co-founder of Luma AI, during a video interview with VentureBeat. “It’s creative, intelligent, and designed for the people who build our world—designers, creators in fashion, media, and entertainment.” Promotional image of Dream Machine iOS. Credit: Luma AI A new approach to visual creation Dream Machine aims to remove the complexity traditionally associated with creative tools. Users can simply describe their ideas in natural language or provide reference images to guide the platform’s outputs. Unlike traditional prompt engineering, which requires precise and technical input, Dream Machine is built for intuitive interaction. “Unlike prompt engineering, where you have to carefully craft specific commands, Dream Machine lets you talk to it like you’re talking to a person. This conversational interface makes editing and creating intuitive,” Jain explained. The platform’s new personalization features, including multi-image prompting and single-image character references, enable users to bring their visions to life with greater accuracy and detail. For instance, designers can upload textures, colors and other visual cues to guide the system’s outputs. “With Dream Machine, you can give it reference images—colors, structures, or textures—and it will intelligently combine and iterate until you get exactly what you want. It’s a game-changer for designers and creatives,” Jain added. New modes include a Brainstorm mode which allows users to apply different style influences to their imagery and video, as well as Boards of multiple images and videos that can be shared between team members and fellow creators, and “Concept Pills” that offer pre-set unified stylistic visuals users can apply to their video and image outputs. Consistent characters from a single image Luma AI envisions a future where visual creation is as seamless and accessible as typing text. Dream Machine bridges this gap, making advanced generative tools usable for everyone, from hobbyists to industry professionals. “Why should creating images and videos be as hard as using Adobe’s tools? Imagine if making text was that difficult—there’d be no digital revolution. Visual thought should be just as accessible,” Jain argued. Beyond accessibility, the platform introduces groundbreaking capabilities for video creation. Users can animate storylines with consistent characters derived from a single image, opening new doors for storytelling. “You can now create infinite variations of a person from just one image. This consistency allows for entire storylines in videos with the same character—something that hasn’t existed in video creation until now,” Jain said. New image generation model Photon At the core of these advancements is Luma Photon, the company’s latest image foundation model, which generates high-quality still images from text prompts — and includes “state-of-the-art” embedded text, something many other image generation models still struggle to accomplish reliably. Photon is built on Luma’s Universal Transformer architecture, which Luma says is eight times faster and more cost-efficient than comparable models. This efficiency enables rapid iteration without sacrificing quality, making it ideal for high-demand use cases. “Our new Photon model is the most creative and personalizable model available today. It adapts in real-time without training, using an advanced universal transform architecture,” Jain explained. Developers can also harness the power of Photon through the Luma AI API, which supports text-to-image, text-to-video and image-to-video transformations. The API ensures privacy for user data and offers scalability for products built on its platform. New subscription pricing The updated Dream Machine is offered in four pricing tiers: Hobbyists: $9.99/month Explorers: $29.99/month Professionals: $99.99/month Enterprise: Custom pricing for large teams These tiers provide flexibility for users with varying needs, whether they’re creating for personal projects or commercial ventures. A new era of possibilities Since its founding in 2021, Luma AI has raised $80M in funding from strategic investors including Andreessen Horowitz, Amplify Partners, Matrix Partners, General Catalyst, and South Park Commons. Anjney Midha, a general partner at Andreessen Horowitz, highlighted the platform’s potential to power industries ranging from photorealistic video generation to interactive 3D worldbuilding. As Dream Machine evolves, Luma AI is delivering on its mission to democratize creativity. “Dream Machine is where you come to visualize what’s in your head. It helps make sense of the increasingly complex world by leveraging AI’s ability to process and simplify vast amounts of information,” Jain concluded. With its blend of accessibility, personalization, and cutting-edge technology, Dream Machine is poised to redefine how people create and share their ideas in the digital age. Correction: this article originally misstated the amount Luma AI had raised and erroneously mentioned Amazon as an investor. The article has since been updated to the correct number and investor list. source

Luma expands Dream Machine AI video model into full creative platform, mobile app Read More »

Anthropic bets on personalization in the AI arms race with new ‘styles’ feature

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Anthropic, a leading artificial intelligence company backed by major tech investors, announced today a significant update to its Claude AI assistant that allows users to customize how the AI communicates — a move that could reshape how businesses integrate AI into their workflows. The new “styles” feature, launching today on Claude.ai, enables users to preset how Claude responds to queries, offering formal, concise, or explanatory modes. Users can also create custom response patterns by uploading sample content that matches their preferred communication style. Customization becomes key battleground in enterprise AI race This development comes as AI companies race to differentiate their offerings in an increasingly crowded market dominated by OpenAI’s ChatGPT and Google’s Gemini. While most AI assistants maintain a single conversational style, Anthropic’s approach acknowledges that different business contexts require different communication approaches. “At the moment, many users don’t even know they can instruct AI to answer in a specific way,” an Anthropic spokesperson told VentureBeat. “Styles helps break through that barrier — it teaches users a new way to use AI and has the potential to open up knowledge they previously thought was inaccessible.” Early enterprise adoption suggests promising results. GitLab, an early customer, has already integrated the feature into various business processes. “Claude’s ability to maintain a consistent voice while adapting to different contexts allows our team members to use styles for various use cases including writing business cases, updating user documentation, and creating and translating marketing materials,” said Taylor McCaslin, Product Lead AI/ML at GitLab, in a statement sent to VentureBeat. Notably, Anthropic is taking a strong stance on data privacy with this feature. “Unlike other AI labs, we don’t train our generative AI models on user-submitted data by default. Anything users upload will not be used to train our models,” the company spokesperson emphasized. This position contrasts with some competitors’ practices of using customer interactions to improve their models. AI customization signals shift in enterprise strategy While team-wide style sharing won’t be available at launch, Anthropic appears to be laying groundwork for broader enterprise features. “We’re striving to make Claude as efficient and user-friendly as possible across a range of industries, workflows, and individuals,” the spokesperson said, suggesting future expansions of the feature. The move comes as enterprise AI adoption accelerates, with companies seeking ways to standardize AI interactions across their organizations. By allowing businesses to maintain consistent communication styles across AI interactions, Anthropic is positioning Claude as a more sophisticated tool for enterprise deployment. The introduction of styles represents a crucial strategic pivot for Anthropic. While competitors have focused on raw performance metrics and model size, Anthropic is betting that the key to enterprise adoption lies in adaptability and user experience. This approach could prove particularly appealing to large organizations struggling to maintain consistent communication across diverse teams and departments. The feature also addresses a growing concern among enterprise customers: the need to maintain brand voice and corporate communication standards while leveraging AI tools. As the AI industry matures beyond its initial phase of technical one-upmanship, the battlefield is shifting toward practical implementation and user experience. Anthropic’s styles feature might seem like a modest update, but it signals a deeper understanding of what enterprises really need from AI: not just intelligence, but intelligence that speaks their language. And in the high-stakes world of enterprise AI, sometimes it’s not what you say, but how you say it that matters most. source

Anthropic bets on personalization in the AI arms race with new ‘styles’ feature Read More »

Hugging Face’s SmolVLM could cut AI costs for businesses by a huge margin

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Hugging Face has just released SmolVLM, a compact vision-language AI model that could change how businesses use artificial intelligence across their operations. The new model processes both images and text with remarkable efficiency while requiring just a fraction of the computing power needed by its competitors. The timing couldn’t be better. As companies struggle with the skyrocketing costs of implementing large language models and the computational demands of vision AI systems, SmolVLM offers a pragmatic solution that doesn’t sacrifice performance for accessibility. Small model, big impact: How SmolVLM changes the game “SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs,” the research team at Hugging Face explain on the model card. What makes this significant is the model’s unprecedented efficiency: it requires only 5.02 GB of GPU RAM, while competing models like Qwen-VL 2B and InternVL2 2B demand 13.70 GB and 10.52 GB respectively. This efficiency represents a fundamental shift in AI development. Rather than following the industry’s bigger-is-better approach, Hugging Face has proven that careful architecture design and innovative compression techniques can deliver enterprise-grade performance in a lightweight package. This could dramatically reduce the barrier to entry for companies looking to implement AI vision systems. Visual intelligence breakthrough: SmolVLM’s advanced compression technology explained The technical achievements behind SmolVLM are remarkable. The model introduces an aggressive image compression system that processes visual information more efficiently than any previous model in its class. “SmolVLM uses 81 visual tokens to encode image patches of size 384×384,” the researchers explained, a method that allows the model to handle complex visual tasks while maintaining minimal computational overhead. This innovative approach extends beyond still images. In testing, SmolVLM demonstrated unexpected capabilities in video analysis, achieving a 27.14% score on the CinePile benchmark. This places it competitively between larger, more resource-intensive models, suggesting that efficient AI architectures might be more capable than previously thought. The future of enterprise AI: Accessibility meets performance The business implications of SmolVLM are profound. By making advanced vision-language capabilities accessible to companies with limited computational resources, Hugging Face has essentially democratized a technology that was previously reserved for tech giants and well-funded startups. The model comes in three variants designed to meet different enterprise needs. Companies can deploy the base version for custom development, use the synthetic version for enhanced performance, or implement the instruct version for immediate deployment in customer-facing applications. Released under the Apache 2.0 license, SmolVLM builds on the shape-optimized SigLIP image encoder and SmolLM2 for text processing. The training data, sourced from The Cauldron and Docmatix datasets, ensures robust performance across a wide range of business use cases. “We’re looking forward to seeing what the community will create with SmolVLM,” the research team stated. This openness to community development, combined with comprehensive documentation and integration support, suggests that SmolVLM could become a cornerstone of enterprise AI strategy in the coming years. The implications for the AI industry are significant. As companies face mounting pressure to implement AI solutions while managing costs and environmental impact, SmolVLM’s efficient design offers a compelling alternative to resource-intensive models. This could mark the beginning of a new era in enterprise AI, where performance and accessibility are no longer mutually exclusive. The model is available immediately through Hugging Face’s platform, with the potential to reshape how businesses approach visual AI implementation in 2024 and beyond. source

Hugging Face’s SmolVLM could cut AI costs for businesses by a huge margin Read More »

Microsoft launches Azure AI Foundry with agent orchestration, management tools

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Microsoft introduced AI agents to its Dynamics 365 platform in October. During Ignite, the company announced it would add more agentic AI capabilities to other Microsoft products, such as SharePoint and Microsoft Copilot 365.  However, enterprises will need to manage the many AI agents they deploy and understand whether these agents accurately follow workflows and can only access what they are supposed to access. Microsoft’s new capabilities in Azure AI aim to help developers build evaluation tools and a way to manage AI agents at scale and solve precisely those issues.  The software developer kit for Azure AI Foundry offers a toolkit to customize, test, deploy and manage AI applications and agents. It lets developers bring control and customization to many AI apps brought into their tech stack.  Microsoft said the SDK, available in preview, has 25 templates developers can use with an integrated library of models to build in scale.  The Azure AI Foundry portal, formerly Azure AI Studio and also in previews, offers developers a visual dashboard for evaluating models and tools. The dashboard also lets people manage who can use certain apps. The company said AI Foundry is integrated with other developer tools like GitHub, Visual Studio and Copilot Studio.  As agents become a more significant part of the AI ecosystem, enterprises want to figure out how to manage how these work together. Azure AI Agent Service lets companies establish orchestration frameworks for automated workflows.  “With features like bring your own storage and private networking, it will ensure data privacy and compliance to help organizations protect their sensitive data,” the company said.  Agent management AI agents emerged as one of the big trends in enterprise AI this year and are poised to grow in the next year as more companies begin experimenting with them. Several providers, like Microsoft and Salesforce, offer customers access to agents or a no-code way of building agents.  Microsoft is also researching how multiple agents can work together through a new framework called Magentic-One.  Ideally, AI agents would automate workflows without requiring human employees to keep prompting AI applications. Companies are beginning to use multiple agents, triggering each activity to build out their agent ecosystems. To ensure the agents do their jobs, some providers create orchestration agents that monitor and direct agents.  However, enterprises still need to figure out the best way to deploy these agents across their entire organization without accidentally letting agents access data they shouldn’t. Determining safety and performance for agents can also be difficult, as current benchmarks don’t really capture agentic performance. source

Microsoft launches Azure AI Foundry with agent orchestration, management tools Read More »

Getting started with AI agents (part 2): Autonomy, safeguards and pitfalls

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More In our first installment, we outlined key strategies for leveraging AI agents to improve enterprise efficiency. I explained how, unlike standalone AI models, agents iteratively refine tasks using context and tools to enhance outcomes such as code generation. I also discussed how multi-agent systems foster communication across departments, creating a unified user experience and driving productivity, resilience and faster upgrades. Success in building these systems hinges on mapping roles and workflows, as well as establishing safeguards such as human oversight and error checks to ensure safe operation. Let’s dive into these critical elements. Safeguards and autonomy Agents imply autonomy, so various safeguards must be built into an agent within a multi-agent system to reduce errors, waste, legal exposure or harm when agents are operating autonomously. Applying all of these safeguards to all agents may be overkill and pose a resource challenge, but I highly recommend considering every agent in the system and consciously deciding which of these safeguards they would need. An agent should not be allowed to operate autonomously if any one of these conditions is met. Explicitly defined human intervention conditions Triggering any one of a set of predefined rules determines the conditions under which a human needs to confirm some agent behavior. These rules should be defined on a case-by-case basis and can be declared in the agent’s system prompt — or in more critical use-cases, be enforced using deterministic code external to the agent. One such rule, in the case of a purchasing agent, would be: “All purchasing should first be verified and confirmed by a human. Call your ‘check_with_human’ function and do not proceed until it returns a value.” Safeguard agents A safeguard agent can be paired with an agent with the role of checking for risky, unethical or noncompliant behavior. The agent can be forced to always check all or certain elements of its behavior against a safeguard agent, and not proceed unless the safeguard agent returns a go-ahead. Uncertainty Our lab recently published a paper on a technique that can provide a measure of uncertainty for what a large language model (LLM) generates. Given the propensity for LLMs to confabulate (commonly known as hallucinations), giving a preference to a certain output can make an agent much more reliable. Here, too, there is a cost to be paid. Assessing uncertainty requires us to generate multiple outputs for the same request so that we can rank-order them based on certainty and choose the behavior that has the least uncertainty. That can make the system slow and increase costs, so it should be considered for more critical agents within the system. Disengage button There may be times when we need to stop all autonomous agent-based processes. This could be because we need consistency, or we’ve detected behavior in the system that needs to stop while we figure out what is wrong and how to fix it. For more critical workflows and processes, it is important that this disengagement does not result in all processes stopping or becoming fully manual, so it is recommended that a deterministic fallback mode of operation be provisioned. Agent-generated work orders Not all agents within an agent network need to be fully integrated into apps and APIs. This might take a while and takes a few iterations to get right. My recommendation is to add a generic placeholder tool to agents (typically leaf nodes in the network) that would simply issue a report or a work-order, containing suggested actions to be taken manually on behalf of the agent. This is a great way to bootstrap and operationalize your agent network in an agile manner. Testing With LLM-based agents, we are gaining robustness at the cost of consistency. Also, given the opaque nature of LLMs, we are dealing with black-box nodes in a workflow. This means that we need a different testing regime for agent-based systems than that used in traditional software. The good news, however, is that we are used to testing such systems, as we have been operating human-driven organizations and workflows since the dawn of industrialization. While the examples I showed above have a single-entry point, all agents in a multi-agent system have an LLM as their brains, and so they can act as the entry point for the system. We should use divide and conquer, and first test subsets of the system by starting from various nodes within the hierarchy. We can also employ generative AI to come up with test cases that we can run against the network to analyze its behavior and push it to reveal its weaknesses. Finally, I’m a big advocate for sandboxing. Such systems should be launched at a smaller scale within a controlled and safe environment first, before gradually being rolled out to replace existing workflows. Fine-tuning A common misconception with gen AI is that it gets better the more you use it. This is obviously wrong. LLMs are pre-trained. Having said this, they can be fine-tuned to bias their behavior in various ways. Once a multi-agent system has been devised, we may choose to improve its behavior by taking the logs from each agent and labeling our preferences to build a fine-tuning corpus. Pitfalls Multi-agent systems can fall into a tailspin, which means that occasionally a query might never terminate, with agents perpetually talking to each other. This requires some form of timeout mechanism. For example, we can check the history of communications for the same query, and if it is growing too large or we detect repetitious behavior, we can terminate the flow and start over. Another problem that can occur is a phenomenon I will call overloading: Expecting too much of a single agent. The current state-of-the-art for LLMs does not allow us to hand agents long and detailed instructions and expect them to follow them all, all the time. Also, did I mention these systems can be inconsistent? A

Getting started with AI agents (part 2): Autonomy, safeguards and pitfalls Read More »