VentureBeat

OctoTools: Stanford’s open-source framework optimizes LLM reasoning through modular tool orchestration

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OctoTools, a new open-source agentic platform released by scientists at Stanford University, can turbocharge large language models (LLMs) for reasoning tasks by breaking down tasks into subunits and enhancing the models with tools. While tool use has already become an important application of LLMs, OctoTools makes these capabilities much more accessible by removing technical barriers and allowing to developers and enterprises to extend a platform with their own tools and workflows. Experiments show that OctoTools outperforms classic prompting methods and other LLM application frameworks, making it a promising tool for real-world uses of AI models. LLMs often struggle with reasoning tasks that involve multiple steps, logical decomposition or specialized domain knowledge. One solution is to outsource specific steps of the solution to external tools such as calculators, code interpreters, search engines or image processing tools. In this scenario, the model focuses on higher-level planning while the actual calculation and reasoning are done through the tools. However, tool use has its own challenges. For example, classic LLMs often require substantial training or few-shot learning with curated data to adapt to new tools, and once augmented, they will be limited to specific domains and tool types.  Tool selection also remains a pain point. LLMs can become good at using one or a few tools, but when a task requires using multiple tools, they can get confused and perform badly. OctoTools framework (source: GitHub) OctoTools addresses these pain points through a training-free agentic framework that can orchestrate multiple tools without the need to fine-tune or adjust the models. OctoTools uses a modular approach to tackle planning and reasoning tasks and can use any general-purpose LLM as its backbone. Among the key components of OctoTools are “tool cards,” which act as wrappers to the tools the system can use, such as Python code interpreters and web-search APIs. Tool cards include metadata such as input-output formats, limitations and best practices for each tool. Developers can add their own tool cards to the framework to suit their applications. When a new prompt is fed into OctoTools, a “planner” module uses the backbone LLM to generate a high-level plan that summarizes the objective, analyzes the required skills, identifies relevant tools and includes additional considerations for the task. The planner determines a set of sub-goals that the system needs to achieve to accomplish the task and describes them in a text-based action plan. For each step in the plan, an “action predictor” module refines the sub-goal to specify the tool required to achieve it and make sure it is executable and verifiable. Once the plan is ready to be executed, a “command generator” maps the text-based plan to Python code that invokes the specified tools for each sub-goal, then passes the command to the “command executor,” which runs the command in a Python environment. The results of each step are validated by a “context verifier” module and the final result is consolidated by a “solution summarizer.” Example of OctoTools components (source: GitHub) “By separating strategic planning from command generation, OctoTools reduces errors and increases transparency, making the system more reliable and easier to maintain,” the researchers write. OctoTools also uses an optimization algorithm to select the best subset of tools for each task. This helps avoid overwhelming the model with irrelevant tools.  Agentic frameworks There are several frameworks for creating LLM applications and agentic systems, including Microsoft AutoGen, LangChain and OpenAI API “function calling.” OctoTools outperforms these platforms on tasks that require reasoning and tool use, according to its developers. OctoTools vs other agentic frameworks (source: GitHub) The researchers tested all frameworks on several benchmarks for visual, mathematical and scientific reasoning, as well as medical knowledge and agentic tasks. OctoTools achieved an average accuracy gain of 10.6% over AutoGen, 7.5% over GPT-Functions, and 7.3% over LangChain when using the same tools. According to the researchers, the reason for OctoTools’ better performance is its superior tool usage distribution and the proper decomposition of the query into sub-goals. OctoTools offers enterprises a practical solution for using LLMs for complex tasks. Its extendable tool integration will help overcome existing barriers to creating advanced AI reasoning applications. The researchers have released the code for OctoTools on GitHub. source

OctoTools: Stanford’s open-source framework optimizes LLM reasoning through modular tool orchestration Read More »

Semantic understanding, not just vectors: How Intuit’s data architecture powers agentic AI with measurable ROI

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Intuit — the financial software giant behind products like TurboTax and QuickBooks — is making significant strides using generative AI to enhance its offerings for small business customers. In a tech landscape flooded with AI promises, Intuit has built an agent-based AI architecture that’s delivering tangible business outcomes for small businesses. The company has deployed what it calls “done for you” experiences that autonomously handle entire workflows and deliver quantifiable business impact. Intuit has been building out its own AI layer, which it calls a generative AI operating system (GenOS). The company detailed some of the ways it is using gen AI to improve personalization at VB Transform 2024. In Sept. 2024, Intuit added agentic AI workflows, an effort that has improved operations for both the company and its users. According to new Intuit data, QuickBooks Online customers are getting paid an average of five days faster, with overdue invoices 10% more likely to be paid in full. For small businesses where cash flow is king, these aren’t just incremental improvements — they’re potentially business-saving innovations. The technical trinity: How Intuit’s data architecture enables true agentic AI What separates Intuit’s approach from competitors is its sophisticated data architecture designed specifically to enable agent-based AI experiences.  The company has built what CDO Ashok Srivastava calls “a trinity” of data systems: Data lake: The foundational repository for all data. Customer data cloud (CDC): A specialized serving layer for AI experiences. “Event bus“: A streaming data system enabling real-time operations. “CDC provides a serving layer for AI experiences, then the data lake is kind of the repository for all such data,” Srivastava told VentureBeat. “The agent is going to be interacting with data, and it has a set of data that it could look at in order to pull information.” Going beyond vector embeddings to power agentic AI The Intuit architecture diverges from the typical vector database approach many enterprises are hastily implementing. While vector databases and embeddings are important for powering AI models, Intuit recognizes that true semantic understanding requires a more holistic approach.  “Where the key issue continues to be is essentially in ensuring that we have a good, logical and semantic understanding of the data,” said Srivastava.  To achieve this semantic understanding, Intuit is building out a semantic data layer on top of its core data infrastructure. The semantic data layer helps provide context and meaning around the data, beyond just the raw data itself or its vector representations. It allows Intuit’s AI agents to better comprehend the relationships and connections between different data sources and elements. By building this semantic data layer, Intuit is able to augment the capabilities of its vector-based systems with a deeper, more contextual understanding of data. This allows AI agents to make more informed and meaningful decisions for customers. Beyond basic automation: How agentic AI completes entire business processes autonomously Unlike enterprises implementing AI for basic workflow automation or customer service chatbots, Intuit has focused on creating fully agentic “done for you” experiences. These are applications that handle complex, multi-step tasks while requiring only final human approval. For QuickBooks users, the agentic system analyzes client payment history and invoice status to automatically draft personalized reminder messages, allowing business owners to simply review and approve before sending. The system’s ability to personalize based on relationship context and payment patterns has directly contributed to measurably faster payments. Intuit is applying identical agentic principles internally, developing autonomous procurement systems and HR assistants.  “We have the ability to have an internal agentic procurement process that employees can use to purchase supplies and book travel,” Srivastava explained, demonstrating how the company is eating its own AI dog food. Designed for the reasoning model era What potentially gives Intuit a competitive advantage over other enterprise AI implementations is how the system was designed with foresight about the emergence of advanced reasoning models like DeepSeek. “We built gen runtime in anticipation of reasoning models coming up,” Ashok revealed. “We’re not behind the eight ball … we’re ahead of it. We built the capabilities assuming that reasoning would exist.” This forward-thinking design means Intuit can rapidly incorporate new reasoning capabilities into their agentic experiences as they emerge, without requiring architectural overhauls. According to Srivastava, Intuit’s engineering teams are already using these capabilities to enable agents to reason across a large number of tools and data in ways that weren’t previously possible. Shifting from AI hype to business impact Perhaps most significantly, Intuit’s approach shows a clear focus on business outcomes rather than technological showmanship. “There’s a lot of work and a lot of fanfare going on these days on AI itself, that it’s going to revolutionize the world, and all of that, which I think is good,” said Srivastava. “But I think what’s a lot better is to show that it’s actually helping real people do better.” The company believes deeper reasoning capabilities will enable even more comprehensive “done for you” experiences that cover more customer needs with greater depth. Each experience combines multiple atomic experiences or discrete operations that together create a complete workflow solution. What this means for enterprises adopting AI For enterprises looking to implement AI effectively, Intuit’s approach offers several valuable lessons for enterprises: Focus on outcomes over technology: Rather than showcasing AI for its own sake target specific business pain points with measurable improvement goals. Build with future models in mind: Design architecture that can incorporate emerging reasoning capabilities without requiring a complete rebuild. Address data challenges first: Before rushing to implement agents, ensure your data foundation can support semantic understanding and cross-system reasoning. Create complete experiences: Look beyond simple automation to create end-to-end “done for you” workflows that deliver complete solutions. As agentic AI continues to mature, enterprises that follow Intuit’s example by focusing on complete solutions rather than isolated AI features may find themselves achieving similar concrete business results rather than simply generating tech buzz. source

Semantic understanding, not just vectors: How Intuit’s data architecture powers agentic AI with measurable ROI Read More »

Adobe launches first Photoshop mobile app and it has amazing Firefly AI-powered object detection, editing, and effects

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Thirty-eight years after its creation, Adobe’s Photoshop remains one of the most successful and widely used software products to this day — with reportedly more than 90% of the world’s creative pros using the image editing and design tool. Now the company has introduced Photoshop on iPhone for free today — the first time Photoshop is getting a dedicated mobile app outside of its Elements stripped-down version launched back in late 2022. The app is available worldwide starting today, with an Android version expected later this year. Alongside the mobile launch, Adobe is offering a new Photoshop Mobile and Web plan that enables seamless editing across devices. Pricing options include a limited free version of the app — with full on-device editing tools but no cross-platform support — and a premium upgrade available for $7.99 per month or $69.99 annually. Current Photoshop or Adobe Creative Cloud subscribers can access the mobile version at no extra cost as part of their existing plans. The mobile app includes essential Photoshop tools such as layering, masking, Tap Select, and Spot Healing Brush. Users can also access Firefly-powered generative AI features like Generative Fill and Generative Expand. Designed specifically for mobile devices, the app allows both experienced professionals and newcomers to create high-quality visuals directly from their phones. Ashley Still, senior vice president of digital media at Adobe, said the company is excited to bring Photoshop to mobile, making its design capabilities more accessible and intuitive. She emphasized that the app empowers creators to produce visually stunning content from anywhere. New features and key capabilities • Core Photoshop Tools: Layering, masking and blending tools tailored for mobile workflows • Generative AI Tools: Firefly-powered Generative Fill and Generative Expand for quickly adding and editing image elements • Tap Select Tool: Intuitive touch-based selection for removing, recoloring or replacing parts of an image • Spot Healing Brush: Quick removal of unwanted elements and distractions from photos • Object Select: Precise selection of people and objects with enhanced accuracy • Direct Integration: Seamless workflows with Adobe Express, Adobe Fresco and Adobe Lightroom • Access to Adobe Stock: A library of free assets for creating unique visuals • Advanced Editing in Premium Plan: Magic Wand, Clone Stamp, Content-Aware Fill and advanced blend modes • Typography Tools: Over 20,000 fonts with the option to import custom fonts • Cross-Device Workflow: Start a project on mobile and continue editing with more precision on the web or desktop Redesigned for mobile The app is purpose-built for phones, with adjustments to the user interface that make tools easier to use with touch controls. Selection tools and touch targets have been optimized to ensure accuracy, even on smaller screens. Tap Select allows users to isolate objects with a single tap, while larger touch targets help prevent accidental selections. Object Select can identify subjects like people, animals and objects, streamlining the editing process. Based on a demo presentation received yesterday during a video call with Adobe, the app’s interface is designed to be intuitive, with essential tools accessible within thumb reach and a clean layout that minimizes visual clutter. Real-world examples Adobe’s press materials included several examples that highlight the app’s functionality: • A photo of a dog was edited using Generative Fill to remove unwanted elements like a leash and smartphone from the background, seamlessly blending the filled areas with the surrounding environment. • In another example, the Remove Tool was used to eliminate distractions like hands and shadows from a photo of two people, creating a cleaner composition. • Object Select was demonstrated by isolating a tennis racket from its background with a few taps, showcasing the tool’s precision and speed. • Additionally, compositing capabilities were shown through a GIF that combined vibrant elements like oversized flowers into a photo of two people, illustrating the app’s creative potential. The Photoshop Mobile and Web plan enables users to transition between devices without losing progress. However, editing Photoshop files in a web browser requires a Creative Cloud subscription, even if the file was originally created on the iOS app. This ensures that advanced editing features and cloud-based workflows are accessible to subscribers across platforms. Impact on enterprise decision-makers For enterprise teams, the expansion of Photoshop to mobile and web platforms offers greater flexibility and efficiency in creative workflows. Designers and marketers can now make quick edits, create visuals and collaborate on projects from anywhere, reducing reliance on desktop environments. Adobe’s Firefly AI models, which power many of the new mobile features, provide a competitive advantage over rivals like Midjourney, DALL·E and Krea AI by offering commercially safe generative AI tools. Firefly is trained on licensed Adobe Stock content and public domain materials, ensuring that AI-generated content can be used commercially without copyright concerns. This makes Photoshop a compelling option for businesses that need to create professional-grade visuals while maintaining compliance. Additionally, the seamless integration with Adobe Creative Cloud and other Adobe tools streamlines workflows, enabling teams to maintain consistency and productivity across devices. The Android version of Photoshop is expected to launch later this year, further expanding Adobe’s ecosystem of creative tools across mobile, web and desktop platforms. source

Adobe launches first Photoshop mobile app and it has amazing Firefly AI-powered object detection, editing, and effects Read More »

GPT-4.5 for enterprise: Do its accuracy and knowledge justify the cost?

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The release of OpenAI GPT-4.5 has been somewhat disappointing, with many pointing out its insane price point (about 10 to 20X more expensive than Claude 3.7 Sonnet and 15 to 30X more costly than GPT-4o). However, given that this is OpenAI’s largest and most powerful non-reasoning model, it is worth considering its strengths and the areas where it shines.  Better knowledge and alignment There is little detail about the model’s architecture or training corpus, but we have a rough estimate that it has been trained with 10X more compute. And, the model was so large that OpenAI needed to spread training across multiple data centers to finish in a reasonable time. Bigger models have a larger capacity for learning world knowledge and the nuances of human language (given that they have access to high-quality training data). This is evident in some of the metrics presented by the OpenAI team. For example, GPT-4.5 has a record-high ranking on PersonQA, a benchmark that evaluates hallucinations in AI models. Practical experiments also show that GPT-4.5 is better than other general-purpose models at remaining true to facts and following user instructions. Users have pointed out that GPT-4.5’s responses feel more natural and context-aware than previous models. Its ability to follow tone and style guidelines has also improved. After the release of GPT-4.5, AI scientist and OpenAI co-founder Andrej Karpathy, who had early access to the model, said he “expect[ed] to see an improvement in tasks that are not reasoning-heavy, and I would say those are tasks that are more EQ (as opposed to IQ) related and bottlenecked by e.g. world knowledge, creativity, analogy making, general understanding, humor, etc.” However, evaluating writing quality is also very subjective. In a survey that Karpathy ran on different prompts, most people preferred the responses of GPT-4o over GPT-4.5. He wrote on X: “Either the high-taste testers are noticing the new and unique structure but the low-taste ones are overwhelming the poll. Or we’re just hallucinating things. Or these examples are just not that great. Or it’s actually pretty close and this is way too small sample size. Or all of the above.” Better document processing In its experiments, Box, which has integrated GPT-4.5 into its Box AI Studio product, wrote that GPT-4.5 is “particularly potent for enterprise use-cases, where accuracy and integrity are mission critical… our testing shows that GPT-4.5 is one of the best models available both in terms of our eval scores and also its ability to handle many of the hardest AI questions that we have come across.” In its internal evaluations, Box found GPT-4.5 to be more accurate on enterprise document question-answering tasks — outperforming the original GPT-4 by about 4 percentage points on their test set​. Source: Box Box’s tests also indicated that GPT-4.5 excelled at math questions embedded in business documents, which older GPT models often struggled with​. For example, it was better at answering questions about financial documents that required reasoning over data and performing calculations.  GPT-4.5 also showed improved performance at extracting information from unstructured data. In a test that involved extracting fields from hundreds of legal documents, GPT-4.5 was 19% more accurate than GPT-4o. Planning, coding, evaluating results Given its improved world knowledge, GPT-4.5 can also be a suitable model for creating high-level plans for complex tasks. Broken-down steps can then be handed over to smaller but more efficient models to elaborate and execute. According to Constellation Research, “In initial testing, GPT-4.5 seems to show strong capabilities in agentic planning and execution, including multi-step coding workflows and complex task automation.” GPT-4.5 can also be useful in coding tasks that require internal and contextual knowledge. GitHub now provides limited access to the model in its Copilot coding assistant and notes that GPT-4.5 “performs effectively with creative prompts and provides reliable responses to obscure knowledge queries.” Given its deeper world knowledge, GPT-4.5 is also suitable for “LLM-as-a-Judge” tasks, where a strong model evaluates the output of smaller models. For example, a model such as GPT-4o or o3 can generate one or several responses, reason over the solution and pass the final answer to GPT-4.5 for revision and refinement. Is it worth the price? Given the huge costs of GPT-4.5, though, it is very hard to justify many of the use cases. But that doesn’t mean it will remain that way. One of the constant trends we have seen in recent years is the plummeting costs of inference, and if this trend applies to GPT-4.5, it is worth experimenting with it and finding ways to put its power to use in enterprise applications. It is also worth noting that this new model can become the basis for future reasoning models. Per Karpathy: “Keep in mind that that GPT4.5 was only trained with pretraining, supervised finetuning and RLHF [reinforcement learning from human feedback], so this is not yet a reasoning model. Therefore, this model release does not push forward model capability in cases where reasoning is critical (math, code, etc.)… Presumably, OpenAI will now be looking to further train with reinforcement learning on top of GPT-4.5 model to allow it to think, and push model capability in these domains.” source

GPT-4.5 for enterprise: Do its accuracy and knowledge justify the cost? Read More »

IBM Granite 3.2 uses conditional reasoning, time series forecasting and document vision to tackle challenging enterprise use cases

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More In the wake of the disruptive debut of DeepSeek-R1, reasoning models have been all the rage so far in 2025. IBM is now joining the party, with the debut today of its Granite 3.2 large language model (LLM) family. Unlike other reasoning approaches such as DeepSeek-R1 or OpenAI’s o3, IBM is deeply embedding reasoning into its core open-source Granite models. It’s an approach that IBM refers to as conditional reasoning, where the step-by-step chain of thought (CoT) reasoning is an option within the models (as opposed to being a separate model). It’s a flexible approach where reasoning can be conditionally activated with a flag, allowing users to control when to use more intensive processing. The new reasoning capability builds on the performance gains IBM introduced with the release of the Granite 3.1 LLMs in Dec. 2024. IBM is also releasing a new vision model in the Granite 3.2 family specifically optimized for document processing. The model is particularly useful for digitizing legacy documents, a challenge many large organizations struggle with. Another enterprise AI challenge IBM aims to solve with Granite 3.2 is predictive modelling. Machine learning (ML) has been used for predictions for decades, but it hasn’t had the natural language interface and ease of use of modern gen AI. That’s where IBM’s Granite time series forecasting models fit in; they apply transformer technology to predict future values from time-based data. “Reasoning is not something a model is, it’s something a model does,” David Cox, VP for AI models at IBM Research, told VentureBeat. What IBM’s reasoning actually brings to enterprise AI While there has been no shortage of excitement and hype around reasoning models in 2025, reasoning for its own sake doesn’t necessarily provide value to enterprise users. The ability to reason in many respects has long been part of gen AI. Simply prompting an LLM to answer in a step-by-step approach triggers a basic CoT reasoning output. Modern reasoning in models like DeepSeek-R1 and now Granite 3.2 goes a bit deeper by using reinforcement learning to train and enable reasoning capabilities. While CoT prompts may be effective for certain tasks like mathematics, the reasoning capabilities in Granite 3.2 can benefit a wider range of enterprise applications. Cox noted that by encouraging the model to spend more time thinking, enterprises can improve complex decision-making processes. Reasoning can benefit software engineering tasks, IT issue resolution and other agentic workflows where the model can break down problems, make better judgments and recommend more informed solutions. IBM also claims that, with reasoning turned on, Granite 3.2 is able to outperform rivals including DeepSeek-R1 on instruction-following tasks. Not every query needs more reasoning; why conditional thinking matters Although Granite 3.2 has advanced reasoning capabilities, Cox stressed that not every query actually needs more reasoning. In fact, many types of common queries can actually be negatively impacted with more reasoning. For example, for a knowledge-based query, a standalone reasoning model like DeepSeek-R1 might spend up to 50 seconds on an internal monologue to answer a basic question like “Where is Rome?” One of the key innovations in Granite 3.2 is the introduction of a conditional thinking feature, which allows developers to dynamically activate or deactivate the model’s reasoning capabilities. This flexibility enables users to strike a balance between speed and depth of analysis, depending on the specific task at hand. Going a step further, the Granite 3.2 models benefit from a method developed by IBM’s Red Hat business unit that uses something called a “particle filter” to enable more flexible reasoning capabilities. This approach allows the model to dynamically control and manage multiple threads of reasoning, evaluating which ones are the most promising to arrive at the final result. This provides a more dynamic and adaptive reasoning process, rather than a linear CoT. Cox explained that this particle filter technique gives enterprises even more flexibility in how they can use the model’s reasoning capabilities. In the particle filter approach, there are many threads of reasoning occurring simultaneously. The particle filter is pruning the less effective approaches, focusing on the ones that provide better outcomes. So, instead of just doing CoT reasoning, there are multiple approaches to solving a problem. The model can intelligently navigate complex problems, selectively focusing on the most promising lines of reasoning. How IBM is solving real enterprise uses cases for documents Large organizations tend to have equally large volumes of documents, many of which were scanned years ago and now sitting in archives. All that data has been difficult to use with modern systems. The new Granite 3.2 vision model is designed to help solve that enterprise challenge. While many multimodal models focus on general image understanding, Granite 3.2’s vision capabilities are engineered specifically for document processing — reflecting IBM’s focus on solving tangible enterprise problems rather than chasing benchmark scores. The system targets what Cox described as “irrational amounts of old scanned documents” sitting in enterprise archives, particularly in financial institutions. These represent opaque data stores that have remained largely untapped despite their potential business value. For organizations with decades of paper records, the ability to intelligently process documents containing charts, figures and tables represents a substantial operational advantage over general-purpose multimodal models that excel at describing vacation photos but struggle with structured business documents. On enterprise benchmarks such as DocVQA and ChartQA, IBM Granite vision 3.2 shows strong results against rivals. Time series forecasting addresses critical business prediction needs Perhaps the most technically distinctive component of the release is IBM’s “tiny time mixers” (TTM)– specialized transformer-based models designed specifically for time series forecasting. However, time series forecasting, which enables predictive analytics and modelling, is not new. Cox noted that for various reasons, time series models have remained stuck in the older era of machine learning (ML) and have not benefited from the same attention of the newer, flashier gen AI models. The Granite TTM models apply the architectural innovations that powered

IBM Granite 3.2 uses conditional reasoning, time series forecasting and document vision to tackle challenging enterprise use cases Read More »

Microsoft’s new Phi-4 AI models pack big performance in small packages

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Microsoft has introduced a new class of highly efficient AI models that process text, images and speech simultaneously while requiring significantly less computing power than other available systems. The new Phi–4 models, released today, represent a breakthrough in the development of small language models (SLMs) that deliver capabilities previously reserved for much larger AI systems. Phi–4–multimodal, a model with just 5.6 billion parameters, and Phi-4-Mini, with 3.8 billion parameters, outperform similarly sized competitors and on certain tasks even match or exceed the performance of models twice their size, according to Microsoft’s technical report. “These models are designed to empower developers with advanced AI capabilities,” said Weizhu Chen, vice president, generative AI at Microsoft. “Phi-4-multimodal, with its ability to process speech, vision and text simultaneously, opens new possibilities for creating innovative and context-aware applications.” This technical achievement comes at a time when enterprises are increasingly seeking AI models that can run on standard hardware or at the “edge” — directly on devices rather than in cloud data centers — to reduce costs and latency while maintaining data privacy. How Microsoft built a small AI model that does it all What sets Phi-4-multimodal apart is its novel “Mixture of LoRAs” technique, enabling it to handle text, images and speech inputs within a single model. “By leveraging the Mixture of LoRAs, Phi-4-Multimodal extends multimodal capabilities while minimizing interference between modalities,” the research paper states. “This approach enables seamless integration and ensures consistent performance across tasks involving text, images, and speech/audio.” The innovation allows the model to maintain its strong language capabilities while adding vision and speech recognition without the performance degradation that often occurs when models are adapted for multiple input types. The model has claimed the top position on the Hugging Face OpenASR leaderboard with a word error rate of 6.14%, outperforming specialized speech recognition systems like WhisperV3. It also demonstrates competitive performance on vision tasks like mathematical and scientific reasoning with images. Compact AI, massive impact: Phi-4-mini sets new performance standards Despite its compact size, Phi-4-mini demonstrates exceptional capabilities in text-based tasks. Microsoft reports the model “outperforms similar size models and is on-par with models twice [as large]” across various language-understanding benchmarks. Particularly notable is the model’s performance on math and coding tasks. According to the research paper, “Phi-4-Mini consists of 32 Transformer layers with hidden state size of 3,072” and incorporates group query attention to optimize memory usage for long-context generation. On the GSM-8K math benchmark, Phi-4-mini achieved an 88.6% score, outperforming most 8-billion-parameter models, while on the MATH benchmark it reached 64%, substantially higher than similar-sized competitors. “For the Math benchmark, the model outperforms similar sized models with large margins, sometimes more than 20 points. It even outperforms two times larger models’ scores,” the technical report notes. Transformative deployments: Phi-4’s real-world efficiency in action Capacity, an AI “answer engine” that helps organizations unify diverse datasets, has already leveraged the Phi family to enhance its platform’s efficiency and accuracy. Steve Frederickson, head of product at Capacity, said in a statement, “From our initial experiments, what truly impressed us about the Phi was its remarkable accuracy and the ease of deployment, even before customization. Since then, we’ve been able to enhance both accuracy and reliability, all while maintaining the cost-effectiveness and scalability we valued from the start.” Capacity reported a 4.2x cost savings compared to competing workflows while achieving the same or better qualitative results for preprocessing tasks. AI without limits: Microsoft’s Phi-4 models bring advanced intelligence anywhere For years, AI development has been driven by a singular philosophy: bigger is better — more parameters, larger models, greater computational demands. But Microsoft’s Phi-4 models challenge that assumption, proving that power isn’t just about scale — it’s about efficiency. Phi-4-multimodal and Phi-4-mini are designed not for the data centers of tech giants, but for the real world — where computing power is limited, privacy concerns are paramount, and AI needs to work seamlessly without a constant connection to the cloud. These models are small, but they carry weight. Phi-4-multimodal integrates speech, vision and text processing into a single system without sacrificing accuracy, while Phi-4-mini delivers math, coding and reasoning performance on par with models twice its size. This isn’t just about making AI more efficient; it’s about making it more accessible. Microsoft has positioned Phi-4 for widespread adoption, making it available through Azure AI Foundry, Hugging Face and the Nvidia API Catalog. The goal is clear: AI that isn’t locked behind expensive hardware or massive infrastructure, but rather can operate on standard devices, at the edge of networks and in industries where compute power is scarce. Masaya Nishimaki, a director at the Japanese AI firm Headwaters Co., Ltd., sees the impact firsthand. “Edge AI demonstrates outstanding performance even in environments with unstable network connections or where confidentiality is paramount,” he said in a statement. That means AI that can function in factories, hospitals, autonomous vehicles — places where real-time intelligence is required, but where traditional cloud-based models fall short. At its core, Phi-4 represents a shift in thinking. AI isn’t just a tool for those with the biggest servers and the deepest pockets. It’s a capability that, if designed well, can work anywhere, for anyone. The most revolutionary thing about Phi-4 isn’t what it can do — it’s where it can do it. source

Microsoft’s new Phi-4 AI models pack big performance in small packages Read More »

Hugging Face launches FastRTC to simplify real-time AI voice and video apps

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Hugging Face, the AI startup valued at over $4 billion, has introduced FastRTC, an open-source Python library that removes a major obstacle for developers when building real-time audio and video AI applications. “Building real-time WebRTC and Websocket applications is very difficult to get right in Python,” Freddy Boulton, one of FastRTC’s creators, said in an announcement on X.com. “Until now.” WebRTC technology enables direct browser-to-browser communication for audio, video and data sharing without plugins or downloads. Despite being essential for modern voice assistants and video tools, implementing WebRTC has remained a specialized skillset that most machine learning (ML) engineers simply don’t possess. Building real-time WebRTC and Websocket applications is very difficult to get right in Python. Until now – Introducing FastRTC, the realtime communication library for Python ⚡️ pic.twitter.com/PR67kiZ9KE — Freddy A Boulton (@freddy_alfonso_) February 25, 2025 The voice AI gold rush meets its technical roadblock The timing couldn’t be more strategic. Voice AI has attracted enormous attention and capital — ElevenLabs recently secured $180 million in funding, while companies like Kyutai, Alibaba and Fixie.ai have all released specialized audio models. Yet, a disconnect persists between these sophisticated AI models and the technical infrastructure needed to deploy them in responsive, real-time applications. As Hugging Face noted in its blog post, “ML engineers may not have experience with the technologies needed to build real-time applications, such as WebRTC.” FastRTC addresses this problem, with automated features handling the complex parts of real-time communication. The library provides voice detection, turn-taking capabilities, testing interfaces and even temporary phone number generation for application access. Want to build Real-time Apps with @GoogleDeepMind Gemini 2.0 Flash? FastRTC lets you build Python based real-time apps using Gradio-UI. ? ? Transforms Python functions into bidirectional audio/video streams with minimal code?️ Built-in voice detection and automatic… pic.twitter.com/o835htr0hl — Philipp Schmid (@_philschmid) February 26, 2025 From complex infrastructure to five lines of code The library’s primary advantage is its simplicity. Developers can reportedly create basic real-time audio applications in just a few lines of code — a striking contrast to the weeks of development work previously required. This shift holds substantial implications for businesses. Companies previously needing specialized communications engineers can now leverage their existing Python developers to build voice and video AI features. “You can use any LLM/text-to-speech/speech-to-text API or even a speech-to-speech model,” the announcement explains. “Bring the tools you love — FastRTC just handles the real-time communication layer.” hot take: WebRTC should be ONE line of Python code introducing FastRTC⚡️ from Gradio! start now: pip install fastrtc what you get:– call your AI from a real phone– automatic voice detection– works with ANY model– instant Gradio UI for testing this changes everything pic.twitter.com/kvx436xbgN — Gradio (@Gradio) February 25, 2025 The coming wave of voice and video innovation The introduction of FastRTC signals a turning point in AI application development. By removing a significant technical barrier, the tool opens up possibilities that had remained theoretical for many developers. The impact could be particularly meaningful for smaller companies and independent developers. While tech giants like Google and OpenAI have the engineering resources to build custom real-time communication infrastructure, most organizations don’t. FastRTC essentially provides access to capabilities that were previously reserved for those with specialized teams. The library’s “cookbook” already showcases diverse applications: voice chats powered by various language models, real-time video object detection and interactive code generation through voice commands. What’s particularly notable is the timing. FastRTC arrives just as AI interfaces are shifting away from text-based interactions toward more natural, multimodal experiences. The most sophisticated AI systems today can process and generate text, images, audio and video — but deploying these capabilities in responsive, real-time applications has remained challenging. By bridging the gap between AI models and real-time communication, FastRTC doesn’t just make development easier — it potentially accelerates the broader shift toward voice-first and video-enhanced AI experiences that feel more human and less computer-like. For users, this could mean more natural interfaces across applications. For businesses, it means faster implementation of features their customers increasingly expect. In the end, FastRTC addresses a classic problem in technology: Powerful capabilities often remain unused until they become accessible to mainstream developers. By simplifying what was once complex, Hugging Face has removed one of the last major obstacles standing between today’s sophisticated AI models and the voice-first applications of tomorrow. source

Hugging Face launches FastRTC to simplify real-time AI voice and video apps Read More »

Replit and Anthropic’s AI is helping non-coders bring software ideas to life

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Replit is helping non-technical employees at Zillow contribute to software development. The real estate giant’s customer routing system, which connects over 100,000 home shoppers to agents, now includes features built by team members who previously couldn’t write code. This breakthrough stems from Replit’s new partnership with Anthropic and Google Cloud, which has enabled over 100,000 applications on Google Cloud Run. The collaboration integrates Anthropic’s Claude AI model with Google Cloud’s Vertex AI platform, allowing anyone with an idea to create custom software. Wow — Replit powers production routes on Zillow, which was built by a non-coder!! https://t.co/mgtDYLfbg6 — Amjad Masad (@amasad) November 22, 2024 “We’re witnessing a transformation in how businesses create software solutions,” said Michele Catasta, Replit’s president, in an exclusive interview with VentureBeat. “Our platform is increasingly being adopted by teams across marketing, sales and operations who need custom solutions that pre-built software can’t provide.” The initiative addresses the growing global developer shortage, expected to reach 4 million by 2025. Companies can now empower non-technical teams to build their own solutions rather than waiting for scarce developer resources. Claude’s sophisticated approach to code generation sets this partnership apart. “Claude excels at producing clean, maintainable code while understanding complex systems across multiple languages and frameworks,” Michael Gerstenhaber, Anthropic’s product VP, told VentureBeat. “It approaches problems strategically, often stepping back to analyze the bigger picture rather than rushing to add code.” Built 2 new internal systems for my team this week (leave requests/customer support) using code generated by Claude. Took me 1 day in total & saved us $5-10K in consultant costs. If an english/psychology grad like me can use code to build stuff, any wordcel can. — Claire Lehmann (@clairlemon) February 7, 2025 How Replit, Anthropic and Google Cloud are making AI coding secure and scalable Replit handles security and reliability concerns through Google Cloud’s enterprise infrastructure. “We’ve built our security framework on a foundation of enterprise-grade infrastructure through Google Cloud’s Vertex AI platform,” Catasta said. “This allows us to offer accessible AI development tools while maintaining stringent security standards.” The partnership demonstrates significant advances in AI capabilities. Claude 3.5 Sonnet improved performance on SWE-bench Verified from 33% to 49%, surpassing many publicly available models. These technical improvements enable users to create everything from personal productivity tools to enterprise applications. Both companies emphasize AI augmentation over automation. “AI’s biggest potential is to augment and enhance human capabilities, rather than simply replacing them,” Gerstenhaber said. “For developer teams, Claude acts as an expert virtual assistant that can dramatically accelerate project timelines — reducing weeks-long projects to days.” Almost paid $100/year for an app I needed (export 1000s of saved posts/bookmarks to a spreadsheet), then thought “hmm I wonder if Claude could make this for me.” 10 minutes later, the app works and I have a CSV of everything I’ve ever saved. Wild! — Kevin Roose (@kevinroose) February 7, 2025 The future of software development: Replit’s AI puts coding in everyone’s hands Replit’s tools could transform who gets to build and sell software. A teenager in rural India recently created an app using just their smartphone, earned enough to buy their first laptop, and now builds software for companies worldwide. Stories like this suggest a future where anyone with an internet connection can turn their ideas into working software — regardless of their technical background or location. Challenges persist. The platform must balance accessibility with code quality and security while ensuring AI-generated solutions remain maintainable and scalable. Success could establish new standards for custom software development in the AI era. The global custom software development market will reach more than $700 billion by 2028, according to industry analysts. Replit’s AI-powered approach could determine who participates in this expanding market. Early results show promise. Companies have built their own employee time-off trackers and help-desk systems within days, tasks that previously took months of development. Some independent developers have created and launched new applications using just their phones, showing how the platform makes software development accessible to more people. In an industry known for high barriers to entry, this partnership between Replit, Anthropic and Google Cloud opens software development to anyone with an idea. The implications extend beyond traditional technology companies to reshape how businesses across industries build and deploy custom solutions. The next billion software creators may not know how to code — and that might be exactly the point. source

Replit and Anthropic’s AI is helping non-coders bring software ideas to life Read More »

Rebuilding Alexa: How Amazon is mixing models, agents and browser-use for smarter AI

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Amazon is betting on agent interoperability and model mixing to make its new Alexa voice assistant more effective, retooling its flagship voice assistant with agentic capabilities and browser-use tasks. This new Alexa has been rebranded to Alexa+, and Amazon is emphasizing that this version “does more.” For instance, it can now proactively tell users if a new book from their favorite author is available, or that their favorite artist is in town — and even offer to buy a ticket. Alexa+ reasons through instructions and taps “experts” in different knowledge bases to answer user questions and complete tasks like “Where is the nearest pizza place to the office? Will my coworkers like it? — Make a reservation if you think they will.” In other words, Alexa+ blends AI agents, computer use capabilities and knowledge it learns from the larger Amazon ecosystem to be what Amazon hopes is a more capable and smarter home voice assistant.  Alexa+ currently runs on Amazon’s Nova models and models from Anthropic. However, Daniel Rausch, Amazon’s VP of Alexa and Echo, told VentureBeat that the device will remain “model agnostic” and that the company could introduce other models (at least models available on Amazon Bedrock) to find the best one for accomplishing tasks. “[It’s about] choosing the right integrations to complete a task, figuring out the right sort of instructions, what it takes to actually complete the task, then orchestrating the whole thing,” said Rausch. “The  big thing to understand about it is that Alexa will continue to evolve with the best models available anywhere on Bedrock.” What is model mixing? Model mixing or model routing lets enterprises and other users choose the appropriate AI model to tap on a query-by-query basis. Developers increasingly turn to model mixing to cut costs. After all, not every prompt needs to be answered by a reasoning model; some models perform certain tasks better.  Amazon’s cloud and AI unit, AWS, has long been a proponent of model mixing. Recently, it announced a feature on Bedrock called Intelligent Prompt Routing, which directs prompts to the best model and model size to resolve the query. And, it could be working. “I can tell you that I cannot say for any given response from Alexa on any given task what model it’s using,” said Rausch.  Agentic interoperability and orchestration Rausch said Alexa+ brings agents together in three different ways. The first is the traditional API; the second is deploying agents that can navigate websites and apps like Anthropic’s Computer Use; the third is connecting agents to other agents.  “But at the center of it all, orchestrating across all those different kinds of experiences are these baseline, very capable, state-of-the-art LLMs,” said Rausch.  He added that if a third-party application already has its own agent, that agent can still talk to the agents working inside Alexa+ even if the external agent was built using a different model.  Rausch emphasized that the Alexa team used Bedrock’s tools and technology, including new multi-agent orchestration tools. Anthropic CPO Mike Krieger told VentureBeat that even earlier versions of Claude won’t be able to accomplish what Alexa+ wants.  “A really interesting ‘Why now?’ moment is apparent in the demo, because, of course, the models have gotten better,” said Krieger. “But if you tried to do this with 3.0 Sonnet or our 3.0 level models, I think you’d struggle in a lot of ways to use a lot of different tools all at once.” Although neither Rausch nor Krieger would confirm which specific Anthropic model Amazon used to build Alexa+, it’s worth pointing out that Anthropic released Claude 3.7 Sonnet on Monday, and it is available on Bedrock.  Large investments in AI  Many user’s first brush with AI came through AI voice assistants like Alexa, Google Home or even Apple’s Siri. Those let people outsource some tasks, like turning on lights. I do not own an Alexa or Google Home device, but I learned how convenient having one could be when staying at a hotel recently. I could tell the Alexa to stop the alarm, turn on the lights and open a curtain while still under the covers.  But while Alexa, Google Home devices, and Siri became ubiquitous in people’s lives, they began showing their age when generative AI became popular. Suddenly, people wanted more real-time answers from AI assistants and demanded smarter task resolutions, such as adding multiple meetings to calendars without the need for much prompting. Amazon admitted that the rise of gen AI, especially agents, has made it possible for Alexa to finally meet its potential.  “Until this moment, we were limited by the technology in what Alexa could be,” Panos Panay, Amazon’s devices and services SVP, said during a demo.  Rausch said the hope is that Alexa+ continues to improve, add new models and hopefully make more people comfortable with what the technology can do.  source

Rebuilding Alexa: How Amazon is mixing models, agents and browser-use for smarter AI Read More »

Industry observers say GPT-4.5 is an “odd” model, question its price

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has announced the release of GPT-4.5, which CEO Sam Altman previously said would be the last non-chain-of-thought (CoT) model.  The company said the new model “is not a frontier model” but is still its biggest large language model (LLM), with more computational efficiency. Altman said that, even though GPT-4.5 does not reason the same way as OpenAI’s other new offerings o1 or o3-mini, this new model still offers more human-like thoughtfulness.  Industry observers, many of whom had early access to the new model, have found GPT-4.5 to be an interesting move from OpenAI, tempering their expectations of what the model should be able to achieve.  Wharton professor and AI commentator Ethan Mollick posted on social media that GPT-4.5 is a “very odd and interesting model,” noting it can get “oddly lazy on complex projects” despite being a strong writer.  OpenAI co-founder and former Tesla AI head Andrej Karpathy noted that GPT-4.5 made him remember when GPT-4 came out and he saw the model’s potential. In a post to X, Karpathy said that, while using GPT 4.5, “everything is a little bit better, and it’s awesome, but also not exactly in ways that are trivial to point to.” Karpathy, however warned that people shouldn’t expect revolutionary impact from the model as it “does not push forward model capability in cases where reasoning is critical (math, code, etc.).” Industry thoughts in detail Here’s what Karpathy had to say about the latest GPT iteration in a lengthy post on X: “Today marks the release of GPT4.5 by OpenAI. I’ve been looking forward to this for ~2 years, ever since GPT4 was released, because this release offers a qualitative measurement of the slope of improvement you get out of scaling pretraining compute (i.e. simply training a bigger model). Each 0.5 in the version is roughly 10X pretraining compute. Now, recall that GPT1 barely generates coherent text. GPT2 was a confused toy. GPT2.5 was “skipped” straight into GPT3, which was even more interesting. GPT3.5 crossed the threshold where it was enough to actually ship as a product and sparked OpenAI’s “ChatGPT moment”. And GPT4 in turn also felt better, but I’ll say that it definitely felt subtle. I remember being a part of a hackathon trying to find concrete prompts where GPT4 outperformed 3.5. They definitely existed, but clear and concrete “slam dunk” examples were difficult to find. It’s that … everything was just a little bit better but in a diffuse way. The word choice was a bit more creative. Understanding of nuance in the prompt was improved. Analogies made a bit more sense. The model was a little bit funnier. World knowledge and understanding was improved at the edges of rare domains. Hallucinations were a bit less frequent. The vibes were just a bit better. It felt like the water that rises all boats, where everything gets slightly improved by 20%. So it is with that expectation that I went into testing GPT4.5, which I had access to for a few days, and which saw 10X more pretraining compute than GPT4. And I feel like, once again, I’m in the same hackathon 2 years ago. Everything is a little bit better and it’s awesome, but also not exactly in ways that are trivial to point to. Still, it is incredible interesting and exciting as another qualitative measurement of a certain slope of capability that comes “for free” from just pretraining a bigger model. Keep in mind that that GPT4.5 was only trained with pretraining, supervised finetuning and RLHF, so this is not yet a reasoning model. Therefore, this model release does not push forward model capability in cases where reasoning is critical (math, code, etc.). In these cases, training with RL and gaining thinking is incredibly important and works better, even if it is on top of an older base model (e.g. GPT4ish capability or so). The state of the art here remains the full o1. Presumably, OpenAI will now be looking to further train with reinforcement learning on top of GPT4.5 to allow it to think and push model capability in these domains. HOWEVER. We do actually expect to see an improvement in tasks that are not reasoning heavy, and I would say those are tasks that are more EQ (as opposed to IQ) related and bottlenecked by e.g. world knowledge, creativity, analogy making, general understanding, humor, etc. So these are the tasks that I was most interested in during my vibe checks. So below, I thought it would be fun to highlight 5 funny/amusing prompts that test these capabilities, and to organize them into an interactive “LM Arena Lite” right here on X, using a combination of images and polls in a thread. Sadly X does not allow you to include both an image and a poll in a single post, so I have to alternate posts that give the image (showing the prompt, and two responses one from 4 and one from 4.5), and the poll, where people can vote which one is better. After 8 hours, I’ll reveal the identities of which model is which. Let’s see what happens 🙂“ Box CEO’s thoughts on GPT-4.5 Other early users also saw potential in GPT-4.5. Box CEO Aaron Levie said on X that his company used GPT-4.5 to help extract structured data and metadata from complex enterprise content.  “The AI breakthroughs just keep coming. OpenAI just announced GPT-4.5, and we’ll be making it available to Box customers later today in the Box AI Studio. We’ve been testing GPT4.5 in early access mode with Box AI for advanced enterprise unstructured data use-cases, and have seen strong results. With the Box AI enterprise eval, we test models against a variety of different scenarios, like Q&A accuracy, reasoning capabilities and more. In particular, to explore the capabilities of GPT-4.5, we focused on a key area with significant potential for enterprise impact: The extraction

Industry observers say GPT-4.5 is an “odd” model, question its price Read More »