VentureBeat

Gemini 2.0 Flash ushers in a new era of real-time multimodal AI

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Google’s release of Gemini 2.0 Flash this week, offering users a way to interact live with video of their surroundings, has set the stage for what could be a pivotal shift in how enterprises and consumers engage with technology. This release — alongside announcements from OpenAI, Microsoft, and others — is part of a transformative leap forward happening in the technology area called “multimodal AI.” The technology allows you to take video — or audio or images — that comes into your computer or phone, and ask questions about it. It also signals an intensification of the competitive race among Google and its chief rivals — OpenAI and Microsoft — for dominance in AI capabilities. But more importantly, it feels like it is defining the next era of interactive, agentic computing. This moment in AI feels to me like an “iPhone moment,” and by that I’m referring to 2007-2008 when Apple released an iPhone that, via a connection with the internet and slick user interface, transformed daily lives by giving people a powerful computer in their pocket. While OpenAI’s ChatGPT may have kicked off this latest AI moment with its powerful human-like chatbot in November 2022, Google’s release here at the end of 2024 feels like a major continuation of that moment — at a time when a lot of observers had been worried about a possible slowdown in improvements of AI technology.   Gemini 2.0 Flash: The catalyst of AI’s multimodal revolution Google’s Gemini 2.0 Flash offers groundbreaking functionality, allowing real-time interaction with video captured via a smartphone. Unlike prior staged demonstrations (e.g. Google’s Project Astra in May), this technology is now available to everyday users through Google’s AI Studio. I encourage you to try it yourself. I used it to view and interact with my surroundings — which for me this morning was my kitchen and dining room. You can see instantly how this offers breakthroughs for education and other use cases. You can see why content creator Jerrod Lew reacted on X yesterday with astonishment when he used Gemini 2.0 realtime AI to edit a video in Adobe Premiere Pro. “This is absolutely insane,” he said, after Google guided him within seconds on how to add a basic blur effect even though he was a novice user.  Sam Witteveen, a prominent AI developer and cofounder of Red Dragon AI, was given early access to test Gemini 2.0 Flash, and he highlighted that Gemini Flash’s speed — it is twice as fast as Google’s flagship until now, Gemini 1.5 Pro — and “insanely cheap” pricing make it not just a showcase for for developers to test new products with, but a practical tool for enterprises managing AI budgets. (To be clear, Google hasn’t actually announced pricing for Gemini 2.0 Flash yet. It is a free preview. But Witteveen is basing his assumptions on the precedent set by Google’s Gemini 1.5 series.) For developers, the live API of these multimodal live features offers significant potential, because they enable seamless integration into applications. That API is also available to use; a demo app is available. Here is the Google blog post for developers. Programmer Simon Willison called the streaming API next-level: “This stuff is straight out of science fiction: being able to have an audio conversation with a capable LLM about things that it can ‘see’ through your camera is one of those ‘we live in the future’ moments.” He noted the way you ask the API to enable a code execution mode, which lets the models write Python code, run it and consider the result as part of their response — all part of an agentic future. The technology is clearly a harbinger of new application ecosystems and user expectations. Imagine being able to analyze live video during a presentation, suggest edits, or troubleshoot in real time. Yes, the technology is cool for consumers, but it’s important for enterprise users and leaders to grasp as well. The new features are the foundation of an entirely new way of working and interacting with technology — suggesting coming productivity gains and creative workflows. The competitive landscape: A race to define the future Wednesday’s release of Google’s Gemini 2.0 Flash comes amid a flurry of releases by Google and by its major competitors, which are rushing to ship their latest technologies by the end of the year. They all promise to deliver consumer-ready multimodal capabilities — live video interaction, image generation, and voice synthesis — but some of them aren’t fully baked or even fully available.  One reason for the rush is that some of these companies offer their employees bonuses to deliver on key products before the end of the year. Another is bragging rights when they get new features out first. They can get major user traction by being first, as OpenAI showed in 2022, when its ChatGPT become the fastest growing consumer product in history. Even though Google had similar technology, it was not prepared for a public release and was left flat-footed. Observers have sharply criticized Google ever since for being too slow.  Here’s what the other companies have announced in the past few days, all helping introduce this new era of multimodal AI. OpenAI’s Advanced Voice Mode with Vision: Launched yesterday but still rolling out, it offers features like real-time video analysis and screen sharing. While promising, early access issues have limited its immediate impact. For example, I couldn’t access it yet even though I’m a Plus subscriber.  Microsoft’s Copilot Vision: Last week, Microsoft launched a similar technology in preview — only for a select group of its Pro users. Its browser-integrated design hints at enterprise applications but lacks the polish and accessibility of Gemini 2.0. Microsoft also released a fast, powerful Phi-4 model to boot. Anthropic’s Claude 3.5 Haiku: Anthropic, until now in a heated race for large language model (LLM) leadership with OpenAI, hasn’t delivered anything

Gemini 2.0 Flash ushers in a new era of real-time multimodal AI Read More »

OpenAI’s o3 shows remarkable progress on ARC-AGI, sparking debate on AI reasoning

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI’s latest o3 model has achieved a breakthrough that has surprised the AI research community. o3 scored an unprecedented 75.7% on the super-difficult ARC-AGI benchmark under standard compute conditions, with a high-compute version reaching 87.5%.  While the achievement in ARC-AGI is impressive, it does not yet prove that the code to artificial general intelligence (AGI) has been cracked. Abstract Reasoning Corpus The ARC-AGI benchmark is based on the Abstract Reasoning Corpus, which tests an AI system’s ability to adapt to novel tasks and demonstrate fluid intelligence. ARC is composed of a set of visual puzzles that require understanding of basic concepts such as objects, boundaries and spatial relationships. While humans can easily solve ARC puzzles with very few demonstrations, current AI systems struggle with them. ARC has long been considered one of the most challenging measures of AI.  Example of ARC puzzle (source: arcprize.org) ARC has been designed in a way that it can’t be cheated by training models on millions of examples in hopes of covering all possible combinations of puzzles.  The benchmark is composed of a public training set that contains 400 simple examples. The training set is complemented by a public evaluation set that contains 400 puzzles that are more challenging as a means to evaluate the generalizability of AI systems. The ARC-AGI Challenge contains private and semi-private test sets of 100 puzzles each, which are not shared with the public. They are used to evaluate candidate AI systems without running the risk of leaking the data to the public and contaminating future systems with prior knowledge. Furthermore, the competition sets limits on the amount of computation participants can use to ensure that the puzzles are not solved through brute-force methods. A breakthrough in solving novel tasks o1-preview and o1 scored a maximum of 32% on ARC-AGI. Another method developed by researcher Jeremy Berman used a hybrid approach, combining Claude 3.5 Sonnet with genetic algorithms and a code interpreter to achieve 53%, the highest score before o3. In a blog post, François Chollet, the creator of ARC, described o3’s performance as “a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models.” It is important to note that using more compute on previous generations of models could not reach these results. For context, it took 4 years for models to progress from 0% with GPT-3 in 2020 to just 5% with GPT-4o in early 2024. While we don’t know much about o3’s architecture, we can be confident that it is not orders of magnitude larger than its predecessors. Performance of different models on ARC-AGI (source: arcprize.org) “This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs,” Chollet wrote. “o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.” It is worth noting that o3’s performance on ARC-AGI comes at a steep cost. On the low-compute configuration, it costs the model $17 to $20 and 33 million tokens to solve each puzzle, while on the high-compute budget, the model uses around 172X more compute and billions of tokens per problem. However, as the costs of inference continue to decrease, we can expect these figures to become more reasonable. A new paradigm in LLM reasoning? The key to solving novel problems is what Chollet and other scientists refer to as “program synthesis.” A thinking system should be able to develop small programs for solving very specific problems, then combine these programs to tackle more complex problems. Classic language models have absorbed a lot of knowledge and contain a rich set of internal programs. But they lack compositionality, which prevents them from figuring out puzzles that are beyond their training distribution. Unfortunately, there is very little information about how o3 works under the hood, and here, the opinions of scientists diverge. Chollet speculates that o3 uses a type of program synthesis that uses chain-of-thought (CoT) reasoning and a search mechanism combined with a reward model that evaluates and refines solutions as the model generates tokens. This is similar to what open source reasoning models have been exploring in the past few months.  Other scientists such as Nathan Lambert from the Allen Institute for AI suggest that “o1 and o3 can actually be just the forward passes from one language model.” On the day o3 was announced, Nat McAleese, a researcher at OpenAI, posted on X that o1 was “just an LLM trained with RL. o3 is powered by further scaling up RL beyond o1.” On the same day, Denny Zhou from Google DeepMind’s reasoning team called the combination of search and current reinforcement learning approaches a “dead end.”  “The most beautiful thing on LLM reasoning is that the thought process is generated in an autoregressive way, rather than relying on search (e.g. mcts) over the generation space, whether by a well-finetuned model or a carefully designed prompt,” he posted on X. While the details of how o3 reasons might seem trivial in comparison to the breakthrough on ARC-AGI, it can very well define the next paradigm shift in training LLMs. There is currently a debate on whether the laws of scaling LLMs through training data and compute have hit a wall. Whether test-time scaling depends on better training data or different inference architectures can determine the next path forward. Not AGI The name ARC-AGI is misleading and some have equated it to solving AGI. However, Chollet stresses that “ARC-AGI is not an acid test for AGI.”  “Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don’t think o3 is AGI yet,” he writes. “o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.” Moreover, he notes that o3 cannot autonomously learn these skills and

OpenAI’s o3 shows remarkable progress on ARC-AGI, sparking debate on AI reasoning Read More »

Slack’s AI agents promise to reshape productivity with contextual power

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Slack will deeply integrate Salesforce’s Agentforce AI agents into its workplace collaboration platform, emphasizing contextual intelligence as the key differentiator in the increasingly crowded AI agent market. “There’s so much of your organization’s knowledge context [there]…Slack’s channels typically reflect your organization’s structure, but also your priorities for that given moment,” said Rob Seaman, Slack’s chief product officer, in an exclusive interview with VentureBeat. “That is just such rich context for agents to be able to answer questions and reason through whether or not they need to be able to take action.” Why context matters for enterprise AI The integration, part of Salesforce’s Agentforce 2.0 launch scheduled for tomorrow, December 17, aims to make AI agents more effective by giving them access to the vast troves of conversational and organizational data that flow through Slack’s channels daily. Seaman outlined three critical capabilities that define these next-generation AI agents: comprehensive contextual knowledge, reasoning ability, and action-taking power. What sets Slack’s implementation apart is its unique position as what Seaman calls a “searchable log of all communication and knowledge” — effectively making it the central nervous system of modern enterprises. Inside Slack’s new AI agent library The platform will introduce a library of customizable AI agents that can perform various tasks, from onboarding new employees to managing complex cross-functional projects. “You’ll see the library of agents in Slack. And it’s pretty magical to see humans and agents together, and to think of this world where humans continue to work with humans, but agents are there as part of the team,” Seaman explained. A key focus is user trust, and another is data governance. Seaman emphasized that all agents will operate with “user context,” meaning they can only access information that the user has permission to see. “Our goal ultimately is to honor user context for every system that an agent and a person [have] interacted with,” he said. The platform includes robust safeguards through what Salesforce calls a “trust layer,” which handles sensitive information appropriately and ensures compliance with business rules. Users can test agents in real time and observe their decision-making processes through a transparent builder interface. How AI agents could transform enterprise software For enterprises struggling with fragmented software stacks, this integration could signal a shift in how organizations approach their technology infrastructure. While Seaman avoided specific predictions about which tools might become obsolete, he suggested that many manual processes currently “spaghetti-ed across numerous systems” could be streamlined through these contextually-aware agents. One concrete example Seaman highlighted was employee onboarding: “Taking you from new hire to productive, is something that the company cares about, and it’s also, from an end-user perspective, it’s kind of a lonely, scary experience in your first several months as you’re trying to find your way.” The race for enterprise AI dominance The integration represents a strategic move by both Slack and Salesforce to position themselves at the forefront of the enterprise AI revolution. While companies like Anthropic and OpenAI have launched their own AI agents, Slack’s deep integration with enterprise workflows and access to organizational context could provide a significant competitive advantage. The development comes at a crucial time as organizations grapple with how to effectively implement AI tools while maintaining security and trust. With this launch, Slack and Salesforce are betting that contextually-aware AI agents, deeply integrated into existing workflows, will prove more valuable than standalone AI solutions. The question remains whether enterprises will embrace this vision of AI agents as team members, but with Slack’s widespread adoption in modern workplaces, the platform is well-positioned to drive this transformation. As Seaman notes, “We’re pretty lucky, frankly, that we’re in this moment, and we have a lot of the primitives that are required to make this possible.” source

Slack’s AI agents promise to reshape productivity with contextual power Read More »

Learn how GE Healthcare used AWS to build a new AI model that interprets MRIs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More MRI images are understandably complex and data-heavy.  Because of this, developers training large language models (LLMs) for MRI analysis have had to slice captured images into 2D. But this results in just an approximation of the original image, thus limiting the model’s ability to analyze intricate anatomical structures. This creates challenges in complex cases involving brain tumors, skeletal disorders or cardiovascular diseases.  But GE Healthcare appears to have overcome this massive hurdle, introducing the industry’s first full-body 3D MRI research foundation model (FM) at this year’s AWS re:Invent. For the first time, models can use full 3D images of the entire body.  GE Healthcare’s FM was built on AWS from the ground up — there are very few models specifically designed for medical imaging like MRIs — and is based on more than 173,000 images from over 19,000 studies. Developers say they have been able to train the model with five times less compute than previously required.  GE Healthcare has not yet commercialized the foundation model; it is still in an evolutionary research phase. An early evaluator, Mass General Brigham, is set to begin experimenting with it soon.  “Our vision is to put these models into the hands of technical teams working in healthcare systems, giving them powerful tools for developing research and clinical applications faster, and also more cost-effectively,” GE HealthCare chief AI officer Parry Bhatia told VentureBeat.  Enabling real-time analysis of complex 3D MRI data While this is a groundbreaking development, generative AI and LLMs are not new territory for the company. The team has been working with advanced technologies for more than 10 years, Bhatia explained.  One of its flagship products is AIR Recon DL, a deep learning-based reconstruction algorithm that allows radiologists to more quickly achieve crisp images. The algorithm removes noise from raw images and improves signal-to-noise ratio, cutting scan times by up to 50%. Since 2020, 34 million patients have been scanned with AIR Recon DL.  GE Healthcare began working on its MRI FM at the beginning of 2024. Because the model is multimodal, it can support image-to-text searching, link images and words, and segment and classify diseases. The goal is to give healthcare professionals more details in one scan than ever before, said Bhatia, leading to faster, more accurate diagnosis and treatment. “The model has significant potential to enable real-time analysis of 3D MRI data, which can improve medical procedures like biopsies, radiation therapy and robotic surgery,” Dan Sheeran, GM for health care and life sciences at AWS, told VentureBeat.  Already, it has outperformed other publicly-available research models in tasks including classification of prostate cancer and Alzheimer’s disease. It has exhibited accuracy up to 30% in matching MRI scans with text descriptions in image retrieval — which might not sound all that impressive, but it’s a big improvement over the 3% capability exhibited by similar models.  “It has come to a stage where it’s giving some really robust results,” said Bhatia. “The implications are huge.” Doing more with (much less) data The MRI process requires a few different types of datasets to support various techniques that map the human body, Bhatia explained.  What’s known as a T1-weighted imaging technique, for instance, highlights fatty tissue and decreases the signal of water, while T2-weighted imaging enhances water signals. The two methods are complementary and create a full picture of the brain to help clinicians detect abnormalities like tumors, trauma or cancer.  “MRI images come in all different shapes and sizes, similar to how you would have books in different formats and sizes, right?” said Bhatia.  To overcome challenges presented by diverse datasets, developers introduced a “resize and adapt” strategy so that the model could process and react to different variations. Also, data may be missing in some areas — an image may be incomplete, for instance — so they taught the model simply to ignore those instances.  “Instead of getting stuck, we taught the model to skip over the gaps and focus on what was available,” said Bhatia. “Think of this as solving a puzzle with some missing pieces.” The developers also employed semi-supervised student-teacher learning, which is particularly helpful when there is limited data. With this method, two different neural networks are trained on both labeled and unlabeled data, with the teacher creating labels that help the student learn and predict future labels.  “We’re now using a lot of these self-supervised technologies, which don’t require huge amounts of data or labels to train large models,” said Bhatia. “It reduces the dependencies, where you can learn more from these raw images than in the past.” This helps to ensure that the model performs well in hospitals with fewer resources, older machines and different kinds of datasets, Bhatia explained.  He also underscored the importance of the models’ multimodality. “A lot of technology in the past was unimodal,” said Bhatia. “It would look only into the image, into the text. But now they’re becoming multi-modal, they can go from image to text, text to image, so that you can bring in a lot of things that were done with separate models in the past and really unify the workflow.”  He emphasized that researchers only use datasets that they have rights to; GE Healthcare has partners who license de-identified data sets, and they’re careful to adhere to compliance standards and policies. Using AWS SageMaker to tackle computation, data challenges Undoubtedly, there are many challenges when building such sophisticated models — such as limited computational power for 3D images that are gigabytes in size. “It’s a massive 3D volume of data,” said Bhatia. “You need to bring it into the memory of the model, which is a really complex problem.” To help overcome this, GE Healthcare built on Amazon SageMaker, which provides high-speed networking and distributed training capabilities across multiple GPUs, and leveraged Nvidia A100 and tensor core GPUs for large-scale training.  “Because of the size of the data

Learn how GE Healthcare used AWS to build a new AI model that interprets MRIs Read More »

Hugging Face shows how test-time scaling helps small language models punch above their weight

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More In a new case study, Hugging Face researchers have demonstrated how small language models (SLMs) can be configured to outperform much larger models. Their findings show that a Llama 3 model with 3B parameters can outperform the 70B version of the model in complex math problems. Hugging Face has fully documented the entire process and provides a roadmap for enterprises that want to create their own customized reasoning models. Image source: Hugging Face Scaling test-time compute The work is inspired by OpenAI o1, which uses extra “thinking” to solve complex math, coding and reasoning problems. The key idea behind models like o1 is to scale “test-time compute,” which effectively means using more compute cycles during inference to test and verify different responses and reasoning paths before producing the final answer. Scaling test-time compute is especially useful when there is not enough memory to run a large model.  Since o1 is a private model and OpenAI has remained tight-lipped about its internal workings, researchers have been speculating about how it works and trying to reverse engineer the process. There are already several open alternatives to o1. Hugging Face work is based on a DeepMind study released in August, which investigates the tradeoffs between inference-time and pre-training compute. The study provides comprehensive guidelines on how to balance training and inference compute to get the best results for a fixed budget. In addition to using extra inference-time compute, the success of the technique hinges on two key components: a reward model that evaluates the SLM’s answers, and a search algorithm that optimizes the path it takes to refine its answers. Image source: Hugging Face Different reasoning algorithms The simplest way to use test-time scaling is “majority voting,” in which the same prompt is sent to the model multiple times and the highest-voted is chosen. In simple problems, majority voting can prove useful, but its gains quickly plateau on complex reasoning problems or tasks where errors are consistent across generations. A more advanced reasoning method is “Best-of-N.” In this technique, the SLM generates multiple answers, but instead of majority voting, a reward model is used to evaluate the answers and choose the best one. “Weighted Best-of-N,” a more nuanced version of this method, factors in consistency to choose answers that are both confident and occur more frequently than others. The researchers used a “process reward model” (PRM) that scores the SLM’s response not only on the final answer but also on the multiple stages it goes through to reach it. Their experiments showed that Weighted Best-of-N and PRMs brought the Llama-3.2 1B near the level of Llama-3.2 8B on the difficult MATH-500 benchmark. Image source: Hugging Face Adding search To further improve the model’s performance, the researchers added search algorithms to the model’s reasoning process. Instead of generating the answer in a single pass, they used “beam search,” an algorithm that guides the model’s answer process step by step. At each step, the SLM generates multiple partial answers. The search algorithm uses the reward model to evaluate the answers and chooses a subset that is worth further exploring. The process is repeated until the model exhausts its inference budget or reaches the correct answer. This way, the inference budget can be narrowed to focus on the most promising answers. The researchers found that while beam search improves the model’s performance on complex problems, it tends to underperform other techniques on simple problems. To address this challenge, they added two more elements to their inference strategy. First was Diverse Verifier Tree Search (DVTS), a variant of beam search that ensures that the SLM doesn’t get stuck in false reasoning paths and diversifies its response branches. Secondly, they developed a “compute-optimal scaling strategy,” as suggested in the DeepMind paper, which dynamically chooses the best test-time scaling strategy based on the difficulty of the input problem.  The combination of these techniques enabled Llama-3.2 1B to punch above its weight and outperform the 8B model by a significant margin. They also found that the strategy was scalable, and when applied to Llama-3.2 3B, they were able to outperform the much larger 70B model. Not a perfect solution yet Scaling test-time compute changes the dynamics of model costs. Enterprises now have the ability to choose where to allocate their compute resources. For example, if you are short on memory or can tolerate slower response times, you can use a small model and spend more inference-time cycles to generate more accurate answers. However, test-time scaling also has its limitations. For example, in the experiments carried out by Hugging Face, researchers used a specially trained Llama-3.1-8B model as the PRM, which requires running two models in parallel (even if it is much more resource-efficient than the 70B model). The researchers acknowledge that the holy grail of test-time scaling is to have “self-verification,” where the original model verifies its own answer as opposed to relying on an external verifier. This is an open area of research. The test-time scaling technique presented in this study is also limited to problems where the answer can be clearly evaluated, such as coding and math. Creating reward models and verifiers for subjective tasks such as creative writing and product design requires further research. But what is clear is that test-time scaling has generated a lot of interest and activity and we can expect more tools and techniques to emerge in the coming months. Enterprises will be wise to keep an eye on how the landscape develops. source

Hugging Face shows how test-time scaling helps small language models punch above their weight Read More »

ChatGPT adds more PC and Mac app integrations, getting closer to piloting your computer

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has expanded the number of applications its desktop apps can work with, including allowing Advanced Voice Mode to work with other apps, and is moving closer to ChatGPT using computers.  The desktop app introduced integrations in November with an initial four applications. During Day 11 of its “12 Days of OpenAI” event, OpenAI announced several new integrated development environments (IDEs), terminals and text apps it will support. ChatGPT now supports BBEdit, MatLab, Nova, Script Editor, TextMate as IDEs, VS Code for VSCode Insiders, VSCodium, Cursor, WindSurf, the JetBrains family of IDEs of Android Studio, AppCode, CLion, DataGrip, GoLand, IntelliJ IDEA, PHPStorm, PyCharm, RubyMine, RustRover and WebStorm. It also added the Warp and Prompt terminal apps as integration. These applications join VS Code, Xcode, Terminal, iTerm 2 and TextEdit as integrated apps.  But coding applications won’t be the only applications that ChatGPT desktop apps can access. OpenAI also added Apple Notes, Notion and Quip to its integrations. Advanced Voice Mode can work with these applications, considering the context of projects in the integrations.  OpenAI emphasized that users must give ChatGPT permission to access these applications.  Letting ChatGPT use your computer for you App integrations with AI chatbots are, of course, nothing new. In October, GitHub Copilot added coding platform integrations. Connecting applications to ChatGPT or Copilot brings context from those platforms into the chat experience. Developers can prompt ChatGPT for some coding help for a project they have on VS Code, and the chatbot understands what they’ve been working on. Kevin Weil, chief product officer at OpenAI, said during a live stream that improving the desktop app will help the company get closer to a more agentic user experience for ChatGPT.  “We’ve been putting a lot of effort into our desktop apps,” said Weil. “As our models get increasingly powerful, ChatGPT will more and more become agentic. That means we’ll go beyond just questions and answers; ChatGPT will begin doing things for you.” He added that the desktop apps “are a big part” of that transformation.  “Being a desktop app you can do much more than you can than just a browser tab,” he said. “That includes things like, with your permission, being able to see what’s on your screen and being able to automate a lot of the work you’re doing on your desktop. We’ll have a lot more to say on that as we go into 2025.” If OpenAI lets ChatGPT see more of your computer, ChatGPT will get closer to Anthropic’s Claude Computer Use feature, which allows Claude to click around a person’s computer, navigate screens and even type text.  OpenAI already announced a fairly similar feature for the mobile version of ChatGPT, although the chatbot cannot yet access computers or phones the same way. Users can share their screens with the chatbot so it can “see” what they’re reading or looking at. Microsoft and Google also developed comparable features with Copilot Vision and Project Astra.  How to access  On MacOS, users who want to open ChatGPT while using other applications can use “option + space” to pull up ChatGPT and choose the application they need through a button on the chat screen. Another shortcut, “option + shift + 1,” brings up the topmost application used.  From that window, users can also access Advanced Voice Mode in the same way. Voice mode automatically detects context from the application.  Integrations are available to ChatGPT Plus, Pro, Team, Enterprise and Edu users. However, Enterprise and Edu subscribers must ask their IT administrators to turn on the feature.  source

ChatGPT adds more PC and Mac app integrations, getting closer to piloting your computer Read More »

Small model, big impact: Patronus AI’s Glider outperforms GPT-4 in key AI evaluation tasks

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A startup founded by former Meta AI researchers has developed a lightweight AI model that can evaluate other AI systems as effectively as much larger models, while providing detailed explanations for its decisions. Patronus AI today released Glider, an open-source 3.8 billion-parameter language model that outperforms OpenAI’s GPT-4o-mini on several key benchmarks for judging AI outputs. The model is designed to serve as an automated evaluator that can assess AI systems’ responses across hundreds of different criteria while explaining its reasoning. “Everything we do at Patronus is focused on bringing powerful and reliable AI evaluation to developers and anyone using language models or developing new LM systems,” said Anand Kannappan, CEO and cofounder of Patronus AI, in an exclusive interview with VentureBeat. Small but mighty: How Glider matches GPT-4’s performance The development represents a significant breakthrough in AI evaluation technology. Most companies currently rely on large proprietary models like GPT-4 to evaluate their AI systems, a process that can be expensive and opaque. Glider is not only more cost-effective due to its smaller size, but also provides detailed explanations for its judgments through bullet-point reasoning and highlighted text spans showing exactly what influenced its decisions. “Currently we have many LLMs serving as judges, but we don’t know which one is best for our task,” explained Darshan Deshpande, research engineer at Patronus AI who led the project. “In this paper, we demonstrate several advances: We’ve trained a model that can run on-device, uses just 3.8 billion parameters, and provides high-quality reasoning chains.” Real-time evaluation: Speed meets accuracy The new model demonstrates that smaller language models can match or exceed the capabilities of much larger ones for specialized tasks. Glider achieves comparable performance to models 17 times its size while running with just one second of latency. This makes it practical for real-time applications where companies need to evaluate AI outputs as they’re being generated. A key innovation is Glider’s ability to evaluate multiple aspects of AI outputs simultaneously. The model can assess factors like accuracy, safety, coherence and tone all at once, rather than requiring separate evaluation passes. It also retains strong multilingual capabilities despite being trained primarily on English data. “When you’re dealing with real-time environments, you need latency to be as low as possible,” Kannappan explained. “This model typically responds in under a second, especially when used through our product.” Privacy first: On-device AI evaluation becomes reality For companies developing AI systems, Glider offers several practical advantages. Its small size means it can run directly on consumer hardware, addressing privacy concerns about sending data to external APIs. Its open-source nature allows organizations to deploy it on their own infrastructure while customizing it for their specific needs. The model was trained on 183 different evaluation metrics across 685 domains, from basic factors like accuracy and coherence to more nuanced aspects like creativity and ethical considerations. This broad training helps it generalize to many different types of evaluation tasks. “Customers need on-device models because they can’t send their private data to OpenAI or Anthropic,” Deshpande explained. “We also want to demonstrate that small language models can be effective evaluators.” The release comes at a time when companies are increasingly focused on ensuring responsible AI development through robust evaluation and oversight. Glider’s ability to provide detailed explanations for its judgments could help organizations better understand and improve their AI systems’ behaviors. The future of AI evaluation: Smaller, faster, smarter Patronus AI, founded by machine learning experts from Meta AI and Meta Reality Labs, has positioned itself as a leader in AI evaluation technology. The company offers a platform for automated testing and security of large language models, with Glider its latest advance in making sophisticated AI evaluation more accessible. The company plans to publish detailed technical research about Glider on arxiv.org today, demonstrating its performance across various benchmarks. Early testing shows it achieving state-of-the-art results on several standard metrics while providing more transparent explanations than existing solutions do. “We’re in the early innings,” said Kannappan. “Over time, we expect more developers and companies will push the boundaries in these areas.” The development of Glider suggests that the future of AI systems may not necessarily require ever-larger models, but rather more specialized and efficient ones optimized for specific tasks. Its success in matching larger models’ performance while providing better explainability could influence how companies approach AI evaluation and development going forward. source

Small model, big impact: Patronus AI’s Glider outperforms GPT-4 in key AI evaluation tasks Read More »

OpenAI confirms new frontier models o3 and o3-mini

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI is slowly inviting selected users to test a whole new set of reasoning models named o3 and o3 mini, successors to the o1 and o1-mini models that just entered full release earlier this month. OpenAI o3, so named to avoid copyright issues with the telephone company O2 and because CEO Sam Altman says the company “has a tradition of being truly bad at names,” was announced during the final day of “12 Days of OpenAI” livestreams today. Altman said the two new models would be initially released to selected third-party researchers for safety testing, with o3-mini expected by the end of January 2025 and o3 “shortly after that.” “We view this as the beginning of the next phase of AI, where you can use these models to do increasingly complex tasks that require a lot of reasoning,” Altman said. “For the last day of this event we thought it would be fun to go from one frontier model to the next frontier model.” The announcement comes just a day after Google unveiled and allowed the public to use its new Gemini 2.0 Flash Thinking model, another rival “reasoning” model that, unlike the OpenAI o1 series, allows users to see the steps in its “thinking” process documented in text bullet points. The release of Gemini 2.0 Flash Thinking and now the announcement of o3 shows that the competition between OpenAI and Google, and the wider field of AI model providers, is entering a new and intense phase as they offer not just LLMs or multimodal models, but advanced reasoning models as well. These can be more applicable to harder problems in science, mathematics, technology, physics and more. The best performance on third-party benchmarks yet Altman also said the o3 model was “incredible at coding,” and the benchmarks shared by OpenAI support that, showing the model exceeding even o1’s performance on programming tasks. • Exceptional Coding Performance: o3 surpasses o1 by 22.8 percentage points on SWE-Bench Verified and achieves a Codeforces rating of 2727, outperforming OpenAI’s Chief Scientist’s score of 2665. • Math and Science Mastery: o3 scores 96.7% on the AIME 2024 exam, missing only one question, and achieves 87.7% on GPQA Diamond, far exceeding human expert performance. • Frontier Benchmarks: The model sets new records on challenging tests like EpochAI’s Frontier Math, solving 25.2% of problems where no other model exceeds 2%. On the ARC-AGI test, o3 triples o1’s score and surpasses 85% (as verified live by the ARC Prize team), representing a milestone in conceptual reasoning. Deliberative alignment Alongside these advancements, OpenAI reinforced its commitment to safety and alignment. The company introduced new research on deliberative alignment, a technique instrumental in making o1 its most robust and aligned model to date. This technique embeds human-written safety specifications into the models, enabling them to explicitly reason about these policies before generating responses. The strategy seeks to solve common safety challenges in LLMs, such as vulnerability to jailbreak attacks and over-refusal of benign prompts, by equipping the models with chain-of-thought (CoT) reasoning. This process allows the models to recall and apply safety specifications dynamically during inference. Deliberative alignment improves upon previous methods like reinforcement learning from human feedback (RLHF) and constitutional AI, which rely on safety specifications only for label generation rather than embedding the policies directly into the models. By fine-tuning LLMs on safety-related prompts and their associated specifications, this approach creates models capable of policy-driven reasoning without relying heavily on human-labeled data. Results shared by OpenAI researchers in a new, non peer-reviewed paper indicate that this method enhances performance on safety benchmarks, reduces harmful outputs, and ensures better adherence to content and style guidelines. Key findings highlight the o1 model’s advancements over predecessors like GPT-4o and other state-of-the-art models. Deliberative alignment enables the o1 series to excel at resisting jailbreaks and providing safe completions while minimizing over-refusals on benign prompts. Additionally, the method facilitates out-of-distribution generalization, showcasing robustness in multilingual and encoded jailbreak scenarios. These improvements align with OpenAI’s goal of making AI systems safer and more interpretable as their capabilities grow. This research will also play a key role in aligning o3 and o3-mini, ensuring their capabilities are both powerful and responsible. How to apply for access to test o3 and o3-mini Applications for early access are now open on the OpenAI website and will close on January 10, 2025. Applicants have to fill out an online form that asks them for a variety of information, including research focus, past experience, and links to prior published papers and their repositories of code on Github, and select which of the models — o3 or o3-mini — they wish to test, as well as what they plan to use them for. Selected researchers will be granted access to o3 and o3-mini to explore their capabilities and contribute to safety evaluations, though OpenAI’s form cautions that o3 will not be available for several weeks. Researchers are encouraged to develop robust evaluations, create controlled demonstrations of high-risk capabilities, and test models on scenarios not possible with widely adopted tools. This initiative builds on the company’s established practices, including rigorous internal safety testing, collaborations with organizations like the U.S. and UK AI Safety Institutes, and its Preparedness Framework. OpenAI will review applications on a rolling basis, with selections starting immediately. A new leap forward? The introduction of o3 and o3-mini signals a leap forward in AI performance, particularly in areas requiring advanced reasoning and problem-solving capabilities. With their exceptional results on coding, math, and conceptual benchmarks, these models highlight the rapid progress being made in AI research. By inviting the broader research community to collaborate on safety testing, OpenAI aims to ensure that these capabilities are deployed responsibly. Watch the stream below: source

OpenAI confirms new frontier models o3 and o3-mini Read More »

The code whisperer: How Anthropic’s Claude is changing the game for software developers

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The software development world is experiencing its biggest transformation since the advent of open-source coding. Artificial intelligence assistants, once viewed with skepticism by professional developers, have become indispensable tools in the $736.96 billion global software development market. One of the products leading this seismic shift is Anthropic’s Claude. Claude is an AI model that has captured the attention of developers worldwide and sparked a fierce battle among tech giants for dominance in AI-powered coding. Claude’s adoption has skyrocketed this year, with the company telling VentureBeat its coding-related revenue surged 1,000% over just the last three months. Software development now accounts for more than 10% of all Claude interactions, making it the model’s most popular use case. This growth has helped propel Anthropic to an $18 billion valuation and attract over $7 billion in funding from industry heavyweights like Google, Amazon, and Salesforce. A breakdown of how Claude, Anthropic’s AI assistant, is being used across different sectors. Web and mobile app development leads at 10.4% of total usage, followed by content creation at 9.2%, while specialized tasks like data analysis represent a smaller but significant portion of activity. (Source: Anthropic) The success hasn’t gone unnoticed by competitors. OpenAI launched its o3 model just last week with enhanced coding capabilities, while Google’s Gemini and Meta’s Llama 3.1 have doubled down on developer tools. This intensifying competition marks a significant shift in the AI industry’s focus — away from chatbots and image generation toward practical tools that generate immediate business value. The result has been a rapid acceleration in capabilities that benefits the entire software industry. Alex Albert, Anthropic’s head of developer relations, attributes Claude’s success to its unique approach. “We grew our coding revenue basically by 10 times over the past three months,” he told VentureBeat in an exclusive interview. “The models are really resonating with developers because they’re seeing just a lot of value compared to previous models.” Beyond code generation: The rise of AI development partners What sets Claude apart isn’t just its ability to write code, but its capacity to think like an experienced developer. The model can analyze up to 200,000 tokens of context — equivalent to about 150,000 words or a small codebase — while maintaining understanding throughout a development session. “Claude has been one of the only models I’ve seen that can maintain coherence along that entire journey,” Albert explains. “It’s able to go multi-file, make edits in the correct spots, and most importantly, know when to delete code rather than just add more.” This approach has led to dramatic productivity gains. According to Anthropic, GitLab reports 25-50% efficiency improvements among its development teams using Claude. Sourcegraph, a code intelligence platform, saw a 75% increase in code insertion rates after switching to Claude as its primary AI model. Perhaps most significantly, Claude is changing who can write software. Marketing teams now build their own automation tools, and sales departments customize their systems without waiting for IT help. What was once a technical bottleneck has become an opportunity for every department to solve its own problems. The shift represents a fundamental change in how businesses operate — technical skills are no longer limited to programmers. Albert confirms this phenomenon, telling VentureBeat, “We have a Slack channel where people from recruitment to marketing to sales are learning to code with Claude. It’s not just about making developers more efficient — it’s about making everyone a developer.” Security risks and job concerns: The challenges of AI in coding However, this rapid transformation has raised concerns. Georgetown’s Center for Security and Emerging Technology (CSET) warns about potential security risks from AI-generated code, while labor groups question the long-term impact on developer jobs. Stack Overflow, the popular programming Q&A site, has reported a shocking decline in new questions since the widespread adoption of AI coding assistants. But the rising tide of AI assistance in coding isn’t eliminating developer jobs — it appears to be elevating many of them. As AI handles routine coding tasks, developers are freed to focus on system architecture, code quality, and innovation. This shift mirrors previous technological transformations in software development: Just as high-level programming languages didn’t eliminate the need for developers, AI assistants are becoming another layer of abstraction that makes development more accessible while creating new opportunities for expertise. How AI is reshaping the future of software development Industry experts predict AI will fundamentally change how software is created in the near future. Gartner forecasts that by 2028, 75% of enterprise software engineers will use AI code assistants, a significant leap from less than 10% in early 2023. Anthropic is preparing for this future with new features like prompt caching, which cuts API costs by 90%, and batch processing capabilities handling up to 100,000 queries simultaneously. “I think these models will increasingly just begin to use the same tools that we do,” Albert predicts. “We won’t need to change our working patterns as much as the models will adapt to how we already work.” The impact of AI coding assistants extends far beyond individual developers, with major tech companies reporting significant benefits. Amazon, for instance, has used its AI-powered software development assistant, Amazon Q Developer, to migrate over 30,000 production applications from Java 8 or 11 to Java 17. This effort has resulted in savings equivalent to 4,500 years of development work and $260 million in annual cost reductions due to performance improvements. However, the effects of AI coding assistants are not uniformly positive across the industry. A study by Uplevel found no significant productivity improvements for developers using GitHub Copilot. More concerning, the study reported a 41% increase in bugs introduced when using the AI tool. This suggests that while AI can accelerate certain development tasks, it may also introduce new challenges in code quality and maintenance. Meanwhile, the landscape of software education is shifting. Traditional coding bootcamps are seeing enrollment decline as AI-focused development

The code whisperer: How Anthropic’s Claude is changing the game for software developers Read More »

Unintended consequences: U.S. election results herald reckless AI development

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More While the 2024 U.S. election focused on traditional issues like the economy and immigration, its quiet impact on AI policy could prove even more transformative. Without a single debate question or major campaign promise about AI, voters inadvertently tipped the scales in favor of accelerationists — those who advocate for rapid AI development with minimal regulatory hurdles. The implications of this acceleration are profound, heralding a new era of AI policy that prioritizes innovation over caution and signals a decisive shift in the debate between AI’s potential risks and rewards. The pro-business stance of President-elect Donald Trump leads many to assume that his administration will favor those developing and marketing AI and other advanced technologies. His party platform has little to say about AI. However, it does emphasize a policy approach focused on repealing AI regulations, particularly targeting what they described as “radical left-wing ideas” within existing executive orders of the outgoing administration. In contrast, the platform supported AI development aimed at fostering free speech and “human flourishing,” calling for policies that enable innovation in AI while opposing measures perceived to hinder technological progress. Early indications based on appointments to leading government positions underscore this direction. However, there is a larger story unfolding: The resolution of the intense debate over AI’s future. An intense debate Ever since ChatGPT appeared in November 2022, there has been a raging debate between those in the AI field who want to accelerate AI development and those who want to decelerate. Famously, in March 2023 the latter group proposed a six-month AI pause in development of the most advanced systems, warning in an open letter that AI tools present “profound risks to society and humanity.” This letter, spearheaded by the Future of Life Institute, was prompted by OpenAI’s release of the GPT-4 large language model (LLM), several months after ChatGPT launched. The letter was initially signed by more than 1,000 technology leaders and researchers, including Elon Musk, Apple Co-founder Steve Wozniak, 2020 Presidential candidate Andrew Yang, podcaster Lex Fridman, and AI pioneers Yoshua Bengio and Stuart Russell. The number of signees of the letter eventually swelled to more than 33,000. Collectively, they became known as “doomers,” a term to capture their concerns about potential existential risks from AI. Not everyone agreed. OpenAI CEO Sam Altman did not sign. Nor did Bill Gates and many others. Their reasons for not doing so varied, although many voiced concerns about potential harm from AI. This led to many conversations about the potential for AI to run amok, leading to disaster. It became fashionable for many in the AI field to talk about their assessment of the probability of doom, often referred to as an equation: p(doom). Nevertheless, work on AI development did not pause. For the record, my p(doom) in June 2023 was 5%. That might seem low, but it was not zero. I felt that the major AI labs were sincere in their efforts to stringently test new models prior to release and in providing significant guardrails for their use. Many observers concerned about AI dangers have rated existential risks higher than 5%, and some have rated much higher. AI safety researcher Roman Yampolskiy rated the probability of AI ending humanity at over 99%. That said, a study released early this year, well before the election and representing the views of more than 2,700 AI researchers, showed that “the median prediction for extremely bad outcomes, such as human extinction, was 5%.” Would you board a plane if there were a 5% chance it might crash? This is the dilemma AI researchers and policymakers face. Must go faster Others have been openly dismissive of worries about AI, pointing instead to what they perceived as the huge upside of the technology. These include Andrew Ng (who founded and led the Google Brain project) and Pedro Domingos (a professor of computer science and engineering at the University of Washington and author of “The Master Algorithm”). They argued, instead, that AI is part of the solution. As put forward by Ng, there are indeed existential dangers, such as climate change and future pandemics, and AI can be part of how these are addressed and mitigated. Ng argued that AI development should not be paused, but should instead go faster. This utopian view of technology has been echoed by others who are collectively known as “effective accelerationists” or “e/acc” for short. They argue that technology — and especially AI — is not the problem, but the solution to most, if not all, of the world’s issues. Startup accelerator Y Combinator CEO Garry Tan, along with other prominent Silicon Valley leaders, included the term “e/acc” in their usernames on X to show alignment to the vision. Reporter Kevin Roose at the New York Times captured the essence of these accelerationists by saying they have  an “all-gas, no-brakes approach.” A Substack newsletter from a couple years ago described the principles underlying effective accelerationism. Here is the summation they offer at the end of the article, plus a comment from OpenAI CEO Sam Altman. AI acceleration ahead The 2024 election outcome may be seen as a turning point, putting the accelerationist vision in a position to shape U.S. AI policy for the next several years. For example, the President-elect recently appointed technology entrepreneur and venture capitalist David Sacks as “AI czar.” Sacks, a vocal critic of AI regulation and a proponent of market-driven innovation, brings his experience as a technology investor to this role. He is one of the leading voices in the AI industry, and much of what he has said about AI aligns with the accelerationist viewpoints expressed by the incoming party platform. In response to the AI executive order from the Biden administration in 2023, Sacks tweeted: “The U.S. political and fiscal situation is hopelessly broken, but we have one unparalleled asset as a country: Cutting-edge innovation in AI driven by a completely

Unintended consequences: U.S. election results herald reckless AI development Read More »