VentureBeat

Medical training’s AI leap: How agentic RAG, open-weight LLMs and real-time case insights are shaping a new generation of doctors at NYU Langone

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Patient data records can be convoluted and sometimes incomplete, meaning doctors don’t always have all the information they need readily available. Added to this is the fact that medical professionals can’t possibly keep up with the barrage of case studies, research papers, trials and other cutting-edge developments coming out of the industry.  New York City-based NYU Langone Health has come up with a novel approach to tackle these challenges for the next generation of doctors.  The academic medical center — which comprises NYU Grossman School of Medicine and NYU Grossman Long Island School of Medicine, as well as six inpatient hospitals and 375 outpatient locations — has developed a large language model (LLM) that serves as a respected research companion and medical advisor.  Every night, the model processes electronic health records (EHR), matching them with relevant research, diagnosis tips and essential background information that it then delivers in concise, tailored emails to residents the following morning. This is an elemental part of NYU Langone’s pioneering approach to medical schooling — what it calls “precision medical education” that uses AI and data to provide highly customized student journeys.  “This concept of ‘precision in everything’ is needed in healthcare,”  Marc Triola, associate dean for educational informatics and director of the Institute for Innovations in Medical Education at NYU Langone Health, told VentureBeat. “Clearly the evidence is emerging that AI can overcome many of the cognitive biases, errors, waste and inefficiencies in the healthcare system, that it can improve diagnostic decision-making.”  How NYU Langone is using Llama to enhance patient care NYU Langone is using an open-weight model built on the latest version of Llama-3.1-8B-instruct and the open-source Chroma vector database for retrieval-augmented generation (RAG). But it’s not just accessing documents — the model is going beyond RAG, actively employing search and other tools to discover the latest research documents. Each night, the model connects to the facility’s EHR database and pulls out medical data for patients seen at Langone the previous day. It then searches for basic background information on diagnoses and medical conditions. Using a Python API, the model also performs a search of related medical literature in PubMed, which has “millions and millions of papers,” Triola explained. The LLM sifts through reviews, deep-dive papers and clinical trials, selecting a couple of the seemingly most relevant and “puts it all together in a nice email.”  Early the following morning, medical students and internal medicine, neurosurgery and radiation oncology residents receive a personalized email with detailed patient summaries. For instance, if a patient with congestive heart failure had been in for a checkup the previous day, the email will provide a refresher on the basic pathophysiology of heart conditions and information about the latest treatments. It also offers self-study questions and AI-curated medical literature. Further, it may give pointers about steps the residents could take next or actions or details they may have overlooked. “We’ve gotten great feedback from students, from residents and from the faculty about how this is frictionlessly keeping them up to date, how they’re incorporating this in the way they make choices about a patient’s plan of care,” said Triola.  A key success metric for him personally was when a system outage halted the emails for a few days — and faculty members and students complained they weren’t receiving the morning nudges they had come to rely on. “Because we’re sending these emails right before our doctors start rounds — which is among the craziest and busiest times of the day for them — and for them to notice that they weren’t getting these emails and miss them as a part of their thinking was awesome,” he said.  Transforming the industry with precision medical education This sophisticated AI retrieval system is fundamental to NYU Langone’s precision medical education model, which Triola explained is based on “higher density, frictionless” digital data, AI and strong algorithms. The institution has collected vast amounts of data over the past decade about students — their performance, the environments they’re taking care of patients in, the EHR notes they’re writing, the clinical decisions they’re making and the way they reason through patient interactions and care. Further, NYU Langone has a vast catalog of all the resources available to medical students, whether those be videos, self-study or exam questions, or online learning modules. The success of the project is also thanks to the medical facility’s streamlined architecture: It boasts centralized IT, a single data warehouse on the healthcare side and a single data warehouse for education, allowing Langone to marry its various data resources. Chief medical information officer Paul Testa noted that great AI/ML systems aren’t possible without great data, but “it’s not the easiest thing to do if you’re sitting on unwarehoused data in silos across your system.” The medical system may be large, but it operates as “one patient, one record, one standard.” Gen AI allowing NYU Langone to move away from ‘one-size-fits-all’ education As Triola put it, the main question his team has been looking to address is: “How do they link the diagnosis, the context of the individual student and all of these learning materials?”  “All of a sudden we’ve got this great key to unlock that: generative AI,” he said.  This has enabled the school to move away from a “one-size-fits-all” model that has been the norm, whether students intended to become, for example, a neurosurgeon or a psychiatrist — vastly different disciplines that require unique approaches.  It’s important that students get tailored education throughout their schooling, as well as “educational nudges” that adapt to their needs, he said. But you can’t just tell faculty to “spend more time with each individual student” — that’s humanly impossible.  “Our students have been hungry for this, because they recognize that this is a high-velocity period of change in medicine and generative AI,” said Triola. “It absolutely will change…what it means to be a

Medical training’s AI leap: How agentic RAG, open-weight LLMs and real-time case insights are shaping a new generation of doctors at NYU Langone Read More »

Anthropic’s Claude 3.7 Sonnet takes aim at OpenAI and DeepSeek in AI’s next big battle

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Anthropic just fired a warning shot at OpenAI, DeepSeek and the entire AI industry with the launch of Claude 3.7 Sonnet, a model that gives users unprecedented control over how much time an AI spends “thinking” before generating a response. The release, alongside the debut of Claude Code, a command-line AI coding agent, signals Anthropic’s aggressive push into the enterprise AI market — a push that could reshape how businesses build software and automate work. The stakes couldn’t be higher. Last month, DeepSeek stunned the tech world with an AI model that matched the capabilities of U.S. systems at a fraction of the cost, sending Nvidia’s stock down 17% and raising alarms about America’s AI leadership. Now Anthropic is betting that precise control over AI reasoning — not just raw speed or cost savings — will give it an edge. Claude 3.7 Sonnet introduces a ‘thinking mode’ toggle, allowing users to optimize the AI’s response time based on task complexity. (Credit: Anthropic) “We just believe that reasoning is a core part and core component of an AI, rather than a separate thing that you have to pay separately to access,” said Dianne Penn, who leads product management for research at Anthropic, in an interview with VentureBeat. “Just like humans, the AI should handle both quick responses and complex thinking. For a simple question like ‘what time is it?’, it should answer instantly. But for complex tasks — like planning a two-week Italy trip while accommodating gluten-free dietary needs — it needs more extensive processing time.” “We don’t see reasoning, planning and self-correction as separate capabilities,” she added. “So this is essentially our way of expressing that philosophical difference…Ideally, the model itself should recognize when a problem requires more intensive thinking and adjust, rather than requiring users to explicitly select different reasoning modes.” A comparison of AI models shows Claude 3.7 Sonnet’s performance across various tasks, with notable gains in extended thinking capabilities compared to its predecessor. (Credit: Anthropic) The benchmark data backs up Anthropic’s ambitious vision. In extended thinking mode, Claude 3.7 Sonnet achieves 78.2% accuracy on graduate-level reasoning tasks, challenging OpenAI’s latest models and outperforming DeepSeek-R1. But the more revealing metrics come from real-world applications. The model scores 81.2% on retail-focused tool use and shows marked improvements in instruction-following (93.2%) — areas where competitors have either struggled or haven’t published results. While DeepSeek and OpenAI lead in traditional math benchmarks, Claude 3.7’s unified approach demonstrates that a single model can effectively switch between quick responses and deep analysis, potentially eliminating the need for businesses to maintain separate AI systems for different types of tasks. How Anthropic’s hybrid AI could reshape enterprise computing The timing of the release is crucial. DeepSeek’s emergence last month sent shockwaves through Silicon Valley, demonstrating that sophisticated AI reasoning could be achieved with far less computing power than previously thought. This challenged fundamental assumptions about AI development costs and infrastructure requirements. When DeepSeek published its results, Nvidia’s stock dropped 17% in a single day, investors suddenly questioning whether expensive chips were truly essential for advanced AI. For businesses, the stakes couldn’t be higher. Companies are spending millions integrating AI into their operations, betting on which approach will dominate. Anthropic’s hybrid model offers a compelling middle path: the ability to fine-tune AI performance based on the task at hand, from instant customer service responses to complex financial analysis. The system maintains Anthropic’s previous pricing of $3 per million input tokens and $15 per million output tokens, even with added reasoning features. Claude 3.7 Sonnet introduces a ‘thinking mode’ toggle, allowing users to optimize the AI’s response time based on task complexity. (Credit: Anthropic) “Our customers are trying to achieve outcomes for their customers,” explained Michael Gerstenhaber, Anthropic’s head of platform. “Using the same model and prompting the same model in different ways allows somebody like Thompson Reuters to do legal research, allows our coding partners like Cursor or GitHub to be able to develop applications and meet those goals.” Anthropic’s hybrid approach represents both a technical evolution and a strategic gambit. While OpenAI maintains separate models for different capabilities and DeepSeek focuses on cost efficiency, Anthropic is pursuing unified systems that can handle both routine tasks and complex reasoning. It’s a philosophy that could reshape how businesses deploy AI and eliminate the need to juggle multiple specialized models. Meet Claude Code: AI’s new developer assistant Anthropic today also unveiled Claude Code, a command-line tool that allows developers to delegate complex engineering tasks directly to AI. The system requires human approval before committing code changes, reflecting growing industry focus on responsible AI development. Claude Code’s terminal interface, part of Anthropic’s new developer tools suite, emphasizes simplicity and direct interaction. (Credit: Anthropic) “You actually still have to accept the changes Claude makes. You are a reviewer with hands on [the] wheel,” Penn noted. “There is essentially a sort of checklist that you have to essentially accept for the model to take certain actions.” The announcements come amid intense competition in AI development. Stanford researchers recently created an open-source reasoning model for under $50, while Microsoft just integrated OpenAI’s o3-mini model into Azure. DeepSeek’s success has also spurred new approaches to AI development, with some companies exploring model distillation techniques that could further reduce costs. The command-line interface of Claude Code allows developers to delegate complex engineering tasks while maintaining human oversight. (Credit: Anthropic) From Pokémon to enterprise: Testing AI’s new intelligence Penn illustrated the dramatic progress in AI capabilities with an unexpected example: “We’ve been asking different versions of Claude to play Pokémon…This version has made it all the way to Vermilion City, captured multiple Pokémon, and even grinds to level-up. It has the right Pokémon to battle against rivals.” “I think you’ll see us continue to innovate and push on the quality of reasoning, push towards things like dynamic reasoning,” Penn explained. “We have always thought of

Anthropic’s Claude 3.7 Sonnet takes aim at OpenAI and DeepSeek in AI’s next big battle Read More »

Voltron Data just partnered with Accenture to solve one of AI’s biggest headaches

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More As artificial intelligence drives unprecedented demand for data processing, Mountain View startup Voltron Data is offering a solution to one of AI’s least discussed but most critical challenges: moving and transforming massive datasets quickly enough to keep up. Voltron Data, which announced a strategic partnership with Accenture today, has developed a GPU-accelerated analytics engine that could help enterprises overcome the data preparation bottleneck hampering AI initiatives. The company’s core product, Theseus, enables organizations to process petabyte-scale data using graphics processing units (GPUs) instead of traditional computer processors (CPUs). “Everyone’s focused on the flashy new stuff that you can touch and feel, but it’s that dataset foundation underneath that is going to be key,” said Michael Abbott, who leads Accenture’s banking and capital markets practice, in an exclusive interview with VentureBeat. “To make AI work, you’ve got to move data around at a speed and pace you just never had to before.” Building for the AI tsunami: Why traditional data processing won’t cut it The partnership comes as companies rushing to adopt generative AI are discovering their existing data infrastructure isn’t equipped to handle the volume and velocity of data required. This challenge is expected to intensify as AI agents become more prevalent in enterprise operations. “Agents will probably write more SQL queries than humans did in a very short time horizon,” said Rodrigo Aramburu, Voltron Data’s Field CTO and cofounder. “If CIOs and CTOs are already saying they spend way too much on data analytics and cloud infrastructure, and the demand is about to step function higher, then we need a step function down in the cost of running those queries.” Unlike traditional database vendors that have retrofitted GPU support onto existing systems, Voltron Data built its engine from the ground up for GPU acceleration. “What most companies have done when they’ve tried to do GPU acceleration is they’ll shoehorn GPUs onto an existing system,” Aramburu told VentureBeat. “By building from the ground up…we’re able to get 10x, 20x, 100x depending on the performance profile of a particular workload.” From 1,400 servers to 14: Early adopters see dramatic results The company positions Theseus as complementary to established platforms like Snowflake and Databricks, leveraging the Apache Arrow framework for efficient data movement. “It’s really an accelerator to all those databases, rather than competition,” Abbott said. “It’s still using the same SQL that was written to get the same answer, but get there a lot faster and quicker in a parallel fashion.” Early adoption has focused on data-intensive industries like financial services, where use cases include fraud detection, risk modeling and regulatory compliance. One large retailer reduced its server count from 1,400 CPU machines to just 14 GPU servers after implementing Theseus, according to Aramburu. Since launching at Nvidia’s GTC conference last March, Voltron Data has secured about 14 enterprise customers, including two large government agencies. The company plans to release a “test drive” version that will allow potential customers to experiment with GPU-accelerated queries on terabyte-scale datasets. Turning the GPU shortage into an opportunity The current GPU shortage sparked by AI demand has been both challenging and beneficial for Voltron Data. While new deployments face hardware constraints, many enterprises possess underutilized GPU infrastructure originally purchased for AI workloads, assets that could be repurposed for data processing during idle periods. “We actually saw it as a boon in that there’s just so many GPUs out in the market that previously weren’t there,” Aramburu noted, adding that Theseus can run effectively on older GPU generations that might otherwise be deprecated. The technology could be particularly valuable for banks dealing with what Abbott calls “trapped data” — information locked in formats like PDFs and documents that could be valuable for AI training but is difficult to access and process at scale. “You’ve seen some of the data that Voltron would show you is potentially 90% more effective and efficient to move data using this technology than standard CPUs,” Abbott said. “That’s the power.” As enterprises grapple with the data demands of AI, solutions that can accelerate data processing and reduce infrastructure costs are likely to become increasingly critical. The partnership with Accenture could help Voltron Data reach more organizations facing these challenges, while giving Accenture’s clients access to technology that could significantly improve their AI initiatives’ performance and efficiency. source

Voltron Data just partnered with Accenture to solve one of AI’s biggest headaches Read More »

DeepSeek’s R1 and OpenAI’s Deep Research just redefined AI — RAG, distillation, and custom models will never be the same

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Things are moving quickly in AI — and if you’re not keeping up, you’re falling behind.  Two recent developments are reshaping the landscape for developers and enterprises alike: DeepSeek’s R1 model release and OpenAI’s new Deep Research product. Together, they’re redefining the cost and accessibility of powerful reasoning models, which has been well reported on. Less talked about, however, is how they’ll push companies to use techniques like distillation, supervised fine-tuning (SFT), reinforcement learning (RL) and retrieval-augmented generation (RAG) to build smarter, more specialized AI applications. After the initial excitement around the amazing achievements of DeepSeek begins to settle, developers and enterprise decision-makers need to consider what it means for them. From pricing and performance to hallucination risks and the importance of clean data, here’s what these breakthroughs mean for anyone building AI today. Cheaper, transparent, industry-leading reasoning models – but through distillation The headline with DeepSeek-R1 is simple: It delivers an industry-leading reasoning model at a fraction of the cost of OpenAI’s o1. Specifically, it’s about 30 times cheaper to run, and unlike many closed models, DeepSeek offers full transparency around its reasoning steps. For developers, this means you can now build highly customized AI models without breaking the bank — whether through distillation, fine-tuning or simple RAG implementations. Distillation, in particular, is emerging as a powerful tool. By using DeepSeek-R1 as a “teacher model,” companies can create smaller, task-specific models that inherit R1’s superior reasoning capabilities. These smaller models, in fact, are the future for most enterprise companies. The full R1 reasoning model can be too much for what companies need — thinking too much, and not taking the decisive action companies need for their specific domain applications. “One of the things that no one is really talking about, certainly in the mainstream media, is that, actually, reasoning models are not working that well for things like agents,” said Sam Witteveen, a machine learning (ML) developer who works on AI agents that are increasingly orchestrating enterprise applications.   As part of its release, DeepSeek distilled its own reasoning capabilities onto a number of smaller models, including open-source models from Meta’s Llama family and Alibaba’s Qwen family, as described in its paper. It’s these smaller models that can then be optimized for specific tasks. This trend toward smaller, fast models to serve custom-built needs will accelerate: Eventually there will be armies of them.  “We are starting to move into a world now where people are using multiple models. They’re not just using one model all the time,” said Witteveen. And this includes the low-cost, smaller closed-sourced models from Google and OpenAI as well. “The means that models like Gemini Flash, GPT-4o Mini, and these really cheap models actually work really well for 80% of use cases.” If you work in an obscure domain, and have resources: Use SFT…  After the distilling step, enterprise companies have a few options to make sure the model is ready for their specific application. If you’re a company in a very specific domain, where details are not on the web or in books — which large language models (LLMs) typically train on — you can inject it with your own domain-specific data sets, with SFT. One example would be the ship container-building industry, where specifications, protocols and regulations are not widely available.  DeepSeek showed that you can do this well with “thousands” of question-answer data sets. For an example of how others can put this into practice, IBM engineer Chris Hay demonstrated how he fine-tuned a small model using his own math-specific datasets to achieve lightning-fast responses — outperforming OpenAI’s o1 on the same tasks (View the hands-on video here.) …and a little RL Additionally, companies wanting to train a model with additional alignment to specific preferences — for example, making a customer support chatbot sound empathetic while being concise — will want to do some RL. This is also good if a company wants its chatbot to adapt its tone and recommendation based on user feedback. As every model gets good at everything, “personality” is going to be increasingly big, Wharton AI professor Ethan Mollick said on X. These SFT and RL steps can be tricky for companies to implement well, however. Feed the model with data from one specific domain area, or tune it to act a certain way, and it suddenly becomes useless for doing tasks outside of that domain or style. For most companies, RAG will be good enough For most companies, however, RAG is the easiest and safest path forward. RAG is a relatively straight-forward process that allows organizations to ground their models with proprietary data contained in their own databases — ensuring outputs are accurate and domain-specific. Here, an LLM feeds a user’s prompt into vector and graph databases to search information relevant to that prompt. RAG processes have gotten very good at finding only the most relevant content. This approach also helps counteract some of the hallucination issues associated with DeepSeek, which currently hallucinates 14% of the time compared to 8% for OpenAI’s o3 model, according to a study done by Vectara, a vendor that helps companies with the RAG process.  This distillation of models plus RAG is where the magic will come for most companies. It has become so incredibly easy to do, even for those with limited data science or coding expertise. I personally downloaded the DeepSeek distilled 1.5b Qwen model, the smallest one, so that it could fit nicely on my Macbook Air. I then loaded up some PDFs of job applicant resumes into a vector database, then asked the model to look over the applicants to tell me which ones were qualified to work at VentureBeat. (In all, this took me 74 lines of code, which I basically borrowed from others doing the same). I loved that the Deepseek distilled model showed its thinking process behind why or why not it recommended each applicant —

DeepSeek’s R1 and OpenAI’s Deep Research just redefined AI — RAG, distillation, and custom models will never be the same Read More »

Breaking down Grok 3: The AI model that could redefine the industry

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Less than two years since its launch, xAI has shipped what could arguably be the most advanced AI model to date. Grok 3 matches or beats the most advanced models on all key benchmarks as well as the user-evaluated Chatbot Arena, and its training has not even been completed yet.  We still don’t have a lot of details about Grok 3, as the team has not yet released a paper or technical report. But from what xAI has shared in a presentation and based on different experiments AI experts have run on the model, we can guess how Grok 3 might affect the AI industry in the coming months. Faster launches With competition increasing between AI labs (just look at the release of DeepSeek-R1), we can expect model release cycles to become shorter. In the Grok 3 presentation, xAI founder Elon Musk said that users may “notice improvements almost every day because we’re continuously improving the model.” “Competitive pressure from DeepSeek and Grok integrated into a shifting political environment for AI — both domestic and international — will make the established leading labs ship sooner,” writes Nathan Lambert, machine learning scientist at Allen Institute for AI. “Increased competition and decreased regulation make it likely that we, the users, will be given far more powerful AI on far faster timelines.” On the one hand, this can be a good thing for users as they constantly get access to the latest and greatest models as opposed to waiting for month-long rollouts. On the other, it can have a destabilizing effect for developers who expect consistent behavior from the model. Previous research and empirical evidence from users has shown that various versions of models can react differently to the same prompt.  Enterprises should develop custom evaluations and regularly run them to make sure new updates do not break their applications. Scaling laws The recent release of DeepSeek-R1 undermined the massive spending that big companies are making to create large compute clusters. But xAI’s sudden rise is a vindication of the massive investments tech companies have been making in AI accelerators. Grok 3 was trained in a record time thanks to xAI’s Collosus supercluster in Memphis. “We don’t have specifics, but it’s reasonably safe to take a datapoint for scaling still helps for performance (but maybe not on costs),” Lambert writes. “xAI’s approach and messaging has been to get the biggest cluster online as soon as possible. The Occam’s Razor explanation until we have more details is that scaling helped, but it is possible that most of Grok’s performance comes from techniques other than naive scaling.” Other analysts have pointed out that xAI’s ability to scale its computer cluster has been the key to the success of Grok 3. However, Musk has alluded that there is more than just scaling at work here. We’ll have to wait for the paper to get the full details. Open source culture There is a growing shift toward open sourcing large language models (LLMs). xAI has already open-sourced Grok 1. According to Musk, the company’s general policy is to open source every model except the latest version. So, when Grok 3 is fully released, Grok 2 will be open-sourced. (Sam Altman has also been entertaining the idea of open sourcing some of OpenAI’s models.) xAI will also refrain from showing the full chain-of-thought (CoT) tokens of Grok 3 reasoning to prevent competitors from copying it. It will instead show a detailed overview of the model’s reasoning trace (as OpenAI has done with o3-mini). The full CoT will only be available once xAI open sources Grok 3, which will probably come after the release of Grok 4. Do your own vibe check Despite the impressive benchmark results, reactions to Grok 3 have been mixed. Former OpenAI and Tesla AI scientist Andrej Karpathy placed its reasoning capabilities at “around state-of-the-art,” along with o1-Pro, but also pointed out that it lags behind other state-of-the-art models on some tasks such as creating compositional scalable vector graphics or navigating ethical issues. Other users have pointed out flaws in Grok 3’s coding abilities in comparison to other models, although there are also many instances of Grok 3 pulling out impressive coding feats. Based on my own experience with leading models, I advise you do your own vibe check and research. I never judge a model based on a one-shot prompt. Have a set of tests that reflect the kind of tasks you accomplish in your organization (see a few examples here). Chances are, with the right approach, you can get the most out of these advanced models. source

Breaking down Grok 3: The AI model that could redefine the industry Read More »

Too many models, too much confusion: OpenAI pledges to simplify its product line

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI plans to “simplify” its model offerings, even as it prepares to ship its last non-reasoning model, GPT-4.5. This is a rare admission from a tech company that its product releases haven’t been differentiated enough for customers. OpenAI CEO Sam Altman took to X to give product roadmap updates and acknowledge that some of the company’s recent releases muddled the waters. “We want to do a better job of sharing our intended roadmap, and a much better job simplifying our product offerings,” Altman wrote. “We want AI to ‘just work’ for you; we realize how complicated our model and product offerings have gotten.” Altman took particular umbrage against ChatGPT’s model picker, where users tap a drop-down to choose from released models. The number of models a user can choose depends on their subscription tier.  Altman said his team “hates the model picker as much as you do” and will bring back “magic unified intelligence,” presumably referring to a single model powering a platform. Altman said the company plans to integrate its technology, including the o-series of models, into GPT-5 when it comes out on ChatGPT and its APIs.  Over the past year, OpenAI has released GPT-4o, o1, o3 and o3-mini models. It has also launched new ChatGPT subscription tiers, including the $200/month ChatGPT Pro.  On top of all that, the company has released Operator agent, Deep Research, ChatGPT tasks and a host of other new features that are sometimes only available through specific models or paid levels.  Awaiting the release of OpenAI’s GPT-4.5 and 5 Altman said OpenAI’s next step is releasing GPT-4.5, its last non-chain-of-thought (CoT) model. It’s been internally called Orion, although he did not explain what new capabilities GPT-4.5 will have.  “After that, a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not and generally be useful for a very wide range of tasks,” said Altman.  When the company releases GPT-5, it will stop shipping the reasoning o3 as a standalone model.  OpenAI understands that some users may not need the full capabilities of its newest models, Altman conceded, which is why the company created a model picker in the first place. GPT-5’s capabilities will be attuned to the subscription level to replace the dropdown, he said. “The free tier of ChatGPT will get unlimited chat access to GPT-5 at the standard intelligence setting (!!), subject to abuse thresholds,” said Altman. “Plus subscribers will be able to run GPT-5 at a higher level of intelligence, and Pro subscribers will be able to run GPT-5 at an even higher level of intelligence. These models will incorporate voice, canvas, search, deep research and more.” Altman did not say when GPT-4.5 or GPT-5 could be released.  Customer confusion OpenAI’s decision to simplify how it presents models points to a continuing problem: Not everyone understands what each model is supposed to do and what makes it different. OpenAI is not alone in this. If I weren’t reporting on the industry day in and day out, I wouldn’t be able to differentiate between the Gemini models and tell you why Claude’s Sonnet is better than Haiku. While this confusion is more common among consumers, simplifying products makes things much more manageable, even for power users like AI engineers.  Increasingly, you must have a working encyclopedia of which models can do what and which are more updated. Integrating the latest capabilities into one model and only allowing more complex abilities based on usage could make it easier for many developers to experiment faster.  source

Too many models, too much confusion: OpenAI pledges to simplify its product line Read More »

Aomni just raised $4M to prove AI can boost sales without replacing humans

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Aomni, an AI platform that helps sales teams conduct deep research on potential customers, has raised $4 million in seed funding led by Decibel, with participation from Sancus Ventures and Ride Home Fund. The company is taking a distinctly human-centric approach in a market saturated with AI tools focused on automating away sales roles. Founded by David Zhang, whose journey from a small Chinese village to launching satellites into space mirrors the ambitious scope of his latest venture, Aomni uses AI agents to transform how enterprise sales teams research and engage with prospects. “If you look at a typical sales rep, they only spend 30-40% of their time in front of customers,” Zhang told VentureBeat in an exclusive interview. “The other 60-70% is spent on tedious knowledge work — filling out CRM systems, lead prospecting, and account research. Our goal is to make all that tedious work disappear.” Aomni’s AI dashboard surfaces real-time intelligence on Snowflake, including a $120,000 deal opportunity and key decision-makers. (Credit: Aomni) AI agents that research in real time, giving sales teams an edge Unlike traditional sales intelligence platforms that rely on static databases and basic firmographic data, Aomni deploys AI agents that conduct real-time web research across dozens of data sources. When researching a prospect, the system spins up multiple Chrome browsers to actively browse recent company announcements, social media posts, product launches and other public information. “Current sales research products are based more on a static database,” Zhang told VentureBeat. “They’re great if you want to find companies that have raised more than $10 million or have 200 employees, but that’s very high-level. Because we do agenda research with real-time web browsing, you can ask very specific questions.” Early customers include major technology companies like Nvidia and AMD, which are using Aomni to navigate the rapidly evolving AI chip market. The platform has helped sales teams improve close rates by up to 40% by enabling more targeted and informed customer conversations. An AI-generated sales strategy showing how Aomni cuts research time from hours to minutes for enterprise accounts. (Credit: Aomni) Why investors are betting big on AI that supports — not replaces — sales teams Jessica Leao, partner at Decibel, said the firm was drawn to Aomni’s human-centric vision in a market dominated by automation plays. “So many AI sales tools focus on replacing reps, but sales will always be human-driven at the highest levels,” she explained. “What I love about Aomni is that the team has a unique combination of deep technical expertise and a compelling vision for the future that is unapologetically team human.” The technology builds on recent advances in large language models and AI agents, but Zhang emphasizes that the key innovation is in the orchestration layer rather than the underlying models. Aomni combines multiple AI providers including OpenAI, Anthropic, and AI21 Labs, with proprietary systems for parsing company-specific information. Zhang’s approach appears prescient given the recent explosion of interest in AI agents. An open-source version of Aomni’s research system, released shortly after OpenAI’s similar Deep Research feature, garnered over 8,000 GitHub stars within days. Introducing deep-research – my own open source implementation of OpenAI’s new Deep Research agent. Get the same capability without paying $200. You can even tweak the behavior of the agent with adjustable breadth and depth. Run it for 5 min or 5 hours, it’ll auto adjust. pic.twitter.com/SF8km7ybJ2 — David (@dzhng) February 4, 2025 The future of sales AI: smarter, faster and more human-centric Looking ahead, Zhang plans to expand Aomni’s capabilities with multimodal features that will allow sales teams to interact with the AI through voice and images. The company is betting heavily on continued improvements in AI model performance and declining compute costs. “Our development philosophy is: Don’t bet against the model,” Zhang said. “A lot of companies spend time on optimization and guardrails and cost savings. Then in two months, there’s a new model that’s half the price and all that work becomes useless.” The funding will help Aomni pursue its vision of turning every sales representative into a top performer through AI augmentation rather than replacement. In a field as relationship-driven as enterprise sales, Zhang believes the future belongs to solutions that enhance rather than automate human capabilities. “Ultimately, sales is still about solving customer problems and building relationships,” Zhang said. “That’s not going to change in 10 years. You can add as much automation as you want on top, but those fundamentals remain the same.” source

Aomni just raised $4M to prove AI can boost sales without replacing humans Read More »

Like it or not, AI is learning how to influence you

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More When I was a kid there were four AI agents in my life. Their names were Inky, Blinky, Pinky and Clyde and they tried their best to hunt me down. This was the 1980s and the agents were the four colorful ghosts in the iconic arcade game Pac-Man. By today’s standards they weren’t particularly smart, yet they seemed to pursue me with cunning and intent. This was decades before neural networks were used in video games, so their behaviors were controlled by simple algorithms called heuristics that dictate how they would chase me around the maze.   Most people don’t realize this, but the four ghosts were designed with different “personalities.” Good players can observe their actions and learn to predict their behaviors. For example, the red ghost (Blinky) was programmed with a “pursuer” personality that charges directly towards you. The pink ghost (Pinky) on the other hand, was given an “ambusher” personality that predicts where you’re going and tries to get there first. As a result, if you rush directly at Pinky, you can use her personality against her, causing her to actually turn away from you. I reminisce because in 1980 a skilled human could observe these AI agents, decode their unique personalities and use those insights to outsmart them. Now, 45 years later, the tides are about to turn. Like it or not, AI agents will soon be deployed that are tasked with decoding your personality so they can use those insights to optimally influence you. The future of AI manipulation In other words, we are all about to become unwitting players in “The game of humans” and it will be the AI agents trying to earn the high score. I mean this literally — most AI systems are designed to maximize a “reward function” that earns points for achieving objectives. This allows AI systems to quickly find optimal solutions. Unfortunately, without regulatory protections, we humans will likely become the objective that AI agents are tasked with optimizing.  I am most concerned about the conversational agents that will engage us in friendly dialog throughout our daily lives. They will speak to us through photorealistic avatars on our PCs and phones and soon, through AI-powered glasses that will guide us through our days. Unless there are clear restrictions, these agents will be designed to conversationally probe us for information so they can characterize our temperaments, tendencies, personalities and desires, and use those traits to maximize their persuasive impact when working to sell us products, pitch us services or convince us to believe misinformation. This is called the “AI Manipulation Problem,” and I’ve been warning regulators about the risk since 2016. Thus far, policymakers have not taken decisive action, viewing the threat as too far in the future. But now, with the release of Deepseek-R1, the final barrier to widespread deployment of AI agents — the cost of real-time processing — has been greatly reduced. Before this year is out, AI agents will become a new form of targeted media that is so interactive and adaptive, it can optimize its ability to influence our thoughts, guide our feelings and drive our behaviors. Superhuman AI ‘salespeople’ Of course, human salespeople are interactive and adaptive too. They engage us in friendly dialog to size us up, quickly finding the buttons they can press to sway us. AI agents will make them look like amateurs, able to draw information out of us with such finesse, it would intimidate a seasoned therapist. And they will use these insights to adjust their conversational tactics in real-time, working to persuade us more effectively than any used car salesman. These will be asymmetric encounters in which the artificial agent has the upper hand (virtually speaking). After all, when you engage a human who is trying to influence you, you can usually sense their motives and honesty. It will not be a fair fight with AI agents. They will be able to size you up with superhuman skill, but you won’t be able to size them up at all. That’s because they will look, sound and act so human, we will unconsciously trust them when they smile with empathy and understanding, forgetting that their facial affect is just a simulated façade.  In addition, their voice, vocabulary, speaking style, age, gender, race and facial features are likely to be customized for each of us personally to maximize our receptiveness. And, unlike human salespeople who need to size up each customer from scratch, these virtual entities could have access to stored data about our backgrounds and interests. They could then use this personal data to quickly earn your trust, asking you about your kids, your job or maybe your beloved New York Yankees, easing you into subconsciously letting down your guard. When AI achieves cognitive supremacy To educate policymakers on the risk of AI-powered manipulation, I helped in the making of an award-winning short film entitled Privacy Lost that was produced by the Responsible Metaverse Alliance, Minderoo and the XR Guild. The quick 3-minute narrative depicts a young family eating in a restaurant while wearing autmented reality (AR) glasses. Instead of human servers, avatars take each diner’s orders, using the power of AI to upsell them in personalized ways. The film was considered sci-fi when released in 2023 — yet only two years later, big tech is engaged in an all-out arms race to make AI-powered eyewear that could easily be used in these ways. In addition, we need to consider the psychological impact that will occur when we humans start to believe that the AI agents giving us advice are smarter than us on nearly every front. When AI achieves a perceived state of “cognitive supremacy” with respect to the average person, it will likely cause us to blindly accept its guidance rather than using our own critical thinking. This deference to a perceived superior intelligence (whether truly superior

Like it or not, AI is learning how to influence you Read More »

AI can fix bugs—but can’t find them: OpenAI’s study highlights limits of LLMs in software engineering

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Large language models (LLMs) may have changed software development, but enterprises will need to think twice about entirely replacing human software engineers with LLMs, despite OpenAI CEO Sam Altman’s claim that models can replace “low-level” engineers. In a new paper, OpenAI researchers detail how they developed an LLM benchmark called SWE-Lancer to test how much foundation models can earn from real-life freelance software engineering tasks. The test found that, while the models can solve bugs, they can’t see why the bug exists and continue to make more mistakes.  The researchers tasked three LLMs — OpenAI’s GPT-4o and o1 and Anthropic’s Claude-3.5 Sonnet — with 1,488 freelance software engineer tasks from the freelance platform Upwork amounting to $1 million in payouts. They divided the tasks into two categories: individual contributor tasks (resolving bugs or implementing features), and management tasks (where the model roleplays as a manager who will choose the best proposal to resolve issues).  “Results indicate that the real-world freelance work in our benchmark remains challenging for frontier language models,” the researchers write.  The test shows that foundation models cannot fully replace human engineers. While they can help solve bugs, they’re not quite at the level where they can start earning freelancing cash by themselves.  Benchmarking freelancing models The researchers and 100 other professional software engineers identified potential tasks on Upwork and, without changing any words, fed these to a Docker container to create the SWE-Lancer dataset. The container does not have internet access and cannot access GitHub “to avoid the possible of models scraping code diffs or pull request details,” they explained. The team identified 764 individual contributor tasks, totaling about $414,775, ranging from 15-minute bug fixes to weeklong feature requests. These tasks, which included reviewing freelancer proposals and job postings, would pay out $585,225. The tasks were added to the expensing platform Expensify.  The researchers generated prompts based on the task title and description and a snapshot of the codebase. If there were additional proposals to resolve the issue, “we also generated a management task using the issue description and list of proposals,” they explained. From here, the researchers moved to end-to-end test development. They wrote Playwright tests for each task that applies these generated patches which were then “triple-verified” by professional software engineers. “Tests simulate real-world user flows, such as logging into the application, performing complex actions (making financial transactions) and verifying that the model’s solution works as expected,” the paper explains.  Test results After running the test, the researchers found that none of the models earned the full $1 million value of the tasks. Claude 3.5 Sonnet, the best-performing model, earned only $208,050 and resolved 26.2% of the individual contributor issues. However, the researchers point out, “the majority of its solutions are incorrect, and higher reliability is needed for trustworthy deployment.” The models performed well across most individual contributor tasks, with Claude 3.5-Sonnet performing best, followed by o1 and GPT-4o.  “Agents excel at localizing, but fail to root cause, resulting in partial or flawed solutions,” the report explains. “Agents pinpoint the source of an issue remarkably quickly, using keyword searches across the whole repository to quickly locate the relevant file and functions — often far faster than a human would. However, they often exhibit a limited understanding of how the issue spans multiple components or files, and fail to address the root cause, leading to solutions that are incorrect or insufficiently comprehensive. We rarely find cases where the agent aims to reproduce the issue or fails due to not finding the right file or location to edit.” Interestingly, the models all performed better on manager tasks that required reasoning to evaluate technical understanding. These benchmark tests showed that AI models can solve some “low-level” coding problems and can’t replace “low-level” software engineers yet. The models still took time, often made mistakes, and couldn’t chase a bug around to find the root cause of coding problems. Many “low-level” engineers work better, but the researchers said this may not be the case for very long.  source

AI can fix bugs—but can’t find them: OpenAI’s study highlights limits of LLMs in software engineering Read More »

Together AI’s $305M bet: Reasoning models like DeepSeek-R1 are increasing, not decreasing, GPU demand

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More When DeepSeek-R1 first emerged, the prevailing fear that shook the industry was that advanced reasoning could be achieved with less infrastructure. As it turns out, that’s not necessarily the case. At least, according to Together AI, the rise of DeepSeek and open-source reasoning has had the exact opposite effect: Instead of reducing the need for infrastructure, it is increasing it. That increased demand has helped fuel the growth of Together AI’s platform and business. Today the company announced a $305 million series B round of funding, led by General Catalyst and co-led by Prosperity7. Together AI first emerged in 2023 with an aim to simplify enterprise use of open-source large language models (LLMs). The company expanded in 2024 with the Together enterprise platform, which enables AI deployment in virtual private cloud (VPC) and on-premises environments. In 2025, Together AI is growing its platform once again with reasoning clusters and agentic AI capabilities.  The company claims that its AI deployment platform has more than 450,000 registered developers and that the business has grown 6X overall year-over-year. The company’s customers include enterprises as well as AI startups such as  Krea AI, Captions and Pika Labs. “We are now serving models across all modalities: language and reasoning and images and audio and video,” Vipul Prakash, CEO of Together AI, told VentureBeat. The huge impact DeepSeek-R1 is having on AI infrastructure demand DeepSeek-R1 was hugely disruptive when it first debuted, for a number of reasons — one of which was the implication that a leading edge open-source reasoning model could be built and deployed with less infrastructure than a proprietary model. However, Prakash explained, Together AI has grown its infrastructure in part to help support increased demand of DeepSeek-R1 related workloads. “It’s a fairly expensive model to run inference on,” he said. “It has 671 billion parameters and you need to distribute it over multiple servers. And because the quality is higher, there’s generally more demand on the top end, which means you need more capacity.” Additionally, he noted that DeepSeek-R1 generally has longer-lived requests that can last two to three minutes. Tremendous user demand for DeepSeek-R1 is further driving the need for more infrastructure. To meet that demand, Together AI has rolled out a service it calls “reasoning clusters” that provision dedicated capacity, ranging from 128 to 2,000 chips, to run models at the best possible performance. How Together AI is helping organizations use reasoning AI There are a number of specific areas where Together AI is seeing usage of reasoning models. These include: Coding agents: Reasoning models help break down larger problems into steps. Reducing hallucinations: The reasoning process helps to verify the outputs of models, thus reducing hallucinations, which is important for applications where accuracy is crucial. Improving non-reasoning models: Customers are distilling and improving the quality of non-reasoning models. Enabling self-improvement: The use of reinforcement learning with reasoning models allows models to recursively self-improve without relying on large amounts of human-labeled data. Agentic AI is also driving increased demand for AI infrastructure  Together AI is also seeing increased infrastructure demand as its users embrace agentic AI. Prakash explained that agentic workflows, where a single user request results in thousands of API calls to complete a task, are putting more compute demand on Together AI’s infrastructure. To help support agentic AI workloads, Together AI recently has acquired CodeSandbox, whose technology provides lightweight, fast-booting virtual machines (VMs) to execute arbitrary, secure code within the Together AI cloud, where the language models also reside. This allows Together AI to reduce the latency between the agentic code and the models that need to be called, improving the performance of agentic workflows. Nvidia Blackwell is already having an impact All AI platforms are facing increased demands.  That’s one of the reasons why Nvidia keeps rolling out new silicon that provides more performance. Nvidia’s latest product chip is the Blackwell GPU, which is now being deployed at Together AI. Prakash said Nvidia Blackwell chips cost around 25% more than the previous generation, but provide 2X the performance. The GB 200 platform with Blackwell chips is particularly well-suited for training and inference of mixture of expert (MoE) models, which are trained across multiple InfiniBand-connected servers. He noted that Blackwell chips are also expected to provide a bigger performance boost for inference of larger models, compared to smaller models. The competitive landscape of agentic AI The market of AI infrastructure platforms is fiercely competitive.  Together AI faces competition from both established cloud providers and AI infrastructure startups. All the hyperscalers, including Microsoft, AWS and Google, have AI platforms. There is also an emerging category of AI-focussed players such as Groq and Samba Nova that are all aiming for a slice of the lucrative market. Together AI has a full-stack offering, including GPU infrastructure with software platform layers on top. This allows customers to easily build with open-source models or develop their own models on the Together AI platform. The company also has a focus on research developing optimizations and accelerated runtimes for both inference and training. “For instance, we serve the DeepSeek-R1 model at 85 tokens per second and Azure serves it at 7 tokens per second,” said Prakash. “There is a fairly widening gap in the performance and cost that we can provide to our customers.” source

Together AI’s $305M bet: Reasoning models like DeepSeek-R1 are increasing, not decreasing, GPU demand Read More »