VentureBeat

OpenAI makes ChatGPT’s image generation available as API

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More People can now natively incorporate Studio Ghibli-inspired pictures generated by ChatGPT into their businesses. OpenAI has added the model behind its wildly popular image generation tool, used in ChatGPT, to its API.  The gpt-image-1 model will allow developers and enterprises to “integrate high-quality, professional-grade image generation directly into their own tools and platforms.”  “The model’s versatility allows it to create images across diverse styles, faithfully follow custom guidelines, leverage world knowledge, and accurately render text — unlocking countless practical applications across multiple domains,” OpenAI said in a blog post.  Pricing for the API separates tokens for text and images. Text input tokens, or the prompt text, will cost $5 per 1 million tokens. Image input tokens will be $10 per million tokens, while image output tokens, or the generated image, will be a whopping $40 per million tokens.  Competitors like Stability AI offer a credit-based system for its API where one credit is equal to $0.01. Using its flagship Stable Image Ultra costs eight credits per generation. Google’s image generation model, Imagen, charges paying users $0.03 per image generated using the Gemini API. Image generation in one place OpenAI allowed ChatGPT users to generate and edit images directly on the chat interface in April, a few months after adding image generation into ChatGPT through the GPT-4o model.  The company said image generation in the chat platform “quickly became one of our most popular features.” OpenAI said over 130 million users have accessed the feature and created 700 million photos in the first week alone.  However, this popularity also presented OpenAI with some challenges. Social media users quickly discovered that they could prompt ChatGPT to generate images inspired by the Japanese animation juggernaut Studio Ghibli, and as a result, my social media feeds were filled with the same photos for the entire weekend. The trend prompted OpenAI CEO Sam Altman to claim the company’s GPUs “are melting.”  OpenAI previously added its image model DALL-E 3 on ChatGPT. That model was a diffusion transformer model rather than the native multimodal understanding that GPT-4o has.  Enterprise use cases  Enterprises want the ability to generate images for their projects, and many don’t want to open a separate application to do so. By adding the image model to its API, OpenAI allows enterprises to connect gpt-image-1 to their own ecosystems.  OpenAI said it’s already seen several enterprises and startups use the model for creative projects, products and experiences, naming several well-known brands in its blog post.  Canva is reportedly exploring ways to integrate gpt-image-1 for its Canva AI and Magic Studio Tools. GoDaddy has already begun experimenting with image generation for customers to create their logos, and Airtable now enables enterprise marketing and creative teams to easily manage asset workflows at scale. OpenAI said gpt-image-1 will get the same safety guardrails on the API as in ChatGPT. The company said images generated with the model natively include metadata from the Coalition for Content Provenance and Authenticity (C2PA) that labels content as AI-generated and tracks ownership. OpenAI is part of C2PA’s steering committee.  Users can also control content moderation to generate images that best align with their brand.  OpenAI promised that it will not use customer API data, including any images uploaded or generated by gpt-image-1 to train its models.  source

OpenAI makes ChatGPT’s image generation available as API Read More »

2027 AGI forecast maps a 24-month sprint to human-level AI

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The distant horizon is always murky, the minute details obscured by sheer distance and atmospheric haze. This is why forecasting the future is so imprecise: We cannot clearly see the outlines of the shapes and events ahead of us. Instead, we take educated guesses.  The newly published AI 2027 scenario, developed by a team of AI researchers and forecasters with experience at institutions like OpenAI and The Center for AI Policy, offers a detailed 2 to 3-year forecast for the future that includes specific technical milestones. Being near-term, it speaks with great clarity about our AI near future. Informed by extensive expert feedback and scenario planning exercises, AI 2027 outlines a quarter-by-quarter progression of anticipated AI capabilities, notably multimodal models achieving advanced reasoning and autonomy. What makes this forecast particularly noteworthy is both its specificity and the credibility of its contributors, who have direct insight into current research pipelines. The most notable prediction is that artificial general intelligence (AGI) will be achieved in 2027, and artificial superintelligence (ASI) will follow months later. AGI matches or exceeds human capabilities across virtually all cognitive tasks, from scientific research to creative endeavors, while demonstrating adaptability, common sense reasoning and self-improvement. ASI goes further, representing systems that dramatically surpass human intelligence, with the ability to solve problems we cannot even comprehend. Like many predictions, these are based on assumptions, not the least of which is that AI models and applications will continue to progress exponentially, as they have for the last several years. As such, it is plausible, but not guaranteed to expect exponential progress, especially as scaling of these models may now be hitting diminishing returns. Not everyone agrees with these predictions. Ali Farhadi, the CEO of the Allen Institute for Artificial Intelligence, told The New York Times: “I’m all for projections and forecasts, but this [AI 2027] forecast doesn’t seem to be grounded in scientific evidence, or the reality of how things are evolving in AI.”  However, there are others who view this evolution as plausible. Anthropic co-founder Jack Clark wrote in his Import AI newsletter that AI 2027 is: “The best treatment yet of what ‘living in an exponential’ might look like.” He added that it is a “technically astute narrative of the next few years of AI development.” This timeline also aligns with that proposed by Anthropic CEO Dario Amodei, who has said that AI that can surpass humans in almost everything will arrive in the next two to three years. And, Google DeepMind said in a new research paper that AGI could plausibly arrive by 2030. The great acceleration: Disruption without precedent This seems like an auspicious time. There have been similar moments like this in history, including the invention of the printing press or the spread of electricity. However, those advances required many years and decades to have a significant impact.  The arrival of AGI feels different, and potentially frightening, especially if it is imminent. AI 2027 describes one scenario that, due to misalignment with human values, superintelligent AI destroys humanity. If they are right, the most consequential risk for humanity may now be within the same planning horizon as your next smartphone upgrade. For its part, the Google DeepMind paper notes that human extinction is a possible outcome from AGI, albeit unlikely in their view.  Opinions change slowly until people are presented with overwhelming evidence. This is one takeaway from Thomas Kuhn’s singular work “The Structure of Scientific Revolutions.” Kuhn reminds us that worldviews do not shift overnight, until, suddenly, they do. And with AI, that shift may already be underway. The future draws near Before the appearance of large language models (LLMs) and ChatGPT, the median timeline projection for AGI was much longer than it is today. The consensus among experts and prediction markets placed the median expected arrival of AGI around the year 2058. Before 2023, Geoffrey Hinton — one of the “Godfathers of AI” and a Turing Award winner — thought AGI was “30 to 50 years or even longer away.” However, progress shown by LLMs led him to change his mind and said it could arrive as soon as 2028. There are numerous implications for humanity if AGI does arrive in the next several years and is followed quickly by ASI. Writing in Fortune, Jeremy Kahn said that if AGI arrives in the next few years “it could indeed lead to large job losses, as many organizations would be tempted to automate roles.” A two-year AGI runway offers an insufficient grace period for individuals and businesses to adapt. Industries such as customer service, content creation, programming and data analysis could face a dramatic upheaval before retraining infrastructure can scale. This pressure will only intensify if a recession occurs in this timeframe, when companies are already looking to reduce payroll costs and often supplant personnel with automation. Cogito, ergo … AI? Even if AGI does not lead to extensive job losses or species extinction, there are other serious ramifications. Ever since the Age of Reason, human existence has been grounded in a belief that we matter because we think.  This belief that thinking defines our existence has deep philosophical roots. It was René Descartes, writing in 1637, who articulated the now-famous phrase: “Je pense, donc je suis” (“I think, therefore I am”). He later translated it into Latin: “Cogito, ergo sum.” In so doing, he proposed that certainty could be found in the act of individual thought. Even if he were deceived by his senses, or misled by others, the very fact that he was thinking proved that he existed. In this view, the self is anchored in cognition. It was a revolutionary idea at the time and gave rise to Enlightenment humanism, the scientific method and, ultimately, modern democracy and individual rights. Humans as thinkers became the central figures of the modern world. Which raises a profound question: If machines can now think, or appear to think, and

2027 AGI forecast maps a 24-month sprint to human-level AI Read More »

VentureBeat spins out GamesBeat, accelerates enterprise AI mission

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More When I launched VentureBeat in 2006, the goal was clear: Chronicle the disruptive technologies rewriting how business gets done. Gaming earned a seat at that table early — so much so that in 2008, I invited Dean Takahashi to build the GamesBeat channel. Today, we’re giving that franchise full independence, freeing VentureBeat to double down on the AI, data and security stack now driving every enterprise agenda. We planted our flag in enterprise AI back in 2016, well before transformer talk filled board decks. Nearly a decade later, AI has leapt from research labs into every budget line. That’s why we’re focusing every editorial calorie — specialized newsletters, practitioner‑driven events and an expanding set of data‑driven products — on turning breakthrough ideas into production wins. GamesBeat will continue chronicling the business of gaming under its own banner; VentureBeat moves forward laser‑focused on the enterprise frontier. If you’re building with AI, keep us bookmarked. We’re just getting started.  — Matt Read the full press release here: VentureBeat Spins Out GamesBeat, Accelerates Enterprise AI Mission VentureBeat today announced the spinout of GamesBeat as a standalone company – a strategic move that sharpens our focus on the biggest transformation of our time: the enterprise shift to AI, data infrastructure, and intelligent security. VentureBeat has been a leading source for news and analysis related to Enterprise AI since 2016. GamesBeat has long had its own distinct voice, loyal community, and growing momentum at the intersection of gaming and innovation. As the gaming industry accelerates, the spinout enables GamesBeat to thrive independently while allowing both brands to scale with greater clarity and purpose. The deal price was undisclosed. Sharing his excitement for the future of both brands, Matt Marshall, Founder and CEO of VentureBeat, shared, “Today marks a pivotal moment for VentureBeat, positioning the company to help enterprise leaders integrate AI into every workflow. This shift enables us to focus on the real tectonic changes occurring in AI, allowing us to create products that offer sharper insights for our readers and deliver unparalleled results for our advertisers. We’re excited for the future and committed to providing value at every turn.” Dean Takahashi, GamesBeat’s new Editorial Director, said: “I believe readers of both publications will see the advantages of this specialized approach as the gaming and tech ecosystems continue to grow in complexity and importance. We’re excited for the future of GamesBeat as an independent media entity and for its future growth.” For VentureBeat, this marks a defining moment. We are now a pure-play platform fully dedicated to serving enterprise technical decision-makers—those driving change across data architecture, AI strategy, and cybersecurity. Our commitment is clear: we’re doubling down on deep editorial coverage, premier events, curated videos and webinars, and targeted newsletters, all aimed at helping leaders navigate AI’s practical implications and competitive edge. VentureBeat has reached over 31 million readers to date and continues to grow rapidly. We’ve achieved sustainable growth through disciplined operations and consistent reinvestment—building a company that’s independent, resilient, and positioned for long-term success. Without relying on outside capital, we’ve scaled by delivering unmatched value to our enterprise audience—and we’re just getting started. Today, VentureBeat operates four specialized newsletters focused on AI, Data, and Security, and hosts 20+ annual events and summits that convene the most influential minds in enterprise technology. Our editorial team leads the industry in coverage of generative AI, model orchestration, data infrastructure, and secure enterprise systems. This move lets both brands go further, faster. GamesBeat will continue covering the business of gaming with renewed focus, while VentureBeat steps even more decisively into its role as the leading authority on AI’s impact across the enterprise. We’re excited for what’s next—and we’re just getting started. — The VentureBeat Team source

VentureBeat spins out GamesBeat, accelerates enterprise AI mission Read More »

Anthropic just analyzed 700,000 Claude conversations — and found its AI has a moral code of its own

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Anthropic, the AI company founded by former OpenAI employees, has pulled back the curtain on an unprecedented analysis of how its AI assistant Claude expresses values during actual conversations with users. The research, released today, reveals both reassuring alignment with the company’s goals and concerning edge cases that could help identify vulnerabilities in AI safety measures. The study examined 700,000 anonymized conversations, finding that Claude largely upholds the company’s “helpful, honest, harmless” framework while adapting its values to different contexts — from relationship advice to historical analysis. This represents one of the most ambitious attempts to empirically evaluate whether an AI system’s behavior in the wild matches its intended design. “Our hope is that this research encourages other AI labs to conduct similar research into their models’ values,” said Saffron Huang, a member of Anthropic’s Societal Impacts team who worked on the study, in an interview with VentureBeat. “Measuring an AI system’s values is core to alignment research and understanding if a model is actually aligned with its training.” Inside the first comprehensive moral taxonomy of an AI assistant The research team developed a novel evaluation method to systematically categorize values expressed in actual Claude conversations. After filtering for subjective content, they analyzed over 308,000 interactions, creating what they describe as “the first large-scale empirical taxonomy of AI values.” The taxonomy organized values into five major categories: Practical, Epistemic, Social, Protective, and Personal. At the most granular level, the system identified 3,307 unique values — from everyday virtues like professionalism to complex ethical concepts like moral pluralism. “I was surprised at just what a huge and diverse range of values we ended up with, more than 3,000, from ‘self-reliance’ to ‘strategic thinking’ to ‘filial piety,’” Huang told VentureBeat. “It was surprisingly interesting to spend a lot of time thinking about all these values, and building a taxonomy to organize them in relation to each other — I feel like it taught me something about human values systems, too.” The research arrives at a critical moment for Anthropic, which recently launched “Claude Max,” a premium $200 monthly subscription tier aimed at competing with OpenAI’s similar offering. The company has also expanded Claude’s capabilities to include Google Workspace integration and autonomous research functions, positioning it as “a true virtual collaborator” for enterprise users, according to recent announcements. How Claude follows its training — and where AI safeguards might fail The study found that Claude generally adheres to Anthropic’s prosocial aspirations, emphasizing values like “user enablement,” “epistemic humility,” and “patient wellbeing” across diverse interactions. However, researchers also discovered troubling instances where Claude expressed values contrary to its training. “Overall, I think we see this finding as both useful data and an opportunity,” Huang explained. “These new evaluation methods and results can help us identify and mitigate potential jailbreaks. It’s important to note that these were very rare cases and we believe this was related to jailbroken outputs from Claude.” These anomalies included expressions of “dominance” and “amorality” — values Anthropic explicitly aims to avoid in Claude’s design. The researchers believe these cases resulted from users employing specialized techniques to bypass Claude’s safety guardrails, suggesting the evaluation method could serve as an early warning system for detecting such attempts. Why AI assistants change their values depending on what you’re asking Perhaps most fascinating was the discovery that Claude’s expressed values shift contextually, mirroring human behavior. When users sought relationship guidance, Claude emphasized “healthy boundaries” and “mutual respect.” For historical event analysis, “historical accuracy” took precedence. “I was surprised at Claude’s focus on honesty and accuracy across a lot of diverse tasks, where I wouldn’t necessarily have expected that theme to be the priority,” said Huang. “For example, ‘intellectual humility’ was the top value in philosophical discussions about AI, ‘expertise’ was the top value when creating beauty industry marketing content, and ‘historical accuracy’ was the top value when discussing controversial historical events.” The study also examined how Claude responds to users’ own expressed values. In 28.2% of conversations, Claude strongly supported user values — potentially raising questions about excessive agreeableness. However, in 6.6% of interactions, Claude “reframed” user values by acknowledging them while adding new perspectives, typically when providing psychological or interpersonal advice. Most tellingly, in 3% of conversations, Claude actively resisted user values. Researchers suggest these rare instances of pushback might reveal Claude’s “deepest, most immovable values” — analogous to how human core values emerge when facing ethical challenges. “Our research suggests that there are some types of values, like intellectual honesty and harm prevention, that it is uncommon for Claude to express in regular, day-to-day interactions, but if pushed, will defend them,” Huang said. “Specifically, it’s these kinds of ethical and knowledge-oriented values that tend to be articulated and defended directly when pushed.” The breakthrough techniques revealing how AI systems actually think Anthropic’s values study builds on the company’s broader efforts to demystify large language models through what it calls “mechanistic interpretability” — essentially reverse-engineering AI systems to understand their inner workings. Last month, Anthropic researchers published groundbreaking work that used what they described as a “microscope” to track Claude’s decision-making processes. The technique revealed counterintuitive behaviors, including Claude planning ahead when composing poetry and using unconventional problem-solving approaches for basic math. These findings challenge assumptions about how large language models function. For instance, when asked to explain its math process, Claude described a standard technique rather than its actual internal method — revealing how AI explanations can diverge from actual operations. “It’s a misconception that we’ve found all the components of the model or, like, a God’s-eye view,” Anthropic researcher Joshua Batson told MIT Technology Review in March. “Some things are in focus, but other things are still unclear — a distortion of the microscope.” What Anthropic’s research means for enterprise AI decision makers For technical decision-makers evaluating AI systems for their organizations, Anthropic’s research offers several key takeaways. First, it suggests

Anthropic just analyzed 700,000 Claude conversations — and found its AI has a moral code of its own Read More »

DeepSeek unveils new technique for smarter, scalable AI reward models

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More DeepSeek AI, a Chinese research lab gaining recognition for its powerful open-source language models such as DeepSeek-R1, has introduced a significant advancement in reward modeling for large language models (LLMs).  Their new technique, Self-Principled Critique Tuning (SPCT), aims to create generalist and scalable reward models (RMs). This could potentially lead to more capable AI applications for open-ended tasks and domains where current models can’t capture the nuances and complexities of their environment and users. The crucial role and current limits of reward models Reinforcement learning (RL) has become a cornerstone in developing state-of-the-art LLMs. In RL, models are fine-tuned based on feedback signals that indicate the quality of their responses.  Reward models are the critical component that provides these signals. Essentially, an RM acts as a judge, evaluating LLM outputs and assigning a score or “reward” that guides the RL process and teaches the LLM to produce more useful responses. However, current RMs often face limitations. They typically excel in narrow domains with clear-cut rules or easily verifiable answers. For example, current state-of-the-art reasoning models such as DeepSeek-R1 underwent an RL phase, in which they were trained on math and coding problems where the ground truth is clearly defined. However, creating a reward model for complex, open-ended, or subjective queries in general domains remains a major hurdle. In the paper explaining their new technique, researchers at DeepSeek AI write, “Generalist RM requires to generate high-quality rewards beyond specific domains, where the criteria for rewards are more diverse and complex, and there are often no explicit reference or ground truth.”  They highlight four key challenges in creating generalist RMs capable of handling broader tasks: Input flexibility: The RM must handle various input types and be able to evaluate one or more responses simultaneously. Accuracy: It must generate accurate reward signals across diverse domains where the criteria are complex and the ground truth is often unavailable.  Inference-time scalability: The RM should produce higher-quality rewards when more computational resources are allocated during inference. Learning scalable behaviors: For RMs to scale effectively at inference time, they need to learn behaviors that allow for improved performance as more computation is used. Different types of reward models Credit: arXiv Reward models can be broadly classified by their “reward generation paradigm” (e.g., scalar RMs outputting a single score, generative RMs producing textual critiques) and their “scoring pattern” (e.g., pointwise scoring assigns individual scores to each response, pairwise selects the better of two responses). These design choices affect the model’s suitability for generalist tasks, particularly its input flexibility and potential for inference-time scaling.  For instance, simple scalar RMs struggle with inference-time scaling because they will generate the same score repeatedly, while pairwise RMs can’t easily rate single responses.  The researchers propose that “pointwise generative reward modeling” (GRM), where the model generates textual critiques and derives scores from them, can offer the flexibility and scalability required for generalist requirements. The DeepSeek team conducted preliminary experiments on models like GPT-4o and Gemma-2-27B, and found that “certain principles could guide reward generation within proper criteria for GRMs, improving the quality of rewards, which inspired us that inference-time scalability of RM might be achieved by scaling the generation of high-quality principles and accurate critiques.”  Training RMs to generate their own principles Based on these findings, the researchers developed Self-Principled Critique Tuning (SPCT), which trains the GRM to generate principles and critiques based on queries and responses dynamically.  The researchers propose that principles should be a “part of reward generation instead of a preprocessing step.” This way, the GRMs could generate principles on the fly based on the task they are evaluating and then generate critiques based on the principles.  “This shift enables [the] principles to be generated based on the input query and responses, adaptively aligning [the] reward generation process, and the quality and granularity of the principles and corresponding critiques could be further improved with post-training on the GRM,” the researchers write. Self-Principled Critique Tuning (SPCT) Credit: arXiv SPCT involves two main phases: Rejective fine-tuning: This phase trains the GRM to generate principles and critiques for various input types using the correct format. The model generates principles, critiques and rewards for given queries/responses. Trajectories (generation attempts) are accepted only if the predicted reward aligns with the ground truth (correctly identifying the better response, for instance) and rejected otherwise. This process is repeated and the model is fine-tuned on the filtered examples to improve its principle/critique generation capabilities. Rule-based RL: In this phase, the model is further fine-tuned through outcome-based reinforcement learning. The GRM generates principles and critiques for each query, and the reward signals are calculated based on simple accuracy rules (e.g., did it pick the known best response?). Then the model is updated. This encourages the GRM to learn how to generate effective principles and accurate critiques dynamically and in a scalable way. “By leveraging rule-based online RL, SPCT enables GRMs to learn to adaptively posit principles and critiques based on the input query and responses, leading to better outcome rewards in general domains,” the researchers write. To tackle the inference-time scaling challenge (getting better results with more compute), the researchers run the GRM multiple times for the same input, generating different sets of principles and critiques. The final reward is determined by voting (aggregating the sample scores). This allows the model to consider a broader range of perspectives, leading to potentially more accurate and nuanced final judgments as it is provided with more resources. However, some generated principles/critiques might be low-quality or biased due to model limitations or randomness. To address this, the researchers introduced a “meta RM”—a separate, lightweight scalar RM trained specifically to predict whether a principle/critique generated by the primary GRM will likely lead to a correct final reward.  During inference, the meta RM evaluates the generated samples and filters out the low-quality judgments before the final voting, further enhancing scaling performance. Putting SPCT into practice with

DeepSeek unveils new technique for smarter, scalable AI reward models Read More »

Google’s Gemini 2.5 Flash introduces ‘thinking budgets’ that cut AI costs by 600% when turned down

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Google has launched Gemini 2.5 Flash, a major upgrade to its AI lineup that gives businesses and developers unprecedented control over how much “thinking” their AI performs. The new model, released today in preview through Google AI Studio and Vertex AI, represents a strategic effort to deliver improved reasoning capabilities while maintaining competitive pricing in the increasingly crowded AI market. The model introduces what Google calls a “thinking budget” — a mechanism that allows developers to specify how much computational power should be allocated to reasoning through complex problems before generating a response. This approach aims to address a fundamental tension in today’s AI marketplace: more sophisticated reasoning typically comes at the cost of higher latency and pricing. “We know cost and latency matter for a number of developer use cases, and so we want to offer developers the flexibility to adapt the amount of the thinking the model does, depending on their needs,” said Tulsee Doshi, Product Director for Gemini Models at Google DeepMind, in an exclusive interview with VentureBeat. This flexibility reveals Google’s pragmatic approach to AI deployment as the technology increasingly becomes embedded in business applications where cost predictability is essential. By allowing the thinking capability to be turned on or off, Google has created what it calls its “first fully hybrid reasoning model.” Pay only for the brainpower you need: Inside Google’s new AI pricing model The new pricing structure highlights the cost of reasoning in today’s AI systems. When using Gemini 2.5 Flash, developers pay $0.15 per million tokens for input. Output costs vary dramatically based on reasoning settings: $0.60 per million tokens with thinking turned off, jumping to $3.50 per million tokens with reasoning enabled. This nearly sixfold price difference for reasoned outputs reflects the computational intensity of the “thinking” process, where the model evaluates multiple potential paths and considerations before generating a response. “Customers pay for any thinking and output tokens the model generates,” Doshi told VentureBeat. “In the AI Studio UX, you can see these thoughts before a response. In the API, we currently don’t provide access to the thoughts, but a developer can see how many tokens were generated.” The thinking budget can be adjusted from 0 to 24,576 tokens, operating as a maximum limit rather than a fixed allocation. According to Google, the model intelligently determines how much of this budget to use based on the complexity of the task, preserving resources when elaborate reasoning isn’t necessary. How Gemini 2.5 Flash stacks up: Benchmark results against leading AI models Google claims Gemini 2.5 Flash demonstrates competitive performance across key benchmarks while maintaining a smaller model size than alternatives. On Humanity’s Last Exam, a rigorous test designed to evaluate reasoning and knowledge, 2.5 Flash scored 12.1%, outperforming Anthropic’s Claude 3.7 Sonnet (8.9%) and DeepSeek R1 (8.6%), though falling short of OpenAI’s recently launched o4-mini (14.3%). The model also posted strong results on technical benchmarks like GPQA diamond (78.3%) and AIME mathematics exams (78.0% on 2025 tests and 88.0% on 2024 tests). “Companies should choose 2.5 Flash because it provides the best value for its cost and speed,” Doshi said. “It’s particularly strong relative to competitors on math, multimodal reasoning, long context, and several other key metrics.” Industry analysts note that these benchmarks indicate Google is narrowing the performance gap with competitors while maintaining a pricing advantage — a strategy that may resonate with enterprise customers watching their AI budgets. Smart vs. speedy: When does your AI need to think deeply? The introduction of adjustable reasoning represents a significant evolution in how businesses can deploy AI. With traditional models, users have little visibility into or control over the model’s internal reasoning process. Google’s approach allows developers to optimize for different scenarios. For simple queries like language translation or basic information retrieval, thinking can be disabled for maximum cost efficiency. For complex tasks requiring multi-step reasoning, such as mathematical problem-solving or nuanced analysis, the thinking function can be enabled and fine-tuned. A key innovation is the model’s ability to determine how much reasoning is appropriate based on the query. Google illustrates this with examples: a simple question like “How many provinces does Canada have?” requires minimal reasoning, while a complex engineering question about beam stress calculations would automatically engage deeper thinking processes. “Integrating thinking capabilities into our mainline Gemini models, combined with improvements across the board, has led to higher quality answers,” Doshi said. “These improvements are true across academic benchmarks – including SimpleQA, which measures factuality.” Google’s AI week: Free student access and video generation join the 2.5 Flash launch The release of Gemini 2.5 Flash comes during a week of aggressive moves by Google in the AI space. On Monday, the company rolled out Veo 2 video generation capabilities to Gemini Advanced subscribers, allowing users to create eight-second video clips from text prompts. Today, alongside the 2.5 Flash announcement, Google revealed that all U.S. college students will receive free access to Gemini Advanced until spring 2026 — a move interpreted by analysts as an effort to build loyalty among future knowledge workers. These announcements reflect Google’s multi-pronged strategy to compete in a market dominated by OpenAI’s ChatGPT, which reportedly sees over 800 million weekly users compared to Gemini’s estimated 250-275 million monthly users, according to third-party analyses. The 2.5 Flash model, with its explicit focus on cost efficiency and performance customization, appears designed to appeal particularly to enterprise customers who need to carefully manage AI deployment costs while still accessing advanced capabilities. “We’re super excited to start getting feedback from developers about what they’re building with Gemini Flash 2.5 and how they’re using thinking budgets,” Doshi said. Beyond the preview: What businesses can expect as Gemini 2.5 Flash matures While this release is in preview, the model is already available for developers to start building with, though Google has not specified a timeline for general availability. The company indicates it will

Google’s Gemini 2.5 Flash introduces ‘thinking budgets’ that cut AI costs by 600% when turned down Read More »

Move over, Alexa: Amazon launches new realtime voice model Nova Sonic for third-party enterprise development

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Amazon is best known as an e-commerce giant and then somewhere perhaps slightly further down the list of notable offerings is its Alexa AI voice assistant product, which just got a big intelligence upgrade last month thanks in part to Amazon Nova and Amazon’s investment Anthropic. Now Alexa will have to make space for a new Amazon voice AI sibling: today the company is introducing Amazon Nova Sonic, a new foundation model designed to allow third-party app developers to build realtime, naturalistic, conversational voice interactivity to their products using Amazon’s web platform Bedrock. It’s available now via a bi-directional streaming application programming interface (API). And actually, Amazon has already incorporated some portions of it — a speech encoder that provides representation and a speech synthesizer — into the new Alexa model, Alexa+. “This approach allows us to bring the benefits of our speech technologies to different use cases simultaneously while continuing to evolve both systems based on customer feedback and technological advancements,” a spokesperson told us. Obvious use cases include customer support and service, guidance, information retrieval, and entertainment. A unified approach Nova Sonic addresses a key challenge in voice AI: the fragmentation of technologies. Traditionally, building voice interfaces required combining separate models for speech recognition, language processing, and speech synthesis, according to Rohit Prasad, SVP and Head Scientist for Artificial General Intelligence (AGI) at Amazon, in a video call interview with VentureBeat yesterday using Amazon’s Chime video service. This complexity often results in robotic, unnatural interactions and increased development overhead. Now, Sonic seeks to improve on this state of affairs by combining all three distinct model types into one. Prasad explained the model’s core innovation: “Nova Sonic brings together three traditionally separate models—speech-to-text, text understanding, and text-to-speech—into one unified system that can model not just the ‘what’ but also the ‘how’ of communication.” By retaining the acoustic context—such as tone, cadence, and style—Nova Sonic helps maintain the nuances of human conversation. Recognizing the intricacies and quirks of live, two-way audio conversations One of Nova Sonic’s defining capabilities is its ability to handle live, two-way conversations. It recognizes when users pause, hesitate, or interrupt—common behaviors in human speech—and responds fluidly while maintaining context. “The real breakthrough here is real-time, interactive, low-latency voice interaction, which means you can interrupt the AI mid-sentence, and it will still maintain context and respond coherently,” said Prasad. This feature is especially relevant in scenarios like customer service, where responsiveness and adaptability are critical. Nova Sonic is also designed to integrate seamlessly with other systems. It automatically generates transcripts of spoken input, which can be used to trigger APIs or interact with proprietary tools. This allows companies to build AI agents that can perform tasks such as booking appointments, retrieving live information, or answering complex customer inquiries. “You can use Nova Sonic through Amazon Bedrock and connect it with any tools or proprietary data sources, even visual ones, as long as they’re wrapped as callable APIs,” said Prasad. This flexibility makes the model suitable for a wide range of industries, from education and travel to enterprise operations and entertainment. Benchmark performance and industry comparisons Nova Sonic has been benchmarked against other real-time voice models, including OpenAI’s GPT-4o and Google’s Gemini Flash 2.0. On the Common Eval data set, it achieved a 69.7% win-rate over Gemini Flash 2.0 and a 51.0% win-rate over GPT-4o for American English single-turn conversations using a masculine voice. Similar gains were seen with feminine and British English voices. Prasad emphasized Nova Sonic’s strong performance in its primary language markets: “Nova Sonic is currently best-in-class in U.S. and British English, outperforming even GPT-4o real-time in both conversational naturalness and accuracy.” He added, “To the best of our knowledge, only two other models—GPT-4o real-time and a variant of GPT-4o mini—come close to what Nova Sonic does in combining speech understanding and generation in real time. This space is still very early and very hard.” Multilingual capabilities and noisy environment handling In speech recognition, Nova Sonic also excels in multilingual and real-world conditions. It recorded a word error rate (WER) of 4.2% on the Multilingual LibriSpeech benchmark, outperforming GPT-4o Transcribe by over 36% across English, French, German, Italian, and Spanish. In noisy, multi-speaker environments (measured using the AMI benchmark), Nova Sonic showed a 46.7% improvement in WER over GPT-4o Transcribe. Expressive voices and language expansion Currently, the model supports multiple expressive voices, both masculine and feminine, in American and British English. Amazon noted that additional accents and languages are in development and will be released in future updates. Low latency and enterprise-friendly cost Speed and cost are also part of the appeal. Third-party benchmarking shows Nova Sonic delivers a customer-perceived latency of 1.09 seconds, compared to 1.18 seconds for OpenAI’s GPT-4o and 1.41 seconds for Google’s Gemini Flash 2.0. From a pricing standpoint, Amazon positions Nova Sonic as an enterprise-ready solution. “We’re nearly 80% cheaper than GPT-4o real-time, and that superior price-performance is resonating with enterprises moving from experimentation to deployment,” said Prasad. Early adoption across sectors According to Amazon, companies across different sectors have already begun using or testing Nova Sonic. ASAPP is applying the technology to optimize contact center workflows, praising its accuracy and natural dialog handling. Education First (EF) uses the model to support language learners with real-time pronunciation feedback, especially for non-native speakers with varied accents. Sports data provider Stats Perform is leveraging Nova Sonic’s low latency and simple setup to power rapid, data-rich interactions in its Opta AI Chat platform. Responsible AI and safety commitment Alongside performance and cost, Amazon is highlighting its commitment to responsible AI development. The Nova family of models includes built-in safeguards and is supported by AWS AI Service Cards that outline intended use cases, potential limitations, and ethical guidelines. Prasad underscored Amazon’s focus on trust and safety: “Trust is paramount for us—developers can customize personality within limits, but we’ve put in strong guardrails to prevent

Move over, Alexa: Amazon launches new realtime voice model Nova Sonic for third-party enterprise development Read More »

From ‘catch up’ to ‘catch us’: How Google quietly took the lead in enterprise AI

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Just a year ago, the narrative around Google and enterprise AI felt stuck. Despite inventing core technologies like the Transformer, the tech giant seemed perpetually on the back foot, overshadowed by OpenAI‘s viral success, Anthropic‘s coding prowess and Microsoft‘s aggressive enterprise push. But witness the scene at Google Cloud Next 2025 in Las Vegas last week: A confident Google, armed with benchmark-topping models, formidable infrastructure and a cohesive enterprise strategy, declaring a stunning turnaround. In a closed-door analyst meeting with senior Google executives, one analyst summed it up. This feels like the moment, he said, when Google went from “catch up, to catch us.”  This sentiment that Google has not only caught up with but even surged ahead of OpenAI and Microsoft in the enterprise AI race prevailed throughout the event. And it’s more than just Google’s marketing spin. Evidence suggests Google has leveraged the past year for intense, focused execution, translating its technological assets into a performant, integrated platform that’s rapidly winning over enterprise decision-makers. From boasting the world’s most powerful AI models running on hyper-efficient custom silicon, to a burgeoning ecosystem of AI agents designed for real-world business problems, Google is making a compelling case that it was never actually lost – but that its stumbles masked a period of deep, foundational development.  Now, with its integrated stack firing on all cylinders, Google appears positioned to lead the next phase of the enterprise AI revolution. And in my interviews with several Google executives at Next, they said Google wields advantages in infrastructure and model integration that competitors like OpenAI, Microsoft or AWS will struggle to replicate. The shadow of doubt: acknowledging the recent past It’s impossible to appreciate the current momentum without acknowledging the recent past. Google was the birthplace of the Transformer architecture, which sparked the modern revolution in large language models (LLMs). Google also started investing in specialized AI hardware (TPUs), which are now driving industry-leading efficiency, a decade ago. And yet, two and a half years ago, it inexplicably found itself playing defense.  OpenAI’s ChatGPT captured the public imagination and enterprise interest at breathtaking speed and became the fastest-growing app in history. Competitors like Anthropic carved out niches in areas like coding. Google’s own public steps sometimes seemed tentative or flawed. The infamous Bard demo fumbles in 2023 and the later controversy over its image generator producing historically inaccurate depictions fed a narrative of a company potentially hampered by internal bureaucracy or overcorrection on alignment. It felt like Google was lost: The AI stumbles seemed to fit a pattern, first shown by Google’s initial slowness in the cloud competition, where it remained a distant third in market share behind Amazon and Microsoft. Google Cloud CTO Will Grannis acknowledged the early questions about whether Google Cloud would stand behind in the long run. “Is it even a real thing?,” he recalled people asking him. The question lingered: Could Google translate its undeniable research brilliance and infrastructure scale into enterprise AI dominance? The pivot: a conscious decision to lead Behind the scenes, however, a shift was underway, catalyzed by a conscious decision at the highest levels to reclaim leadership. Mat Velloso, VP of product for Google DeepMind’s AI Developer Platform, described sensing a pivotal moment upon joining Google in Feb. 2024, after leaving Microsoft. “When I came to Google, I spoke with Sundar [Pichai], I spoke with several leaders here, and I felt like that was the moment where they were deciding, okay, this [generative AI] is a thing the industry clearly cares about. Let’s make it happen,” Velloso shared in an interview with VentureBeat during Next last week. This renewed push wasn’t hampered by a feared “brain drain” that some outsiders felt was depleting Google. In fact, the company quietly doubled down on execution in early 2024 – a year marked by aggressive hiring, internal unification and customer traction. While competitors made splashy hires, Google retained its core AI leadership, including DeepMind CEO Demis Hassabis and Google Cloud CEO Thomas Kurian, providing stability and deep expertise. Moreover, talent began flowing towards Google’s focused mission. Logan Kilpatrick, for instance, returned to Google from OpenAI, drawn by the opportunity to build foundational AI within the company, creating it. He joined Velloso in what he described as a “zero to one experience,” tasked with building developer traction for Gemini from the ground up. “It was like the team was me on day one… we actually have no users on this platform, we have no revenue. No one is interested in Gemini at this moment,” Kilpatrick recalled of the starting point. People familiar with the internal dynamics also credit leaders like Josh Woodward, who helped start AI Studio and now leads the Gemini App and Labs. More recently, Noam Shazeer, a key co-author of the original “Attention Is All You Need” Transformer paper during his first tenure at Google, returned to the company in late 2024 as a technical co-lead for the crucial Gemini project This concerted effort, combining these hires, research breakthroughs, refinements to its database technology and a sharpened enterprise focus overall, began yielding results. These cumulative advances, combined with what CTO Will Grannis termed “hundreds of fine-grain” platform elements, set the stage for the announcements at Next ’25, and cemented Google’s comeback narrative. Pillar 1: Gemini 2.5 and the era of thinking models It’s true that a leading enterprise mantra has become “it’s not just about the model.” After all, the performance gap between leading models has narrowed dramatically, and tech insiders acknowledge that true intelligence is coming from technology packaged around the model, not just the model itself – for example, agentic technologies that allow a model to use tools and explore the web around it. Despite this, to possess the demonstrably best-performing LLM is an important feat – and a powerful validator, a sign that the model-owning company has things like superior research and the

From ‘catch up’ to ‘catch us’: How Google quietly took the lead in enterprise AI Read More »

When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Large language models (LLMs) are increasingly capable of complex reasoning through “inference-time scaling,” a set of techniques that allocate more computational resources during inference to generate answers. However, a new study from Microsoft Research reveals that the effectiveness of these scaling methods isn’t universal. Performance boosts vary significantly across different models, tasks and problem complexities. The core finding is that simply throwing more compute at a problem during inference doesn’t guarantee better or more efficient results. The findings can help enterprises better understand cost volatility and model reliability as they look to integrate advanced AI reasoning into their applications. Putting scaling methods to the test The Microsoft Research team conducted an extensive empirical analysis across nine state-of-the-art foundation models. This included both “conventional” models like GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Pro and Llama 3.1 405B, as well as models specifically fine-tuned for enhanced reasoning through inference-time scaling. This included OpenAI’s o1 and o3-mini, Anthropic’s Claude 3.7 Sonnet, Google’s Gemini 2 Flash Thinking, and DeepSeek R1. They evaluated these models using three distinct inference-time scaling approaches: Standard Chain-of-Thought (CoT): The basic method where the model is prompted to answer step-by-step. Parallel Scaling: the model generates multiple independent answers for the same question and uses an aggregator (like majority vote or selecting the best-scoring answer) to arrive at a final result. Sequential Scaling: The model iteratively generates an answer and uses feedback from a critic (potentially from the model itself) to refine the answer in subsequent attempts. These approaches were tested on eight challenging benchmark datasets covering a wide range of tasks that benefit from step-by-step problem-solving: math and STEM reasoning (AIME, Omni-MATH, GPQA), calendar planning (BA-Calendar), NP-hard problems (3SAT, TSP), navigation (Maze) and spatial reasoning (SpatialMap). Several benchmarks included problems with varying difficulty levels, allowing for a more nuanced understanding of how scaling behaves as problems become harder. “The availability of difficulty tags for Omni-MATH, TSP, 3SAT, and BA-Calendar enables us to analyze how accuracy and token usage scale with difficulty in inference-time scaling, which is a perspective that is still underexplored,” the researchers wrote in the paper detailing their findings. The researchers evaluated the Pareto frontier of LLM reasoning by analyzing both accuracy and the computational cost (i.e., the number of tokens generated). This helps identify how efficiently models achieve their results.  Inference-time scaling Pareto frontier Credit: arXiv They also introduced the “conventional-to-reasoning gap” measure, which compares the best possible performance of a conventional model (using an ideal “best-of-N” selection) against the average performance of a reasoning model, estimating the potential gains achievable through better training or verification techniques. More compute isn’t always the answer The study provided several crucial insights that challenge common assumptions about inference-time scaling: Benefits vary significantly: While models tuned for reasoning generally outperform conventional ones on these tasks, the degree of improvement varies greatly depending on the specific domain and task. Gains often diminish as problem complexity increases. For instance, performance improvements seen on math problems didn’t always translate equally to scientific reasoning or planning tasks. Token inefficiency is rife: The researchers observed high variability in token consumption, even between models achieving similar accuracy. For example, on the AIME 2025 math benchmark, DeepSeek-R1 used over five times more tokens than Claude 3.7 Sonnet for roughly comparable average accuracy.  More tokens do not lead to higher accuracy: Contrary to the intuitive idea that longer reasoning chains mean better reasoning, the study found this isn’t always true. “Surprisingly, we also observe that longer generations relative to the same model can sometimes be an indicator of models struggling, rather than improved reflection,” the paper states. “Similarly, when comparing different reasoning models, higher token usage is not always associated with better accuracy. These findings motivate the need for more purposeful and cost-effective scaling approaches.” Cost nondeterminism: Perhaps most concerning for enterprise users, repeated queries to the same model for the same problem can result in highly variable token usage. This means the cost of running a query can fluctuate significantly, even when the model consistently provides the correct answer.  Variance in response length (spikes show smaller variance) Credit: arXiv The potential in verification mechanisms: Scaling performance consistently improved across all models and benchmarks when simulated with a “perfect verifier” (using the best-of-N results).  Conventional models sometimes match reasoning models: By significantly increasing inference calls (up to 50x more in some experiments), conventional models like GPT-4o could sometimes approach the performance levels of dedicated reasoning models, particularly on less complex tasks. However, these gains diminished rapidly in highly complex settings, indicating that brute-force scaling has its limits. On some tasks, the accuracy of GPT-4o continues to improve with parallel and sequential scaling. Credit: arXiv Implications for the enterprise These findings carry significant weight for developers and enterprise adopters of LLMs. The issue of “cost nondeterminism” is particularly stark and makes budgeting difficult. As the researchers point out, “Ideally, developers and users would prefer models for which the standard deviation on token usage per instance is low for cost predictability.” “The profiling we do in [the study] could be useful for developers as a tool to pick which models are less volatile for the same prompt or for different prompts,” Besmira Nushi, senior principal research manager at Microsoft Research, told VentureBeat. “Ideally, one would want to pick a model that has low standard deviation for correct inputs.”  Models that peak blue to the left consistently generate the same number of tokens at the given task Credit: arXiv The study also provides good insights into the correlation between a model’s accuracy and response length. For example, the following diagram shows that math queries above ~11,000 token length have a very slim chance of being correct, and those generations should either be stopped at that point or restarted with some sequential feedback. However, Nushi points out that models allowing these post hoc mitigations also have a cleaner separation between correct and incorrect samples.

When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems Read More »

BigQuery is 5x bigger than Snowflake and Databricks: What Google is doing to make it even better

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Google Cloud announced a significant number of new features at its Google Cloud Next event last week, with at least 229 new announcements. Buried in that mountain of news, which included new AI chips and agentic AI capabilities, as well as database updates, Google Cloud also made some big moves with its BigQuery data warehouse service. Among the new capabilities is BigQuery Unified Governance, which helps organizations discover, understand and trust their data assets. The governance tools help address key barriers to AI adoption by ensuring data quality, accessibility and trustworthiness. The stakes are enormous for Google as it takes on rivals in the enterprise data space. BigQuery has been on the market since 2011 and has grown significantly in recent years, both in terms of capabilities and user base. Apparently, BigQuery is also a big business for Google Cloud. During Google Cloud Next, it was revealed for the first time just how big the business actually is. According to Google, BigQuery had five times the number of customers of both Snowflake and Databricks. “This is the first year we’ve been given permission to actually post a customer stat, which was delightful for me,” Yasmeen Ahmad, managing director of data analytics at Google Cloud, told VentureBeat. “Databricks and Snowflake, they’re the only other kind of enterprise data warehouse platforms in the market. We have five times more customers than either of them.” How Google is improving BigQuery to advance enterprise adoption While Google now claims to have a more extensive user base than its rivals, it’s not taking its foot off the gas either. In recent months, and particularly at Google Cloud Next, the hyperscaler has announced multiple new capabilities to advance enterprise adoption. A key challenge for enterprise AI is having access to the correct data that meets business service level agreements (SLAs). According to Gartner research cited by Google, organizations that do not enable and support their AI use cases through an AI-ready data practice will see over 60% of AI projects fail to deliver on business SLAs and be abandoned. This challenge stems from three persistent problems that plague enterprise data management: Fragmented data silos Rapidly changing requirements Inconsistent organizational data cultures where teams don’t share a common language around data. Google’s BigQuery Unified Governance solution represents a significant departure from traditional approaches by embedding governance capabilities directly within the BigQuery platform rather than requiring separate tools or processes. BigQuery unified governance: A technical deep dive At the core of Google’s announcement is BigQuery unified governance, powered by the new BigQuery universal catalog. Unlike traditional catalogs that only contain basic table and column information, the universal catalog integrates three distinct types of metadata: Physical/technical metadata: Schema definitions, data types and profiling statistics. Business metadata: Business glossary terms, descriptions and semantic context. Runtime metadata: Query patterns, usage statistics and format-specific information for technologies like Apache Iceberg. This unified approach allows BigQuery to maintain a comprehensive understanding of data assets across the enterprise. What makes the system particularly powerful is how Google has integrated Gemini, its advanced AI model, directly into the governance layer through what they call the knowledge engine. The knowledge engine actively enhances governance by discovering relationships between datasets, enriching metadata with business context and monitoring data quality automatically. Key capabilities include semantic search with natural language understanding, automated metadata generation, AI-powered relationship discovery, data products for packaging related assets, a business glossary, automatic cataloging of both structured and unstructured data and automated anomaly detection. Forget about benchmarks, enterprise AI is a bigger issue Google’s strategy transcends the AI model competition.  “I think there’s too much of the industry just focused on getting on top of that individual leaderboard, and actually Google is thinking holistically about the problem,” Ahmad said. This comprehensive approach addresses the entire enterprise data lifecycle, answering critical questions such as: How do you deliver on trust? How do you deliver on scale? How do you deliver on governance and security? By innovating at each layer of the stack and bringing these innovations together, Google has created what Ahmad calls a real-time data activation flywheel, where, as soon as data is captured, regardless of the type or format or where it’s being stored, there is instant metadata generation, lineage and quality. That said, models do matter. Ahmad explained that with the advent of thinking models like Gemini 2.0, there has been a huge unlock for Google’s data platforms. “A year ago, when you were asking GenAI to answer a business question, anything that got slightly more complex, you would actually need to break it down into multiple steps,” she said. “Suddenly, with the thinking model it can come up with a plan… you’re not having to hard code a way for it to build a plan. It knows how to build plans.” As a result, she said that now you can easily have a data engineering agent build a pipeline that’s three steps or 10 steps. The integration with Google’s AI capabilities has transformed what’s possible with enterprise data.  Real-world impact: How enterprises are benefiting Levi Strauss & Company offers a compelling example of how unified data governance can transform business operations. The 172-year-old company is using Google’s data governance capabilities as it shifts from being primarily a wholesale business to becoming a direct-to-consumer brand. In a session at Google Cloud Next, Vinay Narayana, who runs data and AI platform engineering at Levi’s, detailed his organization’s use case. “We aspire to empower our business analysts to have access to real-time data that is also accurate,” Narayana said. “Before we embarked on our journey to build a new platform, we discovered various user challenges. Our business users didn’t know where the data lived, and if they knew the data source, they didn’t know who owned it. If they somehow got access, there was no documentation.” Levi’s built a data platform on Google Cloud that organizes data products by business

BigQuery is 5x bigger than Snowflake and Databricks: What Google is doing to make it even better Read More »