VentureBeat

OpenAI’s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI‘s voice AI models have gotten it into trouble before with actor Scarlett Johansson, but that isn’t stopping the company from continuing to advance its offerings in this category. Today, the ChatGPT maker has unveiled three new proprietary voice models: gpt-4o-transcribe, gpt-4o-mini-transcribe and gpt-4o-mini-tts. These models will initially be available through the ChatGPT maker’s application programming interface (API) for third-party software developers to build their own apps. They will also be available on a custom demo site, OpenAI.fm, that individual users can access for limited testing and fun. Moreover, the gpt-4o-mini-tts model voices can be customized from several pre-sets via text prompt to change their accents, pitch, tone and other vocal qualities — including conveying whatever emotions the user asks them to, which should go a long way to addressing any concerns OpenAI is deliberately imitating any particular user’s voice (the company previously denied that was the case with Johansson, but pulled down the ostensibly imitative voice option, anyway). Now, it’s up to the user to decide how they want their AI voice to sound when speaking back. In a demo with VentureBeat delivered over a video call, OpenAI technical staff member Jeff Harris showed how, using text alone on the demo site, a user could get the same voice to sound like a cackling mad scientist or a zen, calm yoga teacher. Discovering and refining new capabilities within GPT-4o base The models are variants of the existing GPT-4o model OpenAI launched back in May 2024 and which currently powers the ChatGPT text and voice experience for many users, but the company took that base model and post-trained it with additional data to make it excel at transcription and speech. The company didn’t specify when the models might come to ChatGPT. “ChatGPT has slightly different requirements in terms of cost and performance trade-offs, so while I expect they will move to these models in time, for now, this launch is focused on API users,” Harris said. It is meant to supersede OpenAI’s two-year-old Whisper open-source text-to-speech model, offering lower word error rates across industry benchmarks and improved performance in noisy environments, with diverse accents, and at varying speech speeds across 100+ languages. The company posted a chart on its website showing just how much lower the gpt-4o-transcribe models’ error rates are at identifying words across 33 languages compared to Whisper — with an impressively low 2.46% in English. “These models include noise cancellation and a semantic voice activity detector, which helps determine when a speaker has finished a thought, improving transcription accuracy,” said Harris. Harris told VentureBeat that the new gpt-4o-transcribe model family is not designed to offer “diarization,” or the capability to label and differentiate between different speakers. Instead, it is designed primarily to receive one (or possibly multiple voices) as a single input channel and respond to all inputs with a single output voice in that interaction, however long it takes. The company is also hosting a competition for the general public to find the most creative examples of using its demo voice site OpenAI.fm and share them online by tagging the @openAI account on X. The winner will receive a custom Teenage Engineering radio with the OpenAI logo, which OpenAI Head of Product, Platform Olivier Godement said is one of only three in the world. An audio applications gold mine The enhancements make them particularly well-suited for applications such as customer call centers, meeting note transcription, and AI-powered assistants. Impressively, the company’s newly launched Agents SDK from last week also allows those developers who have already built apps atop its text-based large language models like the regular GPT-4o to add fluid voice interactions with only about “nine lines of code,” according to a presenter during an OpenAI YouTube livestream announcing the new models (embedded above). For example, an e-commerce app built atop GPT-4o could now respond to turn-based user questions like “Tell me about my last orders” in speech with just seconds of tweaking the code by adding these new models. “For the first time, we’re introducing streaming speech-to-text, allowing developers to continuously input audio and receive a real-time text stream, making conversations feel more natural,” Harris said. Still, for those devs looking for low-latency, real-time AI voice experiences, OpenAI recommends using its speech-to-speech models in the Realtime API. Pricing and availability The new models are available immediately via OpenAI’s API, with pricing as follows: • gpt-4o-transcribe: $6.00 per 1M audio input tokens (~$0.006 per minute) • gpt-4o-mini-transcribe: $3.00 per 1M audio input tokens (~$0.003 per minute) • gpt-4o-mini-tts: $0.60 per 1M text input tokens, $12.00 per 1M audio output tokens (~$0.015 per minute) However, they arrive at a time of fiercer-than-ever competition in the AI transcription and speech space, with dedicated speech AI firms such as ElevenLabs offering their new Scribe model, which supports diarization and boasts a similarly (but not as low) reduced error rate of 3.3% in English. It is priced at $0.40 per hour of input audio (or $0.006 per minute, roughly equivalent). Another startup, Hume AI, offers a new model, Octave TTS, with sentence-level and even word-level customization of pronunciation and emotional inflection — based entirely on the user’s instructions, not any pre-set voices. The pricing of Octave TTS isn’t directly comparable, but there is a free tier offering 10 minutes of audio and costs increase from there between Meanwhile, more advanced audio and speech models are also coming to the open source community, including one called Orpheus 3B which is available with a permissive Apache 2.0 license, meaning developers don’t have to pay any costs to run it — provided they have the right hardware or cloud servers. Industry adoption and early results According to testimonials shared by OpenAI with VentureBeat, several companies have already integrated OpenAI’s new audio models into their platforms, reporting significant improvements in voice AI performance. EliseAI, a company focused on property management automation, found that OpenAI’s text-to-speech model enabled more natural and

OpenAI’s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds Read More »

DeepSeek-V3 now runs at 20 tokens per second on Mac Studio, and that’s a nightmare for OpenAI

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Chinese AI startup DeepSeek has quietly released a new large language model that’s already sending ripples through the artificial intelligence industry — not just for its capabilities, but for how it’s being deployed. The 641-gigabyte model, dubbed DeepSeek-V3-0324, appeared on AI repository Hugging Face today with virtually no announcement, continuing the company’s pattern of low-key but impactful releases. What makes this launch particularly notable is the model’s MIT license — making it freely available for commercial use — and early reports that it can run directly on consumer-grade hardware, specifically Apple’s Mac Studio with M3 Ultra chip. The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm! pic.twitter.com/wFVrFCxGS6 — Awni Hannun (@awnihannun) March 24, 2025 “The new DeepSeek-V3-0324 in 4-bit runs at > 20 tokens/second on a 512GB M3 Ultra with mlx-lm!” wrote AI researcher Awni Hannun on social media. While the $9,499 Mac Studio might stretch the definition of “consumer hardware,” the ability to run such a massive model locally is a major departure from the data center requirements typically associated with state-of-the-art AI. DeepSeek’s stealth launch strategy disrupts AI market expectations The 685-billion-parameter model arrived with no accompanying whitepaper, blog post, or marketing push — just an empty README file and the model weights themselves. This approach contrasts sharply with the carefully orchestrated product launches typical of Western AI companies, where months of hype often precede actual releases. Early testers report significant improvements over the previous version. AI researcher Xeophon proclaimed in a post on X.com: “Tested the new DeepSeek V3 on my internal bench and it has a huge jump in all metrics on all tests. It is now the best non-reasoning model, dethroning Sonnet 3.5.” Tested the new DeepSeek V3 on my internal bench and it has a huge jump in all metrics on all tests.It is now the best non-reasoning model, dethroning Sonnet 3.5. Congrats @deepseek_ai! pic.twitter.com/efEu2FQSBe — Xeophon (@TheXeophon) March 24, 2025 This claim, if validated by broader testing, would position DeepSeek’s new model above Claude Sonnet 3.5 from Anthropic, one of the most respected commercial AI systems. And unlike Sonnet, which requires a subscription, DeepSeek-V3-0324‘s weights are freely available for anyone to download and use. How DeepSeek V3-0324’s breakthrough architecture achieves unmatched efficiency DeepSeek-V3-0324 employs a mixture-of-experts (MoE) architecture that fundamentally reimagines how large language models operate. Traditional models activate their entire parameter count for every task, but DeepSeek’s approach activates only about 37 billion of its 685 billion parameters during specific tasks. This selective activation represents a paradigm shift in model efficiency. By activating only the most relevant “expert” parameters for each specific task, DeepSeek achieves performance comparable to much larger fully-activated models while drastically reducing computational demands. The model incorporates two additional breakthrough technologies: Multi-Head Latent Attention (MLA) and Multi-Token Prediction (MTP). MLA enhances the model’s ability to maintain context across long passages of text, while MTP generates multiple tokens per step instead of the usual one-at-a-time approach. Together, these innovations boost output speed by nearly 80%. Simon Willison, a developer tools creator, noted in a blog post that a 4-bit quantized version reduces the storage footprint to 352GB, making it feasible to run on high-end consumer hardware like the Mac Studio with M3 Ultra chip. This represents a potentially significant shift in AI deployment. While traditional AI infrastructure typically relies on multiple Nvidia GPUs consuming several kilowatts of power, the Mac Studio draws less than 200 watts during inference. This efficiency gap suggests the AI industry may need to rethink assumptions about infrastructure requirements for top-tier model performance. China’s open source AI revolution challenges Silicon Valley’s closed garden model DeepSeek’s release strategy exemplifies a fundamental divergence in AI business philosophy between Chinese and Western companies. While U.S. leaders like OpenAI and Anthropic keep their models behind paywalls, Chinese AI companies increasingly embrace permissive open-source licensing. This approach is rapidly transforming China’s AI ecosystem. The open availability of cutting-edge models creates a multiplier effect, enabling startups, researchers, and developers to build upon sophisticated AI technology without massive capital expenditure. This has accelerated China’s AI capabilities at a pace that has shocked Western observers. The business logic behind this strategy reflects market realities in China. With multiple well-funded competitors, maintaining a proprietary approach becomes increasingly difficult when competitors offer similar capabilities for free. Open-sourcing creates alternative value pathways through ecosystem leadership, API services, and enterprise solutions built atop freely available foundation models. Even established Chinese tech giants have recognized this shift. Baidu announced plans to make its Ernie 4.5 model series open-source by June, while Alibaba and Tencent have released open-source AI models with specialized capabilities. This movement stands in stark contrast to the API-centric strategy employed by Western leaders. The open-source approach also addresses unique challenges faced by Chinese AI companies. With restrictions on access to cutting-edge Nvidia chips, Chinese firms have emphasized efficiency and optimization to achieve competitive performance with more limited computational resources. This necessity-driven innovation has now become a potential competitive advantage. DeepSeek V3-0324: The foundation for an AI reasoning revolution The timing and characteristics of DeepSeek-V3-0324 strongly suggest it will serve as the foundation for DeepSeek-R2, an improved reasoning-focused model expected within the next two months. This follows DeepSeek’s established pattern, where its base models precede specialized reasoning models by several weeks. “This lines up with how they released V3 around Christmas followed by R1 a few weeks later. R2 is rumored for April so this could be it,” noted Reddit user mxforest. The implications of an advanced open-source reasoning model cannot be overstated. Current reasoning models like OpenAI’s o1 and DeepSeek’s R1 represent the cutting edge of AI capabilities, demonstrating unprecedented problem-solving abilities in domains from mathematics to coding. Making this technology freely available would democratize access to AI systems currently limited to those with substantial budgets. The potential R2 model arrives amid significant revelations about

DeepSeek-V3 now runs at 20 tokens per second on Mac Studio, and that’s a nightmare for OpenAI Read More »

The watchful AI that never sleeps: Hakimo’s $10.5M bet on autonomous security

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Hakimo announced today it has secured $10.5 million in Series A funding to expand its autonomous security monitoring platform. The Menlo Park-based AI security startup’s round was led by Vertex Ventures and Zigg Capital, with participation from RXR Arden Digital Ventures and existing investors Defy.vc and Gokul Rajaram, bringing the company’s total funding to $20.5 million. The company’s flagship product, AI Operator — an autonomous security agent that monitors existing security systems, detects threats in real-time, and executes response protocols with minimal human intervention — arrives as the physical security industry struggles with staffing shortages, rising costs, and false alarm fatigue. “Rising costs and the severe shortage of quality security guards are driving this shift,” explained Sam Joseph, Hakimo’s co-founder and CEO, in an exclusive interview with VentureBeat. “Companies save approximately $125,000 per year relative to a guard, so switching to Hakimo is a no-brainer to customers.” AI security agents that detect any threat describable in words Unlike conventional security monitoring services, Hakimo combines computer vision with generative AI to create a system that can detect any anomaly or threat that can be described in words. “Hakimo’s AI Operator continuously ingests live video feeds from existing security cameras and other sensors such as alarm systems and door sensors,” Joseph said. “Our AI Operator is trained to identify anomalies and threats—such as unauthorized personnel, tailgating, loitering, intrusion into restricted zones, or unusual behaviors.” When the system detects a potential threat, it reasons through pre-programmed response protocols, issues real-time audio warnings through on-site speakers, and escalates to human operators only when necessary. This multi-layered approach allows for comprehensive security coverage at a fraction of the cost of traditional guard services. The system’s ability to work with existing hardware serves as another differentiator. “Companies can deploy Hakimo in just minutes, directly integrating with their existing cameras and hardware—no new infrastructure required,” Joseph explained. How 24/7 AI monitoring outperforms traditional security guards Hakimo’s growth addresses significant challenges facing traditional security approaches, including police departments declining to respond to unverified alarms and a labor crisis in the manned guarding industry. “A guard physically cannot be at all places at all times, especially at places like campuses and large warehouses or even car dealerships,” Joseph said. “Hakimo, on the other hand, monitors cameras giving complete 100% coverage as long as there are enough cameras present at the site.” The CEO highlighted additional advantages: “Guards have to take mandatory breaks during their shifts whereas Hakimo’s AI Operator never takes breaks. Liability risk: Guards have a high risk of getting injured on the job, which causes a huge liability risk. Hakimo, with no on-site human presence, has no such risk whatsoever.” The company has tripled its customer base over the past year, now serving more than 100 clients across industries including multifamily apartments, car dealerships, construction sites, and Fortune 500 enterprises. In 2024 alone, Hakimo’s technology prevented thousands of security incidents, assisted law enforcement in multiple arrests, and even saved a life. “Our AI Operator detected a person collapsing on-site in the middle of the night with no one around. It was then escalated to our human operators who immediately dispatched emergency services (911), ultimately saving the individual’s life,” Joseph explained. Why investors are betting millions on AI-powered physical security The investment in Hakimo reflects growing recognition that physical security requires technological reinvention as traditional approaches like guard services and alarm monitoring increasingly appear antiquated and cost-prohibitive. With the new funding, Joseph plans to “expand and enhance our AI platform capabilities and scale operations to new markets and industries.” The company targets businesses across multiple sectors that need efficient physical security solutions. “Any business needing protection for physical assets or people benefits significantly, especially property management, car dealerships, construction sites, and campuses,” Joseph said, citing examples including a car dealership that eliminated monthly break-ins, a high-rise apartment building that saved nearly a million dollars on security guard spending, and a student housing property that saw incidents drop by 90% while reducing security costs to a third of traditional expenses. Creating the universal platform for intelligent video monitoring Hakimo was founded in early 2020 by Sam Joseph and Sagar Honnungar, both Stanford-trained AI experts with backgrounds in enterprise SaaS. They identified converging trends: exploding camera deployments, falling hardware costs, and rapid advances in computer vision. Their vision extends beyond current applications: “In five years, Hakimo aims to become the universal platform for intelligent video monitoring and understanding, effectively creating and leading a new category within the physical security industry,” Joseph said. As AI continues to advance, Hakimo’s approach may offer a preview of how intelligent systems will increasingly augment — and in some cases replace — human workers in security and beyond. “Our vision is that Hakimo will be the platform for all kinds of video monitoring,” Joseph said. “Anyone should be able to connect any camera to Hakimo and say ‘look out for X and do Y when you see X happening.’” source

The watchful AI that never sleeps: Hakimo’s $10.5M bet on autonomous security Read More »

Beyond transformers: Nvidia’s MambaVision aims to unlock faster, cheaper enterprise computer vision

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Transformer-based large language models (LLMs) are the foundation of the modern generative AI landscape. Transformers aren’t the only way to do gen AI, though. Over the course of the last year, Mamba, an approach that uses Structured State Space Models (SSM), has also picked up adoption as an alternative approach from multiple vendors, including AI21 and AI silicon giant Nvidia. Nvidia first discussed the concept of Mamba-powered models in 2024 when it initially released the MambaVision research and some early models. This week, Nvidia is expanding on its initial effort with a series of updated MambaVision models available on Hugging Face. MambaVision, as the name implies, is a Mamba-based model family for computer vision and image recognition tasks. The promise of MambaVision for enterprise is that it could improve the efficiency and accuracy of vision operations, at potentially lower costs, thanks to lower computational requirements. What are SSMs and how do they compare to transformers? SSMs are a neural network architecture class that processes sequential data differently from traditional transformers. While transformers use attention mechanisms to process all tokens in relation to each other, SSMs model sequence data as a continuous dynamic system. Mamba is a specific SSM implementation developed to address the limitations of earlier SSM models. It introduces selective state space modelling that dynamically adapts to input data and hardware-aware design for efficient GPU utilization. Mamba aims to provide comparable performance to transformers on many tasks while using fewer computational resources Nvidia using hybrid architecture with MambaVision to revolutionize Computer Vision Traditional Vision Transformers (ViT) have dominated high-performance computer vision for the last several years, but at significant computational cost. Pure Mamba-based approaches, while more efficient, have struggled to match Transformer performance on complex vision tasks requiring global context understanding. MambaVision bridges this gap by adopting a hybrid approach. Nvidia’s MambaVision is a hybrid model that strategically combines Mamba’s efficiency with the Transformer’s modelling power. The architecture’s innovation lies in its redesigned Mamba formulation specifically engineered for visual feature modeling, augmented by strategic placement of self-attention blocks in the final layers to capture complex spatial dependencies. Unlike conventional vision models that rely exclusively on either attention mechanisms or convolutional approaches, MambaVision’s hierarchical architecture employs both paradigms simultaneously. The model processes visual information through sequential scan-based operations from Mamba while leveraging self-attention to model global context — effectively getting the best of both worlds. MambaVision now has 740 million parameters The new set of MambaVision models released on Hugging Face is available under the Nvidia Source Code License-NC, which is an open license. The initial variants of MambaVision released in 2024 include the T and T2 variants, which were trained on the ImageNet-1K library. The new models released this week include the L/L2 and L3 variants, which are scaled-up models. “Since the initial release, we’ve significantly enhanced MambaVision, scaling it up to an impressive 740 million parameters,” Ali Hatamizadeh, Senior Research Scientist at Nvidia wrote in a Hugging Face discussion post. “We’ve also expanded our training approach by utilizing the larger ImageNet-21K dataset and have introduced native support for higher resolutions, now handling images at 256 and 512 pixels compared to the original 224 pixels.” According to Nvidia, the improved scale in the new MambaVision models also improves performance. Independent AI consultant Alex Fazio explained to VentureBeat that the new MambaVision models’ training on larger datasets makes them much better at handling more diverse and complex tasks. He noted that the new models include high-resolution variants perfect for detailed image analysis. Fazio said that the lineup has also expanded with advanced configurations offering more flexibility and scalability for different workloads. “In terms of benchmarks, the 2025 models are expected to outperform the 2024 ones because they generalize better across larger datasets and tasks, Fazio said. Enterprise implications of MambaVision For enterprises building computer vision applications, MambaVision’s balance of performance and efficiency opens new possibilities Reduced inference costs: The improved throughput means lower GPU compute requirements for similar performance levels compared to Transformer-only models. Edge deployment potential: While still large, MambaVision’s architecture is more amenable to optimization for edge devices than pure Transformer approaches. Improved downstream task performance: The gains on complex tasks like object detection and segmentation translate directly to better performance for real-world applications like inventory management, quality control, and autonomous systems. Simplified deployment: NVIDIA has released MambaVision with Hugging Face integration, making implementation straightforward with just a few lines of code for both classification and feature extraction. What this means for enterprise AI strategy MambaVision represents an opportunity for enterprises to deploy more efficient computer vision systems that maintain high accuracy. The model’s strong performance means that it can potentially serve as a versatile foundation for multiple computer vision applications across industries. MambaVision is still somewhat of an early effort, but it does represent a glimpse into the future of computer vision models. MambaVision highlights how architectural innovation—not just scale—continues to drive meaningful improvements in AI capabilities. Understanding these architectural advances is becoming increasingly crucial for technical decision-makers to make informed AI deployment choices. source

Beyond transformers: Nvidia’s MambaVision aims to unlock faster, cheaper enterprise computer vision Read More »

Agentic AI is changing online meeting platforms: Moving from silent observer to active participant

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Online meetings used to be much the same as their physical world counterparts. With the introduction of generative AI, online meeting platforms began to add new insights, including voice transcription services. Platform vendors, including Microsoft Teams, Google Meet and Cisco WebEx, have steadily integrated capabilities that go beyond what in-person physical meetings can provide. Now in the emerging era of agentic AI, online meetings are poised to diverge even further with a new wave of innovations. Zoom recently announced its agentic AI efforts which aims to create the paradigm shift from meetings to milestones. Microsoft has added copilot actions which can integrate with its Microsoft Teams service helping users inside of meetings to get insight and connect with other Microsoft services. Cisco has been steadily expanding its AI capabilities in Webex, and announced a Webex AI agent at the end of 2024 that helps with contact center deployments. Another firm that has been particularly active in the space is Otter AI, which is somewhat differentiated in that it is not directly tethered to any specific online meeting vendor platform. While Otter first made its mark as an AI-powered voice transcription service, it has added an AI assistant called Otter Pilot, an AI chat assistant, and a series of meeting capabilities known as Meeting GenAI. Today, Otter is going a step further, with its foray into agentic AI. While some of the agentic AI features that Otter is adding are not unique, it is doing at least one thing that isn’t yet part of every agentic AI meeting technology. Otter AI is now being integrated as an entity inside of a meeting that can actually respond by voice to queries. No longer is the AI an external participant accessible just via a chat window; AI is now a live entity that is literally part of the meeting. The rise of AI meeting agents: Beyond silent observers For the last several years, AI meeting assistants have been passive observers—transcribing conversations, creating summaries and allowing post-meeting queries. Otter is now changing this dynamic with its AI Meeting Agent, which can actively participate in conversations when summoned. “The new AI meeting agent we’re building will be able to help you with voice in real-time meetings,” Sam Liang, CEO of Otter AI told VentureBeat. “During the meeting, you can say, ‘Hey Otter,’ and ask it questions.” In a live demonstration with VentureBeat, Liang showed how the agent could answer factual questions, provide meeting summaries and even schedule follow-up meetings—all through voice commands during an active conversation. What makes this particularly powerful is the agent’s ability to connect to a company’s knowledge ecosystem. “This agent can become a domain expert,” Liang noted. “When you’re having a meeting, it has almost infinite knowledge from the internet, but this agent also has knowledge about your enterprise.” Autonomous agents: When AI runs the meeting Taking agentic capability a step further, Otter is also launching an autonomous SDR (Sales Development Representative) agent that can independently conduct entire meetings without human intervention. This agent greets website visitors, conducts product demonstrations and schedules follow-up meetings with human sales representatives. “We cannot hire a million human agents to answer questions, but we built this Otter SDR agent that functions like a sales development representative who can greet every single visitor and give them a live demo,” Liang said. The use of chatbots and avatar-based systems is not new. Liang argued that what distinguishes his company’s technology from existing avatar-based solutions is its ability to conduct multimedia product demonstrations in real time. The autonomous agent can share screens, demonstrate product features and respond to specific questions about functionality and pricing. The technical architecture of agentic AI for meetings Agentic AI is an overloaded and somewhat overhyped term in the industry today overall. Functionally agentic AI is about enabling actions, which can be done by combining multiple models with a tool like LangChain or by using the function-calling capabilities present in many models. Liang has an even more nuanced definition for agentic AI. “Agents in general are a more sophisticated AI system that can break down a large and complicated task into smaller tasks,” he said. “It can do reasoning and it can do some planning to perform a task.” Otter is not using LangChain but has developed its own custom technology specifically designed for the challenges of multi-speaker voice environments. The technical architecture combines both public knowledge retrieval and proprietary enterprise information through a custom RAG (Retrieval-Augmented Generation) implementation. This enables the agent to understand company-specific information like employee names, project terminology and internal acronyms. The future of meeting intelligence isn’t just agentic AI What agentic AI is bringing to online meeting platforms represents a powerful new set of capabilities for organizational efficiency. For decades, organizational efficiency experts have warned about the risks of wasted time in meetings. Modern AI-powered platforms are changing that risk. Last week Zoom’s CTO told me that his goal was to move the technology from meetings to milestones, where the output of a meeting isn’t just another meeting but actionable things that will benefit the organization. While agentic AI can create workflows, there is still also benefit in regular AI assistants that are not actually agentic. There will still be standalone AI assistants and fully agentic ones and that’s a good thing, according to Anurag Dhingra, SVP & GM, Enterprise Connectivity and Collaboration at Cisco. “While AI agents act as autonomous do-ers and AI assistants serve as prompted helpers, both offer benefits in boosting productivity and enhancing overall collaboration,” Dhingra told VentureBeat. “It’s not a matter of choosing one over the other but rather leveraging their combined strengths to create environments where teams can focus on innovation and strategic decision-making.” What is also starting to happen is more interoperability across different platforms. For example, Cisco’s AI Assistant will work with workflow applications such as Salesforce, ServiceNow and Outlook. Strategic

Agentic AI is changing online meeting platforms: Moving from silent observer to active participant Read More »

Baidu delivers new LLMs ERNIE 4.5 and ERNIE X1 undercutting DeepSeek, OpenAI on cost — but they’re not open source (yet)

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Over the weekend, Chinese web search giant Baidu announced the launch of two new AI models, ERNIE 4.5 and ERNIE X1, a multimodal language model and reasoning model, respectively. Baidu claims they offer state-of-the-art performance on a variety of metrics, besting DeepSeek’s non-reasoning V3 and OpenAI’s GPT-4.5 (how do you like the close name match Baidu chose as well?) on several third-party benchmark tests such as the C-Eval (assessing Chinese LLM performance on knowledge and reasoning across 52 subjects), CMMLU (massive multitask language understanding in Chinese), and GSM8K (math word problems). It also claims to undercut the cost of both fellow Chinese wunderkind’s DeepSeek’s R1 reasoning model with ERNIE X1 by 50% and US AI juggernaut OpenAI’s GPT-4.5 with ERNIE 4.5 by 99%, respectively. Yet both have some important limitations, including a lack of open source licensing in the former case (which DeepSeek R1 offers) and a far reduced context compared to the latter (8,000 tokens instead of 128,000, frankly an astonishingly low amount in this age of million-token-plus context windows. Tokens are how a large AI model represents information, with more meaning more information. A 128,000-token window is akin to a 250-page novel). As X user @claudeglass noted in a post, the small context window makes it perhaps only suitable for customer service chatbots. Baidu posted on X that it did plan to make the ERNIE 4.5 model family open source on June 30th, 2025. Baidu has enabled access to the models through its application programming interface (API) and Chinese-language chatbot rival to ChatGPT, known as “ERNIE Bot” — it answers questions, generates text, produces creative writing, and interacts conversationally with users — and made ERNIE Bot free to access. ERNIE 4.5: A new generation of multimodal AI ERNIE 4.5 is Baidu’s latest foundation model, designed as a native multimodal system capable of processing and understanding text, images, audio, and video, and is a clear competitor to OpenAI’s GPT-4.5 model released back in February 2025. The model has been optimized for better comprehension, generation, reasoning, and memory. Enhancements include improved hallucination prevention, logical reasoning, and coding capabilities. According to Baidu, ERNIE 4.5 outperforms GPT-4.5 in multiple benchmarks while maintaining a significantly lower cost. The model’s advancements stem from several key technologies, including FlashMask Dynamic Attention Masking, Heterogeneous Multimodal Mixture-of-Experts, and Self-feedback Enhanced Post-Training. ERNIE X1 introduces advanced deep-thinking reasoning capabilities, emphasizing understanding, planning, reflection, and evolution. Unlike standard multimodal AI models, ERNIE X1 is specifically designed for complex reasoning and tool use, enabling it to perform tasks such as advanced search, document-based Q&A, AI-generated image interpretation, code execution, and web page analysis. The model supports a range of tools, including Baidu’s academic search, business information search, and franchise research tools. Its development is based on Progressive Reinforcement Learning, End-to-End Training integrating Chains of Thought and Action, and a Unified Multi-Faceted Reward System. Access and API availability Users can now access both ERNIE 4.5 and ERNIE X1 via the official ERNIE Bot website. For enterprise users and developers, ERNIE 4.5 is now available through Baidu AI Cloud’s Qianfan platform via API access. ERNIE X1 is expected to be available soon. Pricing for API Access: ERNIE 4.5: Input: $0.55 USD per 1 million tokens Output: $2.2 per 1M tokens ERNIE X1: Input: $0.28 per 1M tokens Output: $1.1 per 1M tokens Compare that to: DeepSeek R1 Input: $0.55 per 1M tokens Output: $2.19 per 1M tokens Baidu has also announced plans to integrate ERNIE 4.5 and ERNIE X1 into its broader ecosystem, including Baidu Search and the Wenxiaoyan app. Considerations for enterprise decision-makers For CIOs, CTOs, IT leaders, and DevOps teams, the launch of ERNIE 4.5 and ERNIE X1 presents both opportunities and considerations: Performance vs. Cost – With pricing significantly lower than competing models, organizations evaluating AI solutions may see cost savings by integrating ERNIE models via API. However, further benchmarking and real-world testing may be necessary to assess performance for specific business applications. Multimodal and Reasoning Capabilities – The ability to process and understand text, images, audio, and video could be valuable for businesses in industries such as customer support, content generation, legal tech, and finance. Tool Integration – ERNIE X1’s ability to work with tools like advanced search, document-based Q&A, and code interpretation could provide automation and efficiency gains in enterprise environments. Ecosystem and Localization – As Baidu’s AI models are optimized for Chinese-language processing and regional knowledge, enterprises working in China or targeting Chinese-speaking markets may find ERNIE models more effective than global alternatives. Licensing and Data Privacy – While Baidu has indicated that GPT-4.5 will be made open source later this summer, June 30, 2025, that’s still three months away, so enterprises should at least wait until that time to assess whether it’s worth deploying locally or on US-hosted cloud services. Enterprise users should review Baidu’s policies regarding data privacy, compliance, and model usage before integrating these AI solutions. AI expansion and future outlook As AI development accelerates in 2025, Baidu is positioning itself as a leader in multimodal and reasoning-based AI technologies. The company plans to continue investing in artificial intelligence, data centers, and cloud infrastructure to enhance the capabilities of its foundation models. By offering a combination of powerful performance and lower costs, Baidu’s latest AI models aim to provide businesses and individual users with more accessible and advanced AI tools. For more details, visit ERNIE Bot’s official website. source

Baidu delivers new LLMs ERNIE 4.5 and ERNIE X1 undercutting DeepSeek, OpenAI on cost — but they’re not open source (yet) Read More »

‘Gradually then suddenly’: Is AI job displacement following this pattern?

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Whether by automating tasks, serving as copilots or generating text, images, video and software from plain English, AI is rapidly altering how we work. Yet, for all the talk about AI revolutionizing jobs, widespread workforce displacement has yet to happen. It seems likely that this could be the lull before the storm. According to a recent World Economic Forum (WEF) survey, 40% of employers anticipate reducing their workforce between 2025 and 2030 in areas wherever AI can automate tasks. This statistic dovetails well with earlier predictions. For example, Goldman Sachs said in a research report two years ago that “generative AI could expose the equivalent of 300 million full-time jobs to automation leading to “significant disruption” in the labor market. According to the International Monetary Fund (IMF) “almost 40% of global employment is exposed to AI.” Brookings said last fall in another report that “more than 30% of all workers could see at least 50% of their occupation’s tasks disrupted by gen AI.” Several years ago, Kai-Fu Lee, one of the world’s foremost AI experts, said in a 60 Minutes interview that AI could displace 40% of global jobs within 15 years. If AI is such a disruptive force, why aren’t we seeing large layoffs? Some have questioned those predictions, especially as job displacement from AI so far appears negligible. For example, an October 2024 Challenger Report that tracks job cuts said that in the 17 months between May 2023 and September 2024, fewer than 17,000 jobs in the U.S. had been lost due to AI. On the surface, this contradicts the dire warnings. But does it? Or does it suggest that we are still in a gradual phase before a possible sudden shift? History shows that technology-driven change does not always happen in a steady, linear fashion. Rather, it builds up over time until a sudden shift reshapes the landscape. In a recent Hidden Brain podcast on inflection points, researcher Rita McGrath of Columbia University referenced Ernest Hemingway’s 1926 novel The Sun Also Rises. When one character was asked how they went bankrupt, they answered: “Two ways. Gradually, then suddenly.” This could be an allegory for the impact of AI on jobs. This pattern of change — slow and nearly imperceptible at first, then suddenly undeniable — has been experienced across business, technology and society. Malcolm Gladwell calls this a “tipping point,” or the moment when a trend reaches critical mass, then dramatically accelerates. In cybernetics — the study of complex natural and social systems — a tipping point can occur when recent technology becomes so widespread that it fundamentally changes the way people live and work. In such scenarios, the change becomes self-reinforcing. This often happens when innovation and economic incentives align, making change inevitable. Gradually, then suddenly While employment impacts from AI are (so far) nascent, that is not true of AI adoption. In a new survey by McKinsey, 78% of respondents said their organizations use AI in at least one business function, up more than 40% from 2023. Other research found that 74% of enterprise C-suite executives are now more confident in AI for business advice than colleagues or friends. The research also revealed that 38% trust AI to make business decisions for them, while 44% defer to AI reasoning over their own insights. It is not only business executives who are increasing their use of AI tools. A new chart from the investment firm Evercore depicts increased use among all age groups over the last 9 months, regardless of application. Source: Business Insider This data reveals both broad and growing adoption of AI tools. However, true enterprise AI integration remains in its infancy — just 1% of executives describe their gen AI rollouts as mature, according to another McKinsey survey. This suggests that while AI adoption is surging, companies have yet to fully integrate it into core operations in a way that might displace jobs at scale. But that could change quickly. If economic pressures intensify, businesses may not have the luxury of gradual AI adoption and may feel the need to automate fast. Canary in the coal mine One of the first job categories likely to be hit by AI is software development. Numerous AI tools based on large language models (LLMs) exist to augment programming, and soon the function could be entirely automated. Anthropic CEO Dario Amodei said recently on Reddit that “we’re 3 to 6 months from a world where AI is writing 90% of the code. And then in 12 months, we may be in a world where AI is writing essentially all of the code.” Source: Reddit This trend is becoming clear, as evidenced by startups in the winter 2025 cohort of incubator Y Combinator. Managing partner Jared Friedman said that 25% of this startup batch have 95% of their codebases generated by AI. He added: “A year ago, [the companies] would have built their product from scratch — but now 95% of it is built by an AI.” The LLMs underlying code generation, such as Claude, Gemini, Grok, Llama and ChatGPT, are all advancing rapidly and increasingly perform well on an array of quantitative benchmark tests. For example, reasoning model o3 from OpenAI missed only one question on the 2024 American Invitational Mathematics Exam, scoring 97.7%, and achieved 87.7% on GPQA Diamond, which has graduate-level biology, physics and chemistry questions. Even more striking is a qualitative impression of the new GPT 4.5, as described in a Confluence post. GPT 4.5 correctly answered a broad and vague prompt that other models could not. This might not seem remarkable, but the authors noted: “This insignificant exchange was the first conversation with an LLM where we walked away thinking, ‘Now that feels like general intelligence.’” Did OpenAI just cross a threshold with GPT 4.5? Tipping points While software engineering may be among the first knowledge-worker professions to face widespread AI automation, it will not be the last. Many other white-collar

‘Gradually then suddenly’: Is AI job displacement following this pattern? Read More »

The new best AI image generation model is here: say hello to Reve Image 1.0!

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Reve AI, Inc., an AI startup based in Palo Alto, California, has officially launched Reve Image 1.0, an advanced text-to-image generation model designed to excel at prompt adherence, aesthetics, and typography. This marks the company’s first release, with future tools expected to follow. Reve Image is currently available for free preview at preview.reve.art, allowing users to generate images from text descriptions without requiring advanced prompt engineering. The company has not yet announced API access or long-term pricing plans, nor is it clear if the model will be proprietary or made open source, and if so, under what license. A new approach to AI imagery Reve Image differentiates itself by aiming for a deeper understanding of user intent. It allows users to not only generate images from text but also modify existing images with simple language commands. Example modifications include changing colors, adjusting text, and altering perspectives. The model also supports uploading reference images, enabling users to create visuals that match a specific style or inspiration. One of the model’s standout capabilities is its strong text rendering performance, addressing a common challenge in AI-generated imagery — and making it more directly competitive with text-focused image models such as Ideogram, which are more valuable to those designing logos and branding. Additionally, early user tests suggest that Reve Image handles multi-character prompts more effectively than previous models. Already topping the third-party benchmark charts Reve Image has already been evaluated by third-party AI model testing service Artificial Analysis. In the Artificial Analysis’s Image Arena, which ranks various image generation models based on user reviews and other quantitative metrics, Reve is currently in the lead at #1 for “image generation quality,” outperforming competitors such as Midjourney v6.1, Google’s Imagen 3, Recraft V3, and Black Forest Lab’s FLUX.1.1 [pro]. The benchmarking group highlighted Reve Image’s ability to generate clear and readable text within images, a historically difficult task for AI models. Before its official unveiling, Reve Image was known under the code name “Halfmoon” on social media, generating speculation and anticipation within the AI community. Merging human and AI understanding to create better, higher quality, more lifelike images Reve describes itself as a “small team of passionate researchers, builders, designers, and storytellers with big ideas.” The company is focused on developing creative tooling that enhances how users interact with AI-powered visuals. On X, Michaël Gharbi, Co-Founder and Research Scientist at Reve, shared insights into the company’s long-term vision, emphasizing the goal of building AI models that understand creative intent rather than merely generating visually plausible outputs. “Capturing creative intent requires advanced machine understanding of natural language and other interactions,” Gharbi said. “Our vision is to build a new semantic intermediate representation that both a human and a machine can understand, reason about, and operate on.” Other team members, including engineer Hunter Loftis and researcher Taesung Park, echoed the importance of bringing logic to AI-generated visuals. Park compared current text-to-image models to early large language models (LLMs), stating that they often produce visually appealing but logically inconsistent results. Early user reports show promise and limitations Early user feedback on the AI-heavy subreddit r/singularity (on Reddit), has been largely positive, with many praising the model’s accurate prompt following, high-quality text rendering, and rapid generation speed. Some users have reported success in generating multi-character scenes and complex environments, areas where previous models often struggled. However, some challenges remain. Users have noted that Reve Image: Struggles with certain complex objects (e.g., transparent materials like a full wine glass). Has difficulty recognizing specific fictional characters (e.g., users trying to generate characters from video games found the model produced more generic results). Occasionally misplaces details in multi-object compositions. Despite these hurdles, the team at Reve has been actively engaging with the user community and incorporating feedback into ongoing improvements. In my own brief hands on usage while drafting and creating the header image for this very article, I found Reve to be fairly intuitive and easy-to-use, with impressive visuals and prompt adherence. Like many AI-image generators, there’s a prompt entry textbox, though unlike Midjourney and Ideogram, Reve puts it at the bottom of the website and leaves your generated content up top to fill the majority of the space. In addition, the prompt entry textbox also contains four buttons below it for further fine adjustments to the image generation prompt sequence, including an aspect ratio adjuster (with standard sizing between 16:9 (widescreen landscape) and 9:16 (portrait, like a smartphone)… There’s another button selector for how many images you want to produce from each prompt (1, 2, 4, 8), a button to toggle on and off prompt text enhancement (it’s default toggled on, and this means that Reve will actually automatically edit the text you type in based on what it thinks you want to see in your image, adding lots more rich details and visual language than you might initially include) and a “seed” button for choosing if you want it to use a specific numeric string from a previous generated image to guide the generations going forward. It’s far fewer settings and doesn’t include any visual based editors like Midjourney, but the basics are there and it should be more than enough for most casual AI image users to get started. My brief tests also showed it was on-par or better than Ideogram at rendering legible text baked into images (and far surpassing Midjoruney), as well as on-par or exceeding the quality of rendering recognizable public figures as Grok (again, Midjourney and many other image generators prohibit this). What’s next for Reve image? While the model is currently only available via the company’s website, there is growing anticipation for API access or potential open-source options. Users have also expressed interest in additional features like custom model training, control tools for animation, and integration with creative software. For now, Reve Image remains freely accessible at preview.reve.art, allowing users to explore its capabilities

The new best AI image generation model is here: say hello to Reve Image 1.0! Read More »

Google releases ‘most intelligent model to date,’ Gemini 2.5 Pro

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Just a few months after releasing Gemini 2.0 and the rise of DeepSeek, Google announced its “most intelligent model” yet, Gemini 2.5, capable of reasoning and with better performance and accuracy. Gemini 2.5 comes three months after Google released its previously most intelligent model family, Gemini 2.0 which introduced reasoning and agentic use cases. This new model is available as Gemini 2.5 Pro (experimental) on Google’s AI Studio and for Gemini Advanced users on the Gemini chat interface. It will be available on Vertex AI soon. Koray Kavukcuoglu, CTO at Google DeepMind, said in a blog post that Gemini 2.5 represents the next step in Google’s goal of making “AI smarter and more capable of reasoning.” “Now, with Gemini 2.5, we’ve achieved a new level of performance by combining a significantly enhanced base model with improved post-training,” Kavukcuoglu wrote. “Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.” More context and comprehension Like Gemini 2.0 and Gemini 2.0 Flash Thinking, Gemini 2.5 Pro “thinks” before it responds. The new model can handle multimodal input from text, audio, images, videos and large datasets. Gemini 2.5 Pro can also understand entire code repositories for coding projects. Gemini 2.5 Pro offers some of the largest context windows available for experimental models on Gemini. It ships with a 1 million token context window but will expand to 2 million tokens soon. Google AI Studio product manager Logan Kilpatrick posted on X that Gemini 2.5 Pro is “the first experimental model with higher rate limits + billing.” Google plans to release pricing for Gemini 2.5 models soon. Enhanced coding and reasoning performance Google said the model leads in advanced reasoning benchmark tests. The company said Gemini 2.5 Pro “leads in match and science benchmarks like GPQA and AIME 2025.” Kavukcuoglu said the model also scored “a state-of-the-art 18.8% across models without tool use on Humanity’s Last Exam,” a dataset aiming to capture human knowledge and reasoning. Gemini 2.5 Pro also performs strongly on coding tasks and scored better than Gemini 2.0 in specific benchmarks. Google noted the new model “excels at creating visually compelling web apps and agentic code applications, along with code transformation and editing.” A more competitive market Gemini 2.5 Pro enters the reasoning model fray in a significantly changed environment than Gemini 2.0 did in December. The release of DeepSeek’s reasoning large language model (LLM) DeepSeek-R1 showed that powerful models can perform well at a fraction of the training and compute cost. Furthermore, DeepSeek showed that open-source models can compete with more closed-source LLMs, such as OpenAI’s o1 and o3 models. Besides DeepSeek’s ever-expanding model offerings, Google has to compete with OpenAI’s reasoning models. While the newest model from OpenAI was GPT-4.5 —not a reasoning model—the company is still expected to develop more reasoning models soon. Gemini 2.5 is Google’s second new model this month. In March, the company released the latest version of its small language model, Gemma 3, which offered a 128,000 token context model and was best for use in on-the-go devices. source

Google releases ‘most intelligent model to date,’ Gemini 2.5 Pro Read More »

Midjourney’s surprise: new research on making LLMs write more creatively

Leave a Comment / Top Tech Update / VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Midjourney is best known as one of the leading AI image generators — with nearly 20 million users on its Discord channel, according to third-party trackers, and presumably more atop that on its website — but its ambitions are beginning to expand. Following the news in late summer 2024 that it was building its own computing and AI hardware, the company this week released a new research paper alongside machine learning experts at New York University (NYU) on training text-based large language models (LLMs) such as Meta’s open source Llama and Mistral’s eponymous source models to write more creatively. The collaboration, documented in a new research paper published on AI code community Hugging Face, introduces two new technieques — Diversified Direct Preference Optimization (DDPO) and Diversified Odds Ratio Preference Optimization (DORPO)— designed to expand the range of possible outputs while maintaining coherence and readability. For a company that is best known for its diffusion AI image generating models, Midjourney’s new approach to rethinking creativity in text-based LLMs shows that it is not limiting its ambitions to visuals, and that, a picture may not actually be worth a thousand words. Could a Midjourney-native LLM or fine-tuned version of an existing LLM be in the cards from the small, bootstrapped startup? I reached out to Midjourney founder David Holz but have yet to hear back. Regardless of a first-party Midjourney LLM offering, the implications of its new research go beyond academic exercises and could be used to help fuel a new wave of LLM training among enterprise AI teams, product developers, and content creators looking to improve AI-generated text. It also shows that despite recent interest and investment among AI model providers in new multimodal and reasoning language models, there’s still a lot of juice left to be squeezed, cognitively and performance-wise, from classic Transformer-based, text-focused LLMs. The problem: AI-generated writing collapses around homogenous outputs In domains like fact-based Q&A or coding assistance, LLMs are expected to generate a single best response. However, creative writing is inherently open-ended, meaning there are many valid responses to a single prompt. For an example provided by the Midjourney researchers, given a prompt like “Write a story about a dog on the moon”, the LLM could explore multiple diverse paths like: An astronaut’s pet dog accidentally left behind after a lunar mission. A dog who finds itself in a futuristic canine space colony. A stranded dog that befriends an alien species. Despite this range of possibilities, instruction-tuned LLMs often converge on similar storylines and themes. This happens because: Post-training techniques prioritize user preference over originality, reinforcing popular but repetitive responses. Instruction tuning often smooths out variation, making models favor “safe” responses over unique ones. Existing diversity-promoting techniques (like temperature tuning) operate only at inference time, rather than being baked into the model’s learning process. This leads to homogenized storytelling, where AI-generated creative writing feels repetitive and lacks surprise or depth. The solution: modifying post-training methods to prioritize diversity To overcome these limitations, the researchers introduced DDPO and DORPO, two extensions of existing preference optimization methods. The core innovation in these approaches is the use of deviation—a measure of how much a response differs from others—to guide training. Here’s how it works: During training, the model is given a writing prompt and multiple possible responses. Each response is compared to others for the same prompt, and a deviation score is calculated. Rare but high-quality responses are weighted more heavily in training, encouraging the model to learn from diverse examples. By incorporating deviation into Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), the model learns to produce high-quality but more varied responses. This method ensures that AI-generated stories do not converge on a single predictable structure, but instead explore a wider range of characters, settings, and themes—just as a human writer might. What Midjourney’s researchers did to achieve this The study involved training LLMs on creative writing tasks using a dataset from the subreddit r/writingPrompts, a Reddit community where users post prompts and respond with short stories. The researchers used two base models for their training: Meta’s Llama-3.1-8B (an 8-billion-parameter model from the Llama 3 series). Mistral-7B-v0.3 (a 7-billion-parameter model from Mistral AI). Then, they took these models through the following processes: Supervised Fine-Tuning (SFT): The models were first fine-tuned using LoRA (Low-Rank Adaptation) to adjust parameters efficiently. Preference Optimization: DPO and ORPO were used as baselines—these standard methods focus on improving response quality based on user preference signals. DDPO and DORPO were then applied, introducing deviation-based weighting to encourage more unique responses. Evaluation: Automatic evaluation: Measured semantic and stylistic diversity using embedding-based techniques. Human evaluation: Judges assessed whether outputs were diverse and engaging compared to GPT-4o and Claude 3.5. Key Training Findings: DDPO significantly outperformed standard DPO in terms of output diversity while maintaining quality. Llama-3.1-8B with DDPO achieved the best balance of quality and diversity, producing responses that were more varied than GPT-4o while maintaining coherence. When dataset size was reduced, DDPO models still maintained diversity, though they required a certain number of diverse training samples to be fully effective. Enterprise implications: what does it mean for those using AI to produce creative responses — such as in marketing copywriting, corporate storytelling, and film/TV/video game scripting? For AI teams managing LLM deployment, enhancing output diversity while maintaining quality is a critical challenge. These findings have significant implications for organizations that rely on AI-generated content in applications such as: Conversational AI and chatbots (ensuring varied and engaging responses). Content marketing and storytelling tools (preventing repetitive AI-generated copy). Game development and narrative design (creating diverse dialogue and branching storylines). For professionals responsible for fine-tuning and deploying models in an enterprise setting, this research provides: A new approach to LLM post-training that enhances creativity without sacrificing quality. A practical alternative to inference-time diversity tuning (such as temperature adjustments) by integrating diversity into the learning process itself. The

Midjourney’s surprise: new research on making LLMs write more creatively Read More »

OpenAI’s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds

DeepSeek-V3 now runs at 20 tokens per second on Mac Studio, and that’s a nightmare for OpenAI

The watchful AI that never sleeps: Hakimo’s $10.5M bet on autonomous security

Beyond transformers: Nvidia’s MambaVision aims to unlock faster, cheaper enterprise computer vision

Agentic AI is changing online meeting platforms: Moving from silent observer to active participant

Baidu delivers new LLMs ERNIE 4.5 and ERNIE X1 undercutting DeepSeek, OpenAI on cost — but they’re not open source (yet)

‘Gradually then suddenly’: Is AI job displacement following this pattern?

The new best AI image generation model is here: say hello to Reve Image 1.0!

Google releases ‘most intelligent model to date,’ Gemini 2.5 Pro

Midjourney’s surprise: new research on making LLMs write more creatively

We provide a matching platform and membership services for startup groups in Asia

Useful Links

Become an Affiliate

Contact

News & Insight

Join the family!

Latest News

Cisco names new senior director of strategic communications

COMCO Mundo expands SEA footprint with Singapore launch