VentureBeat

Beyond RAG: How cache-augmented generation reduces latency, complexity for smaller workloads

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Retrieval-augmented generation (RAG) has become the de-facto way of customizing large language models (LLMs) for bespoke information. However, RAG comes with upfront technical costs and can be slow. Now, thanks to advances in long-context LLMs, enterprises can bypass RAG by inserting all the proprietary information in the prompt. A new study by the National Chengchi University in Taiwan shows that by using long-context LLMs and caching techniques, you can create customized applications that outperform RAG pipelines. Called cache-augmented generation (CAG), this approach can be a simple and efficient replacement for RAG in enterprise settings where the knowledge corpus can fit in the model’s context window. Limitations of RAG RAG is an effective method for handling open-domain questions and specialized tasks. It uses retrieval algorithms to gather documents that are relevant to the request and adds context to enable the LLM to craft more accurate responses. However, RAG introduces several limitations to LLM applications. The added retrieval step introduces latency that can degrade the user experience. The result also depends on the quality of the document selection and ranking step. In many cases, the limitations of the models used for retrieval require documents to be broken down into smaller chunks, which can harm the retrieval process.  And in general, RAG adds complexity to the LLM application, requiring the development, integration and maintenance of additional components. The added overhead slows the development process. Cache-augmented retrieval RAG (top) vs CAG (bottom) (source: arXiv) The alternative to developing a RAG pipeline is to insert the entire document corpus into the prompt and have the model choose which bits are relevant to the request. This approach removes the complexity of the RAG pipeline and the problems caused by retrieval errors. However, there are three key challenges with front-loading all documents into the prompt. First, long prompts will slow down the model and increase the costs of inference. Second, the length of the LLM’s context window sets limits to the number of documents that fit in the prompt. And finally, adding irrelevant information to the prompt can confuse the model and reduce the quality of its answers. So, just stuffing all your documents into the prompt instead of choosing the most relevant ones can end up hurting the model’s performance. The CAG approach proposed leverages three key trends to overcome these challenges. First, advanced caching techniques are making it faster and cheaper to process prompt templates. The premise of CAG is that the knowledge documents will be included in every prompt sent to the model. Therefore, you can compute the attention values of their tokens in advance instead of doing so when receiving requests. This upfront computation reduces the time it takes to process user requests. Leading LLM providers such as OpenAI, Anthropic and Google provide prompt caching features for the repetitive parts of your prompt, which can include the knowledge documents and instructions that you insert at the beginning of your prompt. With Anthropic, you can reduce costs by up to 90% and latency by 85% on the cached parts of your prompt. Equivalent caching features have been developed for open-source LLM-hosting platforms. Second, long-context LLMs are making it easier to fit more documents and knowledge into prompts. Claude 3.5 Sonnet supports up to 200,000 tokens, while GPT-4o supports 128,000 tokens and Gemini up to 2 million tokens. This makes it possible to include multiple documents or entire books in the prompt. And finally, advanced training methods are enabling models to do better retrieval, reasoning and question-answering on very long sequences. In the past year, researchers have developed several LLM benchmarks for long-sequence tasks, including BABILong, LongICLBench, and RULER. These benchmarks test LLMs on hard problems such as multiple retrieval and multi-hop question-answering. There is still room for improvement in this area, but AI labs continue to make progress. As newer generations of models continue to expand their context windows, they will be able to process larger knowledge collections. Moreover, we can expect models to continue improving in their abilities to extract and use relevant information from long contexts. “These two trends will significantly extend the usability of our approach, enabling it to handle more complex and diverse applications,” the researchers write. “Consequently, our methodology is well-positioned to become a robust and versatile solution for knowledge-intensive tasks, leveraging the growing capabilities of next-generation LLMs.” RAG vs CAG To compare RAG and CAG, the researchers ran experiments on two widely recognized question-answering benchmarks: SQuAD, which focuses on context-aware Q&A from single documents, and HotPotQA, which requires multi-hop reasoning across multiple documents. They used a Llama-3.1-8B model with a 128,000-token context window. For RAG, they combined the LLM with two retrieval systems to obtain passages relevant to the question: the basic BM25 algorithm and OpenAI embeddings. For CAG, they inserted multiple documents from the benchmark into the prompt and let the model itself determine which passages to use to answer the question. Their experiments show that CAG outperformed both RAG systems in most situations.  CAG outperforms both sparse RAG (BM25 retrieval) and dense RAG (OpenAI embeddings) (source: arXiv) “By preloading the entire context from the test set, our system eliminates retrieval errors and ensures holistic reasoning over all relevant information,” the researchers write. “This advantage is particularly evident in scenarios where RAG systems might retrieve incomplete or irrelevant passages, leading to suboptimal answer generation.” CAG also significantly reduces the time to generate the answer, particularly as the reference text length increases.  Generation time for CAG is much smaller than RAG (source: arXiv) That said, CAG is not a silver bullet and should be used with caution. It is well suited for settings where the knowledge base does not change often and is small enough to fit within the context window of the model. Enterprises should also be careful of cases where their documents contain conflicting facts based on the context of the documents, which might confound the

Beyond RAG: How cache-augmented generation reduces latency, complexity for smaller workloads Read More »

Early days for AI: Only 25% of enterprises have deployed, few reap rewards

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is anticipated to be the year AI gets real, bringing specific, tangible benefit to enterprises.  However, according to a new State of AI Development Report from AI development platform Vellum, we’re not quite there yet: Just 25% of enterprises have deployed AI into production, and only a quarter of those have yet to see measurable impact.  This seems to indicate that many enterprises have not yet identified viable use cases for AI, keeping them (at least for now) in a pre-build holding pattern.  “This reinforces that it’s still pretty early days, despite all the hype and discussion that’s been happening,” Akash Sharma, Vellum CEO, told VentureBeat. “There’s a lot of noise in the industry, new models and model providers coming out, new RAG techniques; we just wanted to get a lay of the land on how companies are actually deploying AI to production.” Enterprises must identify specific use cases to see success Vellum interviewed more than 1,250 AI developers and builders to get a true sense of what’s happening in the AI trenches.  Companies are in various stages of their AI journeys — building out and evaluating strategies and proofs of concept (PoC) (53%), beta testing (14%) and, at the lowest level, talking to users and gathering requirements (7.9%).  By far the most enterprises are focused on building document parsing and analysis tools and customer service chatbots, according to Vellum. But they are also interested in applications incorporating analytics with natural language, content generation, recommendation systems, code generation and automation and research automation. So far, developers report competitor advantage (31.6%), cost and time savings (27.1%) and higher user adoption rates (12.6%) as the biggest impacts they’ve seen so far. Interestingly, though, 24.2% have yet to see any meaningful impact from their investments.  Sharma emphasized the importance of prioritizing use cases from the very start. “We’ve anecdotally heard from people that they just want to use AI for the sake of using AI,” he said. “There’s an experimental budget associated with that.”  While this makes Wall Street and investors happy, it doesn’t mean AI is actually contributing anything, he pointed out. “Something generally everyone should be thinking about is, ‘How do we find the right use cases? Usually, once companies are able to identify those use cases, get them into production and see a clear ROI, they get more momentum, they get past the hype. That results in more internal expertise, more investment.”  OpenAI still at the top, but a mixture of models will be the future When it comes to models used, OpenAI maintains the lead (no surprise there), notably its GPT 4o and GPT 4o-mini. But Sharma pointed out that 2024 offered more options, either directly from model creators or through platform solutions like Azure or AWS Bedrock. And, providers hosting open-source models such as Llama 3.2 70B are gaining traction, too — such as Groq, Fireworks AI and Together AI. “Open-source models are getting better,” said Sharma. “Closed-source competitors to OpenAI are catching up in terms of quality.” Ultimately, though, enterprises aren’t going to just stick with just one model — they will increasingly lean on multi-model systems, he forecasted.  “People will choose the best model for each task at hand,” said Sharma. “While building an agent, you might have multiple prompts, and for each individual prompt the developer will want to get the best quality, lowest cost and lowest latency, and that may or may not come from OpenAI.” Similarly, the future of AI is undoubtedly multimodal, with Vellum seeing a surge in adoption of tools that can handle a variety of tasks. Text is the undisputed top use case, followed by file creation (PDF or Word), images, audio and video.  Also, retrieval-augmented generation (RAG) is a go-to when it comes to information retrieval, and more than half of developers are using vector databases to simplify search. Top open-source and proprietary models include Pinecone, MongoDB, Quadrant, Elastic Search, PG vector, Weaviate and Chroma.  Everyone’s getting involved (not just engineering) Interestingly, AI is moving beyond just IT and becoming democratized across enterprises (akin to the old “it takes a village”). Vellum found that while engineering was most involved in AI projects (82.3%), they are being joined by leadership and executives (60.8%), subject matter experts (57.5%), product teams (55.4%) and design departments (38.2%).  This is largely due to the ease of use of AI (as well as the general excitement around it), Sharma noted.  “This is the first time we’re seeing software being developed in a very, very cross-functional way, especially because prompts can be written in natural language,” he said. “Traditional software usually tends to be more deterministic. This is non-deterministic, which brings more people into the development fold.” Still, enterprises continue to face big challenges — notably around AI hallucinations and prompts; model speed and performance; data access and security; and getting buy-in from important stakeholders.  At the same time, while more non-technical users are getting involved, there is still a lack of pure technical expertise in-house, Sharma pointed out. “The way to connect all the different moving parts is still a skill that not that many developers have today,” he said. “So that’s a common challenge.” However, many existing challenges can be overcome by tooling, or platforms and services that help developers evaluate complex AI systems, Sharma pointed out. Developers can perform tooling internally or with third-party platforms or frameworks; however, Vellum found that nearly 18% of developers are defining prompts and orchestration logic without any tooling at all.  Sharma pointed out that “lack of technical expertise becomes [less of a problem] when you have proper tooling that can guide you through the development journey.” In addition to Vellum, frameworks and platforms used by survey participants include LangChain, Llama Index, Langfuse, CrewAI and Voiceflow. Evaluations and ongoing monitoring are critical Another way to overcome common problems (including hallucinations) is to perform evaluations, or use specific

Early days for AI: Only 25% of enterprises have deployed, few reap rewards Read More »

You can now fine-tune your own version of AI image maker Flux with just 5 images

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Black Forest Labs has quickly made a name for itself as the premiere, high-quality open-source AI image generation startup — even surpassing the quality of models offered by Stability AI, where Black Forest Labs’ founders previously worked. It briefly served as the default image generator in xAI’s Grok language model, too. Credit: Artificial Analysis Today Black Forest Labs is taking this a step further, announcing the release of the FLUX Pro Finetuning API, a tool that empowers creators to customize generative AI models using their own images and concepts. Designed for professionals in marketing, branding, storytelling and other creative industries, the API enables the personalization of the company’s flagship FLUX Pro and FLUX Ultra models with a user-friendly approach. Customization at scale The FLUX Pro Finetuning API allows users to fine-tune generative text-to-image models with five to 20 training images, optionally accompanied by text descriptions. This process results in customized models that maintain the generative versatility of the base FLUX Pro models while aligning outputs with specific creative visions. The tool supports multiple modes, including “character,” “product,” “style” and “general,” making it adaptable for a wide variety of use cases. The trained models can seamlessly integrate with endpoints such as FLUX.1 Fill, Depth, Canny and Redux, as well as with high-resolution generation capabilities of up to four megapixels. Whether for creating brand-consistent marketing visuals or detailed character art, the API enhances precision and adaptability in AI-generated content. Practical applications and use cases for brands, marketers and more Using the FLUX Pro Finetuning API, professionals can create customized models that preserve essential design elements, character consistency or brand properties. A study conducted by Black Forest Labs showed that 68.9% of users preferred FLUX Pro’s fine-tuned results over competing services. Some highlighted applications include: • Inpainting: Using FLUX.1 Fill for iterative edits to refine images • Structural Control: Integrating with FLUX.1 Depth to enhance image generation with precise structural adjustments • Visual Branding: Ensuring consistency across marketing materials and campaigns Partnership with BurdaVerlag Black Forest Labs has partnered with BurdaVerlag, a leading German media and entertainment company, to demonstrate the potential of the FLUX Pro Finetuning API. BurdaVerlag’s creative teams are using the tool to develop customized FLUX models tailored to their brands, such as the children’s publication Lissy PONY. With this integration, BurdaVerlag’s design teams can create visuals that reflect each brand’s identity while exploring new creative directions. The API has accelerated their production workflows, enabling high-quality content generation at scale. Accessible pricing and availability The FLUX Pro Finetuning API is now available via API endpoints through the Flux.1 [dev] model. Pricing for all FLUX models on Black Forest Labs’ API is as follows: • FLUX 1.1 [pro] Ultra: $0.06 per image • FLUX 1.1 [pro]: $0.04 per image • FLUX.1 [pro]: $0.05 per image • FLUX.1 [dev]: $0.025 per image Getting started is a cinch The finetuning process requires minimal input from users. Training images in supported formats (JPG, JPEG, PNG or WebP) are uploaded, with resolutions capped at one megapixel for optimal results. Advanced configuration options allow for fine control over the training process, including iteration counts, learning rates, and trigger words for precise prompt integration. Black Forest Labs has also provided extensive resources, including a Finetuning Beta Guide and Python scripts for easy implementation. Users can monitor progress, adjust parameters, and test results directly via API endpoints, ensuring a smooth and efficient workflow. By combining versatility, ease of use and professional-grade outputs, the FLUX Pro Finetuning API sets a new benchmark for customized content creation in generative AI. With the tool now available, Black Forest Labs aims to transform how individuals and organizations approach personalized media generation, unlocking creative possibilities at an unprecedented scale. source

You can now fine-tune your own version of AI image maker Flux with just 5 images Read More »

Google’s new neural-net LLM architecture separates memory components to control exploding costs of capacity and compute

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A new neural-network architecture developed by researchers at Google might solve one of the great challenges for large language models (LLMs): extending their memory at inference time without exploding the costs of memory and compute. Called Titans, the architecture enables models to find and store during inference small bits of information that are important in long sequences.  Titans combines traditional LLM attention blocks with “neural memory” layers that enable models to handle both short- and long-term memory tasks efficiently. According to the researchers, LLMs that use neural long-term memory can scale to millions of tokens and outperform both classic LLMs and alternatives such as Mamba while having many fewer parameters.  Attention layers and linear models The classic transformer architecture used in LLMs employs the self-attention mechanism to compute the relations between tokens. This is an effective technique that can learn complex and granular patterns in token sequences. However, as the sequence length grows, the computing and memory costs of calculating and storing attention increase quadratically. More recent proposals involve alternative architectures that have linear complexity and can scale without exploding memory and computation costs. However, the Google researchers argue that linear models do not show competitive performance compared to classic transformers, as they compress their contextual data and tend to miss important details. The ideal architecture, they suggest, should have different memory components that can be coordinated to use existing knowledge, memorize new facts, and learn abstractions from their context.  “We argue that in an effective learning paradigm, similar to [the] human brain, there are distinct yet interconnected modules, each of which is responsible for a component crucial to the learning process,” the researchers write. Neural long-term memory “Memory is a confederation of systems — e.g., short-term, working, and long-term memory — each serving a different function with different neural structures, and each capable of operating independently,” the researchers write. To fill the gap in current language models, the researchers propose a “neural long-term memory” module that can learn new information at inference time without the inefficiencies of the full attention mechanism. Instead of storing information during training, the neural memory module learns a function that can memorize new facts during inference and dynamically adapt the memorization process based on the data it encounters. This solves the generalization problem that other neural network architectures suffer from. To decide which bits of information are worth storing, the neural memory module uses the concept of “surprise.” The more a sequence of tokens differs from the kind of information stored in the model’s weights and existing memory, the more surprising it is and thus worth memorizing. This enables the module to make efficient use of its limited memory and only store pieces of data that add useful information to what the model already knows. To handle very long sequences of data, the neural memory module has an adaptive forgetting mechanism that allows it to remove information that is no longer needed, which helps manage the memory’s limited capacity. The memory module can be complementary to the attention mechanism of current transformer models, which the researchers describe as “short-term memory modules, attending to the current context window size. On the other hand, our neural memory with the ability to continuously learn from data and store it in its weights can play the role of a long-term memory.” Titan architecture Example of Titan architecture (source: arXiv) The researchers describe Titans as a family of models that incorporate existing transformer blocks with neural memory modules. The model has three key components: the “core” module, which acts as the short-term memory and uses the classic attention mechanism to attend to the current segment of the input tokens that the model is processing; a “long-term memory” module, which uses the neural memory architecture to store information beyond the current context; and a “persistent memory” module, the learnable parameters that remain fixed after training and store time-independent knowledge. The researchers propose different ways to connect the three components. But in general, the main advantage of this architecture is enabling the attention and memory modules to complement each other. For example, the attention layers can use the historical and current context to determine which parts of the current context window should be stored in the long-term memory. Meanwhile, long-term memory provides historical knowledge that is not present in the current attention context. The researchers ran small-scale tests on Titan models, ranging from 170 million to 760 million parameters, on a diverse range of tasks, including language modeling and long-sequence language tasks. They compared the performance of Titans against various transformer-based models, linear models such as Mamba and hybrid models such as Samba.  Titans (red line) outperforms other models, including GPT-4, on long-sequence tasks in both few-shot and fine-tuned settings (source: arXiv) Titans demonstrated a strong performance in language modeling compared to other models and outperformed both transformers and linear models with similar sizes. The performance difference is especially pronounced in tasks on long sequences, such as “needle in a haystack,” where the model must retrieve bits of information from a very long sequence, and BABILong, where the model must reason across facts distributed in very long documents. In fact, in these tasks, Titan outperformed models with orders of magnitude more parameters, including GPT-4 and GPT-4o-mini, and a Llama-3 model enhanced with retrieval-augmented generation (RAG). Moreover, the researchers were able to extend the context window of Titans up to 2 million tokens while maintaining the memory costs at a modest level. The models still need to be tested at larger sizes, but the results from the paper show that the researchers have still not hit the ceiling of Titans’ potential. What does it mean for enterprise applications? With Google being at the forefront of long-context models, we can expect this technique to find its way into private and open models such as Gemini and Gemma. With LLMs supporting longer context windows,

Google’s new neural-net LLM architecture separates memory components to control exploding costs of capacity and compute Read More »

Runway’s new AI image generator Frames is here, and it looks fittingly cinematic

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The AI media tech provider Runway has announced the release of Frames, its newest text-to-image generation model, and it’s winning early praise from users for producing highly cinematic visuals — a fitting compliment given Runway is known primarily as an AI video model provider. Could Frames dethrone Midjourney as the go-to choice for AI filmmakers and artists? Announced back in November 2024, Frames was initially made available to selected Runway Creators Program ambassadors and power users over the last few weeks. As of today, it’s available to all through Runway’s Unlimited and Enterprise subscription plans, which cost $95 per month/$912 when paid annually or, in the case of the Enterprise plan, $1,500 annually. The company has also posted a showcase of new images generated by users of Frames on its website here, under the name “Worlds of Frames.” Users can generate still images with it on Runway’s website at app.runwayml.com — if they have subscribed to the appropriate plan — and then with one click, use the images as the basis for movies made with Runway’s image-to-video models such as Gen-3 Alpha Turbo. According to Runway, Frames provides an advanced level of stylistic control and visual fidelity, making it a versatile tool for industries like editorial, art direction, pre-visualization, brand development, and production. As Cristóbal Valenzuela, Runway’s cofounder and CEO, wrote on a post on the social network X: “Frames has been engineered from the ground up for professional creative work. If you’re in editorial, art direction, pre-vis, brand development, production, etc., this model is for you.” Valenzuela further noted that the model’s prompting system allows for precision and depth, enabling users to achieve nuanced, naturalistic and cinematically composed results. Users seem to agree, with @GenMagic writing on X: “Very high quality, lots of control with style, and you can animate your images right in runway super quick. I felt like seeing some 90s aesthetic mall nostalgia, and Frames did not disappoint.” Or as user @AIandDesign put it: “Damn @runwayml Frames is completely insane. I can’t stop.” Creativity meets consistency Frames allows users to design worlds with specific points of view and aesthetic characteristics. Its ability to maintain stylistic consistency while offering broad creative exploration sets it apart from previous models, which are often more random and “roll the dice” in terms of user experience. With Frames, users can establish a distinct visual identity for their projects and reliably generate variations that remain true to that specific style. In addition to enabling custom stylistic designs, Frames at launch comes with a library of 19 preset visual styles that users are able to select and further customize: 1. Vivid 2. Vivid Warm 3. Vivid Cool 4. High Contrast 5. High Contrast Warm 6. High Contrast Cool 7. B&W 8. B&W Contrast 9. Muted Pastel 10. Dreamscape 11. Nordic Minimal 12. Light Anime 13. Dark Anime 14. Painted Anime 15. 3D Cartoon 16. Sketch 17. Low Angle 18. In Motion 19. Terracotta These worlds demonstrate the wide range of creative possibilities offered by Frames, making it an ideal tool for artists, designers and filmmakers seeking stylistic precision. Designed for creative pros Frames offers several enhancements tailored for professional users. The model excels in rendering advanced textures, natural lighting and complex compositions, providing more flexibility and a departure from the rigid outputs of earlier image generation models. This release marks the first version of Frames, and Runway has outlined a clear roadmap for future updates, including more style tools and controls. Safety and ethical considerations Runway continues to prioritize safety and ethical responsibility in its generative AI tools. As part of the company’s Foundations for Safe Generative Media initiative, Frames includes robust content moderation features to prevent misuse. The company’s in-house visual moderation system detects and blocks harmful or inappropriate content, balancing creative freedom with safety. To address concerns about misinformation and misuse, Frames embeds invisible watermarks in all AI-generated content. These watermarks comply with provenance standards set by the Coalition for Content Provenance and Authenticity (C2PA), allowing users to trace whether a media item is AI-generated. Additionally, Runway is committed to improving fairness and representation in its generative models. Efforts have been made to reduce bias in visual outputs and to support diverse demographics and languages. Nonetheless, Runway remains party to a lawsuit from human artists who accuse the company and others such as Stability AI and Midjourney of training on their artwork without permission, in violation of copyright. The case, Andersen v. Stability AI Ltd. (3:23-cv-00201), remains unresolved for now, but is working its way through the courts. source

Runway’s new AI image generator Frames is here, and it looks fittingly cinematic Read More »

Luma AI releases Ray2 generative video model with ‘fast, natural’ motion and better physics

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Luma AI made waves with the launch of its Dream Machine generative AI video creation platform last summer. Of course, while that was only seven short months ago, the AI video space has advanced rapidly with the release of many new AI video creation models from rival startups in the U.S. and China, including Runway, Kling, Pika 2.0, OpenAI’s Sora, Google’s Veo 2, MiniMax’s Hailuo and open-source alternatives such as Hotshot and Genmo’s Mochi 1, to name but a few. Even Luma itself recently updated its Dream Machine platform to include new still image generation and brainstorming boards, and also debuted an iOS app. But the updates continue: Today, the San Francisco-based Luma released Ray2, its newest video AI generation model, available now through its Dream Machine website and mobile apps for paying subscribers (to start). The model offers “fast, natural coherent motion and physics,” according to Luma AI cofounder and CEO Amit Jain on his X account, and was trained with 10 times more compute than the original Luma AI video model, Ray1. “This skyrockets the success rate of usable production-ready generations and makes video storytelling accessible to a lot more people,” Jain added. Luma’s Dream Machine web platform offers a free tier with 720 pixel generations capped at a variable number each month: Paid plans begin at $6.99 per month, from “Lite,” which offers 1080p visuals, to Plus ($20.99/month), to Unlimited ($66.49/month) and Enterprise ($1,672.92/year). A leap forward in video gen Right now, Luma’s Ray2 is limited to text-to-video, allowing users to type in descriptions that are transformed into five- or 10-second video clips. The model can generate new videos in a matter of seconds, although right now it can take minutes at a time due to a crush of demand from new users. Examples shared by Luma and early testers in its Creators program showcase the model’s versatility, including a man running through an Antarctic snowstorm surrounded by explosions, and a ballerina performing on an ice floe in the Arctic. Impressively, all the motions in the example videos appear lifelike and fluid — and often, with subjects moving much faster and more naturally than videos from rival AI generators, which often appear to generate in slow motion. The model can even create realistic versions of surreal ideas such as a giraffe surfing, as X user @JeffSynthesized demonstrated. “Ray 2 is the real deal,” he wrote on X. Other AI video creators who have tried the new model seem to largely agree, with Jerrod Lew posting on X: “Improved cinematography, lighting and realism has arrived and it’s awesome.” “…it’s so good!” AI video artist Heather Cooper chimed in. My own tests were a mixed bag, with some more complex prompts creating unnatural and glitchy results. But when it did produce clips that resembled more of what I had in mind in my prompts — such as fencers crossing swords aboard a space station orbiting Jupiter — it was undeniably impressive. Jain said Luma will also add image-to-video, video-to-video and editing capabilities to Ray2 in the future, further expanding the tool’s creative possibilities. To celebrate the launch of Ray2, Luma Labs is hosting the Ray2 Awards, offering creators the chance to win up to $7,000 in prizes. These include: A large scale award: The creator whose Ray2 content garners the most views on a single platform during the first week of launch will win $5,000. Submissions are due by January 22, 2025. A raffle for $3,000: Creators can enter by sharing Ray2 content on social media and engaging with Luma AI’s launch video. The deadline for participation is also January 22. Winners of both awards will be announced on January 27. Submissions can be uploaded via forms provided by Luma Labs, and creators are encouraged to use hashtags #Ray2 and #DreamMachine when sharing their work. Additionally, Luma Labs has launched an affiliate program, allowing participants to earn commissions by promoting its tools. source

Luma AI releases Ray2 generative video model with ‘fast, natural’ motion and better physics Read More »

Edge computing’s rise will drive cloud consumption, not replace it

This article is part of VentureBeat’s special issue, “AI at Scale: From Vision to Viability.” Read more from this special issue here. This article is part of VentureBeat’s special issue, “AI at Scale: From Vision to Viability.” Read more from the issue here. The signs are everywhere that edge computing is about to transform AI as we know it. As AI moves beyond centralized data centers, we’re seeing smartphones run sophisticated language models locally, smart devices processing computer vision at the edge and autonomous vehicles making split-second decisions without cloud connectivity.  “A lot of attention in the AI space right now is on training, which makes sense in traditional hyperscale public clouds,” said Rita Kozlov, VP of product at Cloudflare. “You need a bunch of powerful machines close together to do really big workloads, and those clusters of machines are what are going to predict the weather, or model a new pharmaceutical discovery. But we’re right on the cusp of AI workloads shifting from training to inference, and that’s where we see edge becoming the dominant paradigm.” Kozlov predicts that inference will move progressively closer to users — either running directly on devices, as with autonomous vehicles, or at the network edge. “For AI to become a part of a regular person’s daily life, they’re going to expect it to be instantaneous and seamless, just like our expectations for web performance changed once we carried smartphones in our pockets and started to depend on it for every transaction,” she explained. “And because not every device is going to have the power or battery life to do inference, the edge is the next best place.” Yet this shift toward edge computing won’t necessarily reduce cloud usage as many predicted. Instead, the proliferation of edge AI is driving increased cloud consumption, revealing an interdependency that could reshape enterprise AI strategies. In fact, edge inference represents only the final step in a complex AI pipeline that depends heavily on cloud computing for data storage, processing and model training.  New research from Hong Kong University of Science and Technology and Microsoft Research Asia demonstrates just how deep this dependency runs — and why the cloud’s role may actually grow more vital as edge AI expands. The researchers’ extensive testing reveals the intricate interplay required between cloud, edge and client devices to make AI tasks work more effectively. How edge and cloud complement each other in AI deployments To understand exactly how this cloud-edge relationship works in practice, the research team constructed a test environment mirroring real-world enterprise deployments. Their experimental setup included Microsoft Azure cloud servers for orchestration and heavy processing, a GeForce RTX 4090 edge server for intermediate computation and Jetson Nano boards representing client devices. This three-layer architecture revealed the precise computational demands at each level. The key test involved processing user requests expressed in natural language. When a user asked the system to analyze a photo, GPT running on the Azure cloud server first interpreted the request, then determined which specialized AI models to invoke. For image classification tasks, it deployed a vision transformer model, while image captioning and visual questions used bootstrapping language-image rre-training (BLIP). This demonstrated how cloud servers must handle the complex orchestration of multiple AI models, even for seemingly simple requests. The team’s most significant finding came when they compared three different processing approaches. Edge-only inference, which relied solely on the RTX 4090 server, performed well when network bandwidth exceeded 300 KB/s, but faltered dramatically as speeds dropped. Client-only inference running on the Jetson Nano boards avoided network bottlenecks but couldn’t handle complex tasks like visual question answering. The hybrid approach — splitting computation between edge and client — proved most resilient, maintaining performance even when bandwidth fell below optimal levels. These limitations drove the team to develop new compression techniques specifically for AI workloads. Their task-oriented method achieved remarkable efficiency: Maintaining 84.02% accuracy on image classification while reducing data transmission from 224KB to just 32.83KB per instance. For image captioning, they preserved high-quality results (biLingual evaluation understudy — BLEU — scores of 39.58 vs 39.66) while slashing bandwidth requirements by 92%. These improvements demonstrate how edge-cloud systems must evolve specialized optimizations to work effectively. But the team’s federated learning experiments revealed perhaps the most compelling evidence of edge-cloud symbiosis. Running tests across 10 Jetson Nano boards acting as client devices, they explored how AI models could learn from distributed data while maintaining privacy. The system operated with real-world network constraints: 250 KB/s uplink and 500 KB/s downlink speeds, typical of edge deployments. Through careful orchestration between cloud and edge, the system achieved over ~68% accuracy on the CIFAR10 dataset while keeping all training data local to the devices. CIFAR10 is a widely used dataset in machine learning (ML) and computer vision for image classification tasks. It consists of 60,000 color images, each 32X32 pixels in size, divided into 10 different classes. The dataset includes 6,000 images per class, with 5,000 for training and 1,000 for testing.  This success required an intricate dance: Edge devices running local training iterations, the cloud server aggregating model improvements without accessing raw data and a sophisticated compression system to minimize network traffic during model updates. This federated approach proved particularly significant for real-world applications. For visual question-answering tasks under bandwidth constraints, the system maintained 78.22% accuracy while requiring only 20.39KB per transmission — nearly matching the 78.32% accuracy of implementations that required 372.58KB. The dramatic reduction in data transfer requirements, combined with strong accuracy preservation, demonstrated how cloud-edge systems could maintain high performance even in challenging network conditions. Architecting for edge-cloud The research findings present a roadmap for organizations planning AI deployments, with implications that cut across network architecture, hardware requirements and privacy frameworks. Most critically, the results suggest that attempting to deploy AI solely at the edge or solely in the cloud leads to significant compromises in performance and reliability. Network architecture emerges as a critical consideration. While the study showed that high-bandwidth tasks like visual question answering need up to

Edge computing’s rise will drive cloud consumption, not replace it Read More »

The path forward for gen AI-powered code development in 2025

This article is part of VentureBeat’s special issue, “AI at Scale: From Vision to Viability.” Read more from this special issue here. This article is part of VentureBeat’s special issue, “AI at Scale: From Vision to Viability.” Read more from the issue here. Three years ago AI-powered code development was mostly just GitHub Copilot.  GitHub’s AI-powered developer tool amazed developers with its ability to help with code completion and even generate new code. Now, at the start of 2025, a dozen or more generative AI coding tools and services are available from vendors big and small. AI-powered coding tools now provide sophisticated code generation and completion features, and support an array of programming languages and deployment patterns.  The new class of software development tools has the potential to completely revolutionize how applications are built and delivered — or so many vendors claim. Some observers have worried that these new tools will spell the end for professional coders as we know it. What’s the reality? How are tools actually making an impact today? Where do they fall short and where is the market headed in 2025? “This past year, AI tools have become increasingly essential for developer productivity,” Mario Rodriguez, chief product officer at GitHub, told VentureBeat.  The enterprise efficiency promise of gen AI-powered code development So what can gen AI-powered code development tools do now? Rodriguez said that tools like GitHub Copilot can already generate 30-50% of code in certain workflows. The tools can also help automate repetitive tasks and assist with debugging and learning. They can even serve as a thought partner to help developers go from idea to application in minutes. “We’re also seeing that AI tools not only help developers write code faster, but also write better quality code,” Rodriguez said. “In our latest controlled developer study we found that code written with Copilot is not only easier to read but also more functional — it’s 56% more likely to pass unit tests.” While GitHub Copilot is an early pioneer in the space, other more recent entrants are seeing similar gains. One of the hottest vendors in the space is Replit, which has developed an AI-agent approach to accelerate software development. According to Amjad Masad, CEO of Replit, gen AI-powered coding tools can make coding anywhere between 10-40% faster for professional engineers. “The biggest beneficiaries are front-end engineers, where there is so much boilerplate and repetition in the work,” Masad told VentureBeat. “On the other hand, I think it’s having less impact on low-level software engineers where you have to be careful with memory management and security.” What’s more exciting for Masad isn’t the impact of gen AI coding on existing developers, but rather the impact it can have on others. “The most exciting thing, at least from the perspective of Replit, is that it can make non-engineers into junior engineers,” Masad said. “Suddenly, anyone can create software with code. This can change the world.” Certainly gen AI-powered coding tools have the potential to democratize development and improve professional developers’ efficiency. That said, it isn’t a panacea and it does have some limitations, at least for now. “For simple, isolated projects, AI has made remarkable progress,” Itamar Friedman, cofounder and CEO of Qodo, told VentureBeat. Qodo (formerly Codium AI) is building out a series of AI agent-driven enterprise application development tools. Friedman said that using automated AI tools, anyone can now create basic websites faster and with more personalization than traditional website builders can.  “However, for complex enterprise software that powers Fortune 5000 companies, AI isn’t yet capable of full end-to-end automation,” Friedman noted. “It excels at specific tasks, like question-answering on complex code, line completion, test generation and code reviews.” Friedman argued that the core challenge is in the complexity of enterprise software. In his view, pure large language model (LLM) capabilities on their own can’t handle this complexity.  “Simply using AI to generate more lines of code could actually worsen code quality — which is already a significant problem in enterprise settings,” Friedman said. “So the reason that we don’t see huge adoption yet is because there are still more advances in technology, engineering and machine learning that need to be achieved in order for AI solutions to fully understand complicated enterprise software.” Friedman said that Qodo is addressing that issue by focusing on understanding complex code, indexing it, categorizing it and understanding organizational best practices to generate meaningful tests and code reviews. Another barrier to broader adoption and deployment is legacy code. Brandon Jung, VP of ecosystem at gen AI development vendor Tabnine, told VentureBeat that he sees a lack of quality data preventing wider adoption of AI coding tools.  “For enterprises, many have large, old code bases and that code is not well understood,” Jung said. “Data has always been critical to machine learning and that is no different with gen AI for code.” Towards fully agentic AI-driven code development in 2025 No single LLM can handle everything required for modern enterprise software development. That’s why leading vendors have embraced an agentic AI approach. Qodo’s Friedman expects that in 2025 the features that seemed revolutionary in 2022 — like autocomplete and simple code chat functions — will become commoditized.  “The real evolution will be towards specialized agentic workflows — not one universal agent, but many specialized ones each excelling at specific tasks,” Friedman said. “In 2025 we’re going to see many of these specialized agents developed and deployed until eventually, when there are enough of these, we’re going to see the next inflection point, where agents can collaborate to create complex software.” It’s a direction that GitHub’s Rodriguez sees as well. He expects that throughout 2025, AI tools will continue to evolve to assist developers throughout the entire software lifecycle. That’s more than just writing code; it’s also building, deploying, testing, maintaining and even fixing software. Humans will not be replaced in this process, they will be augmented with AI that will make things faster and more efficient. “This is going to be accomplished with the

The path forward for gen AI-powered code development in 2025 Read More »

4 bold AI predictions for 2025

This article is part of VentureBeat’s special issue, “AI at Scale: From Vision to Viability.” Read more from this special issue here. This article is part of VentureBeat’s special issue, “AI at Scale: From Vision to Viability.” Read more from the issue here. As we wrap up 2024, we can look back and acknowledge that artificial intelligence has made impressive and groundbreaking advances. At the current pace, predicting what kind of surprises 2025 has in store for AI is virtually impossible. But several trends paint a compelling picture of what enterprises can expect in the coming year and how they can prepare themselves to take full advantage. The plummeting costs of inference In the past year, the costs of frontier models have steadily decreased. The price per million tokens of OpenAI’s top-performing large language model (LLM) has dropped by more than 200 times in the past two years.  One key factor driving down the price of inference is growing competition. For many enterprise applications, most frontier models will be suitable, which makes it easy to switch from one to another, shifting the competition to pricing. Improvements in accelerator chips and specialized inference hardware are also making it possible for AI labs to provide their models at lower costs.  To take advantage of this trend, enterprises should start experimenting with the most advanced LLMs and build application prototypes around them even if the costs are currently high. The continued reduction in model prices means that many of these applications will soon be scalable. At the same time, the models’ capabilities continue to improve, which means you can do a lot more with the same budget than you could in the past year.  The rise of large reasoning models The release of OpenAI o1 has triggered a new wave of innovation in the LLM space. The trend of letting models “think” for longer and review their answers is making it possible for them to solve reasoning problems that were impossible with single-inference calls. Even though OpenAI has not released o1’s details, its impressive capabilities have triggered a new race in the AI space. There are now many open-source models that replicate o1’s reasoning abilities and are extending the paradigm to new fields, such as answering open-ended questions. Advances in o1-like models, which are sometimes referred to as large reasoning models (LRMs), can have two important implications for the future. First, given the immense number of tokens that LRMs must generate for their answers, we can expect hardware companies to be more incentivized to create specialized AI accelerators with higher token throughput.  Second, LRMs can help address one of the important bottlenecks of the next generation of language models: high-quality training data. There are already reports that OpenAI is using o1 to generate training examples for its next generation of models. We can also expect LRMs to help spawn a new generation of small specialized models that have been trained on synthetic data for very specific tasks. To take advantage of these developments, enterprises should allocate time and budget to experimenting with the possible applications of frontier LRMs. They should always test the limits of frontier models, and think about what kinds of applications would be possible if the next generation of models overcome those limitations. Combined with the ongoing reduction in inference costs, LRMs can unlock many new applications in the coming year. Transformer alternatives are picking up steam The memory and compute bottleneck of transformers, the main deep learning architecture used in LLMs, has given rise to a field of alternative models with linear complexity. The most popular of these architectures, the state-space model (SSM), has seen many advances in the past year. Other promising models include liquid neural networks (LNNs), which use new mathematical equations to do a lot more with many fewer artificial neurons and compute cycles.  In the past year, researchers and AI labs have released pure SSM models as well as hybrid models that combine the strengths of transformers and linear models. Although these models have yet to perform at the level of the cutting-edge transformer-based models, they are catching up fast and are already orders of magnitude faster and more efficient. If progress in the field continues, many simpler LLM applications can be offloaded to these models and run on edge devices or local servers, where enterprises can use bespoke data without sending it to third parties. Changes to scaling laws The scaling laws of LLMs are constantly evolving. The release of GPT-3 in 2020 proved that scaling model size would continue to deliver impressive results and enable models to perform tasks for which they were not explicitly trained. In 2022, DeepMind released the Chinchilla paper, which set a new direction in data scaling laws. Chinchilla proved that by training a model on an immense dataset that is several times larger than the number of its parameters, you can continue to gain improvements. This development enabled smaller models to compete with frontier models with hundreds of billions of parameters. Today, there is fear that both of those scaling laws are nearing their limits. Reports indicate that frontier labs are experiencing diminishing returns on training larger models. At the same time, training datasets have already grown to tens of trillions of tokens, and obtaining quality data is becoming increasingly difficult and costly.  Meanwhile, LRMs are promising a new vector: inference-time scaling. Where model and dataset size fail, we might be able to break new ground by letting the models run more inference cycles and fix their own mistakes. As we enter 2025, the AI landscape continues to evolve in unexpected ways, with new architectures, reasoning capabilities, and economic models reshaping what’s possible. For enterprises willing to experiment and adapt, these trends represent not just technological advancement, but a fundamental shift in how we can harness AI to solve real-world problems. source

4 bold AI predictions for 2025 Read More »

Nvidia tackles agentic AI safety and security with new NeMo Guardrails NIMs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More As the use of agentic AI continues to grow, so too does the need for safety and security. Today, Nvidia announced a series of updates to its NeMo Guardrails technology designed specifically to address the needs of agentic AI. The basic idea behind guardrails is to provide some form of policy and control for large language models (LLMs) to help prevent unauthorized and unintended outputs. The guardrails concept has been broadly embraced in recent years by multiple vendors, including AWS. The new NeMo Guardrails updates from Nvidia are designed to make it easier for organizations to deploy and provide more granular types of controls. NeMo Guardrails are now available as a NIM (Nvidia Inference Microservices), which are optimized for Nvidia’s GPUs. Additionally, there are three new specific NIM services that enterprises can deploy for content safety, topic control and jailbreak detection. The guardrails have been optimized for agentic AI deployments, rather than just singular LLMs. “It’s not just about guard-railing a model anymore,” Kari Briski, VP for enterprise AI models, software and services at Nvidia, said in a press briefing. “It’s about guard railing and a total system.” What the new NeMo Guardrails bring to enterprise Agentic AI Agentic AI use is expected to be a dominant trend in 2025.  While agentic AI has plenty of benefits, it also brings new challenges, particularly around security, data privacy and governance requirements, which can create significant barriers to deployment. The three new NeMo Guardrails NIMs are intended to help solve some of those challenges. They include: Content Safety NIM: Trained on Nvidia’s Aegis content safety dataset with 35,000 human-annotated samples, this service blocks harmful, toxic and unethical content. Topic Control NIM: Helps ensure that AI interactions remain within predefined topical boundaries, preventing conversation drift and unauthorized information disclosure. Jailbreak Detection NIM: Helps prevent security bypasses through clever hacks, leveraging training data from 17,000 known successful jailbreaks. Complexity of safeguarding agentic AI systems The complexity of safeguarding agentic AI systems is significant, as they can involve multiple interconnected agents and models.  Briski provided an example of a retail customer service agent scenario. Consider a person interacting with at least three agents, a reasoning LLM, a retrieval-augmented generation (RAG) agent and a customer service assistant agent. All are required to enable the live agent.  “Depending on the user interaction, many different LLMs or interactions can be made, and you have to guardrail each one of them,” said Briski. While there is complexity, she noted that a key goal with NeMo Guardrails NIMs is to make it easier for enterprises. As part of today’s rollout, Nvidia is also providing blueprints to demonstrate how the different guardrail NIMs can be deployed for varying scenarios, including customer service and retail. How Nvidia guardrails impact agentic AI performance Another primary concern for enterprises deploying agentic AI is performance.  Briski said that as enterprises deploy agentic AI, there can be concern about introducing latency by adding guardrails.  “I think as people were initially trying to add guardrails in the past, they were applying larger LLMs to try and guardrail,” she explained.  The latest NeMo Guardrail NIMs have been fine-tuned and optimized to address latency concerns. Nvidia’s early testing shows that organizations can get 50% better protection with guardrails, which only add approximately a half second of latency. “This is really important when deploying agents, because as we know, it’s not just one agent, there are multiple agents that could be within an agentic system,” said Briski. Nvidia NeMo Guardrails NIMs for agentic AI are available under the Nvidia AI enterprise license, which currently costs $4,500 per GPU per year. Developers can try them out for free under an open source license, as well as on build.nvidia.com. source

Nvidia tackles agentic AI safety and security with new NeMo Guardrails NIMs Read More »