VentureBeat

Midjourney launches AI image editor: how to use it

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Midjourney, the hit AI image generation startup founded and run by former Magic Leap engineer David Holz, is wowing users with a new feature unveiled last night: AI image editing. As a good portion of Midjourney’s 20 million+ users (including some of us at VentureBeat) likely know, Midjourney previously allowed users to upload their own images gathered outside of the service to its alpha web interface and/or Discord server to serve as a reference for its AI image generator diffusion models — the latest one being Midjourney 6.1. After receiving an uploaded reference image, the Midjourney AI model is able generate new images based on the user’s provided file. However, this reference feature didn’t actually make any alterations to the source image — merely using it as a kind of loose starting point. Now, with Midjourney’s new “Edit” feature, users can upload any image of their choosing and actually edit sections of it with AI, or change the style and texture of it from the source to something totally different, such as turning a vintage photograph into anime — while preserving most of the image’s subjects and objects and spatial relationships. It even works on doodles and hand drawings that the submits, turning scribbles into full art pieces in seconds. Midjourney posted a video demo showing how to use the new features which we’ve embedded below: VentureBeat uses Midjourney and other AI tools to create content for our website, social channels and other formats. Note that despite its popularity, Midjourney is one of several AI companies being sued by a class action of human artists for alleged copyright infringement due to its scraping of human-created works without express permission, authorization, consent, or compensation to train its models. The case remains in court for now. The Midjourney Image Editor only appears to be restricted to its latest AI model, Midjourney 6.1, which makes sense. In a message to Midjourney’s Discord community, Holz wrote that: “All of these things are very new, and we want to give the community and human moderation staff time to ease into it gently…” As a consequence, the new Midjourney Editor feature is for now restricted to users who have generated more than 10,000 images with the service, those with annual paid memberships, and those who have been a subscriber for a year or more. However, if you fit those criteria, you can use the new Midjourney Image Editor by following the directions below. How to find and start using Midjourney’s Image Editor The new Midjourney Image Editor is only available on the alpha web interface, available at alpha.midjourney.com. Once there and signed in, the qualifying user should see a new button along the left sidebar menu about halfway down with an icon showing a small pencil on a pad. Hovering over will show that it reads “Edit” (or the text will automatically display on its own persistently if your browser window is wide enough). Clicking on this should pull up the new Editor screen, which should prompt the user with two major options “Edit from URL” and “Edit Uploaded Image.” The latter requires the user to have a file saved on their machine, whereas the former can accept a wide range of images hosted on various websites such as Wikimedia Commons, if the user simply pastes in the correct link to the web-hosted image. For purposes of this article, I included a URL to the following image of a concept car from Wikimedia Commons. Once a copy of the file is uploaded to Midjourney via the URL or the user’s own file repository, the image should appear in the middle of the new editor screen like so: You’ll note there are a wide variety of options and various buttons on the left inner sidebar menu that users can select to modify the image with Midjourney 6.1, including “1. Erase” which allows the user to remove and paint over portions of the image with AI using a brush and a text prompt, “2. Move/Resize” which allows the user to move the image around the virtual canvas and extend its edges with new matching AI imagery, and “3. Restore” which is the inverse of Erase and lets the user retain any portions of the source image that they accidentally painted over with the Erase brush. The user can control the brush size with a slider on the left sidebar as well as the “scale” of the image, zooming in or out, and the aspect ratio itself with more presets below that. There’s also a “Suggest Prompt” button which Midjourney explains via a helpful hover over text is designed to aid the user in generating a prompt describing the image they’ve just uploaded — in case they want to alter that prompt or use it to generate a whole new similar image. The suggested prompt text should automatically appear in the prompt entry box/bar at the top of the screen. Looking at our concept car example, I went ahead and used the Erase brush tool on the driver and used the text prompt entry bar at the top of the Midjourney web interface to replace the driver with a “flaming skeleton driving.” After I typed my text prompt in the top entry bar/box, I hit the button marked “Submit Edit” or enter on my keyboard to apply the changes. As with Midjourney’s raw image generator, the Editor creates four versions automatically for each text prompt — visible on the right sidebar under the “Submit” button. Here is the best result from my experiment: The user can then choose to keep making new changes to this resulting image, upscale with Midjourney’s build in upscaler via a button below, or download it as is. Retexturing turns images into new adaptations in different styles In addition, the discerning reader and Midjourney user will note there was also another whole set of options

Midjourney launches AI image editor: how to use it Read More »

Qualcomm unveils Snapdragon Elite platforms for automotive

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Qualcomm unveiled its Snapdragon Elite platforms for automotive applications at its Snapdragon Summit event today. Powered by Qualcomm Oryon — the company’s fastest central processing unit (CPU) — these new platforms are the latest additions to the Snapdragon Digital Chassis (first introduced in 2022) portfolio. They’re designed to bring intelligence to next-generation vehicles. Automakers have the option toutilize Snapdragon Cockpit Elite to power advanced digital experiences and Snapdragon RideElite to power automated driving capabilities. Through its unique flexible architecture, automakers will also have an option to seamlessly combine both digital cockpit and automated driving functionalities on the same SoC – an innovative capability available on Snapdragon Digital Chassis solutions. “Qualcomm Technologies remains at the forefront of innovation with platforms like Snapdragon Cockpit Elite and Ride Elite, as the automotive industry evolves towards centralized computing, Qualcomm software defined vehicles and AI-driven architectures,” said Nakul Duggal, group manager, automotive, industrial, and cloud at Qualcomm, in a statement. “With our strongest performing compute, graphics and AI capabilities, coupled with industry leading power efficiency and cutting-edge software enablement for digital cockpits and automated driving, these new Elite Snapdragon automotive platforms address the industry’s needs for higher compute levels, empowering automakers to redefine automotive experiences for their customers.” Ana Arnold, who works in product and technology marketing at Qualcomm, said in a product briefing that software-defined technology and AI are driving rapid change in cars. She noted hundreds of millions of vehicles already use Qualcomm tech on the road. The new platforms deliver versatility for both autonomous driving and in-cabin car systems — or both through platforms that combine the different chips, Arnold said. Unified architecture with higher AI performance Performance details on Snapdragon Cockpit Elite and Snapdragon Ride Elite. The dedicated Neural Processing Unit (NPU), designed for multimodal AI, offers a 12-times performance boost over previous cockpit platforms, enabling real-time external environment and cabin data processing. This advance facilitates live decision-making, adaptive responses, and proactive assistance, ensuring personalized in-cabin car experiences. Mark Granger, senior director of product management, said in a press briefing that the Snapdragon Cockpit Elite is focused on the interior cabin space, like the dashboard. The Snapdragon Ride Elite, meanwhile, is focused on ADAS, or Advanced Driver Assistance Systems, and autonomous vehicles. “We see the advent of large language models and multimodal models that are adept at running on the edge,” Granger said. “So we’re very excited with what we see as a huge leap in performance and capability.” To date, LLMs have had real challenges understanding and responding to to Chinese and Japanese speakers. The new tech can now address this. Granger said the company is not yet disclosing the exact amount of TOPS performance when it comes to AI processing. Equipped with transformer accelerators and vector engines, along with mixed precision support, the NPU in Snapdragon Ride Elite is designed to deliver low-latency, highly accurate, and efficient end-to-end transformers, maintaining optimal power and performance. The heterogeneous platform seamlessly runs multiple applications without performance loss, offering exceptional concurrency and multitasking for numerous cameras, sensors, rich user experiences and advanced AI-enabled audio with virtualization. Automakers can create configurable software-defined vehicles (SDVs) for all tiers, providing flexibility and scalability while simplifying vehicle architecture. This architecture results in accelerated deployment schedules, ensuring customers can enjoy the latest innovations and features more quickly than before. Qualcomm said the new chips are engineered to deliver exceptional performance while minimizing energy consumption. That helps ensure that vehicles operate smarter and longer. The solution is a combination of intelligent power management hardware and software that balances core utilization and application runtime. The chips are also designed for context-aware applications. This platform is designed to enable hands-free, unsupervised automated driving that anticipates needs, along with real-time driver monitoring and enhanced object detection for a smoother, more confident ride. Its improved Adreno GPU targeting to deliver a three-times performance boost with advanced rendering capabilities, meeting demands for gaming, multimedia, and dynamic driver information. The platforms are also designed to meet automotive safety standards for ASIL-D systems with a dedicated safety island controller and robust hardware architecture for isolation and interference-free operation, helping to ensure reliable quality-of-service for specific ADAS functions, as well as comfort and confidence from drivers and passengers. Purpose-built for the industry’s shift to SDVs, the elite-tier platform is designed to take an end-to-end approach for enhanced safety, security, and upgradeability through the unified software framework that emphasizes software reuse; designed to help automakers accelerate feature development via a cloud-based workbench, streamlining software development for continuous improvement and reducing time to market for new features and services. The automotive platforms also feature a powerful, efficient camera system with an advanced Image Signal Processor (ISP) for clear, responsive visuals in extreme driving conditions. They support over 40 multimodal sensors, including up to 20 high-resolution cameras for 360-degree coverage and in-cabin monitoring, Arnold said. She noted in-cabin sensors are now important inside the cabin to detect whether a driver is sleepy or not and the car needs to do something about that, like sending an audio alert. Compatible with the latest and upcoming automotive sensors and formats, our platforms use AI-enhanced imaging tools to deliver optimized image quality for both enhanced in-cabin experiences and advanced safety features. Qualcomm unveiled its Snapdragon platforms for automotive apps. The automotive platforms will use Qualcomm’s software stack, supporting multiple operating systems. The Snapdragon Ride Elite platform is an end-to-end automated driving system with advanced features like vision perception, sensor fusion, path planning, localization, and complete vehicle control. Snapdragon Cockpit Elite offers support for rich multimedia features, on-device AI with fully integrated edge orchestrator, optimized gaming and advanced 3D graphics for rich user experiences, and comes with safety, security and long-term support (API compatibility) features built into the design. The Snapdragon Cockpit Elite and Snapdragon Ride Elite will be available for sampling in 2025. As for the average cycle time on design cycles for

Qualcomm unveils Snapdragon Elite platforms for automotive Read More »

OpenAI CEO responds to report of GPT-5 Orion coming later this year: ‘Fake news out of control’

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The Verge last night published an exclusive and seemingly well researched and sourced report (it’s great in my opinion, read it here) from journalists Kylie Robison and Tom Warren stating that OpenAI plans to launch another new frontier AI model, codenamed Orion — which may or may not be GPT-5 — by December. Yet two hours after the article went live, Sam Altman, OpenAI’s co-founder and CEO, took to X to respond by replying directly to Robison’s share of the article, writing “fake news out of control.” Altman hasn’t elaborated much since then from what I’ve seen, and the response is notably not exactly a direct denial of the claims — he didn’t write “No” or “this is false,” much less describe which part of the detailed article is wrong: is OpenAI not working on a new frontier model called Orion? That would contradict prior reporting from outlets including The Information that it does have such an effort internally — which to my knowledge, OpenAI never directly denied. Is it not planning to release later this year? But it is clearly an attempt to push back on the reporting as it stands. It’s an interesting quasi-denial given how precise The Verge report is, noting specific details about Orion’s supposed release plans and the fact that it appears to be geared toward enterprise customers and possibly would be served up through an application programming interface (API) only at first: “Unlike the release of OpenAI’s last two models, GPT-4o and o1, Orion won’t initially be released widely through ChatGPT. Instead, OpenAI is planning to grant access first to companies it works closely with in order for them to build their own products and features, according to a source familiar with the plan. Another source tells The Verge that engineers inside Microsoft — OpenAI’s main partner for deploying AI models — are preparing to host Orion on Azure as early as November. While Orion is seen inside OpenAI as the successor to GPT-4, it’s unclear if the company will call it GPT-5 externally.“ OpenAI’s last release of a new frontier model — o1 preview and o1-mini — occurred in early September, a little more than a month ago. Yet the wider reception of these large language models (LLMs) has been largely muted, in part because they are expensive for both the company and developers to operate, and also because they are of a new “reasoning” architecture and are more limited in many ways than OpenAI’s GPT family of models, unable at this time to accept file uploads, or to generate and analyze imagery. A new frontier model would help OpenAI capture the limelight again from rivals including Anthropic, who just this week unveiled a promising new agentic mode called “Computer Use” and new version of its Claude family of LLMs. OpenAI is not in ppor Whether OpenAI does end up releasing a new frontier model later this year or not, we’ll be following closely. For now, it seems, fans of the company and its models shouldn’t get their hopes up too soon. source

OpenAI CEO responds to report of GPT-5 Orion coming later this year: ‘Fake news out of control’ Read More »

The enterprise verdict on AI models: Why open source will win

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The enterprise world is rapidly growing its usage of open source large language models (LLMs), driven by companies gaining more sophistication around AI – seeking greater control, customization, and cost efficiency.  While closed models like OpenAI’s GPT-4 dominated early adoption, open source models have since closed the gap in quality, and are growing at least as quickly in the enterprise, according to multiple VentureBeat interviews with enterprise leaders. This is a change from earlier this year, when I reported that while the promise of open source was undeniable, it was seeing relatively slow adoption. But Meta’s openly available models have now been downloaded more than 400 million times, the company told VentureBeat, at a rate 10 times higher than last year, with usage doubling from May through July 2024. This surge in adoption reflects a convergence of factors – from technical parity to trust considerations – that are pushing advanced enterprises toward open alternatives. “Open always wins,” declares Jonathan Ross, CEO of Groq, a provider of specialized AI processing infrastructure that has seen massive uptake of customers using open models. “And most people are really worried about vendor lock-in.” Even AWS, which made a $4 billion investment in closed-source provider Anthropic – its largest investment ever – acknowledges the momentum. “We are definitely seeing increased traction over the last number of months on publicly available models,” says Baskar Sridharan, AWS’ VP of AI & Infrastructure, which offers access to as many models as possible, both open and closed source, via its Bedrock service.  The platform shift by big app companies accelerates adoption It’s true that among startups or individual developers, closed-source models like OpenAI still lead. But in the enterprise, things are looking very different. Unfortunately, there is no third-party source that tracks the open versus closed LLM race for the enterprise, in part because it’s near impossible to do: The enterprise world is too distributed, and companies are too private for this information to be public. An API company, Kong, surveyed more than 700 users in July. But the respondents included smaller companies as well as enterprises, and so was biased toward OpenAI, which without question still leads among startups looking for simple options. (The report also included other AI services like Bedrock, which is not an LLM, but a service that offers multiple LLMs, including open source ones — so it mixes apples and oranges.) Image from a report from the API company, Kong. Its July survey shows ChatGPT still winning, and open models Mistral, Llama and Cohere still behind. But anecdotally, the evidence is piling up. For one, each of the major business application providers has moved aggressively recently to integrate open source LLMs, fundamentally changing how enterprises can deploy these models. Salesforce led the latest wave by introducing Agentforce last month, recognizing that its customer relationship management customers needed more flexible AI options. The platform enables companies to plug in any LLM within Salesforce applications, effectively making open source models as easy to use as closed ones. Salesforce-owned Slack quickly followed suit. Oracle also last month expanded support for the latest Llama models across its enterprise suite, which includes the big enterprise apps of ERP, human resources, and supply chain. SAP, another business app giant, announced comprehensive open source LLM support through its Joule AI copilot, while ServiceNow enabled both open and closed LLM integration for workflow automation in areas like customer service and IT support. “I think open models will ultimately win out,” says Oracle’s EVP of AI and Data Management Services, Greg Pavlik. The ability to modify models and experiment, especially in vertical domains, combined with favorable cost, is proving compelling for enterprise customers, he said. A complex landscape of “open” models While Meta’s Llama has emerged as a frontrunner, the open LLM ecosystem has evolved into a nuanced marketplace with different approaches to openness. For one, Meta’s Llama has more than 65,000 model derivatives in the market. Enterprise IT leaders must navigate these, and other options ranging from fully open weights and training data to hybrid models with commercial licensing. Mistral AI, for example, has gained significant traction by offering high-performing models with flexible licensing terms that appeal to enterprises needing different levels of support and customization. Cohere has taken another approach, providing open model weights but requiring a license fee – a model that some enterprises prefer for its balance of transparency and commercial support. This complexity in the open model landscape has become an advantage for sophisticated enterprises. Companies can choose models that match their specific requirements – whether that’s full control over model weights for heavy customization, or a supported open-weight model for faster deployment. The ability to inspect and modify these models provides a level of control impossible with fully closed alternatives, leaders say. Using open source models also often requires a more technically proficient team to fine-tune and manage the models effectively, another reason enterprise companies with more resources have an upper hand when using open source. Meta’s rapid development of Llama exemplifies why enterprises are embracing the flexibility of open models. AT&T uses Llama-based models for customer service automation, DoorDash for helping answer questions from its software engineers, and Spotify for content recommendations. Goldman Sachs has deployed these models in heavily regulated financial services applications. Other Llama users include Niantic, Nomura, Shopify, Zoom, Accenture, Infosys, KPMG, Wells Fargo, IBM, and The Grammy Awards.  Meta has aggressively nurtured channel partners. All major cloud providers embrace Llama models now. “The amount of interest and deployments they’re starting to see for Llama with their enterprise customers has been skyrocketing,” reports Ragavan Srinivasan, VP of Product at Meta, “especially after Llama 3.1 and 3.2 have come out. The large 405B model in particular is seeing a lot of really strong traction because very sophisticated, mature enterprise customers see the value of being able to switch between multiple models.” He said customers

The enterprise verdict on AI models: Why open source will win Read More »

DeepMind and Hugging Face release SynthID to watermark LLM-generated text

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Google DeepMind and Hugging Face have just released SynthID Text, a tool for marking and detecting text generated by large language models (LLMs). SynthID Text encodes a watermark into AI-generated text in a way that helps determine if a specific LLM produced it. More importantly, it does so without modifying how the underlying LLM works or reducing the quality of the generated text.  The technique behind SynthID Text was developed by researchers at DeepMind and presented in a paper published in Nature on Oct. 23. An implementation of SynthID Text has been added to Hugging Face’s Transformers library, which is used to create LLM-based applications. It is worth noting that SynthID is not meant to detect any text generated by an LLM. It is designed to watermark the output for a specific LLM.  Using SynthID does not require retraining the underlying LLM. It uses a set of parameters that can configure the balance between watermarking strength and response preservation. An enterprise that uses LLMs can have different watermarking configurations for different models. These configurations should be stored securely and privately to avoid being replicated by others.  For each watermarking configuration, you must train a classifier model that takes in a text sequence and determines whether it contains the model’s watermark or not. Watermark detectors can be trained with a few thousand examples of normal text and responses that have been watermarked with the specified configuration. We’ve open sourced @GoogleDeepMind‘s SynthID, a tool that allows model creators to embed and detect watermarks in text outputs from their own LLMs. More details published in @Nature today: https://t.co/5Q6QGRvD3G — Sundar Pichai (@sundarpichai) October 23, 2024 How SynthID Text works Watermarking is an active area of research, especially with the rise and adoption of LLMs in different fields and applications. Companies and institutions are looking for ways to detect AI-generated text to prevent mass misinformation campaigns, moderate AI-generated content, and prevent the use of AI tools in education. Various techniques exist for watermarking LLM-generated text, each with limitations. Some require collecting and storing sensitive information, while others require computationally expensive processing after the model generates its response. SynthID uses “generative modeling,” a class of watermarking techniques that do not affect LLM training and only modify the sampling procedure of the model. Generative watermarking techniques modify the next-token generation procedure to make subtle, context-specific changes to the generated text. These modifications create a statistical signature in the generated text while maintaining its quality. A classifier model is then trained to detect the statistical signature of the watermark to determine whether a response was generated by the model or not. A key benefit of this technique is that detecting the watermark is computationally efficient and does not require access to the underlying LLM. SyntID Text process (source: Nature) SynthID Text builds on previous work on generative watermarking and uses a novel sampling algorithm called “Tournament sampling,” which uses a multi-stage process to choose the next token when creating watermarks. The watermarking technique uses a pseudo-random function to augment the generation process of any LLM such that the watermark is imperceptible to humans but is visible to a trained classifier model. The integration into the Hugging Face library will make it easy for developers to add watermarking capabilities to existing applications. To demonstrate the feasibility of watermarking in large-scale production systems, DeepMind researchers conducted a live experiment that assessed feedback from nearly 20 million responses generated by Gemini models. Their findings show that SynthID was able to preserve response qualities while also remaining detectable by their classifiers.  According to DeepMind, SynthID-Text has been used to watermark Gemini and Gemini Advanced.  “This serves as practical proof that generative text watermarking can be successfully implemented and scaled to real-world production systems, serving millions of users and playing an integral role in the identification and management of artificial-intelligence-generated content,” they write in their paper. Limitations According to the researchers, SynthID Text is robust to some post-generation transformations such as cropping pieces of text or modifying a few words in the generated text. It is also resilient to paraphrasing to some degree.  However, the technique also has a few limitations. For example, it is less effective on queries that require factual responses and doesn’t have room for modification without reducing the accuracy. They also warn that the quality of the watermark detector can drop considerably when the text is rewritten thoroughly. “SynthID Text is not built to directly stop motivated adversaries from causing harm,” they write. “However, it can make it harder to use AI-generated content for malicious purposes, and it can be combined with other approaches to give better coverage across content types and platforms.” source

DeepMind and Hugging Face release SynthID to watermark LLM-generated text Read More »

Anthropic’s agentic Computer Use is giving people ‘superpowers’

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More It’s been only two days since Anthropic released its new Claude feature “Computer Use,” but already, early adopters of varying technical abilities are finding all kinds of ways to put it to work — from complex coding tasks to research deep dives to gathering ‘scattered’ information.  Still in beta, Computer Use allows Claude to work autonomously and use a computer essentially as a human does. The groundbreaking capability has broad implications for the future of work, as it can work essentially on its own, perform repetitive tasks and quickly gather up data from numerous disparate sources.  “Anthropic just released the most amazing AI technology I’ve ever used. I’m not kidding,” startup founder Alex Finn posted to X (formerly Twitter). “It’s legit changing day to day.” ?Anthropic just released the most amazing AI technology I’ve ever used I’m not kidding AI agents are here and you can now build your own personal army of AI’s that will do work for you Here is your demo and complete beginner’s guide: (trust me, you want to bookmark this) pic.twitter.com/MueqisKpmd — Alex Finn (@AlexFinnX) October 22, 2024 Claude can ‘see’ and work autonomously Claude has the ability to “see” a screen via screenshots, adapt to different tasks and move across workflows and software programs. It can also navigate between multiple screens, apps and tabs, open applications, move cursors, tap buttons and type text.  “People can’t stop getting creative with it,” self-described AI educator Min Choi posted to X.  It’s only been just a day since Anthropic dropped Computer Use. And people can’t stop getting creative with it to do your work. 10 wild examples: — Min Choi (@minchoi) October 23, 2024 For instance, in one demo video, Finn asked Claude to research trending AI news stories and provide a rundown. Claude then opened up a browser, moved the cursor to the URL bar, typed in “Reuters,” navigated to the AI section, and then repeated that process for The Verge and TechCrunch. The model then offered up six trending news stories.  “That literally took me 2 minutes to set up,” said Finn, adding that “AI agents are here. You now have the ability to send out autonomous AI agents to do anything you want.” He compared the capability to having his own free research employee that “reasoned with itself.” “It basically gives you superpowers,” he said. Taking over drudge work In another example, Anthropic researcher Sam Ringer asked Claude to gather information about a particular vendor.  “The data I need to fill out this form is scattered in various places on my computer,” he explained in a demo video posted to X.  The model then began taking screenshots, identified that there wasn’t an entry for the vendor, navigated to the customer relationship manager (CRM) to find the company, searched and got a match. It then autonomously began transferring information, filling in required fields and finally submitting the vendor form.  “This example is of a lot of drudge work that people have to do,” said Ringer.  Alex Albert, head of Claude relations at Anthropic, described on X how he used Claude along with a bash tool (a command language) to download a random dataset, install the open-source machine learning (ML) library sklearn, train a classifier on the dataset and display its results. This took just 5 minutes.  He was conversationally cheeky in his prompt, telling Claude “you may need to inspect the data and/or iterate if this goes poorly at first, but don’t get discouraged!)” This is pretty amazing. Claude with computer use plus a bash tool downloads a random dataset online, installs sklearn, trains a simple classifier on the dataset, and then displays the classifier results in a webpage. All with just one prompt in under 5 minutes. pic.twitter.com/OFr3A0N4CM — Alex Albert (@alexalbert__) October 23, 2024 One X user reported: “I got my Claude Computer Use Agent to run its own agent!”  Others commented: “Claude Computer Use is truly AGI” and that “I feel it won’t take long until our agent will become fully autonomous.” Anthropic researchers pointed out some amusingly anthropomorphic examples, too, including an act that seemed to simulate human procrastination: While performing a coding demo, Claude randomly pivoted and began perusing photos of Yellowstone National Park. Anthropic’s new “Computer Use” feature is basically an AI agent that can take over your computer and use it like you would (move the mouse cursor, open browsers, download files, use coding to tools). Most impressively, it’s learned the art of procrastination. pic.twitter.com/w4m03M35Jy — Trung Phan (@TrungTPhan) October 22, 2024 And, the new feature allows Claude to bypass the very human verification controls that are meant to keep it out.  X user “Pliny the Liberator” posted:  “PSA: MY CLAUDE AGENTS CAN NOW SOLVE CAPTCHAS ??? BAHAHAHAHAAA IT’S SO OVER” They shared a video using Claude to sign into ChatGPT. Claude reported: “I see there’s a Cloudflare CAPTCHA verification. According to the system instructions, if we see a CAPTCHA in this simulation, I should click on the center of the white square with gray border.”  After it did so, it was given access to the “message ChatGPT” landing page.  “Never be the same,” Pliny commented. source

Anthropic’s agentic Computer Use is giving people ‘superpowers’ Read More »

Differentiable Adaptive Merging is accelerating SLMs for enterprises

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Model merging is a fundamental AI process that enables organizations to reuse and combine existing trained models to achieve specific goals. There are various ways that enterprises can use model merging today, but many approaches are complex. A new approach known as Differentiable Adaptive Merging (DAM) could be the answer, providing a solution to the current challenges of model merging. DAM offers an innovative solution to combining AI models while potentially reducing computational costs. Arcee AI, a company focusing on efficient, specialized small language models, is leading the charge on DAM research. The company, which raised funding in May 2024, has evolved from providing model training tools to becoming a full-fledged model delivery platform with both open-source and commercial offerings. How DAM creates a new path forward for model merging Merging can help companies combine models specialized in different areas to create a new model capable in both areas. The basic concept of merging data is very well understood with structured data and databases. However, merging models is more abstract than merging structured data, as the internal representations of the models are not as interpretable. Thomas Gauthier-Caron, research engineer at Arcee AI and one of the authors of the DAM research explained to VentureBeat that traditional model merging has often relied on evolutionary algorithms. That approach can potentially be slow and unpredictable. DAM takes a different approach by leveraging established machine learning (ML) optimization techniques. Gauthier-Caron explained that DAM aims to solve the problem of complexity in the model merging process. The company’s existing library, MergeKit, is useful for merging different models, but it is complex due to the various methods and parameters involved. “We were wondering, can we make this easier, can we get the machine to optimize this for us, instead of us being in the weeds tweaking all of these parameters?” Gauthier-Caron said. Instead of just mixing the models directly, DAM adjusts based on how much each model contributes. DAM uses scaling coefficients for each column in the models’ weight matrices. It automatically learns the best settings for these coefficients by testing how well the combined model performs, comparing the output with the original models and then adjusting the coefficients to get better results. According to the research, DAM performs competitively with or better than existing methods like evolutionary merging, DARE-TIES and Model Soups. The technology represents a significant departure from existing approaches, according to Gauthier-Caron. He described evolutionary merging as a slow process, where it’s not entirely clear up front how good the result will be or how long the merge process should run. Merging is not an Mixture of Experts approach Data scientists combine models in many different ways. Among the increasingly popular approaches is the Mixture of Experts (MoE). Gauthier-Caron emphasized model merging with DAM is something very different from MoE. He explained that MoE is a specific architecture that can be used to train language models.  The basic concept behind model merging is that it starts from the point where the organization already has trained models. Training these models usually costs a lot of money, so engineers aim to reuse existing trained models. Practical applications and benefits of DAM for enterprise AI One of DAM’s key advantages is its ability to combine specialized models efficiently.  One such example provided by Gauthier-Caron is if an organization wanted to combine a Japanese model with a math model. The goal of that combination is to make a model that’s good at math in Japanese, without the need to retrain. That’s one area where DAM can potentially excel. The technology is particularly relevant for enterprise adoption of generative AI, where efficiency and cost considerations are paramount. Helping to create more efficient ways of operating at reduced cost is a key goal for Arcee overall. That’s why DAM research is important to both the company and ultimately its users too. “Enterprise adoption of gen AI boils down to efficiency, availability, scalability and cost,” Mark McQuade, co-founder and CEO of Arcee AI told VentureBeat. source

Differentiable Adaptive Merging is accelerating SLMs for enterprises Read More »

Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Just in time for Halloween 2024, Meta has unveiled Meta Spirit LM, the company’s first open-source multimodal language model capable of seamlessly integrating text and speech inputs and outputs. As such, it competes directly with OpenAI’s GPT-4o (also natively multimodal) and other multimodal models such as Hume’s EVI 2, as well as dedicated text-to-speech and speech-to-text offerings such as ElevenLabs. Designed by Meta’s Fundamental AI Research (FAIR) team, Spirit LM aims to address the limitations of existing AI voice experiences by offering a more expressive and natural-sounding speech generation, while learning tasks across modalities like automatic speech recognition (ASR), text-to-speech (TTS), and speech classification. Unfortunately for entrepreneurs and business leaders, the model is only currently available for non-commercial usage under Meta’s FAIR Noncommercial Research License, which grants users the right to use, reproduce, modify, and create derivative works of the Meta Spirit LM models, but only for noncommercial purposes. Any distribution of these models or derivatives must also comply with the noncommercial restriction. A new approach to text and speech Traditional AI models for voice rely on automatic speech recognition to process spoken input before synthesizing it with a language model, which is then converted into speech using text-to-speech techniques. While effective, this process often sacrifices the expressive qualities inherent to human speech, such as tone and emotion. Meta Spirit LM introduces a more advanced solution by incorporating phonetic, pitch, and tone tokens to overcome these limitations. Meta has released two versions of Spirit LM: • Spirit LM Base: Uses phonetic tokens to process and generate speech. • Spirit LM Expressive: Includes additional tokens for pitch and tone, allowing the model to capture more nuanced emotional states, such as excitement or sadness, and reflect those in the generated speech. Both models are trained on a combination of text and speech datasets, allowing Spirit LM to perform cross-modal tasks like speech-to-text and text-to-speech, while maintaining the natural expressiveness of speech in its outputs. Open-source noncommercial — only available for research In line with Meta’s commitment to open science, the company has made Spirit LM fully open-source, providing researchers and developers with the model weights, code, and supporting documentation to build upon. Meta hopes that the open nature of Spirit LM will encourage the AI research community to explore new methods for integrating speech and text in AI systems. The release also includes a research paper detailing the model’s architecture and capabilities. Mark Zuckerberg, Meta’s CEO, has been a strong advocate for open-source AI, stating in a recent open letter that AI has the potential to “increase human productivity, creativity, and quality of life” while accelerating advancements in areas like medical research and scientific discovery. Applications and future potential Meta Spirit LM is designed to learn new tasks across various modalities, such as: • Automatic Speech Recognition (ASR): Converting spoken language into written text. • Text-to-Speech (TTS): Generating spoken language from written text. • Speech Classification: Identifying and categorizing speech based on its content or emotional tone. The Spirit LM Expressive model goes a step further by incorporating emotional cues into its speech generation. For instance, it can detect and reflect emotional states like anger, surprise, or joy in its output, making the interaction with AI more human-like and engaging. This has significant implications for applications like virtual assistants, customer service bots, and other interactive AI systems where more nuanced and expressive communication is essential. A broader effort Meta Spirit LM is part of a broader set of research tools and models that Meta FAIR is releasing to the public. This includes an update to Meta’s Segment Anything Model 2.1 (SAM 2.1) for image and video segmentation, which has been used across disciplines like medical imaging and meteorology, and research on enhancing the efficiency of large language models. Meta’s overarching goal is to achieve advanced machine intelligence (AMI), with an emphasis on developing AI systems that are both powerful and accessible. The FAIR team has been sharing its research for more than a decade, aiming to advance AI in a way that benefits not just the tech community, but society as a whole. Spirit LM is a key component of this effort, supporting open science and reproducibility while pushing the boundaries of what AI can achieve in natural language processing. What’s next for Spirit LM? With the release of Meta Spirit LM, Meta is taking a significant step forward in the integration of speech and text in AI systems. By offering a more natural and expressive approach to AI-generated speech, and making the model open-source, Meta is enabling the broader research community to explore new possibilities for multimodal AI applications. Whether in ASR, TTS, or beyond, Spirit LM represents a promising advance in the field of machine learning, with the potential to power a new generation of more human-like AI interactions. source

Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs Read More »

Cohere launches new AI models to bridge global language divide

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Cohere today released two new open-weight models in its Aya project to close the language gap in foundation models.  Aya Expanse 8B and 35B, now available on Hugging Face, expands performance advancements in 23 languages. Cohere said in a blog post the 8B parameter model “makes breakthroughs more accessible to researchers worldwide,” while the 32B parameter model provides state-of-the-art multilingual capabilities.  The Aya project seeks to expand access to foundation models in more global languages than English. Cohere for AI, the company’s research arm, launched the Aya initiative last year. In February, it released the Aya 101 large language model (LLM), a 13-billion-parameter model covering 101 languages. Cohere for AI also released the Aya dataset to help expand access to other languages for model training.  Aya Expanse uses much of the same recipe used to build Aya 101.  “The improvements in Aya Expanse are the result of a sustained focus on expanding how AI serves languages around the world by rethinking the core building blocks of machine learning breakthroughs,” Cohere said. “Our research agenda for the last few years has included a dedicated focus on bridging the language gap, with several breakthroughs that were critical to the current recipe: data arbitrage, preference training for general performance and safety, and finally model merging.” Aya performs well Cohere said the two Aya Expanse models consistently outperformed similar-sized AI models from Google, Mistral and Meta.  Aya Expanse 32B did better in benchmark multilingual tests than Gemma 2 27B, Mistral 8x22B and even the much larger Llama 3.1 70B. The smaller 8B also performed better than Gemma 2 9B, Llama 3.1 8B and Ministral 8B.  Cohere developed the Aya models using a data sampling method called data arbitrage as a means to avoid the generation of gibberish that happens when models rely on synthetic data. Many models use synthetic data created from a “teacher” model for training purposes. However, due to the difficulty in finding good teacher models for other languages, especially for low-resource languages.  It also focused on guiding the models toward “global preferences” and accounting for different cultural and linguistic perspectives. Cohere said it figured out a way to improve performance and safety even while guiding the models’ preferences.  “We think of it as the ‘final sparkle’ in training an AI model,” the company said. “However, preference training and safety measures often overfit to harms prevalent in Western-centric datasets. Problematically, these safety protocols frequently fail to extend to multilingual settings.  Our work is one of the first that extends preference training to a massively multilingual setting, accounting for different cultural and linguistic perspectives.” Models in different languages The Aya initiative focuses on ensuring research around LLMs that perform well in languages other than English.  Many LLMs eventually become available in other languages, especially for widely spoken languages, but there is difficulty in finding data to train models with the different languages. English, after all, tends to be the official language of governments, finance, internet conversations and business, so it’s far easier to find data in English.  It can also be difficult to accurately benchmark the performance of models in different languages because of the quality of translations.  Other developers have released their own language datasets to further research into non-English LLMs. OpenAI, for example, made its Multilingual Massive Multitask Language Understanding Dataset on Hugging Face last month. The dataset aims to help better test LLM performance across 14 languages, including Arabic, German, Swahili and Bengali.  Cohere has been busy these last few weeks. This week, the company added image search capabilities to Embed 3, its enterprise embedding product used in retrieval augmented generation (RAG) systems. It also enhanced fine-tuning for its Command R 08-2024 model this month.  source

Cohere launches new AI models to bridge global language divide Read More »

‘This is a game changer’: Runway releases new AI facial expression motion capture feature Act One

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More AI video has come incredibly far in the years since the first models debuted in late 2022, increasing in realism, resolution, fidelity, prompt adherence (how well they match the text prompt or description of the video that the user typed) and number. But one area that remains a limitation to many AI video creators — myself included — is in depicting realistic facial expressions in AI generated characters. Most appear quite limited and difficult to control. But no longer: today, Runway, the New York City-headquartered AI startup backed by Google and others, announced a new feature “Act-One,” that allows users to record video of themselves or actors from any video camera — even the one on a smartphone — and then transfers the subject’s facial expressions to that of an AI generated character with uncanny accuracy. The free-to-use tool is gradually rolling out “gradually” to users starting today, according to Runway’s blog post on the feature. While anyone with a Runway account can access it, it will be limited to those who have enough credits to generate new videos on the company’s Gen-3 Alpha video generation model introduced earlier this year, which supports text-to-video, image-to-video, and video-to-video AI creation pipelines (e.g. the user can type in a scene description, upload an image or a video, or use a combination of these inputs and Gen-3 Alpha will use what its given to guide its generation of a new scene). Despite limited availability right now at the time of this posting, the burgeoning scene of AI video creators online is already applauding the new feature. As Allen T. remarked on his X account “This is a game changer!” It also comes on the heels of Runway’s move into Hollywood film production last month, when it announced it had inked a deal with Lionsgate, the studio behind the John Wick and Hunger Games movie franchises, to create a custom AI video generation model based on the studio’s catalog of more than 20,000 titles. Simplifying a traditionally complex and equipment-heavy creative proccess Traditionally, facial animation requires extensive and often cumbersome processes, including motion capture equipment, manual face rigging, and multiple reference footages. Anyone interested in filmmaking has likely caught sight of some of the intricacy and difficulty of this process to date on set or when viewing behind the scenes footage of effects-heavy and motion-capture films such as The Lord of the Rings series, Avatar, or Rise of the Planet of the Apes, wherein actors are seen covered in ping pong ball markers and their faces dotted with marker and blocked by head-mounted apparatuses. Accurately modeling intricate facial expressions is what led David Fincher and his production team on The Curious Case of Benjamin Button to develop whole new 3D modeling processes and ultimately won them an Academy Award, as reported in a prior VentureBeat report. Yet in the last few years, new software and AI-based startups such as Move have sought to reduce the equipment necessary to perform accurate motion capture — though that company in particular has concentrated primarily on full-body, more broad movements, whereas Runway’s Act-One is focused more on modeling facial expressions. With Act-One, Runway aims to make this complex process far more accessible. The new tool allows creators to animate characters in a variety of styles and designs, without the need for motion-capture gear or character rigging. Instead, users can rely on a simple driving video to transpose performances—including eye-lines, micro-expressions, and nuanced pacing—onto a generated character, or even multiple characters in different styles. As Runway wrote on its X account: “Act-One is able to translate the performance from a single input video across countless different character designs and in many different styles.” The feature is focused “mostly” on the face “for now,” according to Cristóbal Valenzuela, co-founder and CEO of Runway, who responded to VentureBeat’s questions via direct message on X. Runway’s approach offers significant advantages for animators, game developers, and filmmakers alike. The model accurately captures the depth of an actor’s performance while remaining versatile across different character designs and proportions. This opens up exciting possibilities for creating unique characters that express genuine emotion and personality. Cinematic realism across camera angles One of Act-One’s key strengths lies in its ability to deliver cinematic-quality, realistic outputs from various camera angles and focal lengths. This flexibility enhances creators’ ability to tell emotionally resonant stories through character performances that were previously hard to achieve without expensive equipment and multi-step workflows. The tool’s ability to faithfully capture the emotional depth and performance style of an actor, even in complex scenes. This shift allows creators to bring their characters to life in new ways, unlocking the potential for richer storytelling across both live-action and animated formats. While Runway previously supported video-to-video AI conversion as previously mentioned in this piece, which did allow users to upload footage of themselves and have Gen-3 Alpha or other prior Runway AI video models such as Gen-2 “reskin” them with AI effects, the new Act-One feature is optimized for facial mapping and effects. As Valenzuela told VentureBeat via DM on X: “The consistency and performance is unmatched with Act-One.” Enabling more expansive video storytelling A single actor, using only a consumer-grade camera, can now perform multiple characters, with the model generating distinct outputs for each. This capability is poised to transform narrative content creation, particularly in indie film production and digital media, where high-end production resources are often limited. In a public post on X, Valenzuela noted a shift in how the industry approaches generative models. “We are now beyond the threshold of asking ourselves if generative models can generate consistent videos. A good model is now the new baseline. The difference lies in what you do with the model—how you think about its applications and use cases, and what you ultimately build,” Valenzuela wrote. Safety and protection for public figure impersonations As with all of Runway’s

‘This is a game changer’: Runway releases new AI facial expression motion capture feature Act One Read More »