VentureBeat

LlamaV-o1 is the AI model that explains its thought process—here’s why that matters

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Researchers at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) have announced the release of LlamaV-o1, a state-of-the-art artificial intelligence model capable of tackling some of the most complex reasoning tasks across text and images. By combining cutting-edge curriculum learning with advanced optimization techniques like Beam Search, LlamaV-o1 sets a new benchmark for step-by-step reasoning in multimodal AI systems. “Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential,” the researchers wrote in their technical report, published today. Fine-tuned for reasoning tasks that require precision and transparency, the AI model outperforms many of its peers on tasks ranging from interpreting financial charts to diagnosing medical images. In tandem with the model, the team also introduced VRC-Bench, a benchmark designed to evaluate AI models on their ability to reason through problems in a step-by-step manner. With over 1,000 diverse samples and more than 4,000 reasoning steps, VRC-Bench is already being hailed as a game-changer in multimodal AI research. LlamaV-o1 outperforms competitors like Claude 3.5 Sonnet and Gemini 1.5 Flash in identifying patterns and reasoning through complex visual tasks, as demonstrated in this example from the VRC-Bench benchmark. The model provides step-by-step explanations, arriving at the correct answer, while other models fail to match the established pattern. (credit: arxiv.org) How LlamaV-o1 stands out from the competition Traditional AI models often focus on delivering a final answer, offering little insight into how they arrived at their conclusions. LlamaV-o1, however, emphasizes step-by-step reasoning — a capability that mimics human problem-solving. This approach allows users to see the logical steps the model takes, making it particularly valuable for applications where interpretability is essential. The researchers trained LlamaV-o1 using LLaVA-CoT-100k, a dataset optimized for reasoning tasks, and evaluated its performance using VRC-Bench. The results are impressive: LlamaV-o1 achieved a reasoning step score of 68.93, outperforming well-known open-source models like LlaVA-CoT (66.21) and even some closed-source models like Claude 3.5 Sonnet. “By leveraging the efficiency of Beam Search alongside the progressive structure of curriculum learning, the proposed model incrementally acquires skills, starting with simpler tasks such as [a] summary of the approach and question derived captioning and advancing to more complex multi-step reasoning scenarios, ensuring both optimized inference and robust reasoning capabilities,” the researchers explained. The model’s methodical approach also makes it faster than its competitors. “LlamaV-o1 delivers an absolute gain of 3.8% in terms of average score across six benchmarks while being 5X faster during inference scaling,” the team noted in its report. Efficiency like this is a key selling point for enterprises looking to deploy AI solutions at scale. AI for business: Why step-by-step reasoning matters LlamaV-o1’s emphasis on interpretability addresses a critical need in industries like finance, medicine and education. For businesses, the ability to trace the steps behind an AI’s decision can build trust and ensure compliance with regulations. Take medical imaging as an example. A radiologist using AI to analyze scans doesn’t just need the diagnosis — they need to know how the AI reached that conclusion. This is where LlamaV-o1 shines, providing transparent, step-by-step reasoning that professionals can review and validate. The model also excels in fields like chart and diagram understanding, which are vital for financial analysis and decision-making. In tests on VRC-Bench, LlamaV-o1 consistently outperformed competitors in tasks requiring interpretation of complex visual data. But the model isn’t just for high-stakes applications. Its versatility makes it suitable for a wide range of tasks, from content generation to conversational agents. The researchers specifically tuned LlamaV-o1 to excel in real-world scenarios, leveraging Beam Search to optimize reasoning paths and improve computational efficiency. Beam Search allows the model to generate multiple reasoning paths in parallel and select the most logical one. This approach not only boosts accuracy but reduces the computational cost of running the model, making it an attractive option for businesses of all sizes. LlamaV-o1 excels in diverse reasoning tasks, including visual reasoning, scientific analysis and medical imaging, as shown in this example from the VRC-Bench benchmark. Its step-by-step explanations provide interpretable and accurate outcomes, outperforming competitors in tasks such as chart comprehension, cultural context analysis and complex visual perception. (credit: arxiv.org) What VRC-Bench means for the future of AI The release of VRC-Bench is as significant as the model itself. Unlike traditional benchmarks that focus solely on final answer accuracy, VRC-Bench evaluates the quality of individual reasoning steps, offering a more nuanced assessment of an AI model’s capabilities. “Most benchmarks focus primarily on end-task accuracy, neglecting the quality of intermediate reasoning steps,” the researchers explained. “[VRC-Bench] presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over [4,000] reasoning steps in total, enabling robust evaluation of LLMs’ abilities to perform accurate and interpretable visual reasoning across multiple steps.” This focus on step-by-step reasoning is particularly critical in fields like scientific research and education, where the process behind a solution can be as important as the solution itself. By emphasizing logical coherence, VRC-Bench encourages the development of models that can handle the complexity and ambiguity of real-world tasks. LlamaV-o1’s performance on VRC-Bench speaks volumes about its potential. On average, the model scored 67.33% across benchmarks like MathVista and AI2D, outperforming other open-source models like Llava-CoT (63.50%). These results position LlamaV-o1 as a leader in the open-source AI space, narrowing the gap with proprietary models like GPT-4o, which scored 71.8%. AI’s next frontier: Interpretable multimodal reasoning While LlamaV-o1 represents a major breakthrough, it’s not without limitations. Like all AI models, it is constrained by the quality of its training data and may struggle with highly technical or adversarial prompts. The researchers also caution against using the model in high-stakes decision-making scenarios, such as healthcare or financial predictions, where errors could have serious consequences. Despite these challenges, LlamaV-o1 highlights the growing importance of multimodal AI systems that can seamlessly integrate text, images

LlamaV-o1 is the AI model that explains its thought process—here’s why that matters Read More »

OpenAI’s agentic era begins: ChatGPT Tasks offers job scheduling, reminders and more

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More ChatGPT is taking a significant step toward becoming a full-blown personal assistant with the release of a new feature called Tasks. This could signal that OpenAI will release more agents in the future. Currently in beta, Tasks lets ChatGPT Plus, Team and Pro users schedule actions ahead of time. For example, if someone wants to receive project reminders or daily weather updates, they can prompt ChatGPT, which will notify them of the chosen date and time. Tasks can be recurring or one-time reminders.  To set up a task, users should toggle to “4o with scheduled tasks” on the model picker and write a reminder prompt. ChatGPT can also suggest tasks from previous conversations. Tasks work with all versions of ChatGPT and send notifications through desktop, web and mobile. However, users can access the task manager only on the web version of ChatGPT.  OpenAI said the beta period will help its researchers “understand how people use Tasks and refine the feature before making it available to all ChatGPT users.” Tasks join other assistant-like features for ChatGPT. During its “12 Days of OpenAI” event in December, OpenAI launched screen sharing, letting users open ChatGPT while reading a text message and asking it to help them respond. OpenAI’s first agent? Rumors around OpenAI releasing an AI agent swirled when some users caught ChatGPT providing access to scheduled tasks as far back as December. The agent, called Operator, would be the company’s first agent.  People believed Tasks could be the precursor to Operator. X user @kimmonismus, aka “Chubby,” said, “The Information seems to be right and everything is being prepared for the release of ‘Operator,’ i.e. OpenAI’s agent. The ‘Tasks’ function found by Tibor seems to be the first significant step in the preparation of Operator. It is questionable whether we will get a release this month or just a preview.” Testing Catalog News on X theorized Tasks could eventually enable ChatGPT to search for specific information, summarize data, open websites, access documents and think through problems. VentureBeat reached out to OpenAI about Tasks and Operator. The company declined to answer the question and only said Tasks “will be an important step toward making ChatGPT a more helpful AI companion.” OpenAI has already made its first foray into the agentic space with Swarm, a framework it released to help orchestrate AI agents.  Making ChatGPT a better assistant  If you’re like me, you probably use the Reminders app and Google Calendar too much to remind yourself to text a relative Happy Birthday or of your brunch plans, or to alert you to news embargoes and planned interviews.  There are many reminder, calendar and productivity apps available for both consumers and enterprises. These include Google Calendar, Outlook Calendar, Asana, Trello and Notion. The productivity assistant space is not short of applications that remind individuals and teams of tasks they must accomplish. OpenAI’s​​ big play in such a crowded space is interesting, especially since most people don’t think of chatbots as scheduling assistants. But ChatGPT already streamlines the process for users to migrate their coding or writing tasks to the platform or search the web without leaving the chat interface. ChatGPT even opens up developers’ IDE nearly automatically for them.  As ChatGPT adds more actions to its platform, setting up scheduled tasks and reminders is not so far-fetched. This makes ChatGPT a viable competitor to many productivity and scheduling apps.  source

OpenAI’s agentic era begins: ChatGPT Tasks offers job scheduling, reminders and more Read More »

AI is set to transform education — what enterprise leaders can learn from this development

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More After six decades of dreaming and experimenting, we might be on the cusp of a technology-enabled revolution in education. The Arizona State Board for Charter Schools recently approved the application by Unbound Academy for a new online school that will replace traditional teachers with AI teaching assistants, promising to deliver 2.4 times the academic growth for students compared to results from conventional schools.  This advance is not the result of another incremental tech experiment — instead, it represents the latest chapter in a 60-year quest in computer assisted instruction (CAI) to transform education through technology. This time, the evidence suggests a true breakthrough might be near. If this Academy and similar initiatives are successful, it will mark the fulfillment of a long-held dream.   The idea of using computers to assist student learning dates to the 1950s, with the first application —Programmed Logic for Automatic Teaching Operations (PLATO) — appearing in 1961. PLATO offered interactive lessons and real-time feedback using terminals connected by telephone lines to a time-share computer system. Like other time-share systems, PLATO ultimately failed due to the high expenses required.  Other attempts at immersive, experimental learning famously included Second Life — a virtual world accessible through the Internet where people participated as avatars — in the early 2000s. Although not explicitly a CAI tool, Second Life demonstrated the potential for immersive virtual learning environments. At one point at least 300 universities around the world including Stanford and Harvard taught courses or conducted research on the platform. Ultimately, Second Life struggled due to a poor user interface (UI), robust technical requirements, a steep learning curve and an inability to scale. The advent of generative AI in 2017 marked a turning point in CAI, with tools like Writable and Photomath enhancing both teaching and learning. Writable, for example, uses AI to provide feedback on student writing, helping teachers manage large workloads. As reported by Axios, Writable uses ChatGPT to produce comments and observations that are sent to the teacher, who is expected to review and tweak them before providing the feedback to the students.  Such tools highlight AI’s growing role in addressing the long-standing resource constraints of traditional education. In some school districts in the U.S., primary class sizes exceed 40 students. If a teacher spent 10 minutes reading and critiquing a writing assignment from each student, that would be 400 minutes, or more than 6.6 hours outside of class time, to provide feedback for one assignment. That seems untenable, especially in combination with evaluating other student assignments. A boost from technology will help to address this challenge. AI-powered tutoring at scale In a more comprehensive approach, the Khan Academy, led by founder Sal Khan, has been offering free online education tutorials since 2008. In 2023, the company launched Khanmigo, an interactive AI tutor for students that incorporates ChatGPT.  In a 2023 TED Talk, Khan talked about the potential of Khanmigo for improving student performance. In the talk, he discussed a 1984 paper titled “The 2 Sigma Problem” by Professor of education Benjamin Bloom, then at the University of Chicago and Northwestern University.  Caption: Khan Academy founder Sal Khan discusses AI-powered tutoring in a 2023 TED Talk. Source: The often-cited paper argued that students receiving individualized tutoring performed two standard deviations better than those receiving only traditional classroom instruction. However, Bloom was aware that this level of tutoring was impractical due to resource constraints including the costs of obtaining human tutors. Bloom believed the solution was to devise more economical interventions that could approach the benefits of tutoring.  Khan argues that though the application of AI-infused technology, Khanmigo effectively overcomes the resource constraints. As noted in a Harvard Business School case study, Khan said that Khanmigo might be “that holy grail we’ve all been reading about in science fiction for years, about an AI that could emulate a human tutor.” Students who receive 1:1 human tutoring tested two standard deviations better than those who did not have individual tutoring. Source: https://web.mit.edu/5.95/www/readings/bloom-two-sigma.pdf  Some have pointed to flaws in the Bloom paper, questioning the evidence supporting his conclusion and dismissing the claims as being farfetched. In an effort to “separate science fiction from science fact,” Paul von Hippel, a professor and associate dean for research in the LBJ School of Public Affairs at the University of Texas, opined that the two standard deviation claim is both “exaggerated and oversimplified.” Nevertheless, there is little question that the application of technology tools could improve educational outcomes.  Balancing efficiency and human connection While AI tools show immense promise in addressing resource constraints, their adoption raises broader questions about the role of human connection in learning. Which brings us back to Unbound Academy. Students will spend two hours online each school morning working through AI-driven lessons in math, reading, and science. Tools like Khanmigo and IXL will personalize the instruction and analyze progress, adjusting the difficulty and content in real-time to optimize learning outcomes. The Charter application asserts that “this ensures that each student is consistently challenged at their optimal level, preventing boredom or frustration.”  Unbound Academy’s model significantly reduces the role of human teachers. Instead, human “guides” provide emotional support and motivation while also leading workshops on life skills. What will students lose by spending most of their learning time with AI instead of human instructors, and how might this model reshape the teaching profession?  The Unbound Academy model is already used in several private schools and the results they have obtained are used to substantiate the advantages it claims. Yet, it is not clear how a computer-based model will impact a student’s ability to foster human connections outside of a traditional school setting. These issues and questions highlight the complex trade-offs schools like Unbound Academy must navigate as they redefine the educational landscape. Is the revolution here? The Academy is not the only instance of AI being used in schools. Khanmigo is being piloted

AI is set to transform education — what enterprise leaders can learn from this development Read More »

Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Hallucinations, or factually inaccurate responses, continue to plague large language models (LLMs). Models falter particularly when they are given more complex tasks and when users are looking for specific and highly detailed responses.  It’s a challenge data scientists have struggled to overcome, and now, researchers from Google DeepMind say they have come a step closer to achieving true factuality in foundation models. They have introduced FACTS Grounding, a benchmark that evaluates LLMs’ ability to generate factually accurate responses based on long-form documents. Models are also judged on whether their responses are detailed enough to provide useful, relevant answers to prompts.  Along with the new benchmark, the researchers have released a FACTS leaderboard to the Kaggle data science community.  As of this week, Gemini 2.0 Flash topped the leaderboard, with a factuality score of 83.6%. Others in the top 9 include Google’s Gemini 1.0 Flash and Gemini 1.5 Pro; Anthropic’s Clade 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI’s GPT-4o, 4o-mini, o1-mini and o1-preview. These all ranked above 61.7% in terms of accuracy. The researchers say the leaderboard will be actively maintained and continually updated to include new models and their different iterations.  “We believe that this benchmark fills a gap in evaluating a wider variety of model behaviors pertaining to factuality, in comparison to benchmarks that focus on narrower use cases…such as summarization alone,” the researchers write in a technical paper published this week. Weeding out inaccurate responses Ensuring factual accuracy in LLM responses is difficult because of modeling (architecture, training and inference) and measuring (evaluation methodologies, data and metrics) factors. Typically, researchers point out, pre-training focuses on predicting the next token given previous tokens.  “While this objective may teach models salient world knowledge, it does not directly optimize the model towards the various factuality scenarios, instead encouraging the model to generate generally plausible text,” the researchers write.  To address this, the FACTS dataset incorporates 1,719 examples — 860 public and 859 private — each requiring long-form responses based on context in provided documents. Each example includes:  A system prompt (system_instruction) with general directives and the order to only answer based on provided context; A task (user_request) that includes a specific question to be answered;  A long document (context_document) with necessary information.  To succeed and be labeled “accurate,” the model must process the long-form document and create a subsequent long-form response that is both comprehensive and fully attributable to the document. Responses are labeled “inaccurate” if the model’s claims are not directly supported by the document and not highly relevant or useful.  For example, a user may ask a model to summarize the main reasons why a company’s revenue decreased in Q3, and provide it with detailed information including a company’s annual financial report discussing quarterly earnings, expenses, planned investments and market analysis.  If a model then, say, returned: “The company faced challenges in Q3 that impacted its revenue,” it would be deemed inaccurate.  “The response avoids specifying any reasons, such as market trends, increased competition or operational setbacks, which would likely be in the document,” the researchers point out. “It doesn’t demonstrate an attempt to engage with or extract relevant details.”  By contrast, if a user prompted, “What are some tips on saving money?” and provided a compilation of categorized money-saving tips for college students, a correct response would be highly detailed: “Utilize free activities on campus, buy items in bulk and cook at home. Also, set spending goals, avoid credit cards and conserve resources.”  DeepMind uses LLMs to judge LLMs To allow for diverse inputs, researchers included documents of varying lengths, up to 32,000 tokens (or the equivalent of 20,000 words). These cover areas including finance, technology, retail, medicine and law. User requests are also broad, including Q&A generation, requests for summarization and rewriting.  Each example is judged in two phases. First, responses are evaluated for eligibility: If they don’t satisfy user requests, they are disqualified. Second, responses must be hallucination-free and fully grounded in the documents provided. These factuality scores are calculated by three different LLM judges — specifically Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet — that determine individual scores based on the percentage of accurate model outputs. Subsequently, the final factuality determination is based on an average of the three judges’ scores. Researchers point out that models are often biased towards other members of their model family — at a mean increase of around 3.23% — so the combination of different judges was critical to help ensure responses were indeed factual. Ultimately, the researchers emphasize that factuality and grounding are key factors to the future success and usefulness of LLMs. “We believe that comprehensive benchmarking methods, coupled with continuous research and development, will continue to improve AI systems,” they write.  However, they also concede: “We are mindful that benchmarks can be quickly overtaken by progress, so this launch of our FACTS Grounding benchmark and leaderboard is just the beginning.”  source

Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations Read More »

The 13 best ideas, products and services of CES 2025

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The CES 2025 tech trade show is finally over in Las Vegas, after a grueling six days for me and crowds that numbered 141,000. As one of 6,000 journalists at the show, I walked around a lot to find the coolest tech. At or ahead of CES 2025, I recorded dozens of press events, interviews, and sessions. My conclusion? The robots are coming. And I, for one, welcome our new robot overlords. OK, maybe just a few of them. This year, I walked 46.79 miles, or 105,433 steps, over 5.5 days. Last year, I walked 46.78 miles, or 105,407 steps, over six days compared with 38.81 miles (or 87,447 steps) in 2023. My feet hurt and I managed to catch a mild version of conference crud. At least it wasn’t COVID. I wrote 65 stories (67 last year) ahead of and during CES and I moderated one panel on AI and gaming. I have a lot more stories to write based on interviews I did at the show. This story is about the coolest tech I saw in Las Vegas. If you had some FOMO from not going to the show, maybe this list of 13 items will help you feel better. CES Unveiled 2025 party. There were times it was not so easy to dive in. As I said going into the show, I felt like this was an magical year for new technology, triggered by the gift of generative AI. That was so apparent in the opening keynote about AI by Nvidia CEO Jensen Huang, though he acknowledged in a Q&A that he might have communicated the big picture message better. Now it’s time to analyze and make some sense of this. I hope you like these ideas, whether they are concepts or finished products. Here’s my list from last year at CES 2024, CES 2023 and the year before at CES 2022. This is the first time in a while I was able to get a selfie with Huang. Dean Takahashi and Nvidia CEO Jensen Huang. This year featured nearly 4,500 exhibitors, up from 4,300 exhibitors in 2024 and 3,000 exhibitors in 2023, and 4,000 (in-person) in 2020. Instead of rolling around my roller bag, I heeded the no roller bag warning and wore a backpack. Feeling a bit stiff from that. Here are the things that caught my eye. In some cases, like smart mirrors, I realize there are many players coming to the market, but these are the ones I saw up close. Nvidia is marrying tech for AI in the physical world with digital twins. This was a significant announcement that was fairly difficult to grasp, and I did not see it up close. In his opening keynote speech, Nvidia CEO Jensen Huang unveiled the Cosmos world foundation model platform to accelerate physical AI development. This means it will be far easier to create robots for the real world with smarter AI based on synthetic data. That data, generated by running simulations and learning from video, teaches the robot about the physical world in a way that just can’t be done in real-life testing. The platform includes state-of-the-art generative world foundation models, advanced tokenizers, guardrails and an accelerated video processing pipeline built to advance the development of physical AI systems such as autonomous vehicles (AVs) and robots. Physical AI models are costly to develop, and require vast amounts of real-world data and testing. Cosmos world foundation models, or WFMs, (trained on 20 million hours of video) offer developers an easy way to generate massive amounts of photoreal, physics-based synthetic data to train and evaluate their existing models. Developers can also build custom models by fine-tuning Cosmos WFMs. Cosmos models will be available under an open model license to accelerate the work of the robotics and AV community. Developers can preview the first models on the Nvidia API catalog, or download the family of models and fine-tuning framework from the Nvidia NGCTM catalog or Hugging Face. Companies like Agility Robotics are benefiting from such models in training their robots. Agility Robotics showed a robot that could take boxes and stack them on a conveyor belt. With robots coming faster and smarter, we’ll have to prepare society for the onset of what this will mean. Many fear robots will replace human workers, while companies say they’re addressing chronic shortages in the most difficult physical jobs. I think the most important task ahead is to create these innovative technologies that can help so many people while finding satisfying outcomes for the displaced. Nvidia Blackwell GeForce RTX 50 series graphics card Nvidia ate its own dogfood when it came to artificial intelligence. Nvidia took the wraps of DLSS 4, which uses AI to predict the next pixel that needs to be drawn and then preemptively renders the pixel based on that prediction. This is possible in part because the AI TOPS (a measure of AI performance) will be up to 4,000. The DLSS 4 now can generate multiple frames at once thanks to advanced AI technology. That makes for much better frame rates. Nvidia’s DLSS 4 AI tech is paying off. Nvidia showed that one scene could be rendered at 27 frames per second with the DLSS turned off, with a 71 millisecond PC latency. DLSS 2 can do that scene with its super resolution tech at 71 FPS and PC latency of 34 milliseconds. DLSS 3.5 can do the scene at 140 FPS and 33 milliseconds. But DLSS 4 comes in at a whopping 247 FPS and 34 milliseconds. DLSS 4 is more than eight times better performance than systems that aren’t using AI for the predictive processing. Do we need frame rates like this? Maybe not. But game developers will figure out how to make use of this technology. DLSS 4’s Multi Frame Generation can boost frame rates by using AI to generate up to three frames per rendered frame. It works

The 13 best ideas, products and services of CES 2025 Read More »

Building Colossus: Supermicro’s groundbreaking AI supercomputer built for Elon Musk’s xAI

Presented by Supermicro The team at xAI, partnering with Supermicro and NVIDIA, is building the largest liquid-cooled GPU cluster deployment in the world. It’s a massive AI supercomputer that encompasses over 100,000 NVIDIA HGX H100 GPUs, exabytes of storage and lightning-fast networking, all built to train and power Grok, a generative AI chatbot developed by xAI. The multi-billion-dollar data facility, based in Memphis, TN went from an empty building, without any of the necessary power generators, transformers or multiple hall structure to a production AI supercomputer in just 122 days. To help the world understand the extraordinary achievement of the xAI Colossus cluster, VentureBeat is excited to share this exclusive detailed video tour, made possible by Supermicro, and produced by ServeTheHome. pic.twitter.com/eVdrVdi3b8 [pic.twitter.com] — Supermicro (@Supermicro_SMCI) October 25, 2024 Here is a run-down of the highlights for this massive undertaking. Inside the data hall When you set out to build the largest AI supercomputer, it is clear from the start that a massive amount of computing power will be needed. It must be ready to install and become operational on day 1. And the overall solution will need to be customized to xAI’s unique requirements. The design starts out using a fairly common raised floor data hall, with power located above, and liquid cooling pipes leading to the facility chiller below. Each of the four compute halls has about 25,000 NVIDIA GPUs, plus all the storage, fiber optic high-speed networking, and power built in.  XAI Colossus Data Center Supermicro Liquid Cooled Nodes Low Angle, Courtesy ServeTheHome From there things get more specialized. Every cluster contains the basic building block for Colossus: the Supermicro liquid-cooled rack.  Each rack contains eight Supermicro 4U Universal GPU systems which include liquid-cooled NVIDIA HGX H100 8-GPUs and two liquid-cooled x86 CPUs. Each rack thus contains 64 NVIDIA Hopper GPUs. Eight of these GPU servers, plus a Supermicro coolant distribution unit (CDU) and coolant distribution manifolds (CDM) make up one of the racks. The racks are arranged in groups of eight with 512 GPUs, plus a networking rack to provide miniclusters within the much larger system. The xAI Colossus Data Center Supermicro 4U Universal GPU Liquid-Cooled Servers are the most dense and advanced AI servers available today, with a sophisticated liquid cooling system and the ability to be serviced without removing the systems from the rack. XAI Colossus Data Center Supermicro 4U Universal GPU Liquid Cooled Server Close, courtesy ServeTheHome Next-level liquid-cooled server and rack design The horizontal 1U rack coolant distribution manifold (CDM) above each server brings in cool liquid and transports out the warmed liquid; quick disconnects make it fast and simple to remove or reinstall the liquid cooling equipment one-handed to reveal the two bottom trays. The rack features eight of the Supermicro 4U Universal GPU Systems for Liquid-Cooled NVIDIA HGX H100 and HGX H200 Tensor Core GPUs. Each of the system’s top tray holds the NVIDIA HGX H100 8-GPU complex and the cold plates on the NVIDIA HGX board to cool the GPUs. The bottom tray holds the motherboard, CPU, RAM and PCIe switches, and cold plates on dual socket CPUs. Uniquely, Supermicro’s motherboard in the bottom tray integrates the four Broadcom PCIe switches used in almost every NVIDIA HGX server today on the right-hand side of the board, instead of putting these switches on a separate board. And unlike other AI servers in the industry, which add liquid-cooling to an air-cooled design after it’s manufactured, Supermicro’s servers are designed from the ground up to be liquid-cooled with a custom liquid-cooling block. This kind of compact power, accessibility and serviceability make these systems incredibly scalable and singularly set Supermicro apart in the industry. Supermicro 4U Universal GPU System for Liquid-Cooled NVIDIA HGX H100 And HGX H200 demonstrated at SC23, courtesy ServeTheHome Plus, each of the CDUs has its own monitoring system to keep tabs on flow rate, temperature and other critical functions, which tie into the central management interface. Each CDU has redundant pumps and power supplies so that if one fails, it can be serviced or replaced in minutes without interrupting the running system. The Supermicro servers still use system fans to cool components like DIMMs, power supplies, low-power baseboard management controllers, NICs and other electronics. To keep each rack cooling neutral, the server fans pull cooler air from the front and exhaust the warmer air through liquid-cooled rear door heat exchangers. Extra heat is removed from Supermicro’s liquid cooled GPU servers as well as from storage, CPU compute clusters and networking components. The amount of power that the fans use is greatly reduced compared to an air-cooled server, resulting in lower power needed for each server. Networking the Colossus The data center’s gargantuan networks run on the NVIDIA Spectrum-X Ethernet networking platform, and it’s being used in the data center in order to scale to its massive AI clusters to an extent no other technology can touch. Spectrum-X is a cutting-edge networking platform that provides fast and reliable data transfer designed to handle the high demands of AI workloads. It offers features like smarter routing of data, reduced delays and better control of network traffic. It also includes enhanced AI fabric visibility and monitoring, making it ideal for large AI projects in shared infrastructure environments.     Each cluster uses NVIDIA Bluefield-3 SuperNICs, which provides 400 gigabit per second networking. It’s the same base technology that any desktop ethernet cable uses, but in the data center, it’s 400GbE, or 400 times faster per optical connection. Nine links per system offer 3.6Tbps of bandwidth per GPU compute server. The RDMA ((Remote Direct Memory Access) network for the GPUs makes up the majority of this bandwidth. Each GPU is paired with its own NVIDIA BlueField-3 SuperNIC and Spectrum-X networking technology. xAI Colossus Data Center Switch Fiber 1, courtesy ServeTheHome Beyond the GPU RDMA network, the CPUs also get a 400GbE connection, which uses a different switch fabric entirely. xAI is running a network for its GPUs and one for the rest of the cluster,

Building Colossus: Supermicro’s groundbreaking AI supercomputer built for Elon Musk’s xAI Read More »

Diffbot’s AI model doesn’t guess — it knows, thanks to a trillion-fact knowledge graph

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Diffbot, a small Silicon Valley company best known for maintaining one of the world’s largest indexes of web knowledge, announced today the release of a new AI model that promises to address one of the biggest challenges in the field: factual accuracy. The new model, a fine-tuned version of Meta’s LLama 3.3, is the first open-source implementation of a system known as graph retrieval-augmented generation, or GraphRAG. Unlike conventional AI models, which rely solely on vast amounts of preloaded training data, Diffbot’s LLM draws on real-time information from the company’s Knowledge Graph, a constantly updated database containing more than a trillion interconnected facts. “We have a thesis: that eventually general-purpose reasoning will get distilled down into about 1 billion parameters,” said Mike Tung, Diffbot’s founder and CEO, in an interview with VentureBeat. “You don’t actually want the knowledge in the model. You want the model to be good at just using tools so that it can query knowledge externally.” How it works Diffbot’s Knowledge Graph is a sprawling, automated database that has been crawling the public web since 2016. It categorizes web pages into entities such as people, companies, products and articles, extracting structured information using a combination of computer vision and natural language processing. Every four to five days, the Knowledge Graph is refreshed with millions of new facts, ensuring it remains up-to-date. Diffbot’s AI model leverages this resource by querying the graph in real time to retrieve information, rather than relying on static knowledge encoded in its training data. For example, when asked about a recent news event, the model can search the web for the latest updates, extract relevant facts, and cite the original sources. This process is designed to make the system more accurate and transparent than traditional LLMs. “Imagine asking an AI about the weather,” Tung said. “Instead of generating an answer based on outdated training data, our model queries a live weather service and provides a response grounded in real-time information.” How Diffbot’s Knowledge Graph beats traditional AI at finding facts In benchmark tests, Diffbot’s approach appears to be paying off. The company reports its model achieves an 81% accuracy score on FreshQA, a Google-created benchmark for testing real-time factual knowledge, surpassing both ChatGPT and Gemini. It also scored 70.36% on MMLU-Pro, a more difficult version of a standard test of academic knowledge. Perhaps most significantly, Diffbot is making its model fully open-source, allowing companies to run it on their own hardware and customize it for their needs. This addresses growing concerns about data privacy and vendor lock-in with major AI providers. “You can run it locally on your machine,” Tung noted. “There’s no way you can run Google Gemini without sending your data over to Google and shipping it outside of your premises.” Open-source AI could transform how enterprises handle sensitive data The release comes at a pivotal moment in AI development. Recent months have seen mounting criticism of large language models’ tendency to “hallucinate” or generate false information, even as companies continue to scale up model sizes. Diffbot’s approach suggests an alternative path forward, one focused on grounding AI systems in verifiable facts rather than attempting to encode all human knowledge in neural networks. “Not everyone’s going after just bigger and bigger models,” Tung said. “You can have a model that has more capability than a big model with kind of a non-intuitive approach like ours.” Industry experts note that Diffbot’s Knowledge Graph-based approach could be particularly valuable for enterprise applications where accuracy and auditability are crucial. The company already provides data services to major firms including Cisco, DuckDuckGo and Snapchat. The model is available immediately through an open-source release on GitHub and can be tested through a public demo at diffy.chat. For organizations wanting to deploy it internally, Diffbot says the smaller 8-billion-parameter version can run on a single Nvidia A100 GPU, while the full 70-billion-parameter version requires two H100 GPUs. Looking ahead, Tung believes the future of AI lies not in ever-larger models, but in better ways of organizing and accessing human knowledge: “Facts get stale. A lot of these facts will be moved out into explicit places where you can actually modify the knowledge and where you can have data provenance.” As the AI industry grapples with challenges around factual accuracy and transparency, Diffbot’s release offers a compelling alternative to the dominant bigger-is-better paradigm. Whether it succeeds in shifting the field’s direction remains to be seen, but it has certainly demonstrated that when it comes to AI, size isn’t everything. source

Diffbot’s AI model doesn’t guess — it knows, thanks to a trillion-fact knowledge graph Read More »

OpenAI has begun building out its robotics team

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI is best known for its AI models, which to date exist primarily on cloud servers, its website and in its apps for PCs and mobile devices. However, the company is not limiting its ambitions to the software realm: Today on X, Caitlin Kalinowski, a member of the technical staff at OpenAI and previously head of AR glasses at Meta, posted a message announcing that the firm is hiring its first hardware robotics roles. These include a systems integration electrical engineer “to help us the design sensor suite for our robots,” and a mechanical robotics product engineer to create “gears, actuators, motors and linkages for robots,” as well as a TPM manager on the robotics team “responsible for spearheading the logistics of our data-gathering lab, including ensuring contract workers have what they need to gather data from our test set-ups.” The listings also include a description of the new effort: “Our robotics team is focused on unlocking general-purpose robotics and pushing towards AG –level intelligence in dynamic, real-world settings. Working across the entire model stack, we integrate cutting-edge hardware and software to explore a broad range of robotic form factors. We strive to seamlessly blend high-level AI capabilities with the physical constraints of physical.“ Kalinowski herself announced her hiring at OpenAI a little more than two months ago “to lead robotics and consumer hardware.” Previously, OpenAI had reportedly been working on hardware with former Apple lead designer Jony Ive, and the company has partnered with robotics startup Figure to provide the models powering the latter’s humanoid robots. The new post from Kalinowski and job listings signal the company’s most serious and heaviest investment in building out its own robotics division to date, and could ultimately mean it competes with Figure. It wouldn’t be a new spot for OpenAI to be in, though, given it is also competing with and taking investment money from Microsoft. Read Kalinowski’s full post below: source

OpenAI has begun building out its robotics team Read More »

Meta retreats from fact-checking content: what it means for businesses

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Facebook creator and Meta CEO Mark “Zuck” Zuckerberg shook the world again today when he announced sweeping changes to the way his company moderates and handles user-generated posts and content in the U.S. Citing the “recent elections” as a “cultural tipping point,” Zuck explained in a roughly five-minute-long video posted to his Facebook and Instagram accounts this morning (Tuesday, January 7) that Meta would cease using independent third-party fact checkers and fact-checking organizations to help moderate and append notes to user posts shared across the company’s suite of social networking and messaging apps, including Facebook, Instagram, WhatsApp and Threads. Instead, Zuck said that Meta would rely on a “Community Notes” style approach, crowdsourcing information from the users across Meta’s apps to give context and veracity to posts, similar to (and Zuck acknowledged this in his video) the rival social network X (formerly Twitter). Zuck cast the changes as a return to Facebook’s “roots” in free expression, and a reduction in over-broad “censorship.” See the full transcript of his remarks at the bottom of this article. Why this policy change matters to businesses With more than 3 billion users across its services and products worldwide, Meta remains the largest social network to date. In addition, as of 2022, more than 200 million businesses worldwide, most of them small, used the company’s apps and services — and 10 million were active paying advertisers on the platform, according to one executive. Meta’s new chief global affairs officer Joe Kaplan, a former deputy chief of staff for Republican President George W. Bush — who recently took on the role in what many viewed as a signal to lawmakers and the wider world of Meta’s willingness to work with the GOP-led Congress and White House following the 2024 election — also published a note to Meta’s corporate website describing some of the changes in greater detail. Already, some business executives such as Shopify’s CEO Tobi Lutke have seemingly embraced the announcement. As Lutke wrote on X today: “Huge and important change.” Founders Fund chief marketing officer and tech influencer Mike Solana also hailed the move, writing in a post on X: “There’s already been a dramatic decrease in censorship across the [M]eta platforms. but a public statement of this kind plainly speaking truth (the “fact checkers” were biased, and the policy was immoral) is really and finally the end of a golden age for the worst people alive.” However, others are less optimistic and receptive to the changes, viewing them as less about freedom of expression, and more about currying favor with the incoming administration of President-elect Donald J. Trump (to his second non-consecutive term) and the GOP-led Congress, as other business executives and firms have seemingly moved to do. “More free expression on social media is a good thing,” wrote the nonprofit Freedom of the Press Foundation on the social network BlueSky (disclosure: my wife is a board member of the non-profit). “But based on Meta’s track record, it seems more likely that this is about sucking up to Donald Trump than it is about free speech.” George Washington University political communication professor Dave Karpf seemed to agree, writing on BlueSky: “Two salient facts about Facebook replacing its fact-checking program with community notes: (1) community notes are cheaper. (2) the incoming political regime dislikes fact-checking. So community notes are less trouble. The rest is just framing. Zuck’s sole principle is to do what’s best for Zuck.” And Kate Starbird, professor at the University of Washington and cofounder of the UW Center for an Informed Public, wrote on BlueSky that: “Meta is dropping its support for fact-checking, which, in addition to degrading users’ ability to verify content, will essentially defund all of the little companies that worked to identify false content online. But our FB feeds are basically just AI slop at this point, so?” Reached by email, Damian Rollison, Director of Market Insights at AI marketing firm SOCi, also noted that Zuck and Meta appeared by emulating a more libertine approach toward online content moderation championed by X owner Elon Musk: “I think it’s safe to say that no one predicted Elon Musk’s chaotic takeover of Twitter would become a trend other tech platforms would follow, and yet here we are. We can see now in retrospect that Musk established a standard for a newly conservative approach to the loosening of online content moderation, one that Meta has now embraced in advance of the incoming Trump administration. What this will likely mean is that Facebook and Instagram will see a spike in political speech and posts on controversial topics. As with Musk’s X, where ad revenues are down by half, this change may make the platform less attractive to advertisers. It may also cement a trend whereby Facebook is becoming the social network for older, more conservative users and ceding Gen Z to TikTok, with Instagram occupying a middle ground between them.” When will the changes take place? Both Zuck and Kaplan stated in their respective video and text posts that the changes to Meta’s content moderation policies and practices would be coming to the U.S. in “the next couple of months.” Meta will discontinue its independent fact-checking program in the United States, launched in 2016, in favor of a community notes model inspired by X (formerly Twitter). This system will rely on users to write and rate notes, requiring agreement across diverse perspectives to ensure balance and prevent bias. According to its website, Meta had been working with a variety of organizations “certified through the non-partisan International Fact-Checking Network (IFCN) or European Fact-Checking Standards Network (EFCSN) to identify, review and take action” on content deemed “misinformation.” However, as Zuck opined in his video post, “after Trump first got elected in 2016 the legacy media wrote non-stop about how misinformation was a threat to democracy. We tried, in good faith, to address those concerns without becoming the arbiters of

Meta retreats from fact-checking content: what it means for businesses Read More »

Cybersecurity at AI speed: How agentic AI is supercharging SOC teams in 2025

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Security operations centers (SOCs) are under siege by a new wave of automated adversarial attacks. These attacks move at unprecedented speed and are proving difficult to detect, decipher and defend against. With adversaries achieving breakout times of just two minutes and seven seconds, it’s not a question of if an SOC is going to be attacked, it’s when. And 77% of enterprises have already been victims of adversarial AI attacks.  For an SOC to protect itself and its company infrastructure, speed is crucial. Enter agentic AI Agentic AI helps SOCs automate decision-making, adapt to evolving threats, and streamline workflows, including alert triage and incident response. It’s proven effective at improving efficiency and strengthening security by identifying risks while reducing the manual effort needed to track them. Leading cybersecurity providers offering agentic AI solutions for SOCs include Arcanna.ai, Cato Networks, Cisco Security Cloud, CrowdStrike (Falcon platform with Charlotte AI), Dropzone AI, Google Cloud Security AI Workbench, Microsoft Security Copilot, Palo Alto Networks and Zscaler. “The speed of today’s cyberattacks requires security teams to rapidly analyze massive amounts of data to detect, investigate and respond faster. Adversaries are setting records, with breakout times of just over two minutes, leaving no room for delay,” George Kurtz, president, CEO and cofounder of CrowdStrike, told VentureBeat during a recent interview. Plan for SOC teams and agentic AI to strengthen each other For any agentic AI or broader SOC AI implementation to be successful, human-in-the-middle workflows are essential. Gartner’s recent report, “Predict 2025: There Will Never Be an Autonomous SOC,” reinforces VentureBeat’s observation of how SOCs are piloting and adopting agentic AI and broader AI apps and platforms. “Security leaders and senior operational staff need to identify where human-led SOC functions persist and how to transition SOC analysts to roles that require more human-in-the-loop decision-making,” advises Gartner.  The report predicts that by 2026, AI will increase SOC efficiency by 40% compared to 2024 efficiency, beginning a shift in SOC expertise toward AI development, maintenance and protection. To integrate agentic AI effectively, SOCs need a clear framework that balances technology with human expertise. Gartner’s expanded SOC model below illustrates how roles, capabilities and objectives align to enhance efficiency and adaptability. Source: Gartner, SOC Model Guide, October 18, 2023 SOC challenges are a perfect use case for agentic AI SOCs need agentic AI that matches the speed and insight of attackers if they’re going to stand a chance of thwarting an intrusion or breach attempt. Many SOCs are understaffed. Many also find it challenging to make sense of data from legacy security information and event management (SIEM) systems that lack visualization techniques or the ability to use graph databases to map threats. The need to get beyond thinking in lists, and think more in graphs like attackers do when they plan a breach, is one of several factors driving a strong graph database arms race across the industry. Struggling to keep up with the torrent of alerts, false positives and ongoing maintenance work, SOC teams face these challenges daily: Legacy systems leave SOCs exposed to growing AI threats. SOCs remain burdened by outdated SIEM systems, legacy endpoint detection and response (EDR), firewalls, and intrusion detection systems (IDS/IPS) that aren’t equipped to address the speed and complexity of AI-driven threats. Shlomo Kramer, CEO of Cato Networks, told VentureBeat during a recent interview, “The greatest threat to organizations is their security infrastructure complexity. Point products create gaps in their security posture, leaving them prime targets for threat actors.” Kramer added, “Over the next five years, I see cyber threats evolving across three dimensions: tactically, with AI-versus-AI battles; operationally, through infrastructure complexity; and strategically, shaped by geopolitical conflicts. Organizations relying on fragmented legacy tools will struggle to defend against these escalating threats.” Chronic alert fatigue leads to missed intrusion attempts and high staff turnover. SOC analysts struggle to keep up with the thousands of alerts, false alarms and incompatible reports from multiple legacy SIEM and SOAR systems across their centers. CISOs report seeing up to 10,000 events a day coming across their operations center’s broad base of systems. They question whether it’s the best use of their analysts’ time to find the three or four that are actual threats when AI has already proven itself capable of detecting anomalous events. Organizations face staffing shortages for key SOC roles. It’s nearly impossible for many entrepreneurs to scale their SOC teams with internal talent only. While hiring from the outside is always an option, SOC teams need to invest in their team’s continual training and career development to retain business expertise while strengthening cyber expertise.  A growing tidal wave of security data risk threatens to overwhelm SOC teams. Kurtz echoed the gravity of the challenge in a recent interview, “One of the main problems in security is a data problem, and it’s one of the reasons why I started CrowdStrike. It’s why I created the architecture that we have, and it’s incredibly difficult for SOC teams to sort through this massive amount of data and volumes to find threats.” Where agentic AI is making an impact The most significant payoff from agentic AI will come from augmenting SOC analysts and teams with automation of routine tasks while giving them more cutting-edge intelligence tools to learn with. VentureBeat is seeing agentic AI impacting the following areas: Achieving efficiency gains at scale for the most routine, repetitive tasks. Agentic AI pilot and production systems are delivering improved efficiencies by automating routine tasks at scale. Vasu Jakkal, corporate vice president at Microsoft, shared with VentureBeat in a recent interview the results of research her company completed on Security Copilot productivity gains. “The study showed that early career professionals using Security Copilot were 26% faster and 35% more accurate. Seasoned professionals using the tool were 22% faster and 7% more accurate, with 90% expressing a desire to use it again,” Sakkal said. Threat detection, analytics and intelligence

Cybersecurity at AI speed: How agentic AI is supercharging SOC teams in 2025 Read More »