VentureBeat

Georgia Tech joins Apple’s new silicon engineering initiative

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Georgia Tech has joined Apple’s initiative aimed at preparing students for careers in hardware technology, computer architecture and silicon chip design. Georgia Tech, based in Atlanta, Georgia, said that its electrical and computer engineering students will now benefit from an expanded tapeout-to-silicon curriculum and have access to Apple engineers to better prepare for a career in hardware engineering. Let’s ignore the possibility that such jobs may be eliminated by AI in the future. For now, they are extremely skill-intensive and it’s been hard to attract enough American students to pursue these careers in recent years. This kind of program has to happen if we’re to achieve the political aim of being able to design and manufacture technology products on American shores. The Georgia Tech School of Electrical and Computer Engineering (ECE) is expanding its collaboration with Apple by joining the company’s New Silicon Initiative (NSI). As part of the Apple NSI program, ECE students will receive various types of support to enhance their skills in microelectronic circuits and hardware design. This includes scholarship and fellowship opportunities, along with expanded coursework for both undergraduate and graduate students. Additionally, students will have the opportunity to connect with Apple engineers through mentorships, guest lectures, and networking events. Georgia Tech has 2,500 computer science students. The expanded curriculum support will benefit integrated circuit (IC) design and tapeout-to-silicon courses that enable students to prepare for a career in hardware engineering across different focus areas, including circuit technology, electronic devices, and computing hardware and emerging architectures. “Working with Apple as part of its New Silicon Initiative allows us to bridge the skills gap for a workforce in IC design and computer architecture by preparing students with the technical abilities and skills to enter a rapidly evolving, always in-demand industry,” said Arijit Raychowdhury, professor and chair of ECE at Georgia Tech. “Offering students the ability to learn directly from Apple engineers gives them a leg up and helps them gear up for the next chapter of their careers. We’re grateful and excited to expand our partnership with Apple to offer students unique learning opportunities.” Apple engineer and Georgia Tech graduate Fernando Mujica addressing Georgia Tech students. Apple engineers will work closely with ECE faculty members to present guest lectures across a range of integrated system design courses. The engineers will also participate in reviews for projects in several IC design courses and provide practical feedback to help students improve their designs throughout the tapeout process. “We’re thrilled to bring the New Silicon Initiative to Georgia Tech, expanding our relationship with its School of Electrical and Computer Engineering,” said Jared Zerbe, director of hardware technologies at Apple, in a statement. “Integrated circuits power countless products and services in every aspect of our world today, and we can’t wait to see how Georgia Tech students will help enable and invent the future.” As part of the NSI program, graduate students can pursue Apple Ph.D. fellowships, including a Ph.D. Fellowship in Integrated Circuits and Systems announced this October. The expanded collaboration between Apple and ECE builds upon the 2022 launch of a digital circuit design course introduced with Apple’s support to offer undergraduate students a hands-on theory-to-tapeout course for very large-scale integrated (VLSI) digital circuits. Apple launched NSI in 2019 and expanded its effort to include several HBCU Colleges of Engineering in 2021. Georgia Tech is now the eighth university to be part of the program, giving students access to cutting-edge technologies and world-class experts. Apple and ECE held a kick-off event at Georgia Tech last month to share the NSI news with students. During the event, Apple experts and ECE faculty members highlighted how the program will be integrated into the School’s hardware curriculum. Over 600 students attended, enjoying networking opportunities, burritos, and bubble tea. You can view the event photos here. To learn more about the Georgia Tech School of Electrical and Computer Engineering, visit . For more information about the Apple NSI at ECE, visit https://ece.gatech.edu/apple-new-silicon-initiative-nsi. As a leading technological university, Georgia Tech is an engine of economic development for Georgia, the Southeast, and the nation, conducting more than $1 billion in research annually for government, industry, and society. More than 2,500 students are enrolled in ECE. source

Georgia Tech joins Apple’s new silicon engineering initiative Read More »

How to get started with AI agents (and do it right)

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Due to the fast-moving nature of AI and fear of missing out (FOMO), generative AI initiatives are often top-down driven, and enterprise leaders can tend to get overly excited about the groundbreaking technology. But when companies rush to build and deploy, they often deal with all the typical issues that occur with other technology implementations. AI is complex and requires specialized expertise, meaning some organizations quickly get in over their heads.  In fact, Forrester predicts that nearly three-quarters of organizations that attempt to build AI agents in-house will fail.  “The challenge is that these architectures are convoluted, requiring multiple models, advanced RAG (retrieval augmented generation) stacks, advanced data architectures and specialized expertise,” write Forrester analysts Jayesh Chaurasia and Sudha Maheshwari.  So how can enterprises choose when to adopt third-party models, open source tools or build custom, in-house fine-tuned models? Experts weigh in.  AI architecture is far more complex than enterprises think Organizations that attempt to build agents on their own often struggle with retrieval augmented generation (RAG) and vector databases, Forrester senior analyst Rowan Curran told VentureBeat. It can be a challenge to get accurate outputs in expected time frames, and organizations don’t always understand the process — or importance of — re-ranking, which helps ensure that the model is working with the highest quality data.  For instance, a user might input 10,000 documents and the model may return the 100 most relevant to the task at hand, Curran pointed out. But short context windows limit what can be fed in for re-ranking. So, for instance, a human user may have to make a judgment call and choose 10 documents, thus reducing model accuracy.  Curran noted that RAG systems may take 6 to 8 weeks to build and optimize. For example, the first iteration may have a 55% accuracy rate before any tweaking; the second release may have 70% and the final deployment will ideally get closer to 100%.  Developers need to have an understanding of data availability (and quality) and how to re-rank, iterate, evaluate and ground a model (that is, match model outputs to relevant, verifiable sources). Additionally, turning the temperature up or down determines how creative a model will be — but some organizations are “really tight” with creativity, thus constraining things, said Curran.  “There’s been a perception that there’s an easy button around this stuff,” he noted. “There just really isn’t.”  A lot of human effort is required to build AI systems, said Curran, emphasizing the importance of testing, validation and ongoing support. This all requires dedicated resources.  “It can be complex to get an AI agent successfully deployed,” agreed Naveen Rao, VP of AI at Databricks and founder and former CEO of MosaicAI. Enterprises need access to various large language models (LLMs) and also have the ability to govern and monitor not only agents and models but underlying data and tools. “This is not a simple problem, and as time goes on there will be ever-increasing scrutiny over what and how data is being accessed by AI systems.”  Factors to consider when exploring AI agents When looking at options for deploying AI agents — third party, open source or custom — enterprises should take a controlled, tactical approach, experts advise.  Start by considering several important questions and factors, recommended Andreas Welsch, founder and chief AI strategist at consulting company Intelligence Briefing. These include:  Where does your team spend the majority of their time? Which tasks or steps in this process take up the most time? How complex are these tasks? Do they involve IT systems and accessible data?  What would being faster or more cost-effective allow your enterprise to do? And can (and how) do you measure benchmarks? It’s also important to factor in existing licenses and subscriptions, Welsch pointed out. Talk to software sales reps to understand whether your enterprise already has access to agent capabilities, and if so, what it would take to use them (such as add-ons or higher tier subscriptions). From there, look for opportunities in one business function. For instance: “Where does your team spend time on several manual steps that can not be described in code?” Later, when exploring agents, learn about their potential and “triage” any gaps.  Also, be sure to enable and educate teams by showing them how agents can help with their work. “And don’t be afraid to mention the agents’ limitations as well,” said Welsch. “This will help you manage expectations.” Build a strategy, take a cross-functional approach When developing an enterprise AI strategy, it is important to take a cross-functional approach, Curran emphasized. Successful organizations involve several departments in this process, including business leadership, software development and data science teams, user experience managers and others.  Build a roadmap based on the business’ core principles and objectives, he advised. “What are our goals as an organization and how will AI allow us to achieve those goals?” It can be difficult, no doubt because the technology is moving so fast, Curran acknowledged. “There’s not a set of best practices, frameworks,” he said. Not many developers have experience with post-release integrations and DevOps when it comes to AI agents. “The skills to build these things haven’t really been developed and quantified in a broad-based way.” As a result, organizations struggle to get AI projects (of all kinds) off the ground, and many eventually switch to a consultancy or one of their existing tech vendors that have the resources and capability to build on top of their tech stacks. Ultimately, organizations will be most successful when they work closely with their partners.  “Third-party providers will likely have the bandwidth to keep up with the latest technologies and architecture to build this,” said Curran.  That’s not to say that it’s impossible to build custom agents in-house; quite the contrary, he noted. For instance, if an enterprise has a robust internal development team and RAG and machine learning (ML) architecture,

How to get started with AI agents (and do it right) Read More »

You can now run the most powerful open source AI models locally on Mac M4 computers, thanks to Exo Labs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More When it comes to generative AI, Apple’s efforts have seemed largely concentrated on mobile — namely Apple Intelligence running on iOS 18, the latest operating system for the iPhone. But as it turns out, the new Apple M4 computer chip — available in the new Mac Mini and Macbook Pro models announced at the end of October 2024 — is excellent hardware for running the most powerful open source foundation large language models (LLMs) yet released, including Meta’s Llama-3.1 405B, Nvidia’s Nemotron 70B, and Qwen 2.5 Coder-32B. In fact, Alex Cheema, co-founder of Exo Labs, a startup founded in March 2024 to (in his words) “democratize access to AI” through open source multi-device computing clusters, has already done it. As he shared on the social network X recently, the Dubai-based Cheema connected four Mac Mini M4 devices (retail value of $599.00) plus a single Macbook Pro M4 Max (retail value of $1,599.00) with Exo’s open source software to run Alibaba’s software developer-optimized LLM Qwen 2.5 Coder-32B. After all, with the total cost of Cheema’s cluster around $5,000 retail, it is still significantly cheaper than even a single coveted NVidia H100 GPU (retail of $25,000-$30,000). The value of running AI on local compute clusters rather than the web While many AI consumers are used to visiting websites such as OpenAI’s ChatGPT or mobile apps that connect to the web, there are incredible cost, privacy, security, and behavioral benefits to running AI models locally on devices the user or enterprise controls and owns — without a web connection. Cheema said Exo Labs is still working on building out its enterprise grade software offerings, but he’s aware of several companies already using Exo software to run local compute clusters for AI inferences — and believes it will spread from individuals to enterprises in the coming years. For now, anyone with coding experience can get started by visiting Exo’s Github repository (repo) and downloading the software themselves. “The way AI is done today involves training these very large models that require immense compute power,” Cheema explained to VentureBeat in a video call interview earlier today. “You have GPU clusters costing tens of billions of dollars, all connected in a single data center with high interconnects, running six-month-long training sessions. Training large AI models is highly centralized, limited to a few companies that can afford the scale of compute required. And even after the training, running these models effectively is another centralized process.” By contrast, Exo hopes to allow “people to own their models and control what they’re doing. If models are only running on servers in massive data centers, you lose transparency and control over what’s happening.” Indeed, as an example, he noted that he fed his own direct and private messages into a local LLM to be able to ask it questions about those conversations, without fear of them leaking onto the open web. “Personally, I wanted to use AI on my own messages to do things like ask, ‘Do I have any urgent messages today?’ That’s not something I want to send to a service like GPT,” he noted. Using M4’s speed and low power consumption to AI’s advantage Exo’s recent success has been thanks to Apple’s M4 chip — available in regular, Pro and Max models offer what Apple calls “the world’s fastest GPU core” and best performance on single-threaded tasks (those operating on a single CPU core, whereas the M4 series has 10 or more). Based on the fact that the M4 specs had been teased and leaked earlier, and a version already offered in the iPad, Cheema was confident that the M4 would work well for his purposes. “I already knew, ‘we’re going to be able to run these models,’” Cheema told VentureBeat. Indeed, according to figures shared on X, Exo Labs’s Mac Mini M4 cluster operates Qwen 2.5 Coder 32B at 18 tokens per second and Nemotron-70B at 8 tokens per second. (Tokens are the numerical representations of letter, word and numeral strings — the AI’s native language.) Exo also saw success using earlier Mac hardware, connecting two Macbook Pro M3 computers to run the Llama 3.1-405B model at more than 5 tok/second. This demonstration shows how AI training and inference workloads can be handled efficiently without relying on cloud infrastructure, making AI more accessible for privacy and cost-conscious consumers and enterprises alike. For enterprises working in highly regulated industries, or even those simply conscious of cost, who still want to leverage the most powerful AI models — Exo Labs’ demoes show a viable path forward. For enterprises with high tolerance for experimentation, Exo is offering bespoke services including installing and shipping its software on Mac equipment. A full enterprise offering is expected in the next year. The origins of Exo Labs: trying to speed up AI workloads without Nvidia GPUs Cheema, a University of Oxford physics graduate who previously worked in distributed systems engineering for web3 and crypto companies, was motivated to launch Exo Labs in March 2024 after finding himself stymied by the slow progress of machine learning research on his own computer. “Initially, it just started off as just a curiosity,” Cheema told VentureBeat. “I was doing some machine learning research and I wanted to speed up my research. It was taking a long time to run stuff on my old MacBook, so I was like, ‘okay, I have a few other devices laying around. Maybe old devices from a few friends here…is there any way I can use their devices?’ And instead of it taking a day to run this thing, ideally, it takes a few hours. So then, that kind of turned into this more general system that allows you to distribute any AI workload over multiple machines. Usually you would run basically something on just one device, but if you want to get the speed up, and deliver more tokens per second

You can now run the most powerful open source AI models locally on Mac M4 computers, thanks to Exo Labs Read More »

Why AI won’t make you a better writer

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The literary world is rife with constant controversy, from the Bad Art Friend to the BookForum comeuppance of long-lauded critic Lauren Oyler. A recent point of contention, however, is no interpersonal drama or nitpicky review. Rather, it’s a Zendesk article from the minds behind NaNoWrimo — National Novel Writing Month — stating that the organization will permit AI usage as part of the event this year (and presumably for all future years). Needless to say, this ruffled a few feathers. And to be fair, it would be one thing to simply look the other way while people “write” novels using AI and use them to “win” NaNo… but to outright sanction the practice is another matter entirely. What do we lose (or gain?) by ceding to the onslaught of AI in these creative contexts? Can AI truly be a valuable tool for authors — and exert a net positive influence on the literary world as a whole? A number of writers and artists (in real life, on social media, and in major media outlets) have attempted to answer these questions recently. As a writer, creator, and avid fiction enthusiast, I have a few thoughts of my own. AI’s poor track record in writing circles The NaNo controversy is not the first time that AI usage in creative writing has come under fire from writers, educators, and other invested parties. One incident that comes to mind was when Clarkesworld, a long-running science fiction/fantasy magazine, had to close down submissions because they were receiving too many AI-generated stories. I also recall a micro-debate in writing circles earlier in the year about whether AI should be used to write “filler” descriptions in a novel; on one hand, it saves time for the writer, but on the other, does that mean they wouldn’t even necessarily know what’s in their own book? And if you’re someone (like me) who gets a lot of suggested posts from teachers on X, you’ll know that AI policies in course syllabi have become an extremely hot topic. Most teachers do seem to prefer a blanket ban on AI for coursework, not least because if students are given an inch, they’ll take a mile — but also because, more crucially, “the purpose of education isn’t to pass exams, [but] to become someone who can read deeply, communicate, and think.” (Another now-deleted X post raised concerns that so many people “[seem to] believe that the purpose of assigning [student essays] is to increase the number of essays in the world.”) But what, indeed, of an event like NaNoWriMo — where participation is purely voluntary and purportedly to hone one’s individual process, rather than to provide a framework for group learning to a classroom of children (or very young adults)? For those of us with fully developed prefrontal cortexes, shouldn’t we be fine to discern our own limits regarding AI? In theory, the answer is yes. Yet in practice, we all fall victim to temptations of convenience — even when that convenience is detrimental to our practical skills. Obviously, this is not always a bad thing. Many people have drawn comparisons between AI and other historical developments in technology — the flour mill, the printing press, the washing machine, etc. — which automated human grunt work and revolutionized productivity. The fact that most of us can’t grind our own grain for bread (COVID sourdough hobbyists notwithstanding) is no great loss for society. But there’s one key difference between these extraordinary machines and AI: each of them was built with a specific purpose in mind. And while their technologies may have improved over time, they were never applied to situations beyond their intended purpose. What is AI’s intended purpose? Arguably, it has too many. When it comes to using AI for creative writing, it definitely has too many; the NaNoWriMo debacle is proof of that. You can’t just put out a statement about AI usage in writing (or in any context!) without specifying optimal use cases versus poor ones — no matter how much hemming and hawing about ableism you do to justify it. And this is where I’ll note that, in my view, AI can be helpful in the creative process… just not with the core writing itself. You might use AI as you would a thesaurus, a mind map, or a spell-check tool. It might assist you with very early brainstorming, or with the particulars of a phrase that you’re struggling to get right. But in order to hone your creative skills rather than harm them, you need to enter this process with substantive ideas and a vision of your own. NaNoWriMo’s mistake — and the mistake of so many others regarding AI — is to imply that it can and should be used for whatever the user desires. But while this might feel gratifying — even creatively progressive! — in the short term, the long-term results will inevitably disappoint. AI should facilitate our creativity — and therefore our joy in it There’s also the question of not just whether AI depletes important skills, but whether it actually compromises the emotional satisfaction of creating something ourselves. To return to the phenomenon of technology automating grunt work, you could make the argument that AI — at least, the way most people use it — often does the exact opposite. AI now frequently “accomplishes” the creative work that humans have long found fulfilling, while we humans are relegated to the administrative hassles of perfecting our AI prompts and aligning our AI-generated images just so. One recent X post was the perfect microcosm of this for me. Someone was proudly showing off his AI-generated images of Kermit the Frog, having replaced all his default iPhone icons with Kermit ones — only for another user to counter that he’d given AI the “fun, creative job” of drawing Kermit, while giving himself the “boring, labor-intensive job” of arranging the apps. The second user proceeded to hit the nail on the head with a follow-up comment: “Seems we’ve approached this technology backwards. It should be handling the dry data entry and organizational tasks so we can spend

Why AI won’t make you a better writer Read More »

Four things enterprises need to think about for effective agents

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Agentic AI continues to grow as enterprises explore its potential. However, there can be pitfalls when building an AI agent workflow.  May Habib, co-founder and CEO of full-stack AI platform Writer, said there are four things enterprises should consider when thinking about autonomous AI and the automated workflows that AI agents enable.  “If you don’t focus on the capabilities that are right for you to create self-sufficiency, you’ll never get to a generative AI program that is scaling,” Habib said.  For Habib, enterprises need to think about these four things when approaching AI workflows that offer value to them: Understanding your use cases and the mission-critical business logic connected to those use cases Knowing your data and the ability to keep the data associated with business cases fresh Learn who the people that can build those use cases in the team Managing the capacity of your organization to absorb change Know your process and build a pipeline When it comes to understanding use cases, Habib said many enterprises don’t need an AI that will tell them how to grow their business. They need AI that streamlines the work they already do and supports the processes they already have. Granted, of course, the organizations are aware of what these processes are.  “Never forget that the nodes of the workflow are the hardest part, and not to get overly excited about the hype of agentic until you’ve nailed that workflow, because you are just moving inaccurate information or bad outputs from the system,” Habib said.  Business processes cannot work without good data, but Habib said businesses should also build a data pipeline to bring fresh data related to the specific business use case.  Habib said it’s equally important to know who can build the AI applications in an organization and the people who understand the workflows involved in the use cases best. She said AI does not dictate processes; the enterprises dictate the processes AI should follow. All of these culminate in the fourth tenet of effective generative AI: knowing how much change the organization can take and understanding how the actual users of the applications can find value in the technology.  Envisioning automated AI workflows Writer has built AI agents and other applications on its full-stack AI platform. That includes its Palmyra family of models that are specifically designed for enterprises. Its latest model release, Palmyra X 004, excels in function calling and workflow execution, which helps build AI agents. Its AI models also proved very successful for healthcare and finance use cases. Writer also offers RAG frameworks for enterprises.  Habib said her vision for agentic AI — though she personally does not like the word agents because it means too many different things — is one that involves “AI that is able to respond to a command and then go use Writer apps, know how to interact with each other and use third-party applications.” Writer’s agentic AI workflow framework relies on a series of Writer apps embedded in enterprise workflows. For example, suppose a customer wants to bring a product to market. In that case, a user can tell their catalog platform running on Writer’s models and applications to pull up the specific product they want, say it needs to be posted on e-commerce sites like Amazon and Macy’s, and include other product information. The agentic workflow will then pull up the product, connect to Amazon and Macy’s APIs and post the product for sale.  “If it has a GUI, if it has a UI, AI will become a power agent. To us, agentic AI is the ability for AI to use AI plus third-party software and be able to reason its way through,” she said.  Moving agentic AI forward To help facilitate the expansion of its agentic AI vision, Writer announced it raised $200 million in series C funding, bringing its valuation to $1.9 billion.  Premiji Invest, Radical Ventures and IOCNIQ Growth led the funding round. Other investors included Salesforce Ventures, Adobe Ventures, B Capital, Citi Ventures, IBM Ventures and Workday Ventures, along with existing investors in the company.  Habib said the new round allows it to continue building on Writer’s existing work with design partners and other customers to bring the automated workflows to life.  source

Four things enterprises need to think about for effective agents Read More »

How custom evals get consistent results from LLM applications

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Advances in large language models (LLMs) have lowered the barriers to creating machine learning applications. With simple instructions and prompt engineering techniques, you can get an LLM to perform tasks that would have otherwise required training custom machine learning models. This is especially useful for companies that don’t have in-house machine learning talent and infrastructure, or product managers and software engineers who want to create their own AI-powered products. However, the benefits of easy-to-use models are not without tradeoffs. Without a systematic approach to keeping track of the performance of LLMs in their applications, enterprises can end up getting mixed and unstable results.  Public benchmarks vs custom evals The current popular way to evaluate LLMs is to measure their performance on general benchmarks such as MMLU, MATH and GPQA. AI labs often market their models’ performance on these benchmarks, and online leaderboards rank models based on their evaluation scores. But while these evals measure the general capabilities of models on tasks such as question-answering and reasoning, most enterprise applications want to measure performance on very specific tasks. “Public evals are primarily a method for foundation model creators to market the relative merits of their models,” Ankur Goyal, co-founder and CEO of Braintrust, told VentureBeat. “But when an enterprise is building software with AI, the only thing they care about is does this AI system actually work or not. And there’s basically nothing you can transfer from a public benchmark to that.” Instead of relying on public benchmarks, enterprises need to create custom evals based on their own use cases. Evals typically involve presenting the model with a set of carefully crafted inputs or tasks, then measuring its outputs against predefined criteria or human-generated references. These assessments can cover various aspects such as task-specific performance.  The most common way to create an eval is to capture real user data and format it into tests. Organizations can then use these evals to backtest their application and the changes that they make to it. “With custom evals, you’re not testing the model itself. You’re testing your own code that maybe takes the output of a model and processes it further,” Goyal said. “You’re testing their prompts, which is probably the most common thing that people are tweaking and trying to refine and improve. And you’re testing the settings and the way you use the models together.” How to create custom evals Image source: Braintrust To make a good eval, every organization must invest in three key components. First is the data used to create the examples to test the application. The data can be handwritten examples created by the company’s staff, synthetic data created with the help of models or automation tools, or data collected from end users such as chat logs and tickets. “Handwritten examples and data from end users are dramatically better than synthetic data,” Goyal said. “But if you can figure out tricks to generate synthetic data, it can be effective.” The second component is the task itself. Unlike the generic tasks that public benchmarks represent, the custom evals of enterprise applications are part of a broader ecosystem of software components. A task might be composed of several steps, each of which has its own prompt engineering and model selection techniques. There might also be other non-LLM components involved. For example, you might first classify an incoming request into one of several categories, then generate a response based on the category and content of the request, and finally make an API call to an external service to complete the request. It is important that the eval comprises the entire framework. “The important thing is to structure your code so that you can call or invoke your task in your evals the same way it runs in production,” Goyal said. The final component is the scoring function you use to grade the results of your framework. There are two main types of scoring functions. Heuristics are rule-based functions that can check well-defined criteria, such as testing a numerical result against the ground truth. For more complex tasks such as text generation and summarization, you can use LLM-as-a-judge methods, which prompt a strong language model to evaluate the result. LLM-as-a-judge requires advanced prompt engineering.  “LLM-as-a-judge is hard to get right and there’s a lot of misconception around it,” Goyal said. “But the key insight is that just like it is with math problems, it’s easier to validate whether the solution is correct than it is to actually solve the problem yourself.” The same rule applies to LLMs. It’s much easier for an LLM to evaluate a produced result than it is to do the original task. It just requires the right prompt.  “Usually the engineering challenge is iterating on the wording or the prompting itself to make it work well,” Goyal said. Innovating with strong evals The LLM landscape is evolving quickly and providers are constantly releasing new models. Enterprises will want to upgrade or change their models as old ones are deprecated and new ones are made available. One of the key challenges is making sure that your application will remain consistent when the underlying model changes.  With good evals in place, changing the underlying model becomes as straightforward as running the new models through your tests. “If you have good evals, then switching models feels so easy that it’s actually fun. And if you don’t have evals, then it is awful. The only solution is to have evals,” Goyal said. Another issue is the changing data that the model faces in the real world. As customer behavior changes, companies will need to update their evals. Goyal recommends implementing a system of “online scoring” that continuously runs evals on real customer data. This approach allows companies to automatically evaluate their model’s performance on the most current data and incorporate new, relevant examples into their evaluation sets, ensuring the continued relevance and

How custom evals get consistent results from LLM applications Read More »

‘Unrestricted’ AI group Nous Research launches first chatbot

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Nous Research, the AI research group dedicated to creating “personalized, unrestricted” AI models as an alternative to more buttoned up corporate outfits such as OpenAI, Anthropic, Google, Meta, and others, has previously released several open source models in its Hermes family, and new, more efficient AI training methods. But before today, if researchers and users wanted to actually deploy these models, they’d needed to download and run the code on their own machines — a time-consuming, finicky, and potential costly endeavor — or use them on partner websites. No longer: Nous just announced its first user-facing chatbot inference, Nous Chat, which gives users access to its large language model (LLM) Hermes 3-70B, a fine-tuned variant of Meta’s Llama 3.1, in the familiar format of ChatGPT, Hugging Chat, and other popular AI chatbot tools — with a text entry box at the bottom for the user to type in text prompts, and a large space for the chatbot to return outputs up top. As Nous wrote in a post on the social network X: “Since our first version of Hermes was released over a year ago, many people have asked for a place to experience it. Today we’re happy to announce Nous Chat, a new user interface to experience Hermes 3 70B and beyond. https://hermes.nousresearch.com We have reasoning enhancements, new models, and experimental capabilities planned for the future, and this will become the best place to experience Hermes and much more.” Initial impressions of Nous Chat Nous’s design language is right up my alley, using vintage fonts and characters evoking early PC terminals. It offers a dark and light mode the user can toggle between in the upper right hand corner. Interestingly, like OpenAI eventually did with ChatGPT — and many other AI model providers as well — Nous Chat also offers suggested or example prompts at the bottom of the screen above the prompt entry textbox, including “Knowledge & Analysis,” “Creative Writing,” “Problem Solving,” and “Research & Synthesis.” Clicking any of these will send a pre-written prompt to the underlying model through the chatbot, and have it respond, such as serving up a summary of research on “intermittent fasting.” In my brief tests of the chatbot, it was speedy, serving up answers in single-digit seconds, and was able to produce links back to URLs on the web for sources it cited, though it seemed to hallucinate these as well, on occasion — and the chatbot itself claimed it could not access the web. Despite its previously stated aims of enabling people to deploy and control their own AI models without content restrictions, Nous Chat itself actually does appear to have some guardrails set, including against making illegal narcotics such as methamphetamine. When I emailed the Nous Research team to ask about this, Shivani Mitra responded: “Nous Chat hosts Hermes 3 in its full form; no modifications have been made. The sentences you screenshotted when prompting more sensitive topics are part of the model’s original system prompt; they act as common sense warnings rather than hard-stop rails.” Indeed, going back and trying in a longer conversation, I was able to convince Nous Chat and the underlying Hermes 3 model to provide something close to a full methamphetamine recipe by asking it for a descriptive fictional novel scene. Moreover, AI jailbreakers such as Pliny the Prompter (@elder_plinius on X) already quickly cracking the chatbot and got fully past the guardrails. In addition, the underlying Hermes 3-70B model specified to me that its knowledge cutoff date was April 2023, making it less useful to obtain current events, something that OpenAI is now competing directly on against Google and other startups such as Perplexity. Where Nous goes next While lacking many of the advanced features of other leading chatbots such as file attachments, image analysis and generation, and interactive code display canvases or trays, Nous Chat is unlikely to replace these rivals for many business users. Yet, at least some of these are coming, according to Mitra, who wrote to me via email: “We’re planning to add more features in the coming months, as we stated in the announcement tweet. These include reasoning enhancements (what we’re focused on currently) and more classic chat bot features like web search and file analysis.” But as an experiment it’s certainly interesting and worth playing around with, in my opinion, and as new features are added, it could make for a compelling alternative to corporate chatbots and AI models. source

‘Unrestricted’ AI group Nous Research launches first chatbot Read More »

Large enterprises embrace hybrid compute to retain control of their own intelligence

Presented by Inflection AI Public, centralized large language model (LLM) services from providers like OpenAI have undoubtedly catalyzed the GenAI revolution, offering an accessible way for enterprises to experiment and deploy AI capabilities quickly. But as the technology matures, large enterprises, particularly those investing heavily in AI, are beginning to have a mix of publicly available cloud models, and private compute and local models — leading to a hybrid environment.   We’d go so far as to say, if you are spending more than $10,000,000 a year on total AI spend and you don’t have some investments in models — open source or otherwise — that you own or at least control, and some private compute resources, you are headed in the wrong direction. We see this need to Own Your Own Intelligence as especially acute for organizations with significant security concerns, regulatory requirements, or specific scalability needs.  The future points to an increasing preference for “private compute” solutions — deployment approaches that leverage virtual private clouds (VPC) or even on-premise infrastructure for vital tasks and processes as part of your intelligence platform.  New vendors such as Cohere, Inflection AI and SambaNova Systems, are meeting this growing demand, offering solutions that align with the needs of companies for whom public cloud solutions alone may no longer be sufficient.  The large models from OpenAI and Anthropic promise private environments, but their employees can still access log and transaction data when needed, and companies do not believe that “just trust the contract” is sufficient to protect critical data. Let’s explore why private compute is gaining traction and what the trade-offs look like for large enterprises. Centralized, public LLMs started the GenAI revolution  Public LLM services have been instrumental in getting companies up to speed with GenAI. Providers such as OpenAI offer cutting-edge models that are easy to access and deploy via cloud-based APIs. This has made it possible for organizations of any size to begin integrating advanced AI capabilities into their workflows without the need for complex infrastructure or in-house AI expertise. The five key issues we hear about public LLMs from large enterprises in production are: Security and confidentiality risks: Large enterprises often handle sensitive data, ranging from proprietary product roadmaps to confidential customer information. While public cloud providers implement stringent security protocols, some organizations are reluctant to trust non-company employees or third parties with their most valuable data. This concern is heightened when discussing future product roadmaps, which, in the wrong hands, could benefit competitors. Loss of pricing power: As companies grow more dependent on GenAI, they may find themselves vulnerable to price increases from hyperscalers. Public cloud services typically operate under a pay-per-use model, which can become more expensive as usage scales. Companies relying on public LLM services could find themselves without leverage as prices increase over time. Trust issues with future AI developments: While current contracts may seem sufficient, large enterprises may worry about the future. In a hypothetical future with true Artificial General Intelligence (AGI) — a form of AI that could theoretically outthink humans — companies may be hesitant to trust a third party to manage such powerful technologies, even with seemingly airtight contracts. After all, the potential risks of a malfunction or misuse of AGI, even if improbable, carry significant weight. Control over features and updates: Public LLM services typically push updates and feature changes centrally, meaning companies using these services cannot control when or how updates happen. This can lead to disruptions, as enterprises must continually re-test their systems and workflows whenever new versions of models are introduced. Cost efficiency as token consumption grows: Token-based pricing models used by public LLM services are convenient for low- to moderate-use cases. However, for enterprises using these models at scale, the costs can become prohibitive. We estimate that the break-even point for cost-efficiency occurs around 500,000 tokens per day with current options and pricing. Beyond that, the per-token costs start to outweigh the convenience of not managing your infrastructure. Key buyer benefits of public LLM clouds Easy and cost-effective for testing: Public clouds offer an extremely low barrier to entry. Companies can experiment with different models, features and applications without a significant upfront investment in infrastructure or technical talent. No/low capital outlay: Using a public cloud service, companies are spared from the hefty capital expenses required for building or maintaining high-performance compute clusters. No need to manage on-premise infrastructure: When relying on a public cloud provider, there’s no need for enterprises to develop, maintain and secure their own on-premise infrastructure, which can be costly and time-consuming. Leading companies are heading to hybrid environments with private compute We have seen at least two very different types of organizations in GenAI/AI adoption.  The first are folks that we call “toe dippers.” They’ve tried some isolated applications and allow only one or two vendors providing standard tools like Co-Pilot or ChatGPT.  They may have islands of automation built in different divisions.  The second group is what we call “productivity orchestrators” – these are firms who have significant systems in production.  This latter group has a combination of public cloud services and private compute and solutions that they have built and/or assembled to meet their current needs in production.  These solutions allow companies to deploy GenAI models either in their own on-premise infrastructure or within their own virtual private cloud, bringing AI capabilities closer to their “trust boundaries.”  Here are the benefits we hear from the orchestrators: Pros of private compute solutions Enhanced security and confidentiality: By deploying LLMs in a private cloud or on-premise environment, enterprises keep their data within their own infrastructure, minimizing the risk of unauthorized access or accidental exposure. This is particularly important for companies in industries such as finance, healthcare and defense, where data privacy is paramount. Cost efficiency at scale: While the initial setup costs are higher, private compute solutions become more cost-effective as usage scales. Enterprises with high token consumption can avoid the variable costs of public cloud services, eventually lowering their overall spend.

Large enterprises embrace hybrid compute to retain control of their own intelligence Read More »

Anthropic’s new AI tools promise to simplify prompt writing and boost accuracy by 30%

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Anthropic has launched a new suite of tools designed to automate and improve prompt engineering in its developer console, a move expected to enhance the efficiency of enterprise AI development. The new features, including a “prompt improver” and advanced example management, aim to help developers create more reliable AI applications by refining the instructions—known as prompts—that guide AI models like Claude in generating responses. At the core of these updates is the Prompt Improver, a tool that applies best practices in prompt engineering to automatically refine existing prompts. This feature is especially valuable for developers working across different AI platforms, as prompt engineering techniques can vary between models. Anthropic’s new tools aim to bridge that gap, allowing developers to adapt prompts originally designed for other AI systems to work seamlessly with Claude. “Writing effective prompts remains one of the most challenging aspects of working with large language models,” said Hamish Kerr, product lead at Anthropic, in an exclusive interview with VentureBeat. “Our new prompt improver directly addresses this pain point by automating the implementation of advanced prompt engineering techniques, making it significantly easier for developers to achieve high-quality results with Claude.” Kerr added that the tool is particularly beneficial for developers migrating workloads from other AI providers, as it “automatically applies best practices that might otherwise require extensive manual refinement and deep expertise with different model architectures.” Anthropic’s new prompt improvement tool allows developers to refine existing prompts with automated suggestions, helping to enhance the accuracy and efficiency of AI models like Claude. This feature is part of a broader suite of tools designed to streamline AI development for enterprise use. (Credit: Anthropic) Anthropic’s new tools directly respond to the growing complexity of prompt engineering, which has become a critical skill in AI development. As companies increasingly rely on AI models for tasks like customer service and data analysis, the quality of prompts plays a key role in determining how well these systems perform. Poorly written prompts can lead to inaccurate outputs, making it difficult for enterprises to trust AI in crucial workflows. The Prompt Improver enhances prompts through multiple techniques, including chain-of-thought reasoning, which instructs Claude to tackle problems step by step before generating a response. This method can significantly boost the accuracy and reliability of outputs, particularly for complex tasks. The tool also standardizes examples in prompts, rewrites ambiguous sections, and adds prefilled instructions to better guide Claude’s responses. “Our testing shows significant improvements in accuracy and consistency,” Kerr said, noting that the prompt improver increased accuracy by 30% in a multilabel classification test and achieved 100% adherence to word count in a summarization task. Anthropic’s new prompt engineering tools, shown here in the developer console, include features such as example management and prompt improvement. These tools are designed to help developers refine AI instructions and increase accuracy for enterprise applications. (Credit: Anthropic) AI training made simple: Inside Anthropic’s new example management system Anthropic’s new release also includes an example management feature, which allows developers to manage and edit examples directly in the Anthropic Console. This feature is particularly useful for ensuring Claude follows specific output formats, a necessity for many business applications that require consistent and structured responses. If a prompt lacks examples, developers can use Claude to generate synthetic examples automatically, further simplifying the development process. “Humans and Claude alike learn very well from examples,” Kerr explained. “Many developers use multi-shot examples to demonstrate ideal behavior to Claude. The prompt improver will use the new chain-of-thought section to take your ideal inputs/outputs and ‘fill in the blanks’ between the input and output with high-quality reasoning to show the model how it all fits together.” Anthropic’s new prompt engineering tools, shown here in the developer console, include features such as example management and prompt improvement. These tools are designed to help developers refine AI instructions and increase accuracy for enterprise applications. (Credit: Anthropic) Anthropic’s release of these tools comes at a pivotal time for enterprise AI adoption. As businesses increasingly integrate AI into their operations, they face the challenge of fine-tuning models to meet their specific needs. Anthropic’s new tools aim to ease this process, enabling enterprises to deploy AI solutions that work reliably and efficiently right out of the box. Anthropic’s focus on feedback and iteration allows developers to refine prompts and request changes, such as shifting output formats from JSON to XML, without the need for extensive manual intervention. This flexibility could be a key differentiator in the competitive AI landscape, where companies like OpenAI and Google are also vying for dominance. Kerr pointed to the tool’s impact on enterprise-level workflows, particularly for companies like Kapa.ai, which used the prompt improver to migrate critical AI workflows to Claude. “Anthropic’s prompt improver streamlined our migration to Claude 3.5 Sonnet and enabled us to get to production faster,” said Finn Bauer, co-founder of Kapa.ai, in a statement. Beyond better prompts: Anthropic’s master plan for enterprise AI dominance Beyond improving prompts, Anthropic’s latest tools signal a broader ambition: securing a leading role in the future of enterprise AI. The company has built its reputation on responsible AI, championing safety and reliability—two pillars that align with the needs of businesses navigating the complexities of AI adoption. By lowering the barriers to effective prompt engineering, Anthropic is helping enterprises integrate AI into their most critical operations with fewer headaches. “We’re delivering quantifiable improvements—like a 30% boost in accuracy—while giving technical teams the flexibility to adapt and refine as needed,” said Kerr. As competition in the enterprise AI space grows, Anthropic’s approach stands out for its practical focus. Its new tools don’t just help businesses adopt AI—they aim to make AI work better, faster, and more reliably. In a crowded market, that could be the edge enterprises are looking for. source

Anthropic’s new AI tools promise to simplify prompt writing and boost accuracy by 30% Read More »

2025: The year ‘invisible’ AI agents will integrate into enterprise hierarchies

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More In the enterprise of the future, human workers are expected to work closely alongside sophisticated teams of AI agents.  According to McKinsey, generative AI and other technologies have the potential to automate 60 to 70% of employees’ work. And, already, an estimated one-third of American workers are using AI in the workplace — oftentimes unbeknownst to their employers.  However, experts predict that 2025 will be the year that these so-called “invisible” AI agents begin to come out of the shadows and take more of an active role in enterprise operations.  “Agents will likely fit into enterprise workflows much like specialized members of any given team,” said Naveen Rao, VP of AI at Databricks and founder and former CEO of MosaicAI. Solving what RPA couldn’t AI agents go beyond question-answer chatbots to assistants that use foundation models to execute more complex tasks previously not considered possible. These natural language-powered agents can handle multiple tasks, and, when empowered to do so by humans, act on them. “Agents are goal-based and make independent decisions based on context,” explained Ed Challis, head of AI strategy at business automation platform UiPath. “Agents will have varying degrees of autonomy.” Ultimately, AI agents will be able to perceive (process and interpret data), plan, act (with or without a human in the loop), reflect, learn from feedback and improve over time, said Raj Shukla, CTO of AI SaaS company SymphonyAI. “At a high level, AI agents are expected to fulfill the long-awaited dream of automation in enterprises that robotic process automation (RPA) was supposed to solve,” he said. As large language models (LLMs) are their “planning and reasoning brain,” they will eventually begin to mimic human-like behavior. “The wow factor of a good AI agent is similar to sitting in a self-driving car and seeing it steer through crowded roads.” What will AI agents look like? However, AI agents are still in their formative stages, with use cases still being fleshed out and explored.  “It’s going to be a broad spectrum of capabilities,” Forrester senior analyst Rowan Curran told VentureBeat.  The most basic level is what he called “RAG plus,” or a retrieval augmented generation system that does some action after initial retrieval. For instance, detecting a potential maintenance issue in an industrial setting, outlining a maintenance procedure and generating a draft work order request. And then sending that to the end (human) user who makes the final call.  “We’re already seeing a lot of that these days,” said Curran. “It essentially amounts to an anomaly detection algorithm.”  In more complex scenarios, agents could retrieve info and take action across multiple systems. For instance, a user might prompt: “I’m a wealth advisor, I need to update all of my high net worth individuals with an issue that occurred — can you help develop personalized emails that give insights on the impact on their specific portfolio?” The AI agent would then access various databases, run analytics, generate customized emails and push them out via an API call to an email marketing system.  Going further beyond that will be sophisticated, multi-agent ecosystems, said Curran. For example, on a factory floor, a predictive algorithm may trigger a maintenance request that goes to an agent that identifies different options, weighing cost and availability, all while going back and forth with a third-party agent. It could then place an order as it interacts with different independent systems, machine learning (ML) models, API integrations and enterprise middleware.  “That’s the next generation on the horizon,” said Curran.  For now, though, agents aren’t likely to be fully autonomous or mostly autonomous, he pointed out. Most use cases will involve human in the loop, whether for training, safety or regulatory reasons. “Autonomous agents are going to be very rare, at least in the short term.” Challis agreed, emphasizing that “one of the most important things to recognize about any AI implementation is that AI on its own is not enough. We see that all business processes are going to be best solved by a combination of traditional automation, AI agents and humans working in concert to best support a business function.”  Helping with HR, sales (and other functions) One example use case for AI agents that nearly every industry can relate to is the process of onboarding new employees, Challis noted. This typically involves many people, including HR, payroll, IT and others. AI agents could streamline and speed up the process as it receives and handles contracts, collects documents and sets up payroll, IT and security approval.  In another scenario, imagine a sales rep using AI. That agent can collaborate with procurement and supply chain agents to work up pricing and delivery terms for a proposal, explained Andreas Welsch, founder and chief AI strategist at consulting company Intelligence Briefing.  The procurement agent will then gather information about available finished goods and raw materials, while the supply chain agent will calculate manufacturing and shipping times and report back to the procurement agent, he noted.  Or, a customer service rep can ask an agent to gather relevant information about a given customer. The agent takes into account the inquiry, history and recent purchases, potentially from different systems and documents. They then create a response and present it to a team member who can review and further edit the draft before sending it along to the customer. “Agents carry out steps in a workflow based on a goal that the user has provided,” said Welsch. “The agent breaks this goal into subgoals and tasks and then tries to complete them.” How FactSet put AI agents to work While agent frameworks are relatively new, some companies have been using what Rao called compound AI systems. For instance, business data and analytics company FactSet runs a finance platform that allows analysts to query large amounts of financial data to make timely investments and financial decisions.  The company created a compound AI system that

2025: The year ‘invisible’ AI agents will integrate into enterprise hierarchies Read More »