5 ways to overcome the barriers of AI infrastructure deployments

Presented by Penguin Solutions


Today, organizations are under intense pressure to leverage AI as a competitive advantage, but we’re still in the early stages. Only about 40% of large-scale enterprises have actively deployed AI in their business, but barriers keep another 40% in the exploration and experimentation phases. Although there is massive interest, 38% of IT professionals admit that a lack of technology infrastructure is a major barrier to AI success.

Why are so many organizations falling behind in the race to implement AI? The Harvard Business Review estimates the failure rate is as high as 80% — about twice the rate of other corporate IT project failures. One of the top barriers preventing successful AI deployments is limited AI skills and expertise. In fact, 9 out of 10 organizations suffer from a shortage of IT skills, which exposes execution gaps in AI system-design, deployment and ongoing cluster management. Without the necessary insight, software tools and expertise, 83% of organizations admit to not being able to fully utilize their GPU and AI hardware, even after the system is deployed.

Managing AI infrastructure is a whole new ballgame, which requires a significantly different approach compared to traditional IT infrastructure, says Jonathan Ha, senior director of product management – AI systems at Penguin Solutions.

“Tuning the cost, performance, data and operational model for a specific use case and workload starts with a solid AI infrastructure, managed intelligently,” Ha says. “You cannot and will not move from proof of concept to production at scale until you’ve established that foundation.”

Here’s a look at the five most common challenges when building out your AI architecture and how enterprises can approach and overcome them.

Challenge #1: IT organizations are not AI-ready

IT has decades’ worth of tools, processes and experience monitoring and managing general-purpose and high-performance computing (HPC) workloads at the CPU level. However, today’s AI infrastructure requires significant enhancements in monitoring and management capabilities. With the addition of new technologies like high-powered GPUs, high-performance interconnects, low-latency network fabrics and even the addition of liquid-cooling infrastructure, IT organizations are challenged with building the expertise to monitor and manage these AI clusters, especially at scale.

Designing the compute and storage cluster architectures, building the network topologies and then tuning it all to get maximum performance for your AI workloads all takes specialized skills, experience and expertise.

The solution: Invest in AI infrastructure expertise

Many organizations approach this challenge with a false sense of confidence, believing their extensive IT infrastructure expertise equips them with the knowledge and know-how to succeed. Unfortunately, that often means they struggle with getting their infrastructure up and running, or achieving the results they expect. The success of an AI strategy hinges on the very first decisions made: use cases, project design, hardware needs, costs and more. That takes practical, up-to-the minute experience in designing, deploying and managing today’s AI infrastructure.

Unfortunately, the explosion of AI has far outpaced the talent pool, making that expertise hard to find. In such a tight market, it is critical to get the right talent in place, whether through training existing staff, hiring externally or selecting the right AI infrastructure partner.

Challenge #2: Building for today and tomorrow’s needs

Even before designing a system, organizations need to map out their AI use-cases, models and data sets to scope out the scale of the required AI infrastructure. It’s important to consider factors such as model parameters, users supported and performance needs, while also anticipating how those needs will grow and change as AI adoption continues to grow. At the same time, organizations must also consider rapidly expanding data demands and the constantly evolving technology landscape. How can an organization stay agile, scale easily and deliver expected performance, security and stability when managing profoundly complex AI architecture?

The solution: Plan from the ground up

First, an organization should develop a comprehensive AI roadmap that identifies the resources required at each stage of the AI journey and the timeline for their deployment. For example, starting the design with a data center is crucial, as its power and cooling capabilities will determine the feasibility of the AI cluster and future scalability. Second comes selecting and integrating validated, modular architectures that allow for easy configuration to meet changing compute demands while providing high availability and performance, even as workloads and use-cases change over time.

Challenge #3: Data management and governance just got even more important

AI depends on the efficient management of large datasets across the entire pipeline. Data security can become a challenge, and ensuring the data is clean, accurate and unbiased, as well as aligning with internal and external compliance regulations is an ongoing risk and a continuous responsibility.

“Every piece of data becomes valuable in an AI initiative, but it is also more vulnerable once it’s released from an organization’s silos. Plus, bias often creeps in, introduced by tagging and labeling when training an AI model,” Ha says. “Establishing the appropriate processes, controls and governance to use data in a safe and equitable manner is something that must be a top priority.”

The solution: Putting guardrails in place

Leaders must invest time in understanding the potential pitfalls, including leaks, misuse of data and miscategorization of data, as well as biases, before touching the data and beginning the AI initiative. They should then establish processes and tools to safeguard the data in all locations. Plus, it is important to map out what roles get what kind of access and be vigilant in tracking and monitoring that activity.

Challenge #4: Managing AI infrastructure requires a new approach

Misconfigured networks, node failures or loss of GPUs can disrupt operations, causing delays in new product launches or hindering the discovery of critical insights. Addressing these challenges is difficult due to the complexity of the architecture and the need for skilled talent. Expertise is required to manage optimal cluster design and intelligent cluster management. Additionally, continuous tuning and refinement of your model throughout the pipeline is essential for success.

The solution: Embracing new operations strategies

Keeping an AI initiative on track and continuously optimized requires implementing an AIOps approach, which combines big data, analytics and ML into an automated and intelligent IT platform. This ensures complete visibility and control over all aspects of an AI pipeline. It automates the sorting and integration of organizational data, identifies application performance and availability issues, diagnoses root causes and then addresses them to minimize slowdowns and shortages. By doing so, it uncovers ways to optimize workloads and enhance efficiency.

Challenge #5: ROI hinges on availability and performance

AI is a demanding and costly undertaking which cannot afford inefficient systems or unnecessary downtime – and yet so many organizations are grappling with it daily. For example, a recent Meta paper detailed the company’s experience training their Llama 3 model, which boasts 16,000 GPUs in the cluster. Unfortunately, there was a GPU-related failure in the cluster every three hours. And when you’re doing a simultaneous parallel workload, that can lead to delays, job restarts or even incorrect results and outcomes.

“We’ve heard from customers and other large-scale AI infrastructure providers that at any given time their AI clusters may only have between 30% and 70% of their GPU nodes available,” Ha says. “If you only have 70% of your GPU nodes available and are achieving only 70% of your target performance from your system, you are only realizing 49% of the potential value of your AI infrastructure investment. The 51% of lost value will have a significant negative impact on your ROI.”

The solution: Automation is key

Being able to monitor, manage and create processes that automate and predict failures is the best way to mitigate a great deal of the risk, Ha says. When Meta implemented automated tools and processes, they saw one training run with 400-plus interruptions – and all but three of those interruptions were automatically handled with no human intervention and without having to stall the job.

“That’s the secret sauce that comes with having over 2 billion man-hours of experience managing these large AI clusters – having the tools, insights and automated processes to keep them up and running,” he says.

Looking forward and launching an AI strategy

Launching an AI strategy takes time, effort and a great deal of specialized skill and understanding. Addressing and tackling these challenges while keeping pace with competitors launching their own initiatives becomes increasingly risky, especially when working with a rapidly evolving technology. There are ways to strengthen and safeguard AI initiatives, Ha says.

“The challenge isn’t just the complexity, or even the skill set,” he says. “It’s about evolving your organization along with the technology.”

To ensure a successful AI initiative, organizations must stay abreast of the latest technological advancements and foster an internal culture that is proficient in AI. By leveraging the capabilities of AIOps and MLOps, these organizations can integrate AI seamlessly into their workflows across various teams and domains. To optimize their AI models continuously, breaking down departmental silos and fostering collaboration is essential. A culture of experimentation, iteration and learning from both successes and failures, supported by partnerships with AI experts, is fundamental for long-term AI strategy success.

The most important piece of advice for a successful AI initiative?

“Solid investments in the right tools, partners and expertise,” Ha says. “AI is a huge undertaking, but developing the foundation and those capabilities right from the start helps you deliver return on investment and faster time to value, significantly reduces the risk to the business and offers the competitive advantage you need to succeed in the marketplace.”

Visit Penguin Solutions to learn more about fool-proofing your AI architecture and launching successful AI initiatives with a trusted partner. With 25 years of HPC experience and more than 75,000 GPUs deployed since 2017, Penguin Solutions is the trusted strategic partner for AI and HPC solutions and services for leading organizations such as Meta, U.S. Navy, Sandia Labs and Georgia Tech. Its OriginAI solution provides assured infrastructure for critical, demanding AI workloads.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact [email protected].

source

Leave a Comment

Your email address will not be published. Required fields are marked *