-
Notifications
You must be signed in to change notification settings - Fork 12
home
The project delivers a comprehensive full-stack solution for the Intel® Enterprise AI Foundation on the OpenShift platform, applicable across data center, cloud, and edge environments. It utilizes innovative General Operators technology to provision AI accelerators, including the Intel Gaudi Processor, Flex and Max GPUs, and Xeon CPU accelerators such as QAT, SGX, and DSA. Additionally, the project introduces solutions for integrating Gaudi Software or OneAPI-based AI software into OpenShift AI. Key AI workload integrations, such as LLM inferencing, fine-tuning, and post-training for enterprise AI, are under development. The plans also include the GPU network provisioning and full-stack integration with OpenShift.
The RoCE Networking-Based LLM Post-Training Solution with Intel Enterprise AI Foundation for OpenShift delivers a end-to-end solution for large language model (LLM) post-training workloads for enterprise AI area. By harnessing Intel's advanced hardware, such as Gaudi accelerators, and the built-in RoCE, it provides a cost-efficient, scalable and high-performance solution.
The seamless integration with Red Hat OpenShift and OpenShift AI provides a production-grade platform for deploying and managing AI workloads, offering enterprises a robust, flexible, and user-friendly environment.
The Intel Enterprise AI Foundation for OpenShift integrates and optimizes a production-ready AI software stack for Intel’s distributed AI compute and networking solutions. This paper explores the implementation of RDMA over Converged Ethernet (RoCE) within the Intel® Gaudi® AI architecture, specifically addressing scale-up and scale-out networking requirements for Transformer-based Large Language Models (LLMs).
With the deep understanding of the Transformer Architecture and Parallel Computing for Collective Communication in AI Networks, as well as the unique "3-ply" networking topology facilitated by Gaudi’s integrated RoCE engines, this paper provides an in-depth examination of the Habana Collective Communications Library (HCCL) centric full-stack software to enable the AI network, detailing the software-hardware co-design necessary to maximize throughput and mitigate latency. By comparing Gaudí’s architecture with industry standards such as NVIDIA’s NCCL and Meta’s Grand Teton, this paper highlights the performance benefits and trade-offs inherent in Gaudí’s innovative built-in networking design. Ultimately, this paper serves as a comprehensive reference for partners and developers to evaluate, deploy, and optimize high-performance AI networking on the Intel Enterprise AI platform.
Featuring an integrated RoCE-v2 engine, the Intel Gaudi AI Platform provides a unified RDMA networking fabric. This architecture supports both scale-up networking and scale-out networking. This standardized approach to RDMA is consistent with other industry-leading AI infrastructures, such as Meta’s Grand Teton, which also rely on InfiniBand and RoCE for high-performance interconnectivity. To effectively develop and optimize the Gaudi AI network stack—specifically the Collective Communication Library (CCL), RDMA-core, and the Linux kernel’s InfiniBand subsystem for Intel Enterprise AI Foundation — it is essential to perform a deep dive into RDMA networking principles. In this analysis, we also reference industry standards such as the open-source NCCL framework to provide a comprehensive overview of the RDMA network stack.
Understanding Transformer Architecture and Parallel Computing for Collective Communication in AI Network
This paper establishes a foundation by clarifying LLM data stream concepts that are often inconsistently defined in the industry. Using GPT-3 as a primary example, we introduce the "decoder-only" Transformer architecture, followed by an analysis of modern Mixture-of-Experts (MoE) models like DeepSeek. Finally, we provide a deep dive into the parallel computing algorithms - the core workload for AI networking. For those working on Intel Enterprise AI Networking, a consolidated understanding of these fields is essential. This paper can serve as your entry point for these knowledges.
Collective Communication Libraries (CCLs), such as NCCL and HCCL, are foundational components of AI networking. They provide the essential Collective communication primitives required for parallel computing across thousands of GPUs or AI accelerators. While some CCLs offer reasonable out-of-the-box performance, fine-tuning their parameters to meet specific workload requirements remains a significant challenge. Achieving true optimization is impossible without a comprehensive understanding of how AI computing and networking software and hardware function as a unified system. This paper provides a deep dive into the mechanisms of resource allocation, explaining how to coordinate compute and network resources to achieve peak performance for specific AI workloads.
The Intel Enterprise AI Foundation for OpenShift project provides Intel AI computing and networking hardware feature-provisioning and key AI computing and networking software stack enabling and optimization and provide the production ready AI platform with the Red Hat OpenShift Container Platform (RHOCP). The technology to deploy and manage Intel Enterprise AI End-to-End (E2E) solutions and the related reference workloads for these features are also included in the project.
Fast GPU Provisioning technology enables GPU provisioning less than 1 second with no reboots using pre-built driver containers. The feature eliminates any dependency on machine configuration which triggers reboot, an expensive operation. Instead, the required operations are performed at runtime. This leads to a simplified and accelerated deployment process.
When the containers need access to device files they usually need to run as root UID/GID as 0/0. But when the device plugins make the device files available to the workload containers, it is owned by root and thus the containers need to run as root. But it is not a good security practice. So its always a good idea to run containers as rootless. Here is short tutorial on how to run the Intel Device plugins so they the workload containers can run as rootless. By default this is not turned on.