A dynamic parallel method for performance optimization on hybrid CPUs

Luo Yu, Liu Yucheng, Shen Haihao

2024-12-04

A dynamic parallel method for performance optimization on hybrid CPUs

Summary

This paper discusses a new method called dynamic parallel optimization for improving the performance of hybrid CPUs when running AI models.

What's the problem?

Hybrid CPUs, which combine different types of processing cores, are increasingly used to run AI models. However, current AI frameworks do not take full advantage of the varying capabilities of these cores, leading to poor performance during AI tasks. This means that the AI models may not run as fast or efficiently as they could.

What's the solution?

To solve this problem, the researchers developed a dynamic parallel method that balances the workload across all cores of a hybrid CPU before starting the AI tasks. This method optimizes how data is processed, allowing for better use of memory bandwidth and significantly increasing the performance of large language model (LLM) inference. Their approach has shown impressive results, achieving over 90% memory bandwidth utilization on Intel hybrid CPUs and improving processing speed by up to 3.7 times compared to previous methods.

Why it matters?

This research is important because it enhances the efficiency of AI systems running on hybrid CPUs, making them faster and more capable. As AI continues to grow in importance across various fields, optimizing how these systems operate will lead to better performance in applications such as natural language processing, computer vision, and more. This can ultimately improve user experiences and enable more advanced AI functionalities.

Abstract

The AIPC concept is gaining popularity, and more and more hybrid CPUs will be running AI models on client devices. However, the current AI inference framework overlooks the imbalanced hardware capability of hybrid CPUs, leading to low inference performance. To address this issue, we have introduced a dynamic parallel method for hybrid CPUs, which significantly increases LLM inference performance by balancing the workload for each core of a hybrid CPU before the parallel work starts. This method has enabled Neural Speed to achieve more than 90% (on average) of memory bandwidth on two hybrid Intel CPUs.

View Paper