Train a GPT model with 18 billion parameters using a single GPU on your home computer! Come and experience the new features of the open-source project Colossal-AI!

When it comes to training large AI models, people think of using thousands of GPUs, expensive training costs, and only a few tech giants can afford it. AI users, such as start-ups or university researchers, can only watch the training of large models without any other options.

Now, a personal computer with only one GPU can train GPT models with up to 18 billion parameters, and a laptop can also train models with over 1 billion parameters. Compared to existing mainstream solutions, the parameter capacity can increase by more than ten times!

This significant improvement comes from Colossal-AI, which is a universal and efficient training system for large AI models. Most importantly, it is completely open source and only requires minimal modifications to allow existing deep learning projects to train larger models on a single consumer-grade graphics card, allowing everyone to train large AI models at home! In particular, it makes downstream tasks and application deployment, such as fine-tuning and inference of large AI models, easier!

By providing various popular and efficient parallel processing methods, Colossal-AI can also help its users easily deploy existing projects to large computing clusters.

Since Google proposed the BERT model with 300 million parameters in 2018, the parameter record of large models has been updated multiple times in just a few years, such as the GPT-3 with 175 billion parameters proposed by OpenAI, and the MT-NLG with 530 billion parameters jointly launched by Microsoft and NVIDIA, and so on...

Dense models have achieved a scale of over 100 billion parameters, while sparse expert-mixed models (MoE), such as the Switch Transformer released by Google in 2021, have increased the number of parameters to the trillion level.

However, training such large models from scratch can be very expensive. It usually requires the simultaneous use of hundreds or even thousands of professional high-performance GPUs, such as the NVIDIA A100. If we use dedicated InfiniBand high-speed networks to build supercomputer clusters, the training cost may even reach 10 million dollars.

Train large models with just one consumer-grade GPU#

Obviously, AI users like university students and individual developers cannot afford such high costs to train large models. Among AI communities, the most popular computing resource for such people is NVIDIA RTX GPUs.

To improve AI productivity, benefit more developers from large models, and truly achieve our vision of making the use of large AI models "fast and affordable," Colossal-AI only needs a few lines of code to increase the model training capacity by ten times. This will allow everyone to train large AI models using a single ordinary GPU.

Colossal-AI performs better than ordinary PyTorch and mainstream distributed solutions like Microsoft's DeepSpeed on all types of hardware.

For the representative of large models - GPT, Colossal-AI can train GPT models with up to 150 million parameters on a gaming laptop with an RTX 2060 6GB. For computers with an RTX 3090 24GB, Colossal-AI can help users train GPT models with 18 billion parameters. Colossal-AI can also bring significant improvements to high-performance graphics cards like Tesla V100.

Colossal-AI has also successfully implemented PaLM (Pathways Language Model) recently released by Google. It has also shown excellent performance improvements on various hardware, while Microsoft's DeepSpeed has not yet released its official PaLM implementation.

Key technology: Enhanced heterogeneous training#

The biggest problem with training large AI models using a single consumer-grade GPU is that the GPU memory capacity is extremely limited, which severely limits the number of model parameters that can be accommodated. The ZeRO-offload method proposed by Microsoft DeepSpeed attempts to partition the model and utilize CPU memory, which has a larger capacity and lower cost. Currently, there are several modified versions of DeepSpeed for heterogeneous training. However, as shown on the left side of the figure below, when the GPU memory is not enough to meet its corresponding model requirements, even if there is still available memory on the CPU, the system will crash.

Colossal-AI is different from the methods derived from DeepSpeed. It has built its own core technologies from scratch, such as ZeRO, to solve the problems of DeepSpeed. DeepSpeed statically partitions model data between CPU and GPU memory and uses a fixed memory layout for different training configurations. Colossal-AI has made many improvements to improve the efficiency of GPU and CPU memory usage. After all, CPU memory is much cheaper than high-performance graphics cards with large memory.

Colossal-AI's Gemini mechanism effectively manages and utilizes the heterogeneous memory of GPU and CPU, so during the training process, tensors are dynamically allocated to the CPU-GPU storage space, breaking through the GPU's memory limitation.

We take advantage of the iterative nature of the deep learning network training process and divide the training into two stages: warm-up stage and non-warm-up stage, based on the number of iterations. In the initial warm-up stage, memory information is monitored; in the non-warm-up stage, the collected information is used to efficiently move tensors to minimize CPU-GPU data movement.

This sounds simple, but it is actually very challenging to implement. Because the lifecycle of non-model data is not managed by the user, and existing deep learning frameworks do not expose tracking interfaces for non-model data to users, it is difficult to obtain memory usage information for non-model data. Secondly, non-framework overhead, such as CUDA context, also needs to be considered.

Colossal-AI obtains CPU and GPU memory usage information by sampling during the warm-up stage. The usage of non-model data can be obtained by comparing the maximum system memory usage and model memory usage between two moments. By querying the memory manager, the memory usage of the model can be known, as shown by the black solid line in the figure below.

All model tensors are managed by the memory manager, and each tensor is marked with state information, including HOLD, COMPUTE, FREE, etc. Based on dynamically queried memory usage, Colossal-AI changes the tensor state and continuously adjusts the tensor position. This ultimately achieves efficient utilization of GPU and CPU memory, maximizes model capacity, and balances training speed in extremely limited hardware conditions. This is of great significance for the democratization of AI and low-cost fine-tuning of large models for downstream tasks.

In addition, distributed parallel technology is an important method to further accelerate model training. Colossal-AI uses techniques such as multi-dimensional parallelism and heterogeneous parallelism to address pain points of existing solutions, such as limited parallel dimensions, low efficiency, poor generality, difficult deployment, and lack of maintenance. This allows users to efficiently and quickly deploy large AI models with only a few modifications to their code.

For example, for super large AI models like GPT-3, Colossal-AI only needs half of the computing resources compared to NVIDIA's solution to start training; if the same computing resources are used, the speed can be further improved by 11%, reducing the training cost of GPT-3 by over 1 million dollars.

For AlphaFold, which is used for protein structure prediction, our team has introduced FastFold based on the Colossal-AI acceleration solution. FastFold has successfully surpassed other solutions proposed by Google and Columbia University, reducing the training time of AlphaFold from 11 days to 67 hours, while also reducing the overall cost. In addition, we have achieved a speed improvement of 9.3 to 11.6 times in long sequence inference.

In addition, Colossal-AI attaches great importance to the construction of the open-source community, providing English and Chinese tutorials, and supporting the latest cutting-edge applications such as PaLM and AlphaFold. We will also regularly launch new innovative features. We always welcome suggestions and discussions from the community, and we will be happy to help if you encounter any problems. You can ask questions here or create discussion topics in our forum. We greatly appreciate your suggestions. Recently, Colossal-AI ranked first on the popular projects list on Github, which was achieved against the background of many projects with 10K stars.

About the original authors

The original authors are all core members of HPC-AI Tech, from renowned universities such as the University of California, Berkeley, Stanford University, Tsinghua University, Peking University, the National University of Singapore, Nanyang Technological University, etc. They have also worked at tech giants such as Google Brain, IBM, Intel, Microsoft, NVIDIA, etc. The company has also received seed funding from top venture capital firms such as Innovation Works and ZhenFund.

Prof. Yang You, Founder of HPC-AI Tech

Ph.D., University of California, Berkeley

IPDPS/ICPP Best Paper Author

ACM/IEEE CS George Michael Memorial HPC Fellowship

Forbes 30 Under 30 (Asia 2021)

IEEE-CS Outstanding Newcomer Award in Supercomputing

UC Berkeley EECS Lotfi A. Zadeh Prize

Prof. James Demmel, CSO of HPC-AI Tech

Distinguished Professor, University of California, Berkeley

ACM/IEEE Fellow

Member of the American Academy of Sciences, the Academy of Engineering, and the Academy of Arts and Sciences

Check out the project over here: https://github.com/hpcaitech/ColossalAI

Original article link (this article is a translation of the original article): https://medium.com/@hpcaitech/train-18-billion-parameter-gpt-models-with-a-single-gpu-on-your-personal-computer-8793d08332dc