Hi, I am Hyunwoong Ko, a machine learning engineer in TUNiB. Recently, TUNiB publicly released Parallelformers, an efficient model parallelization library for web server deployments. In this article, I’d like to discuss how Parallelformers came into being and its operating mechanism. If you have any questions feel free to contact me at kevin.ko@tunib.ai.
Introducing the creation of Parallelformers
TUNiB makes artificial intelligence that can engage in complex conversations with people. Prior to development, we found it necessary to test various existing models and think about the attributes of a good conversation.
Enlarging the model size is the current trend in AI. Subsequently, conversation model sizes are also getting larger. The models we decided to test were also very large, and to deploy them to the server, we needed a GPU with large memory. In particular, Blenderbot 9B, the largest model we tested, was 17.6GB purely by model size and required a GPU memory of approximately 22GB.
[Figure 1] Blenderbot 9B’s immense size... 😲
The engineers in TUNiB use Google Cloud Platform (GCP). We found that the only GPU with over 22GB memory in GCP was A100(40GB). In addition to Blenderbot 9B, we planned to test other various conversation models. Hence, we actually needed three A100s. The problem was that using three A100s cost about $11 per hour, about $264 per day and a total of $8000 per month.
While we were concerned about the high cost, we found that it is much more affordable to deploy using several smaller GPUs. For example, let’s say we need a GPU with 120GB memory. Instead of using three A100 (40GB), using eight T4 (15GB) can reduce the cost from $8000 to $1600. For this reason, we decided to deploy the models by splitting them in smaller GPUs.
But parallelization is never easy... 😅
Splitting a model up between multiple GPUs is called ‘model parallelization’. Model parallelization enables large models to operate in multiple small GPUs, so it was a good solution for our concerns about the high cost. However, because model parallelization is an advanced technique, we tried to use existing tools instead of forming parallelization ourselves.
[Figure 2] Model Parallelization
We tested Microsoft’s DeepSpeed and Nvidia’s Faster Transformer, the most well-known parallelization tools. Their inference engines provide model parallelization based on Megatron LM and acceleration of inference through kernel fusion. Kernel fusion is a technology that combine multiple GPU operations into one, to greatly improve the operation speed.
[Figure 3] Example of kernel fusion
However, there were some challenges in actually utilizing these parallelization tools for deployment. The biggest problem was that (1) they only supported a very small number of models. Currently, DeepSpeed and Faster Transformer support three models each, but unfortunately, they did not support Blenderbot, the model I wanted to parallelize.
In addition, DeepSpeed was (2) unable to deploy models to the web server due to process flow, and (3) it could only start parallelization once the model was uploaded to the GPU. Moreover, Faster Transformer was (4) quite complicated to use. For these reasons, we concluded that it was impossible to use the above-mentioned parallelization tools.
Developing our own parallelization tool 🔥
After much consideration, we decided to develop our own parallelization tool. We needed a tool that (1) supports various models, (2) can be used for web server deployment, (3) does not start parallelization in the GPU, and (4) is easy to use. Focusing on these four factors, we started developing the tool step by step.
Step 1: Parallelizing various models - No Kernel Fusion
As I said before, existing tools used kernel fusion to parallelize models and increase the speed at the same time. Without a doubt, kernel fusion could significantly improve computational speed, but at the same time it also limited model diversity.
[Figure 4] So many diverse language models... 😂
As shown in the figure above, a wide variety of language models exist, most of which have their own operations. For example, a model called ‘BigBird’ has its own operation called Random Attention. However, when using kernel fusion, these unique operations must also be fused. In other words, all the unique operations of each model must be implemented anew in a kernel fusion method. Yet, because there are so many different types of language models, newly implementing all of the operations is a very time-consuming task. For this reason, existing tools only provided users with three major models that where frequently used.
For us, it was more important to parallelize different models than to speed up operations. So, we chose to reuse code from existing models instead of kernel fusion. While maintaining the model parallelization function of existing tools, we used the existing operations instead of fusing them. We gave up on speed, but won diversity.
[Figure 5] Types of models supported by Parallelformers
As a result, we successfully parallelized 68 of the 70 models offered by Huggingface Transformers. Furthermore, we could parallelize not only language models but also voice and vision models.
Step2: Deploying parallelized models on web servers - Inversion of Process Control
Because we didn’t use kernel fusion in step 1, we could parallelize various models like Blenderbot. However, it was still a problem that these models could not be deployed on the web server.
[Figure 6] How DeepSpeed Launcher works
The DeepSpeed is designed to allow the framework to simultaneously execute the user’s entire code multiple times. This is because each of the parallelized model pieces need to operate simultaneously on multiple GPUs. However, some problems arose as the entire code repeated itself. We realized that these problems make it impossible to deploy parallelized models to web servers.
First, it was a problem that even user codes that need to be executed only once were executed multiple times. As a result, (1) the model loaded repeatedly, exceeding the CPU memory and (2) the server opened multiple times to the same port, causing an error. Also, (3) we could not break parallelization in the user’s code, because the model was parallelized in the framework code. (Parent-child process problem)
[Figure 7] How inversion of process control works
To solve this problem, we reversed the traditional processing method. We changed the structure so that we could call the framework code from the user’s code several times, simultaneously. We called this method the “Inversion of process control”.
[Figure 8] In Parallelformers, only certain parts operate in parallel unlike in DeepSpeed, where all parts operate in parallel
By applying the ‘Inversion of process control’ method, it’s possible to execute only parts of the user’s code that need to run multiple times. This prevents repeated loading of the same model or opening the server multiple times to the same port. As a result, we were able to successfully deploy the parallelized models to the web server. Also, because parallelization starts with the user’s code, we could break parallelization whenever we wanted.
Step3: Solving GPU memory problems - Lazy GPU Allocation
We could deploy various models to the web server through ‘Inversion of process control’. However, DeepSpeed had another problem.
[Figure 9] DeepSpeed starts parallelization with the model on the GPU
The problem was that parallelization started on the GPU. Generally, you cannot upload a large model to a small GPU. So, we parallelize the model to make this possible. However, DeepSpeed requires the model to be fully uploaded on the GPU before parallelization. Therefore, parallelization is only possible when you have access to sufficient GPU memory. Ironically, it is impossible to parallelize when you actually need to due to a lack of GPU memory.
[Figure 10] Parallelformers starts parallelizing with models on the CPU
We solved this problem by enabling parallelization to start on the CPU. In general, most computing machines have a larger CPU memory than GPU memory. We were able to alleviate GPU capacity pressures by completing parallelization on the CPU, with larger memory, then later uploading the split parts to the GPU.
Step4: Making it easy to use - Method Hijacking
By designing and implementing these different techniques, we were able to create a parallelization mechanism that perfectly met our needs. However, if users had to deal with all of these techniques themselves, it would be too difficult to use. We introduced an additional mechanism called ‘Method Hijacking’ to make it easier for users to utilize these mechanisms.
[Figure 11] Method Hijacking
Method Hijacking is a method that executes numerous complex tasks by taking away the flow when the user calls a function that he or she has been using. This allows the user to implement code in exactly the same way as before, even if they are not familiar with these complicated mechanisms.
Conclusion
After a lot of hard work, we were able to create a great parallelization tool called Parallelformers, and we succeeded in deploying the web demo application below at an affordable price. We were able to deploy large models on eight inexpensive T4s without using A100 as planned, saving a significant amount of our budget.
Parallelformers was a technically difficult project, but we learned and experienced a lot while developing this tool. In particular, it was unfortunate that people could not use popular big models if they don’t have access to an expensive GPU. To address some of these difficulties, we decided to release this tool as an open-source. We also began discussions on collaborating with the Huggingface Transformers team and the Microsoft DeepSpeed team so that our work can help more people. We hope that in the near future, there will be a world where more people can easily access big models
Well, that’s all we have today. If you’d like to know more about the theoretical background and instructions, please refer to our Github Repository or the Official Huggingface document introducing Parallelformers. Thank you.
'TECH' 카테고리의 다른 글
[2021.10.28] Large-scale language modeling tutorials with PyTorch (0) | 2021.11.04 |
---|---|
[2021.09.18] TUNiB-Electra 공개 (0) | 2021.11.04 |
[2021.08.02] BlenderBot 2.0_TUNiB (0) | 2021.11.04 |
[2021.07.26] Parallelformers: 빅모델 배포를 향한 여정_튜닙 (0) | 2021.11.04 |
[2021.07.12] TUNiB ranked 1st in 2021 AI Online Competition (0) | 2021.11.04 |