Speed-up LLM inference : Inference 최적화 방법

LLM을 Inference할 때 최적화하는 방법에 관하여

Aug 11, 2023

Speed-up LLM inference :

7 Ways to Speed Up Inference of Your Hosted LLMs

TLDR; techniques to speed up inference of LLMs to increase token generation speed and reduce memory consumption

https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47

“In the future, every 1% speedup on LLM inference will have similar economic value as 1% speedup on Google Search infrastructure.” — Jim Fan, NVIDIA senior AI scientist = 요즘 시대에 Infernece 속도라는건 웹의 렌더링 시간, 구글의 검색에 걸리는 시간과 같다.

요약본

Precision Reduction : float16 or bfloat16을 쓰는 것이 모델을 20%까지 빠르게 해주고, 메모리 소비를 2배 줄인다.

PyTorch는 32-bit floats를 디폴트로 사용함

여기에도 여러 추가 방법이 있다. 쉽게 해주는 라이브러리도 존재

Use 8-bit or 4-bit quantization : 2배, 3배까지 메모리 소비를 줄인다. 대신 퀄리티가 내려간다. 메모리가 중요한 경우에 사용하는게 좋음

Adapter(LoRA, QLoRA)를 써서 파인튜닝해라. 내 데이터에 대한 성능이 향상될 수 있음.

tensor 병렬화를 통해서 multiple GPU에서 속도 올리기

Batch Inference

이렇게 한다고..?? Inference에서도 배치를 최대한 활용, CPU의 스케줄링처럼 진행되네

이걸 아래 라이브러리에서 해준다?

가능하다면 Text Generation Inference, DeepSpeed, vLLM 같은 라이브러리를 사용해라.

These already include various optimization techniques: tensor parallelism, quantization, continuous batching of incoming requests, optimized CUDA kernels, and more.

항상 잘 평가하고 버그 고치고 production전에 써봐라

Based on what is known about the model, Falcon architecture is very similar to GPT-3 and LLaMA, except for using multiquery attention (Shazeer 2019) and RefinedWeb corpus as a training dataset (which can be a key to success).

팔콘은 메모리 효율화를 위한 학습방법(key, value를 멀티헤드 어텐션간에 공유하는 방법)을 사용했고 특정 데이터셋을 썼다(Refined web 이게 좋다고 함. 좋은 데이터는 중요하다.)

Multiquery attention is a concept where the same key and value tensors are shared for efficiency across different attention heads, as illustrated for a multihead attention block below.

키와 밸류는 각 헤드에서 같은걸 공유해서 사용한다.

실험을 위한 세팅

Lit-GPT 라이브러리를 사용해서 falcon-7B 모델 다운로드 → Lit-GPT 포맷으로 변경

Lit-GPT는 PyTorch의 light fabric이라는 프레임워크를 사용 : Fabric is the fast and lightweight way to scale PyTorch models without boilerplate code.

리소스는 A100 GPU, VRam 40GB.

Methods to Accelerate the LLM Inference

Precision Reduction : float16 or bfloat16을 쓰는 것이 모델을 20%까지 빠르게 해주고, 메모리 소비를 2배 줄인다.

PyTorch는 32-bit floats를 디폴트로 사용함

여기에도 여러 추가 방법이 있다. 쉽게 해주는 라이브러리도 존재

float32의 구조

일반적으로 bit가 많을수록 더 정확한 숫자 단위까지 계산이 가능해지기 때문에 계산과정에서 에러 발생률이 낮아진다. 그래도 16-bit로 낮추는게 메모리도 2배 덜 들고, 빨라진다.

Lit-GPT에서는 Fabric을 쓰기 때문에 코드 한줄로 가능하다고 하는데,,

Mixed-precision training

학습 동안에, 계속 16-bit을 쓰지 않고 필요에 따라 16-32를 왔다갔다한다.

이렇게하면 accuracy와 stability를 유지하면서 학습이 가능하다.

더 디테일하게 : 출처는 위의 링크

weights를 lower-precision(FP16)으로 바꿔서

forward pass 연산하고, backward pass = 기울기 계산하고 →

다시 higher precision으로 바꿔서 for numerical stability and avoiding issues such as vanishing or exploding gradients that can occur when using lower-precision arithmetic.

LR을 곱한 뒤 가중치를 업데이트 한다.

Brain floating point

구글이 제시한 포맷. 여기서

특히 TPU에서 사용하기 위해 ML/DL을 위해 탄생한 방법.

기존 float 16과 달리 Exponent(지수)에 8비트를 부여했다.

float16 계산 방법에 대해서

1비트 : 부호(0은 양수, 1은 음수)

5비트 : Exponent : 가수 부분에 곱해지는 2의 지수

10비트 : Fraction : 가수부분

정확히 이해가 가지 않을 수 있는데 계산은 이렇게 된다

정확히 위의 값은 7.252

neural network는 exponent의 크기에 더 sensitive하다는 연구 결과에 따라 exponent에 더 많은 자원을 할당한 것. float32와 exponent bits가 같다.

bfloat16으로 바꾼 결과

Quantization

더 성능을 높이고 싶다면 lower floating point를 넘어서서 quantization으로 갈 수 있다.

Quantization converts the model weights from floats to low-bit integer representations, for example, 8-bit integers (and, recently, even 4-bit integers). ?? 정보가 유지되나 라는 생각..

딥러닝에서 Quantization이란 : https://gaussian37.github.io/dl-concept-quantization/

실수를 정수화 하는 것.

당연히 이건 weight나 activation function의 값이 어느 정도의 범위 안에 있다는 것을 가정하고 실시될 수 있다.

weight 값이 -10 ~ 30 의 범위에 있다고 가정하면

최소값인 -10을 unit8의 0에 대응시키고 30을 255에 대응시키면 32bit 자료형을 8bit으로 바꿀 수 있게 되어지는 것.(일반적으로 비트 수가 N배 줄어들면 복잡도는 N^2배 줄어든다.) 그리고 정수형이 하드웨어에 좀 더 친회적이다.(이유는 추측?)

1) 모델의 사이즈 축소, 2) 모델의 연산량 감소, 3) 효율적인 하드웨어 사용

하지만 막 할수 없겠지?

결과를 보면 작은 네트워크로 대충 양자화(4bit) 한 것보다 큰 네트워크로 양자화 잘 했을 때(2bit) 모델도 작고 성능도 좋았다.

FP32 → INT8 : model size=1/4, inference speed=2~4, memory bandwidth=2~4

딥러닝에서의 양자화에 관해

인퍼런스 때만 사용한다. 양자화는 학습과는 무관하다
모든 딥러닝의 layer가 quantization가능한건 아니다. 레이어의 특성에 따라 다른 부분도 있다.
모든 layer가 quantization되어야하는 것도 아니다. 레이어 여러개를 묶어서 한번에 하기도 한다.
레이어의 특성에 따라 다르다 → 모델에 따라 quantization을 적용한 성과가 다르다.

근데 이 아티클에서는 이런 두가지가 있다고 얘기하네

Post-Training Quantization (PTQ): A model is first trained to converge, then we convert its weights to a lower precision without more training. It is usually quite cheap to implement in comparison to training.
Quantization-Aware Training (QAT): Quantization is applied during pre-training or further fine-tuning. QAT can perform better but requires extra computation resources and access to representative training data.

학습 때 사용하는 방법과 인퍼런스때만 쓰는 방법.

모델에 따라서 적용할 Quantization 방법도 다르다. LLama에 관한건

파인튜닝 with Adapter.

모델은 픽스하고 추가적인 adapter만 학습하는 것을 말함. 특히 LoRA

QLoRA : which added quantization and a few other optimizations to LoRA

Pruning도 한다?

LLM-Pruner: On the Structural Pruning of Large Language Models

Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which...

https://arxiv.org/abs/2305.11627

A Simple and Effective Pruning Approach for Large Language Models

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance....

https://arxiv.org/abs/2306.11695

원하는 정도만큼 prune이 가능하다(50%까지)

위는 두개의 다른 프루닝 기준에 관한 논문

이건 대놓고 Trade-off가 아닌건가? 아직 자세히 모르겠음

Batch Inference

위의 그림처럼 inference도 배치로 처리하면 좋다(당연히)

After careful evaluation, I opted for vLLM as my preferred choice. vLLM utilizes PagedAttention, the new attention algorithm that effectively manages attention keys and values: it delivers up to 24x higher throughput than HuggingFace Transformers without requiring any model architecture changes.

vLLM의 장점

State-of-the-art serving throughput

Efficient management of attention key and value memory with PagedAttention

Continuous batching of incoming requests

Optimized CUDA kernels

Other alternative libraries for the LLM Inference:

Accelerate lets you offload part of the model onto the CPU. Offloading helps you optimize the throughput of an inference service, even when the whole model fits on a GPU.

DeepSpeed Inference helps you serve transformer-based models more efficiently when: (a) The model fits on a GPU and (b) The model’s kernels are supported by the DeepSpeed library. This is your go-to solution if latency is your main concern.

DeepSpeed MII is a library that quickly sets up a GRPC endpoint for the inference model, with the option to use either the ZeRO-Inference or DeepSpeed Inference technology.

OpenLLM is an open platform for operating large language models (LLMs) in production. Fine-tune, serve, deploy, and monitor any LLMs with ease.

Aviary — a new open source project that simplifies and enables easy self-hosted serving of multiple LLM models efficiently

Subscribe to our newsletter

See more posts

Speed-up LLM inference : Inference 최적화 방법

LLaVA 논문 읽으면서 생각들

[논문] Multi-persona LLM

Patterns for Building LLM-based Systems

[논문리뷰] What is LoRA? : Low-Rank Adaptation of Large Language Models