NVIDIA Technical Blog

Mastering LLM Techniques: Inference Optimization

- Overview: This post focuses on inference optimization for Language Model (LLM) techniques, specifically the process of generating subsequent tokens autoregressively. The prefill phase involves processing input tokens to compute intermediate states for generating new tokens.
- Efficient attention modules: One challenge is optimizing attention modules for parallel computation. The matrix-matrix operation involved in generating new tokens can be highly parallelized to improve efficiency.
- Managing keys and values: To avoid recomputing all tokens at each time step, the keys and values can be cached in GPU memory. Each layer of the model may have its own key-value (KV) cache.
- LLM memory requirement: The major contributors to GPU memory requirement are model weights and the KV cache. Model weights occupy memory due to the parameters, while the KV cache can have a significant memory footprint.
- Parallelizing the model: Models can be parallelized by splitting the weights over multiple devices and sharding the batch size of inputs into microbatches. Pipeline parallelism involves executing subsets of layers on separate devices.
- Multi-head attention: In multi-head attention blocks, each head or group of heads can be assigned to a different device for independent and parallel computation.