Achieving Top Inference Performance with the NVIDIA H100 Tensor Core GPU and NVIDIA TensorRT-LLM

- NVIDIA has released the open-source NVIDIA TensorRT-LLM, which includes the latest kernel optimizations for the NVIDIA H100 Tensor Core GPU.
- AMD recently shared benchmark results comparing the inference performance of the H100 GPU to its MI300X chip, but the results were not based on optimized software. When benchmarked correctly, the H100 GPU is 2x faster.
- The actual measured performance of a single NVIDIA DGX H100 server with eight NVIDIA H100 GPUs on the Llama 2 70B model shows that the H100 GPU can process a single inference in 1.7 seconds using a batch size of one.
- When a fixed 2.5-second response time budget is set, an 8-GPU DGX H100 server can process over five Llama 2 70B inferences per second compared to less than one per second with batch one.
- NVIDIA is continuously optimizing its software to improve AI performance and encourages users to check their performance pages and GitHub sites for the latest updates.
- The DGX H100 AMD Footnote measurement was done by NVIDIA using vLLM based on the configurations provided by AMD in their footnotes.