Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Commit 1f3a421

Browse files
authored
Add Latest News section (NVIDIA#361)
1 parent 71a5b97 commit 1f3a421

File tree

5 files changed

+63
-4
lines changed

5 files changed

+63
-4
lines changed

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,15 @@ TensorRT-LLM
1717
<div align="left">
1818

1919
## Latest News
20-
* [2023/11/03] [TensorRT-LLM is up to **4.6x faster on H100 than A100**, achieving **10,000 tok/s at 100ms to first token.**](./docs/source/blogs/H100vsA100.md)
20+
* [2023/11/13] [**H200** achieves nearly **12,000 tok/sec on Llama2-13B**](./docs/source/blogs/H200launch.md)
21+
22+
<img src="./docs/source/blogs/media/H200launch_Llama70B_tps.png" alt="H200 Llama2 70B" width="250" height="auto">
23+
<img src="./docs/source/blogs/media/H200launch_GPT175B_tps.png" alt="H200 GPT3 175B" width="250" height="auto">
2124

22-
<img src="./docs/source/blogs/media/TRT_LLM_v0-5-0_H100vA100_tps.png" alt="max throughput" width="450" height="auto">
23-
<img src="./docs/source/blogs/media/TRT_LLM_v0-5-0_H100vA100_1st.png" alt="1st token latency" width="450" height="auto">
25+
H200 FP8 achieves 11,819 tok/s on Llama2-13B on a single GPU, and is up to 1.9x faster than H100.
2426

25-
H100 FP8 increases max throughput, decreases 1st token latency, and reduces memory consumption. At peak, TensorRT-LLM on H100 can achieve >10K token/s or <10ms to first token. See full [performance data](#performance).
2627

28+
* [2023/11/03] [TensorRT-LLM is up to **4.6x faster on H100 than A100**, achieving **10,000 tok/s at 100ms to first token.**](./docs/source/blogs/H100vsA100.md)
2729
* [2023/10/22] [🚀 RAG on Windows using TensorRT-LLM and LlamaIndex 🦙](https://github.com/NVIDIA/trt-llm-rag-windows#readme)
2830
* [2023/10/19] Getting Started Guide - [Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
2931
](https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/)

docs/source/blogs/H200launch.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
2+
3+
TensorRT-LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news/nvidia-supercharges-hopper-the-worlds-leading-ai-computing-platform) achieves **11,819 tokens/s on Llama2-13B** on a single GPU. H200 is up to **1.9x faster** than H100. This performance is enabled by H200's larger, faster [HBM3e memory](#latest-hbm-memory).
4+
5+
6+
**H200 FP8 Max throughput**
7+
8+
|Model | Batch Size<sup>(1)</sup> | TP<sup>(2)</sup> | Input Length | Output Length | Throughput (out tok/s) |
9+
|:----------|:-------------------------|:-----------------|:-------------|:--------------|-----------------------:|
10+
| llama_13b | 1024 | 1 | 128 | 128 | 11,819 |
11+
| llama_13b | 128 | 1 | 128 | 2048 | 4,750 |
12+
| llama_13b | 64 | 1 | 2048 | 128 | 1,349 |
13+
| llama_70b | 512 | 1 | 128 | 128 | 3,014 |
14+
| llama_70b | 512 | 4 | 128 | 2048 | 6,616 |
15+
| llama_70b | 64 | 2 | 2048 | 128 | 682 |
16+
| llama_70b | 32 | 1 | 2048 | 128 | 303 |
17+
18+
<sub>Preliminary measured performance, subject to change. TensorRT-LLM v0.5.0, TensorRT v9.1.0.4 | H200, H100 FP8. </sub>
19+
20+
<sup>*(1) Largest batch supported on given TP configuration by power of 2.*</sup> <sup>*(2) TP = Tensor Parallelism*</sup>
21+
22+
Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT-LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/performance.html).
23+
24+
### H200 vs H100
25+
26+
H200's HBM3e larger capacity & faster memory enables up to **1.9x** performance on LLMs compared to H100. Max throughput improves due to its dependence on memory capacity and bandwidth, benefitting from the new HBM3e. First token latency is compute bound for most ISLs, meaning H200 retains similar time to first token as H100.
27+
28+
For practical examples of H200's performance:
29+
30+
**Max Throughput TP1:**
31+
an offline summarization scenario (ISL/OSL=2048/128) with Llama-70B on a single H200 is 1.9x more performant than H100.
32+
33+
**Max Throughput TP8:**
34+
an online chat agent scenario (ISL/OSL=80/200) with GPT3-175B on a full HGX (TP8) H200 is 1.6x more performant than H100.
35+
36+
<img src="media/H200launch_Llama70B_tps.png" alt="max throughput llama TP1" width="250" height="auto">
37+
<img src="media/H200launch_GPT175B_tps.png" alt="max throughput GPT TP8" width="250" height="auto">
38+
39+
<sub>Preliminary measured performance, subject to change.
40+
TensorRT-LLM v0.5.0, TensorRT v9.1.0.4. | Llama-70B: H100 FP8 BS 8, H200 FP8 BS 32 | GPT3-175B: H100 FP8 BS 64, H200 FP8 BS 128 </sub>
41+
42+
43+
**Max Throughput across TP/BS:**
44+
Max throughput<sup>(3)</sup> on H200 vs H100 varies by model, sequence lengths, BS, and TP. Below results shown for maximum throughput per GPU across all these variables.
45+
46+
<img src="media/H200launch_H200vsH100_tps.png" alt="max throughput llama sweep" width="500" height="auto">
47+
48+
<sub>Preliminary measured performance, subject to change.
49+
TensorRT-LLM v0.5.0, TensorRT v9.1.0.4 | H200, H100 FP8. </sub>
50+
51+
52+
<sup>*(3) Max Throughput per GPU is defined as the highest tok/s per GPU, swept across TP configurations & BS powers of 2.*</sup>
53+
54+
55+
### Latest HBM Memory
56+
57+
H200 is the newest addition to NVIDIA’s data center GPU portfolio. To maximize that compute performance, H200 is the first GPU with HBM3e memory with 4.8TB/s of memory bandwidth, a 1.4X increase over H100. H200 also expands GPU memory capacity nearly 2X to 141 gigabytes (GB). The combination of faster and larger HBM memory accelerates performance of LLM model inference performance with faster throughput and tokens per second. These results are measured and preliminary, more updates expected as optimizations for H200 continue with TensorRT-LLM.
13.8 KB
Loading
41.3 KB
Loading
13.5 KB
Loading

0 commit comments

Comments
 (0)