Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Commit 9b3e12d

Browse files
authored
Update TensorRT-LLM (NVIDIA#546)
1 parent 8dd9c91 commit 9b3e12d

File tree

8 files changed

+80
-21
lines changed

8 files changed

+80
-21
lines changed

docs/source/2023-05-19-how-to-debug.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ print(outputs.keys())
5858
print(outputs['inter'])
5959
```
6060

61-
Here is the [full example](../../tests/test_debugging_api.py).
61+
Here is the [full example](source:tests/test_debugging_api.py).
6262

6363

6464
## Debug on E2E models

docs/source/blogs/H100vsA100.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
33

44

5-
# H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token.
5+
# H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
66

77
TensorRT-LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100**. H100 FP8 is able to achieve over 10,000 output tok/s at [peak throughput](https://nvidia.github.io/TensorRT-LLM/performance.html#h100-gpus-fp8) for 64 concurrent requests, while maintaining a 1st token latency of 100ms. For [min-latency](https://nvidia.github.io/TensorRT-LLM/performance.html#id1) applications, TRT-LLM H100 can achieve less than 10ms to 1st token latency.
88

@@ -32,10 +32,10 @@ The full data behind these charts & tables and including larger models with high
3232

3333
Stay tuned for a highlight on Llama coming soon!
3434

35-
#### MLPerf on H100 with FP8
35+
## MLPerf on H100 with FP8
3636
In the most recent MLPerf results, NVIDIA demonstrated up to 4.5x speedup in model inference performance on the NVIDIA H100 compared to previous results on the NVIDIA A100 Tensor Core GPU. Using the same data types, the H100 showed a 2x increase over the A100. Switching to FP8 resulted in yet another 2x increase in speed.
3737

38-
#### What is H100 FP8?
38+
## What is H100 FP8?
3939
H100 is NVIDIA's next-generation, highest-performing data center GPU. Based on the NVIDIA Hopper GPU architecture, H100 accelerates AI training and inference, HPC, and data analytics applications in cloud data centers, servers, systems at the edge, and workstations. Providing native support for FP8 data types H100 can double performance and halve memory consumption, compared to 16-bit floating point options on H100.
4040

4141
FP8 specification introduced in the paper [FP8 Formats for Deep Learning](https://arxiv.org/abs/2209.05433) can be used to speed up training as well as inference with post-training-quantization of models trained using 16-bit formats. The specification consists of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). The recommended use of FP8 encodings is E4M3 for weight and activation tensors, and E5M2 for gradient tensors.

docs/source/gpt_runtime.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ MPI_Init(&argc, &argv);
173173
// Get the number of ranks (size of the world).
174174
int worldSize;
175175
MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
176-

176+
177177
// Get the unique identifier for each rank.
178178
int rank;
179179
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

docs/source/index.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Welcome to TensorRT-LLM's documentation!
1515
batch_manager.md
1616
gpt_attention.md
1717
precision.md
18+
installation.md
1819
performance.md
1920
2023-05-19-how-to-debug.md
2021
2023-05-17-how-to-add-a-new-model.md
@@ -65,3 +66,15 @@ Indices and tables
6566
* :ref:`genindex`
6667
* :ref:`modindex`
6768
* :ref:`search`
69+
70+
71+
Blogs
72+
----------
73+
74+
.. toctree::
75+
:maxdepth: 2
76+
:caption: Blogs
77+
:hidden:
78+
79+
blogs/H100vsA100.md
80+
blogs/H200launch.md

docs/source/installation.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Table of Contents
1+
# Build TensorRT-LLM
22

33
- [Overview](#overview)
44
- [Fetch the Sources](#fetch-the-sources)
@@ -153,8 +153,8 @@ The list of supported architectures can be found in the
153153

154154
### Build the Python Bindings for the C++ Runtime
155155

156-
The C++ Runtime, in particular, [`GptSession`](../../cpp/include/tensorrt_llm/runtime/gptSession.h) can be exposed to
157-
Python via [bindings](../../cpp/tensorrt_llm/pybind/bindings.cpp). This is currently an opt-in feature which needs to be
156+
The C++ Runtime, in particular, [`GptSession`](source:cpp/include/tensorrt_llm/runtime/gptSession.h) can be exposed to
157+
Python via [bindings](source:cpp/tensorrt_llm/pybind/bindings.cpp). This is currently an opt-in feature which needs to be
158158
explicitly activated during compilation time. The corresponding option `--python_bindings` can be specified
159159
to `build_wheel.py` in the standard way:
160160

@@ -164,7 +164,7 @@ python3 ./scripts/build_wheel.py --python_bindings --trt_root /usr/local/tensorr
164164

165165
After installing the resulting wheel as described above, the C++ Runtime bindings will be available in
166166
package `tensorrt_llm.bindings`. Running `help` on this package in a Python interpreter will provide on overview of the
167-
relevant classes. The [associated unit tests](../../tests/bindings) should also be consulted for understanding the API.
167+
relevant classes. The [associated unit tests](source:tests/bindings) should also be consulted for understanding the API.
168168

169169
### Link with the TensorRT-LLM C++ Runtime
170170

@@ -209,5 +209,5 @@ headers contained under `cpp` should not be included directly since they might
209209
change in future versions.
210210

211211
For examples of how to use the C++ runtime, see the unit tests in
212-
[gptSessionTest.cpp](cpp/tests/runtime/gptSessionTest.cpp) and the related
213-
[CMakeLists.txt](cpp/tests/CMakeLists.txt) file.
212+
[gptSessionTest.cpp](source:cpp/tests/runtime/gptSessionTest.cpp) and the related
213+
[CMakeLists.txt](source:cpp/tests/CMakeLists.txt) file.

docs/source/memory.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -75,8 +75,8 @@ The Python runtime allocates KV cache tensors based on the parameters of the `Ge
7575

7676
## Memory pool
7777

78-
TensorRT-LLM C++ runtime is using stream-ordered memory allocator to allocate and free buffers, see [BufferManager::initMemoryPool](cpp/tensorrt_llm/runtime/bufferManager.cpp), which uses the default memory pool managed by the CUDA driver. When a `GptSession` object is destroyed, memory is returned to the memory pool and can be reused by the next instance of a `GptSession` object. Memory will be released from the pool if it is required for other memory allocations.
79-
However, `nvidia-smi` may still show high memory occupation after memory is returned to the CUDA driver's memory pool. This should not be a concern and is intended behavior. The amount of reserved and free memory in the pool can be inspected by [BufferManager::memoryPoolReserved())](cpp/tensorrt_llm/runtime/bufferManager.cpp) and [BufferManager::memoryPoolFree())](cpp/tensorrt_llm/runtime/bufferManager.cpp), respectively.
78+
TensorRT-LLM C++ runtime is using stream-ordered memory allocator to allocate and free buffers, see [BufferManager::initMemoryPool](source:cpp/tensorrt_llm/runtime/bufferManager.cpp), which uses the default memory pool managed by the CUDA driver. When a `GptSession` object is destroyed, memory is returned to the memory pool and can be reused by the next instance of a `GptSession` object. Memory will be released from the pool if it is required for other memory allocations.
79+
However, `nvidia-smi` may still show high memory occupation after memory is returned to the CUDA driver's memory pool. This should not be a concern and is intended behavior. The amount of reserved and free memory in the pool can be inspected by [BufferManager::memoryPoolReserved())](source:cpp/tensorrt_llm/runtime/bufferManager.cpp) and [BufferManager::memoryPoolFree())](source:cpp/tensorrt_llm/runtime/bufferManager.cpp), respectively.
8080

8181
## Known Issues
8282

docs/source/performance.md

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ performance that can be delivered by TensorRT-LLM.
1010
## Methodology
1111

1212
The different performance numbers below were collected using the methodology
13-
described in the benchmarks [folder](../../benchmarks/).
13+
described in the benchmarks [folder](source:benchmarks/).
1414

1515
## High Throughput
1616

@@ -145,6 +145,7 @@ include a more efficient implementation that runs single Matmul + SwiGLU fused k
145145
## Reproducing Benchmarked Results
146146

147147
### Building the TensorRT-LLM Container
148+
148149
---
149150
In order to benchmark TensorRT-LLM, you will need to follow the [Quick Start](../../README.md#quick-start)
150151
build process to create a baseline container for building a wheel. Additionally, the development
@@ -231,7 +232,8 @@ in [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
231232

232233
## Benchmarking per Model
233234

234-
#### GPT-J 6B
235+
### GPT-J 6B
236+
235237
---
236238
```shell
237239
python examples/gptj/build.py \
@@ -255,7 +257,7 @@ python examples/gptj/build.py \
255257
--enable_two_optimization_profiles
256258
```
257259

258-
##### Throughput Benchmark
260+
#### Throughput Benchmark
259261

260262
```shell
261263
in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "64:2048,2048")
@@ -269,7 +271,7 @@ do
269271
done
270272
```
271273

272-
##### First Token Latency Benchmark
274+
#### First Token Latency Benchmark
273275

274276
```shell
275277
in_out_sizes=("64:128,1" "64:2048,1")
@@ -285,6 +287,7 @@ done
285287

286288

287289
### Llama2-7b
290+
288291
---
289292
```shell
290293
pip install -r examples/llama/requirements.txt
@@ -313,7 +316,7 @@ python examples/llama/build.py \
313316
--hidden_act silu
314317
```
315318

316-
##### Throughput Benchmark
319+
#### Throughput Benchmark
317320

318321
```shell
319322
in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "32:2048,2048")
@@ -326,7 +329,7 @@ do
326329
./cpp/build/benchmarks/gptSessionBenchmark --model llama --engine_dir /tmp/engines/llama/7b --warm_up 1 --batch_size $batch_size --duration 0 --num_runs 5 --input_output_len $in_out_dims
327330
done
328331
```
329-
##### First Token Latency Benchmark
332+
#### First Token Latency Benchmark
330333

331334
```shell
332335
in_out_sizes=("64:128,1" "32:2048,1")
@@ -372,7 +375,7 @@ python examples/llama/build.py \
372375
--multiple_of 4096
373376
```
374377

375-
##### Throughput Benchmark
378+
#### Throughput Benchmark
376379

377380
```shell
378381
in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "64:2048,2048")
@@ -386,7 +389,7 @@ do
386389
done
387390
```
388391

389-
##### First Token Latency Benchmark
392+
#### First Token Latency Benchmark
390393

391394
```shell
392395
in_out_sizes=("64:128,1" "64:128,1")
@@ -402,6 +405,7 @@ done
402405

403406

404407
### Falcon-180B
408+
405409
---
406410

407411
Benchmarking Falcon-180B requires a custom engine per batch size, input/output sequence length due

examples/gpt/README.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -535,3 +535,45 @@ python3 build.py --model_dir=./c-model/gpt2/2-gpu --dtype bfloat16 --world_size=
535535

536536
mpirun -np 2 python3 ../summarize.py --engine_dir trt_engine/gpt2/bfloat16/2-gpu --hf_model_dir gpt2 --batch_size 10 --test_trt_llm --check_accuracy --tensorrt_llm_rouge1_threshold=14 --dataset_path ./dataset --no_add_special_tokens
537537
```
538+
539+
### Run LoRA with the Nemo checkpoint
540+
541+
```bash
542+
git clone https://huggingface.co/nvidia/GPT-2B-001
543+
python3 nemo_ckpt_convert.py -i GPT-2B-001/GPT-2B-001_bf16_tp1.nemo -o /tmp/c-model/gpt-next-2B --tensor-parallelism 1 --storage-type bfloat16
544+
545+
python3 build.py --model_dir=/tmp/c-model/gpt-next-2B/1-gpu/ \
546+
--dtype bfloat16 \
547+
--remove_input_padding \
548+
--use_gpt_attention_plugin \
549+
--output_dir /tmp/gpt-next-2B/ \
550+
--use_lora_plugin \
551+
--max_batch_size 4 \
552+
--max_input_len 512 \
553+
--max_output_len 50 \
554+
--lora_target_modules "attn_qkv"
555+
556+
python3 nemo_lora_convert.py -i tmp_nemo_ckpt/gpt2b_lora-900.nemo -o /tmp/gpt-next-2B/ -t bf16 # Assume lora weights are in tmp_nemo_ckpt/gpt2b_lora-900.nemo
557+
558+
python3 ../run.py --max_output_len=20 \
559+
--vocab_file=/tmp/c-model/gpt-next-2B/1-gpu/tokenizer.model \
560+
--engine_dir /tmp/gpt-next-2B/ \
561+
--lora_dir /tmp/gpt-next-2B/ \
562+
--lora_task_uids "lora" \
563+
--no_add_special_tokens \
564+
--input_text "After Washington had returned to Williamsburg, Dinwiddie ordered him to lead a larger force to assist Trent in his work. While en route, Washington learned of Trent's retreat. Since Tanaghrisson had promised support to the British, Washington continued toward Fort Duquesne and met with the Mingo leader. Learning of a French scouting party in the area, Washington, with Tanaghrisson and his party, surprised the Canadians on May 28 in what became known as the Battle of Jumonville Glen. They killed many of the Canadians, including their commanding officer, Joseph Coulon de Jumonville, whose head was reportedly split open by Tanaghrisson with a tomahawk. The historian Fred Anderson suggests that Tanaghrisson was acting to gain the support of the British and regain authority over his own people. They had been inclined to support the French, with whom they had long trading relationships. One of Tanaghrisson's men told Contrecoeur that Jumonville had been killed by British musket fire. Question: Upon learning of a French scounting party in the area, what did Washington do? Answer:"
565+
```
566+
567+
Users who want to skip LoRA module may pass uid -1 with `--lora_task_uids -1`.
568+
In that case, the model will not run the LoRA module and the results will be
569+
different.
570+
571+
```bash
572+
python3 ../run.py --max_output_len=20 \
573+
--vocab_file=/tmp/c-model/gpt-next-2B/1-gpu/tokenizer.model \
574+
--engine_dir /tmp/gpt-next-2B/ \
575+
--lora_dir /tmp/gpt-next-2B/ \
576+
--lora_task_uids "-1" \
577+
--no_add_special_tokens \
578+
--input_text "After Washington had returned to Williamsburg, Dinwiddie ordered him to lead a larger force to assist Trent in his work. While en route, Washington learned of Trent's retreat. Since Tanaghrisson had promised support to the British, Washington continued toward Fort Duquesne and met with the Mingo leader. Learning of a French scouting party in the area, Washington, with Tanaghrisson and his party, surprised the Canadians on May 28 in what became known as the Battle of Jumonville Glen. They killed many of the Canadians, including their commanding officer, Joseph Coulon de Jumonville, whose head was reportedly split open by Tanaghrisson with a tomahawk. The historian Fred Anderson suggests that Tanaghrisson was acting to gain the support of the British and regain authority over his own people. They had been inclined to support the French, with whom they had long trading relationships. One of Tanaghrisson's men told Contrecoeur that Jumonville had been killed by British musket fire. Question: Upon learning of a French scounting party in the area, what did Washington do? Answer:"
579+
```

0 commit comments

Comments
 (0)