Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Commit 59f41c0

Browse files
authored
Update TensorRT-LLM (NVIDIA#708)
* Update TensorRT-LLM * update * Bump version to 0.7.0
1 parent 0268914 commit 59f41c0

File tree

658 files changed

+2884866
-425530
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

658 files changed

+2884866
-425530
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ cpp/cmake-build-*
2121
cpp/.ccache/
2222
tensorrt_llm/libs
2323
tensorrt_llm/bindings.pyi
24+
tensorrt_llm/bindings/*.pyi
2425

2526
# Testing
2627
.coverage.*

.pre-commit-config.yaml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ repos:
1515
rev: v4.1.0
1616
hooks:
1717
- id: check-added-large-files
18-
exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin'
18+
exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/'
1919
- id: check-merge-conflict
2020
- id: check-symlinks
2121
- id: detect-private-key
@@ -33,9 +33,7 @@ repos:
3333
- id: clang-format
3434
types_or: [c++, c, cuda]
3535
exclude: |
36-
(?x)^(
37-
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/.*
38-
)$
36+
(?x)^(.*cubin.cpp$ | .*fmha_cubin.h)$
3937
- repo: https://github.com/cheshirekow/cmake-format-precommit
4038
rev: v0.6.10
4139
hooks:
@@ -46,4 +44,5 @@ repos:
4644
- id: codespell
4745
args:
4846
- --skip=".git,3rdparty"
47+
- --exclude-file=examples/whisper/tokenizer.py
4948
- --ignore-words-list=rouge,inout,atleast,strat

README.md

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ TensorRT-LLM
88
[![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
99
[![cuda](https://img.shields.io/badge/cuda-12.2-green)](https://developer.nvidia.com/cuda-downloads)
1010
[![trt](https://img.shields.io/badge/TRT-9.2-green)](https://developer.nvidia.com/tensorrt)
11-
[![version](https://img.shields.io/badge/release-0.6.1-green)](./setup.py)
11+
[![version](https://img.shields.io/badge/release-0.7.0-green)](./setup.py)
1212
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
1313

1414
[Architecture](./docs/source/architecture.md)   |   [Results](./docs/source/performance.md)   |   [Examples](./examples/)   |   [Documentation](./docs/source/)
@@ -17,24 +17,29 @@ TensorRT-LLM
1717
<div align="left">
1818

1919
## Latest News
20-
* [2023/11/13] [**H200** achieves nearly **12,000 tok/sec on Llama2-13B**](./docs/source/blogs/H200launch.md)
20+
* [2023/12/04] [**Falcon-180B** on a **single H200** GPU with INT4 AWQ, and **6.7x faster Llama-70B** over A100](./docs/source/blogs/Falcon180B-H200.md)
2121

22-
<img src="./docs/source/blogs/media/H200launch_tps.png" alt="H200 TPS" width="500" height="auto">
22+
<img src="./docs/source/blogs/media/Falcon180B-H200_H200vA100.png" alt="H200 TPS" width="400" height="auto">
2323

24-
H200 FP8 achieves 11,819 tok/s on Llama2-13B on a single GPU, and is up to 1.9x faster than H100.
24+
H200 with INT4 AWQ, runs Falcon-180B on a _single_ GPU.
2525

26+
H200 is now 2.4x faster on Llama-70B with recent improvements to TensorRT-LLM GQA; up to 6.7x faster than A100.
2627

27-
* [2023/11/03] [TensorRT-LLM is up to **4.6x faster on H100 than A100**, achieving **10,000 tok/s at 100ms to first token.**](./docs/source/blogs/H100vsA100.md)
28+
* [2023/11/27] [SageMaker LMI now supports TensorRT-LLM - improves throughput by 60%, compared to previous version](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)
29+
* [2023/11/13] [H200 achieves nearly 12,000 tok/sec on Llama2-13B](./docs/source/blogs/H200launch.md)
2830
* [2023/10/22] [🚀 RAG on Windows using TensorRT-LLM and LlamaIndex 🦙](https://github.com/NVIDIA/trt-llm-rag-windows#readme)
2931
* [2023/10/19] Getting Started Guide - [Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
3032
](https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/)
3133
* [2023/10/17] [Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows
3234
](https://blogs.nvidia.com/blog/2023/10/17/tensorrt-llm-windows-stable-diffusion-rtx/)
33-
* [2023/9/9] [NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs](https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/)
3435

35-
[2023/10/31 - Phind](https://www.phind.com/blog/phind-model-beats-gpt4-fast) ; [2023/10/12 - Databricks (MosaicML)](https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices) ;
36-
[2023/10/4 - Perplexity](https://blog.perplexity.ai/blog/introducing-pplx-api) ;
37-
[2023/9/27 - CloudFlare](https://www.cloudflare.com/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/);
36+
37+
[2023/11/27 - Amazon Sagemaker](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)
38+
[2023/11/17 - Perplexity](https://blog.perplexity.ai/blog/turbocharging-llama-2-70b-with-nvidia-h100) ;
39+
[2023/10/31 - Phind](https://www.phind.com/blog/phind-model-beats-gpt4-fast) ;
40+
[2023/10/12 - Databricks (MosaicML)](https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices) ;
41+
[2023/10/04 - Perplexity](https://blog.perplexity.ai/blog/introducing-pplx-api) ;
42+
[2023/09/27 - CloudFlare](https://www.cloudflare.com/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/);
3843

3944
## Table of Contents
4045

@@ -145,10 +150,13 @@ mkdir -p ./bloom/560M && git clone https://huggingface.co/bigscience/bloom-560m
145150
```
146151
***2. Build the engine***
147152

148-
```python
153+
```bash
149154
# Single GPU on BLOOM 560M
150-
python build.py --model_dir ./bloom/560M/ \
155+
python convert_checkpoint.py --model_dir ./bloom/560M/ \
151156
--dtype float16 \
157+
--output_dir ./bloom/560M/trt_ckpt/fp16/1-gpu/
158+
# May need to add trtllm-build to PATH, export PATH=/usr/local/bin:$PATH
159+
trtllm-build --checkpoint_dir ./bloom/560M/trt_ckpt/fp16/1-gpu/ \
152160
--use_gemm_plugin float16 \
153161
--use_gpt_attention_plugin float16 \
154162
--output_dir ./bloom/560M/trt_engines/fp16/1-gpu/
@@ -161,7 +169,7 @@ See the BLOOM [example](examples/bloom) for more details and options regarding t
161169
The `../summarize.py` script can be used to perform the summarization of articles
162170
from the CNN Daily dataset:
163171

164-
```python
172+
```bash
165173
python ../summarize.py --test_trt_llm \
166174
--hf_model_dir ./bloom/560M/ \
167175
--data_type fp16 \
@@ -239,10 +247,12 @@ the models listed in the [examples](examples/.) folder.
239247
The list of supported models is:
240248

241249
* [Baichuan](examples/baichuan)
250+
* [BART](examples/enc_dec)
242251
* [Bert](examples/bert)
243252
* [Blip2](examples/blip2)
244253
* [BLOOM](examples/bloom)
245254
* [ChatGLM](examples/chatglm)
255+
* [FairSeq NMT](examples/nmt)
246256
* [Falcon](examples/falcon)
247257
* [Flan-T5](examples/enc_dec)
248258
* [GPT](examples/gpt)
@@ -252,7 +262,8 @@ The list of supported models is:
252262
* [InternLM](examples/internlm)
253263
* [LLaMA](examples/llama)
254264
* [LLaMA-v2](examples/llama)
255-
* [Mistral](examples/llama)
265+
* [mBART](examples/enc_dec)
266+
* [Mistral](examples/llama#mistral-v01)
256267
* [MPT](examples/mpt)
257268
* [mT5](examples/enc_dec)
258269
* [OPT](examples/opt)
@@ -261,9 +272,10 @@ The list of supported models is:
261272
* [SantaCoder](examples/gpt)
262273
* [StarCoder](examples/gpt)
263274
* [T5](examples/enc_dec)
275+
* [Whisper](examples/whisper)
264276

265277
Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder
266-
support that contains many encoder-decoder models such as T5, Flan-T5, etc. We
278+
functionality that supports many encoder-decoder models such as T5 family, BART family, Whisper family, NMT family, etc. We
267279
unroll the exact model names in the list above to let users find specific
268280
models easier.
269281

@@ -367,7 +379,7 @@ For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
367379

368380
### Change Log
369381

370-
#### Versions 0.6.0 / 0.6.1
382+
#### Version 0.6.1
371383

372384
* Models
373385
* ChatGLM3

benchmarks/cpp/README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,13 @@ instead, and be sure to set DLL paths as specified in
1818

1919
### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)
2020

21+
#### Prepare TensorRT-LLM engine(s)
22+
2123
Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
2224

23-
You can reuse the engine built by benchmarking code for Python Runtime, please see that [`document`](../python/README.md).
25+
You can use the [`build.py`](../python/build.py) script to build the engine(s). Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built by benchmarking code, please see that [`document`](../python/README.md).
26+
27+
#### Launch benchmarking
2428

2529
For detailed usage, you can do the following
2630
```

0 commit comments

Comments
 (0)