Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Commit 8dd9c91

Browse files
kaiyuxShixiaowei02
andauthored
Update TensorRT-LLM (NVIDIA#539)
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
1 parent 119e216 commit 8dd9c91

File tree

97 files changed

+1293
-399
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

97 files changed

+1293
-399
lines changed

.pre-commit-config.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ repos:
1515
rev: v4.1.0
1616
hooks:
1717
- id: check-added-large-files
18-
exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/'
18+
exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin'
1919
- id: check-merge-conflict
2020
- id: check-symlinks
2121
- id: detect-private-key
@@ -33,7 +33,9 @@ repos:
3333
- id: clang-format
3434
types_or: [c++, c, cuda]
3535
exclude: |
36-
(?x)^(.*cubin.cpp$ | .*fmha_cubin.h)$
36+
(?x)^(
37+
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/.*
38+
)$
3739
- repo: https://github.com/cheshirekow/cmake-format-precommit
3840
rev: v0.6.10
3941
hooks:

README.md

Lines changed: 119 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ TensorRT-LLM
88
[![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
99
[![cuda](https://img.shields.io/badge/cuda-12.2-green)](https://developer.nvidia.com/cuda-downloads)
1010
[![trt](https://img.shields.io/badge/TRT-9.1-green)](https://developer.nvidia.com/tensorrt)
11-
[![version](https://img.shields.io/badge/release-0.5.0-green)](./setup.py)
11+
[![version](https://img.shields.io/badge/release-0.6.1-green)](./setup.py)
1212
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
1313

1414
[Architecture](./docs/source/architecture.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/performance.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)
@@ -36,7 +36,6 @@ H200 FP8 achieves 11,819 tok/s on Llama2-13B on a single GPU, and is up to 1.9x
3636
[2023/10/4 - Perplexity](https://blog.perplexity.ai/blog/introducing-pplx-api) ;
3737
[2023/9/27 - CloudFlare](https://www.cloudflare.com/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/);
3838

39-
4039
## Table of Contents
4140

4241
- [TensorRT-LLM Overview](#tensorrt-llm-overview)
@@ -186,7 +185,8 @@ TensorRT-LLM is rigorously tested on the following GPUs:
186185

187186
* [H100](https://www.nvidia.com/en-us/data-center/h100/)
188187
* [L40S](https://www.nvidia.com/en-us/data-center/l40s/)
189-
* [A100](https://www.nvidia.com/en-us/data-center/a100/)/[A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/)
188+
* [A100](https://www.nvidia.com/en-us/data-center/a100/)
189+
* [A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/)
190190
* [V100](https://www.nvidia.com/en-us/data-center/v100/) (experimental)
191191

192192
If a GPU is not listed above, it is important to note that TensorRT-LLM is
@@ -254,14 +254,18 @@ The list of supported models is:
254254
* [LLaMA-v2](examples/llama)
255255
* [Mistral](examples/llama)
256256
* [MPT](examples/mpt)
257+
* [mT5](examples/enc_dec)
257258
* [OPT](examples/opt)
258259
* [Qwen](examples/qwen)
259260
* [Replit Code](examples/mpt)
260261
* [SantaCoder](examples/gpt)
261262
* [StarCoder](examples/gpt)
262263
* [T5](examples/enc_dec)
263264

264-
Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder support that contains many encoder-decoder models such as T5, Flan-T5, etc. We unroll the exact model names in the list above to let users find specific models easier.
265+
Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder
266+
support that contains many encoder-decoder models such as T5, Flan-T5, etc. We
267+
unroll the exact model names in the list above to let users find specific
268+
models easier.
265269

266270
## Performance
267271

@@ -325,7 +329,11 @@ enable plugins, for example: `--use_gpt_attention_plugin`.
325329

326330
* MPI + Slurm
327331

328-
TensorRT-LLM is a [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might encounter interferences:
332+
TensorRT-LLM is a
333+
[MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package
334+
that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are
335+
running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might
336+
encounter interferences:
329337
```
330338
--------------------------------------------------------------------------
331339
PMI2_Init failed to initialize. Return code: 14
@@ -347,19 +355,123 @@ SLURM, depending upon the SLURM version you are using:
347355
Please configure as appropriate and try again.
348356
--------------------------------------------------------------------------
349357
```
350-
As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a dedicated MPI environment, not the one provided by your Slurm allocation.
358+
As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm
359+
node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a
360+
dedicated MPI environment, not the one provided by your Slurm allocation.
361+
351362
For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
352363

353364
## Release notes
354365

355-
* TensorRT-LLM requires TensorRT 9.1.0.4 and 23.08 containers.
366+
* TensorRT-LLM requires TensorRT 9.2 and 23.10 containers.
356367

357368
### Change Log
358369

370+
#### Versions 0.6.0 / 0.6.1
371+
372+
* Models
373+
* ChatGLM3
374+
* InternLM (contributed by @wangruohui)
375+
* Mistral 7B (developed in collaboration with Mistral.AI)
376+
* MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
377+
* Qwen (contributed by @Tlntin and @zhaohb)
378+
* Replit Code V-1.5 3B (external contribution)
379+
* T5, mT5, Flan-T5 (Python runtime only)
380+
381+
* Features
382+
* Add runtime statistics related to active requests and KV cache
383+
utilization from the batch manager (see
384+
the [batch manager](docs/source/batch_manager.md) documentation)
385+
* Add `sequence_length` tensor to support proper lengths in beam-search
386+
(when beam-width > 1 - see
387+
[tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
388+
* BF16 support for encoder-decoder models (Python runtime - see
389+
[examples/enc_dec](examples/enc_dec/README.md))
390+
* Improvements to memory utilization (CPU and GPU - including memory
391+
leaks)
392+
* Improved error reporting and memory consumption
393+
* Improved support for stop and bad words
394+
* INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
395+
[examples/baichuan](examples/baichuan/README.md))
396+
* INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
397+
support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
398+
* INT4 AWQ support for the Falcon models
399+
(see [examples/falcon](examples/falcon/README.md))
400+
* LoRA support (functional preview only - limited to the Python runtime,
401+
only QKV support and not optimized in terms of runtime performance) for
402+
the GPT model (see the
403+
[Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
404+
in the GPT example)
405+
* Multi-GPU support for encoder-decoder models (Python runtime - see
406+
[examples/enc_dec](examples/enc_dec/README.md))
407+
* New heuristic for launching the Multi-block Masked MHA kernel (similar
408+
to FlashDecoding - see
409+
[decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
410+
* Prompt-Tuning support for GPT and LLaMA models (see the
411+
[Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
412+
* Performance optimizations in various CUDA kernels
413+
* Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
414+
[`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
415+
* Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
416+
* Support for different micro batch sizes for context and generation
417+
phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
418+
`GptSession::Config::genMicroBatchSize` in
419+
[tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
420+
* Support for "remove input padding" for encoder-decoder models (see
421+
[examples/enc_dec](examples/enc_dec/README.md))
422+
* Support for context and generation logits (see `mComputeContextLogits` and
423+
`mComputeGenerationLogits` in
424+
[tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
425+
* Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
426+
`"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
427+
* Update to CUTLASS 3.x
428+
429+
* Bug fixes
430+
* Fix for ChatGLM2 #93 and #138
431+
* Fix tensor names error "RuntimeError: Tensor names
432+
(`host_max_kv_cache_length`) in engine are not the same as expected in
433+
the main branch" #369
434+
* Fix weights split issue in BLOOM when `world_size = 2` ("array split
435+
does not result in an equal division") #374
436+
* Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
437+
* Fix a crash in GenerationSession if stream keyword argument is not None
438+
#202
439+
* Fix a typo when calling PyNVML API [BUG] code bug #410
440+
* Fix bugs related to the improper management of the `end_id` for various
441+
models [C++ and Python]
442+
* Fix memory leaks [C++ code and Python models]
443+
* Fix the std::alloc error when running the gptManagerBenchmark -- issue
444+
gptManagerBenchmark std::bad_alloc error #66
445+
* Fix a bug in pipeline parallelism when beam-width > 1
446+
* Fix a bug with Llama GPTQ due to improper support of GQA
447+
* Fix issue #88
448+
* Fix an issue with the Huggingface Transformers version #16
449+
* Fix link jump in windows readme.md #30 - by @yuanlehome
450+
* Fix typo in batchScheduler.h #56 - by @eltociear
451+
* Fix typo #58 - by @RichardScottOZ
452+
* Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
453+
builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
454+
* Fix the log message to be more accurate on KV cache #224
455+
* Fix Windows release wheel installation: Failed to install the release
456+
wheel for Windows using pip #261
457+
* Fix missing torch dependencies: [BUG] The batch_manage.a choice error
458+
in --cpp-only when torch's cxx_abi version is different with gcc #151
459+
* Fix linking error during compiling google-test & benchmarks #277
460+
* Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
461+
the lack of bfloat16 #335
462+
* Minor bug fixes
463+
464+
#### Version 0.5.0
465+
359466
* TensorRT-LLM v0.5.0 is the first public release.
360467

361468
### Known Issues
362469

470+
* The hang reported in issue
471+
[#149](https://github.com/triton-inference-server/tensorrtllm_backend/issues/149)
472+
has not been reproduced by the TensorRT-LLM team. If it is caused by a bug
473+
in TensorRT-LLM, that bug may be present in that release
474+
363475
### Report Issues
364476

365477
You can use GitHub issues to report issues with TensorRT-LLM.

benchmarks/cpp/gptSessionBenchmark.cpp

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -40,14 +40,8 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
4040
std::shared_ptr<nvinfer1::ILogger> const& logger, int warmUp, int numRuns, int duration,
4141
GptSession::Config& sessionConfig, bool cudaGraphMode, bool printAllLogits)
4242
{
43-
4443
std::string modelNameHyphen = modelName;
4544
std::filesystem::path jsonFileName = dataPath / "config.json";
46-
if (tc::strStartsWith(modelName, "chatglm"))
47-
{
48-
std::replace(modelNameHyphen.begin(), modelNameHyphen.end(), '_', '-');
49-
jsonFileName = dataPath / (modelNameHyphen + std::string("-config.json"));
50-
}
5145
auto const json = GptJsonConfig::parse(jsonFileName);
5246
auto const modelConfig = json.getModelConfig();
5347
auto const inputPacked = modelConfig.usePackedInput();

benchmarks/python/all_reduce.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,10 @@
1515

1616
from argparse import ArgumentParser
1717

18-
import tensorrt as trt
18+
# isort: off
1919
import torch
20+
import tensorrt as trt
21+
# isort: on
2022
from cuda import cuda, cudart
2123
from mpi4py import MPI
2224
from polygraphy.backend.trt import CreateConfig, EngineFromNetwork

benchmarks/python/allowed_configs.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -376,6 +376,7 @@ class ModelConfig(BaseModel):
376376
build_config=BuildConfig(
377377
num_layers=28,
378378
num_heads=32,
379+
num_kv_heads=2,
379380
hidden_size=4096,
380381
vocab_size=65024,
381382
hidden_act='swiglu',
@@ -393,6 +394,7 @@ class ModelConfig(BaseModel):
393394
build_config=BuildConfig(
394395
num_layers=28,
395396
num_heads=32,
397+
num_kv_heads=2,
396398
hidden_size=4096,
397399
vocab_size=65024,
398400
hidden_act='swiglu',

benchmarks/python/base_benchmark.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,9 @@ def get_engine_name(model, dtype, tp_size, rank):
4545

4646
def serialize_engine(engine, path):
4747
with open(path, 'wb') as f:
48-
f.write(bytearray(engine))
48+
# engine object is already complies with python buffer protocol, no need to
49+
# convert it to bytearray before write, converting to bytearray consumes lots of memory
50+
f.write(engine)
4951

5052

5153
class BaseBenchmark(object):

benchmarks/python/bert_benchmark.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,10 @@
1616
import time
1717
from collections import OrderedDict
1818

19-
import tensorrt as trt
19+
# isort: off
2020
import torch
21+
import tensorrt as trt
22+
#isort: on
2123
from allowed_configs import get_build_config
2224
from base_benchmark import BaseBenchmark, serialize_engine
2325

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
version https://git-lfs.github.com/spec/v1
2-
oid sha256:81a5a7da16e6d2c5bc50c0d77aedc26e8cbb7eac1e94b7df95df364fe8d404c1
3-
size 1703538
2+
oid sha256:5fe5cd33bfbd9b6c96417e9cb8d005916b73a8239a44360d5b32d8c1d41bd475
3+
size 1701210
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
version https://git-lfs.github.com/spec/v1
2-
oid sha256:72fa54cfed6f0fa0b9c4a2590909290d3fa1498e5882210a8d354b027f251e9b
3-
size 1713878
2+
oid sha256:86bf2ceb58342221f56416b9d469f2bfbb1e7e145bd9a4be95efdb80f8ea92c7
3+
size 1711238
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
ab705e21b4d4527a1177e1bff0c303e5 libtensorrt_llm_batch_manager_static.a
2-
c6947ed12be84196549e86305da71844 libtensorrt_llm_batch_manager_static.pre_cxx11.a
3-
fba9849f41452995d351e7030d68c98c1f3b3230 commit
1+
c70c75df9894675075a7a8e61d827017 libtensorrt_llm_batch_manager_static.a
2+
26f8306f855b80f5f47600b1fd476f34 libtensorrt_llm_batch_manager_static.pre_cxx11.a
3+
c1df7c3bc2ffd9233c0adf1b09229cde24385f10 commit

0 commit comments

Comments
 (0)