janhq
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 4 additions & 2 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 119 additions & 7 deletions b/‎README.md‎
Lines changed: 119 additions & 7 deletions
diff --git a/‎benchmarks/cpp/gptSessionBenchmark.cpp‎
Lines changed: 0 additions & 6 deletions b/‎benchmarks/cpp/gptSessionBenchmark.cpp‎
Lines changed: 0 additions & 6 deletions
diff --git a/‎benchmarks/python/all_reduce.py‎
Lines changed: 3 additions & 1 deletion b/‎benchmarks/python/all_reduce.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎benchmarks/python/allowed_configs.py‎
Lines changed: 2 additions & 0 deletions b/‎benchmarks/python/allowed_configs.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎benchmarks/python/base_benchmark.py‎
Lines changed: 3 additions & 1 deletion b/‎benchmarks/python/base_benchmark.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎benchmarks/python/bert_benchmark.py‎
Lines changed: 3 additions & 1 deletion b/‎benchmarks/python/bert_benchmark.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.a‎
Lines changed: 2 additions & 2 deletions b/‎cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.a‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a‎
Lines changed: 2 additions & 2 deletions b/‎cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/version.txt‎
Lines changed: 3 additions & 3 deletions b/‎cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/version.txt‎
Lines changed: 3 additions & 3 deletions
@@ -15,7 +15,7 @@ repos:
     rev: v4.1.0
     hooks:
     -   id: check-added-large-files
-        exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/'
+        exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin'
     -   id: check-merge-conflict
     -   id: check-symlinks
     -   id: detect-private-key
@@ -33,7 +33,9 @@ repos:
     -   id: clang-format
         types_or: [c++, c, cuda]
         exclude: |
-            (?x)^(.*cubin.cpp$ | .*fmha_cubin.h)$
+            (?x)^(
+                cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/.*
+            )$
 -   repo: https://github.com/cheshirekow/cmake-format-precommit
     rev: v0.6.10
     hooks:
 
@@ -8,7 +8,7 @@ TensorRT-LLM
 [![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
 [![cuda](https://img.shields.io/badge/cuda-12.2-green)](https://developer.nvidia.com/cuda-downloads)
 [![trt](https://img.shields.io/badge/TRT-9.1-green)](https://developer.nvidia.com/tensorrt)
-[![version](https://img.shields.io/badge/release-0.5.0-green)](./setup.py)
+[![version](https://img.shields.io/badge/release-0.6.1-green)](./setup.py)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
 
 [Architecture](./docs/source/architecture.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/performance.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)
@@ -36,7 +36,6 @@ H200 FP8 achieves 11,819 tok/s on Llama2-13B on a single GPU, and is up to 1.9x
 [2023/10/4 - Perplexity](https://blog.perplexity.ai/blog/introducing-pplx-api) ;
 [2023/9/27 - CloudFlare](https://www.cloudflare.com/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/);
 
-
 ## Table of Contents
 
 - [TensorRT-LLM Overview](#tensorrt-llm-overview)
@@ -186,7 +185,8 @@ TensorRT-LLM is rigorously tested on the following GPUs:
 
 * [H100](https://www.nvidia.com/en-us/data-center/h100/)
 * [L40S](https://www.nvidia.com/en-us/data-center/l40s/)
-* [A100](https://www.nvidia.com/en-us/data-center/a100/)/[A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/)
+* [A100](https://www.nvidia.com/en-us/data-center/a100/)
+* [A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/)
 * [V100](https://www.nvidia.com/en-us/data-center/v100/) (experimental)
 
 If a GPU is not listed above, it is important to note that TensorRT-LLM is
@@ -254,14 +254,18 @@ The list of supported models is:
 * [LLaMA-v2](examples/llama)
 * [Mistral](examples/llama)
 * [MPT](examples/mpt)
+* [mT5](examples/enc_dec)
 * [OPT](examples/opt)
 * [Qwen](examples/qwen)
 * [Replit Code](examples/mpt)
 * [SantaCoder](examples/gpt)
 * [StarCoder](examples/gpt)
 * [T5](examples/enc_dec)
 
-Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder support that contains many encoder-decoder models such as T5, Flan-T5, etc. We unroll the exact model names in the list above to let users find specific models easier.
+Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder
+support that contains many encoder-decoder models such as T5, Flan-T5, etc. We
+unroll the exact model names in the list above to let users find specific
+models easier.
 
 ## Performance
 
@@ -325,7 +329,11 @@ enable plugins, for example: `--use_gpt_attention_plugin`.
 
 * MPI + Slurm
 
-TensorRT-LLM is a [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might encounter interferences:
+TensorRT-LLM is a
+[MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package
+that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are
+running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might
+encounter interferences:
 ```
 --------------------------------------------------------------------------
 PMI2_Init failed to initialize.  Return code: 14
@@ -347,19 +355,123 @@ SLURM, depending upon the SLURM version you are using:
 Please configure as appropriate and try again.
 --------------------------------------------------------------------------
 ```
-As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a dedicated MPI environment, not the one provided by your Slurm allocation.
+As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm
+node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a
+dedicated MPI environment, not the one provided by your Slurm allocation.
+
 For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
 
 ## Release notes
 
-  * TensorRT-LLM requires TensorRT 9.1.0.4 and 23.08 containers.
+  * TensorRT-LLM requires TensorRT 9.2 and 23.10 containers.
 
 ### Change Log
 
+#### Versions 0.6.0 / 0.6.1
+
+  * Models
+      * ChatGLM3
+      * InternLM (contributed by @wangruohui)
+      * Mistral 7B (developed in collaboration with Mistral.AI)
+      * MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
+      * Qwen (contributed by @Tlntin and @zhaohb)
+      * Replit Code V-1.5 3B (external contribution)
+      * T5, mT5, Flan-T5 (Python runtime only)
+
+  * Features
+      * Add runtime statistics related to active requests and KV cache
+        utilization from the batch manager (see
+        the [batch manager](docs/source/batch_manager.md) documentation)
+      * Add `sequence_length` tensor to support proper lengths in beam-search
+        (when beam-width > 1 - see
+        [tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+      * BF16 support for encoder-decoder models (Python runtime - see
+        [examples/enc_dec](examples/enc_dec/README.md))
+      * Improvements to memory utilization (CPU and GPU - including memory
+        leaks)
+      * Improved error reporting and memory consumption
+      * Improved support for stop and bad words
+      * INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
+        [examples/baichuan](examples/baichuan/README.md))
+      * INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
+        support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
+      * INT4 AWQ support for the Falcon models
+        (see [examples/falcon](examples/falcon/README.md))
+      * LoRA support (functional preview only - limited to the Python runtime,
+        only QKV support and not optimized in terms of runtime performance) for
+        the GPT model (see the
+        [Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
+        in the GPT example)
+      * Multi-GPU support for encoder-decoder models (Python runtime - see
+        [examples/enc_dec](examples/enc_dec/README.md))
+      * New heuristic for launching the Multi-block Masked MHA kernel (similar
+        to FlashDecoding - see
+        [decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
+      * Prompt-Tuning support for GPT and LLaMA models (see the
+        [Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
+      * Performance optimizations in various CUDA kernels
+      * Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
+        [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+      * Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
+      * Support for different micro batch sizes for context and generation
+        phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
+        `GptSession::Config::genMicroBatchSize` in
+        [tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
+      * Support for "remove input padding" for encoder-decoder models (see
+        [examples/enc_dec](examples/enc_dec/README.md))
+      * Support for context and generation logits (see `mComputeContextLogits` and
+        `mComputeGenerationLogits` in
+        [tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
+      * Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
+        `"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+      * Update to CUTLASS 3.x
+
+  * Bug fixes
+      * Fix for ChatGLM2 #93 and #138
+      * Fix tensor names error "RuntimeError: Tensor names
+        (`host_max_kv_cache_length`) in engine are not the same as expected in
+        the main branch" #369
+      * Fix weights split issue in BLOOM when `world_size = 2` ("array split
+        does not result in an equal division") #374
+      * Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
+      * Fix a crash in GenerationSession if stream keyword argument is not None
+        #202
+      * Fix a typo when calling PyNVML API [BUG] code bug #410
+      * Fix bugs related to the improper management of the `end_id` for various
+        models [C++ and Python]
+      * Fix memory leaks [C++ code and Python models]
+      * Fix the std::alloc error when running the gptManagerBenchmark -- issue
+        gptManagerBenchmark std::bad_alloc error #66
+      * Fix a bug in pipeline parallelism when beam-width > 1
+      * Fix a bug with Llama GPTQ due to improper support of GQA
+      * Fix issue #88
+      * Fix an issue with the Huggingface Transformers version #16
+      * Fix link jump in windows readme.md #30 - by @yuanlehome
+      * Fix typo in batchScheduler.h #56 - by @eltociear
+      * Fix typo #58 - by @RichardScottOZ
+      * Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
+        builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
+      * Fix the log message to be more accurate on KV cache #224
+      * Fix Windows release wheel installation: Failed to install the release
+        wheel for Windows using pip #261
+      * Fix missing torch dependencies: [BUG] The batch_manage.a choice error
+        in --cpp-only when torch's cxx_abi version is different with gcc #151
+      * Fix linking error during compiling google-test & benchmarks #277
+      * Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
+        the lack of bfloat16 #335
+      * Minor bug fixes
+
+#### Version 0.5.0
+
   * TensorRT-LLM v0.5.0 is the first public release.
 
 ### Known Issues
 
+  * The hang reported in issue
+    [#149](https://github.com/triton-inference-server/tensorrtllm_backend/issues/149)
+    has not been reproduced by the TensorRT-LLM team. If it is caused by a bug
+    in TensorRT-LLM, that bug may be present in that release
+
 ### Report Issues
 
 You can use GitHub issues to report issues with TensorRT-LLM.
@@ -40,14 +40,8 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
     std::shared_ptr<nvinfer1::ILogger> const& logger, int warmUp, int numRuns, int duration,
     GptSession::Config& sessionConfig, bool cudaGraphMode, bool printAllLogits)
 {
-
     std::string modelNameHyphen = modelName;
     std::filesystem::path jsonFileName = dataPath / "config.json";
-    if (tc::strStartsWith(modelName, "chatglm"))
-    {
-        std::replace(modelNameHyphen.begin(), modelNameHyphen.end(), '_', '-');
-        jsonFileName = dataPath / (modelNameHyphen + std::string("-config.json"));
-    }
     auto const json = GptJsonConfig::parse(jsonFileName);
     auto const modelConfig = json.getModelConfig();
     auto const inputPacked = modelConfig.usePackedInput();
 
@@ -15,8 +15,10 @@
 
 from argparse import ArgumentParser
 
-import tensorrt as trt
+# isort: off
 import torch
+import tensorrt as trt
+# isort: on
 from cuda import cuda, cudart
 from mpi4py import MPI
 from polygraphy.backend.trt import CreateConfig, EngineFromNetwork
 
@@ -376,6 +376,7 @@ class ModelConfig(BaseModel):
                 build_config=BuildConfig(
                     num_layers=28,
                     num_heads=32,
+                    num_kv_heads=2,
                     hidden_size=4096,
                     vocab_size=65024,
                     hidden_act='swiglu',
@@ -393,6 +394,7 @@ class ModelConfig(BaseModel):
                 build_config=BuildConfig(
                     num_layers=28,
                     num_heads=32,
+                    num_kv_heads=2,
                     hidden_size=4096,
                     vocab_size=65024,
                     hidden_act='swiglu',
 
@@ -45,7 +45,9 @@ def get_engine_name(model, dtype, tp_size, rank):
 
 def serialize_engine(engine, path):
     with open(path, 'wb') as f:
-        f.write(bytearray(engine))
+        # engine object is already complies with python buffer protocol, no need to
+        # convert it to bytearray before write, converting to bytearray consumes lots of memory
+        f.write(engine)
 
 
 class BaseBenchmark(object):
 
@@ -16,8 +16,10 @@
 import time
 from collections import OrderedDict
 
-import tensorrt as trt
+# isort: off
 import torch
+import tensorrt as trt
+#isort: on
 from allowed_configs import get_build_config
 from base_benchmark import BaseBenchmark, serialize_engine
 
 
@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:81a5a7da16e6d2c5bc50c0d77aedc26e8cbb7eac1e94b7df95df364fe8d404c1
-size 1703538
+oid sha256:5fe5cd33bfbd9b6c96417e9cb8d005916b73a8239a44360d5b32d8c1d41bd475
+size 1701210
@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:72fa54cfed6f0fa0b9c4a2590909290d3fa1498e5882210a8d354b027f251e9b
-size 1713878
+oid sha256:86bf2ceb58342221f56416b9d469f2bfbb1e7e145bd9a4be95efdb80f8ea92c7
+size 1711238
@@ -1,3 +1,3 @@
-ab705e21b4d4527a1177e1bff0c303e5  libtensorrt_llm_batch_manager_static.a
-c6947ed12be84196549e86305da71844  libtensorrt_llm_batch_manager_static.pre_cxx11.a
-fba9849f41452995d351e7030d68c98c1f3b3230 commit
+c70c75df9894675075a7a8e61d827017  libtensorrt_llm_batch_manager_static.a
+26f8306f855b80f5f47600b1fd476f34  libtensorrt_llm_batch_manager_static.pre_cxx11.a
+c1df7c3bc2ffd9233c0adf1b09229cde24385f10 commit