@@ -16,9 +16,11 @@ how it returns completed requests to the user.
1616## The Batch Manager API
1717
1818A software component (called the client in the text that follows) can interact
19- with the batch manager using two main callbacks. Their signatures are defined
19+ with the batch manager using two mandatory, and several optional callbacks. Their signatures are defined
2020in the [ ` callbacks.h ` ] ( source:cpp/include/tensorrt_llm/batch_manager/callbacks.h ) file.
2121
22+ These callbacks are invoked in the generation loop at regular intervals and serve a variety of functions descibed below.
23+
2224### Get and Send Callbacks
2325
2426The entry point to pass new requests to the batch manager is a callback of type
@@ -42,7 +44,7 @@ tensor. See
4244[ ` InferenceRequest.h ` ] ( source:cpp/include/tensorrt_llm/batch_manager/InferenceRequest.h )
4345for more details.
4446
45- The responses are delivered to the client through a callback of type
47+ Responses are delivered to the client through a callback of type
4648` SendResponseCallback ` . A conforming callback must accept the 64-bit
4749request ID that uniquely identifies the request, the list of output tensors,
4850a boolean (identifying the last response for the request when set to
@@ -94,9 +96,10 @@ The statistics are packaged as a JSON string. That string contains three fields:
9496
9597### GptManager Design
9698
97- GptManager is designed to integrate into an inference server that's managing a pool of
99+ Batch Manager is designed to integrate into an inference server that's executing a pool of
98100active work items populated by a stream of requests actively received
99- by the server. GptManager spawns a worker thread in its constructor that then
101+ by the server. GptManager assumes a GPT-style autoregressive model architecture.
102+ GptManager spawns a worker thread in its constructor that then
100103persistently runs the token generation loop. The worker thread invokes ` GetInferenceRequestsCallback `
101104at the start of each loop iteration, which is intended to read new
102105requests. It invokes ` SendResponseCallback ` at the end of each iteration when one or
@@ -138,16 +141,13 @@ even in the worst case of KV cache consumption. That mode corresponds to a
138141`schedulerPolicy` set to `GUARANTEED_NO_EVICT`.
139142
140143The `GptManager`'s worker thread terminates when the `GptManager` destructor is
141- called and there are no more active requests. Alternatively, a special request
142- with a `requestID` of `-1` can be sent to the `GptManager`, it will be
143- interpreted as a `TERMINATE` signal. It leads to the invocation of
144- `waitUntilTerminate` which returns when the worker thread has terminated.
144+ called and there are no more active requests.
145145
146146### Multi-GPU execution
147147
148148When running on multiple GPUs using either tensor or pipeline parallelism, it
149149is assumed that the server launches as many processes as GPU ranks, and each
150- process runs its own copy of `GptManager`. The number of GPUs visible on a given
150+ process runs its own instance of `GptManager`. The number of GPUs visible on a given
151151node can be controlled using the `CUDA_VISIBLE_DEVICES` environment variable.
152152
153153Care must be taken to ensure all ranks see the same inputs at each iteration of
0 commit comments