Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Commit 26f797c

Browse files
authored
Update docs/source/batch_manager.md (#40)
1 parent 7295ce2 commit 26f797c

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

docs/source/batch_manager.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,11 @@ how it returns completed requests to the user.
1616
## The Batch Manager API
1717

1818
A software component (called the client in the text that follows) can interact
19-
with the batch manager using two main callbacks. Their signatures are defined
19+
with the batch manager using two mandatory, and several optional callbacks. Their signatures are defined
2020
in the [`callbacks.h`](source:cpp/include/tensorrt_llm/batch_manager/callbacks.h) file.
2121

22+
These callbacks are invoked in the generation loop at regular intervals and serve a variety of functions descibed below.
23+
2224
### Get and Send Callbacks
2325

2426
The entry point to pass new requests to the batch manager is a callback of type
@@ -42,7 +44,7 @@ tensor. See
4244
[`InferenceRequest.h`](source:cpp/include/tensorrt_llm/batch_manager/InferenceRequest.h)
4345
for more details.
4446

45-
The responses are delivered to the client through a callback of type
47+
Responses are delivered to the client through a callback of type
4648
`SendResponseCallback`. A conforming callback must accept the 64-bit
4749
request ID that uniquely identifies the request, the list of output tensors,
4850
a boolean (identifying the last response for the request when set to
@@ -94,9 +96,10 @@ The statistics are packaged as a JSON string. That string contains three fields:
9496

9597
### GptManager Design
9698

97-
GptManager is designed to integrate into an inference server that's managing a pool of
99+
Batch Manager is designed to integrate into an inference server that's executing a pool of
98100
active work items populated by a stream of requests actively received
99-
by the server. GptManager spawns a worker thread in its constructor that then
101+
by the server. GptManager assumes a GPT-style autoregressive model architecture.
102+
GptManager spawns a worker thread in its constructor that then
100103
persistently runs the token generation loop. The worker thread invokes `GetInferenceRequestsCallback`
101104
at the start of each loop iteration, which is intended to read new
102105
requests. It invokes `SendResponseCallback` at the end of each iteration when one or
@@ -138,16 +141,13 @@ even in the worst case of KV cache consumption. That mode corresponds to a
138141
`schedulerPolicy` set to `GUARANTEED_NO_EVICT`.
139142
140143
The `GptManager`'s worker thread terminates when the `GptManager` destructor is
141-
called and there are no more active requests. Alternatively, a special request
142-
with a `requestID` of `-1` can be sent to the `GptManager`, it will be
143-
interpreted as a `TERMINATE` signal. It leads to the invocation of
144-
`waitUntilTerminate` which returns when the worker thread has terminated.
144+
called and there are no more active requests.
145145
146146
### Multi-GPU execution
147147
148148
When running on multiple GPUs using either tensor or pipeline parallelism, it
149149
is assumed that the server launches as many processes as GPU ranks, and each
150-
process runs its own copy of `GptManager`. The number of GPUs visible on a given
150+
process runs its own instance of `GptManager`. The number of GPUs visible on a given
151151
node can be controlled using the `CUDA_VISIBLE_DEVICES` environment variable.
152152
153153
Care must be taken to ensure all ranks see the same inputs at each iteration of

0 commit comments

Comments
 (0)