You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tools/server/README-dev.md
+26-1Lines changed: 26 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,7 +42,15 @@ graph TD
42
42
server_response --> server_routes
43
43
```
44
44
45
-
TODO: mention about how batching is handled by `server_slot`
45
+
### Batching
46
+
47
+
The server context maintains a single batch shared across all slots. When `update_slots()` is invoked, the system iterates through all active slots to populate this batch. For each slot, either a generated token from the previous decoding step or available prompt tokens are added to the batch.
48
+
49
+
Batching constraints apply: slots can only be batched together if they share compatible configurations. For instance, slots using a specific LoRA adapter can be batched with each other, but not with slots using a different LoRA adapter or no adapter at all.
50
+
51
+
Once the batch reaches capacity or all slots have been processed, `llama_decode` is called to execute the inference. This operation represents the primary computational bottleneck in `update_slots()`.
52
+
53
+
Following decoding, the system either retrieves embeddings or samples the next token using `common_sampler_sample`. If a slot has remaining prompt tokens to process, it yields until the next `update_slots()` iteration.
46
54
47
55
### Thread Management
48
56
@@ -62,6 +70,23 @@ Each incoming HTTP request is handled by its own thread managed by the HTTP libr
62
70
- All JSON formatting and chat template logic must stay in the HTTP layer.
63
71
- Avoid passing raw JSON between the HTTP layer and `server_slot`. Instead, parse everything into native C++ types as early as possible.
64
72
73
+
### Example trace of a request
74
+
75
+
Here is an example trace of an API request for text completion:
76
+
77
+
- A request arrives at the HTTP layer.
78
+
- The request is routed to the corresponding handler inside `server_routes`. In this case, `handle_completions_impl` is invoked.
79
+
- The handler parses the input request, constructs a new `server_task`, and passes it to `server_res_generator`.
80
+
-`server_res_generator` creates a new `task_result_state` for each task:
81
+
-`task_result_state` stays in the HTTP layer, responsible for keeping track of the current state of the response (e.g., parsing tool calls or thinking messages).
82
+
-`server_task` is moved into `server_queue` inside `server_context`.
83
+
-`server_context` launches the task by moving it into an available slot (see `launch_slot_with_task()`).
84
+
-`update_slot()` processes the task as described in the "Batching" section above.
85
+
- Results may be sent using `send_partial_response` or `send_final_response`, which creates a new `server_task_result` and pushes it to the response queue.
86
+
- At the same time, `server_res_generator` listens to the response queue and retrieves this response.
87
+
- As the response is stateless, `server_res_generator` calls `response->update()` to update the response with the current state.
88
+
-`server_res_generator` then calls `response->to_json()` and passes the response to the HTTP layer.
89
+
65
90
### Testing
66
91
67
92
`llama-server` includes an automated test suite based on `pytest`.
0 commit comments