-
Notifications
You must be signed in to change notification settings - Fork 14k
[DRAFT] CUDA: Improve performance via less synchronizations between token #17795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
How about using existing I believe the Besides, I believe the current way of synchronizing both For synchronizations at the beginning of this code section ( |
While there is some amount of inter-op between the two backends ( Ultimately, this is a matter of taste and what architectural trade-off seems best in this specific case. |
Thank you for the explanation. I was not aware of the distinction and it totally makes sense. |
|
I believe these changes make an implicit assumption that the async copies will enter a common execution queue/stream with all previous operations. This seems to be the case for CUDA, but I don't think it is general enough assumption that can be made for all backends. For example, removing this backend synchronization here I think is incorrect, since semantically it means that copying of the input can start immediately, even if the backend is still doing work: llama.cpp/ggml/src/ggml-backend.cpp Lines 1458 to 1464 in 09519f7
With CUDA, it is apparently not a problem because the copying does not start immediately - it gets queued in a queue (is this correct? who owns this queue? what guarantees that the copy will wait for all previous CUDA operations to finish?). But for some other backend, a copy can start immediately - as soon as we call Another implicit assumption that I think is not very good in this proposal: ggml_backend_dummy_buffer_set_tensor_async(...) {
start_thread_to_copy_data(...).detach();
}Technically, this is an asynchronous call so it's perfectly fine from an API standpoint. But there isn't a common queue to manage this thread. The thread belongs to the buffer, not to the backend. Calling |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ultimate authority on what the ggml backend interface should or shouldn't be is Diego but my opinion is that ggml_backend_tensor_set should be blocking and that an asynchronous equivalent should be an explicit opt-in. Otherwise user code can accidentally introduce use-after-free bugs depending on how fast or slow the memcpy is vs. the user code.
|
Re Georgi's first point: I think the PR as-is is already incorrect for CUDA. The reason there is a CPU synchronization for the memcpy is that the source pointer is under host control and the host could free the memory immediately after the function returns, resulting in a use-after-free bug. With this PR this could in principle still happen. It would only be safe if there is a precondition for user code that stipulates that memory may only be freed after calling Re Georgi's second point: If I understood Diego's idea of what a ggml "backend" is supposed to be correctly it was pretty much equivalent to the concept of a CUDA stream. And especially when the ggml backend API has a function |
[DRAFT]
This PR suggest to remove some superfluous synchronization calls between tokens to be faster on CUDA backends. I see between 1% and 2% perf gain depending on the model, GPU and settings.
Mechanism
The performance impact is best explained visually. Here are the "before" and "after" nsight system traces. They are not to scale. The relevant part is the row with the green and red bubbles. Both images show the overhead between the GPU execution of two tokens. The generation of the n-th token ends on the left hand side of the screenshot, in the green bubble titled
cudaStreamSynchronize. The calculation of the next/n+1st token starts on the right hand side, at the green bar titledcudaGraphLaunch. In between, there is CPU orchestration overhead. This PR aims to shrink the time spent in the middle, between GPU token generation. Original:In the middle of the above image, we see red and green bubbles alternating. In this case, the green bubbles are synchronization steps, the red bubbles are asynchronous copy calls from host to device. If async operations are immediately followed by synchronization calls, they are executed synchronously. This is not efficient. Removing the green synchronization operations between asynchronous copy calls leads to asynchronous copies and reduced overhead between GPU token generation:
Performance
I benchmarked on a RTX Pro 6000 Blackwell using
./llama-bench -m $models -p 0 -n 128,256,512 -fa 1.My testing shows around 1% improvement, with
gpt-oss-20bgaining up to 1.4%.llama 3B Q4_K - Mediumshows very high variance, prompting me to run the tests again with-r 100. At-r 100, a clearer trend of improved performance forgemma3n E2B Q8_0is also visible.Details with default `-r 5`
Baseline:
PR:
Speedup:
Details with `-r 100`
Baseline:
PR:
Speedup:
Implementation Concerns
The approach here aims to minimize changes in the general backend, and to other backends. However, the synchronization calls originate from the general backend. Some changes there are unavoidable, as well as retaining a synchronization call after the last copy to ensure correctness across backends.
Additionally, AFAIK there is no documentation on the functional guarantees of a function like
ggml_copy_tensor, and it could be that the current design proposal violates existing assumptions, or practices around potentially breaking ABIs between ggml and llama.cpp. For this reason, this PR is a draft.I also have not copy-pasted my additions
ggml_backend_buffer_iinterface changes (addedset_tensor_async+ whitespace) to the the other backends just yet. This causes the unrelated tests to fail.Please advise on the architectural choices of this implementation.
For example, we could make the
set_tensorin the CUDA backend default async. This would avoid interface changes, but change the behavior of similar functions between backends.@ggerganov @JohannesGaessler