llama.cpp SYNC #45

akapoor3518 · 2025-09-02T20:24:37Z

llama.cpp SYNC

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…400) (#17572) * Make invalid schema a user error (400) * Move invalid_argument exception handler to ex_wrapper * Fix test * Simplify test back to original pattern

- Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB.

* Added RISC-V supported tests * Added default value for LLAMA_FATAL_WARNINGS and option to specify by user * Added RISC-V supported tests * Added default value for LLAMA_FATAL_WARNINGS and option to specify by user * Removed apt prompt * Added RISC-V specific tests with corrections Corrections included: 1. Changed the test names from debian to ubuntu as it is more stable than Debian Trixie 2. Added explicit compiler in cmake command as GCC compiler below version 14 have been recorded to throw errors with rvv1.0 and some other extensions 3. Added dependencies which are not installed by default in the RISC-V Ubuntu 24.04 4. Separate ccache directory for all jobs as all the ccache results are not the same and may cause ccache to not work * Resolved the merge conflict and cleaned up run.sh * Update ci/run.sh Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Removed previously added build ci for RISC-V * Removed trailing whitespaces * corrected build name Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * cleanup * Enabled build tests (1) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Enabled build tests (2) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * enable openssl --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* server: add --media-path for local media files * remove unused fn

…es (#17688)

* common : restart grammar-based rejection sampling * sampling : allow null samplers

* feat: inital support for gelu using sigmoid approximation * snapshot: faster gelu using polynomial approximation * test: disable l2-block prefetch in polynomail approximation * Revert "test: disable l2-block prefetch in polynomail approximation" This reverts commit 7233999. * Revert "snapshot: faster gelu using polynomial approximation" This reverts commit 2a787a6. * debug: temporarily disable unnecessary log message for debug purpose * Feat: optiized unaligned sigmoid_f32 * Feat: larger l2prefetch block * feat: apply unaligned-load optimization on mul and mul_scalar * Revert "debug: temporarily disable unnecessary log message for debug purpose" This reverts commit 84f2f23. * refactor: cleanup commented unused code * chore: reformat code with clang-formatter to pass cli test * Revert "chore: reformat code with clang-formatter to pass cli test" This reverts commit 952877e. * fix: fix loop overflow * chore: fix formating ci error

* webui: fix chat header width when sidebar is closed * chore: add index.html.gz

@ngxson

* server/webui: add server-side WebUI config support Add CLI arguments --webui-config (inline JSON) and --webui-config-file (file path) to configure WebUI default settings from server side. Backend changes: - Parse JSON once in server_context::load_model() for performance - Cache parsed config in webui_settings member (zero overhead on /props) - Add proper error handling in router mode with try/catch - Expose webui_settings in /props endpoint for both router and child modes Frontend changes: - Add 14 configurable WebUI settings via parameter sync - Add tests for webui settings extraction - Fix subpath support with base path in API calls Addresses feedback from @ngxson and @ggerganov * server: address review feedback from ngxson * server: regenerate README with llama-gen-docs

* snapshot: debug ggml-hexagon swiglu-oai * fix: fix hvx_min_scalar_f32 * feat: working swiglu-oai * chore: fix formating isue

* Uncached model read * Removing additional --mmap arg * Removing trailing whitespaces * Adding fallback when O_DIRECT is not supported * Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp * Adding maybe unused keyword for Mac and Windows. * File seek aligned * Removing all branches for direct_io in llama-model-loader.cpp * Always use alignment from llama_file * use_mmap=true

* keep file part order from model index * treat index as authoritative * sort index parts

* webui: fix chat screen shadow width * chore: add index.html.gz

@ServeurpersoCom

…18091) * draft: incremental markdown rendering with stable blocks * refactor: Logic improvements * refactor: DRY Markdown post-processing logic * refactor: ID generation improvements * fix: Remove runes * refactor: Clean up & add JSDocs * chore: update webui static output * fix: Add tick to prevent race conditions for rendering Markdown blocks Suggestion from @ServeurpersoCom Co-authored-by: Pascal <admin@serveurperso.com> * chore: Run `npm audit fix` * chore: update webui static output * feat: Improve performance using global counter & id instead of UUID * refactor: Enhance Markdown rendering with link and code features * chore: update webui static output * fix: Code block content extraction * chore: update webui static output * chore: update webui static output --------- Co-authored-by: Pascal <admin@serveurperso.com>

Co-authored-by: zhang hui <you@example.com>

* cmake: add BF16 RVV flag for ggml-cpu * ggml-cpu: add floating-point conversion kernels * ggml: add floating-point kernels Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: fix lmul in vec_dot_bf16 * ggml-cpu: change redsum to lmul 4, fix leftover --------- Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* webui: display prompt processing stats * feat: Improve UI of Chat Message Statistics * chore: update webui build output * refactor: Post-review improvements * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ASR with LFM2-Audio-1.5B * Set rope_theta * Fix comment * Remove rope_theta setting * Address PR feedback * rename functions to conformer * remove some redundant ggml_cont * fix missing tensor * add prefix "a." for conv tensors * remove redundant reshape * clean up * add test model --------- Co-authored-by: Tarek Dakhran <tarek@liquid.ai>

This implements a variation of the perf logger where rather than timing each operation individually with effectively a barrier in between, we put the timing boundaries where we already synchronize and time the groups of work that normally overlap. This can be useful to help understand whether individual operations need to be optimized, or if the group is already running efficiently. GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when GGML_VK_PERF_LOGGER is also set). GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.

* Android basic sample app layout polish * Add missing screenshots and polish android README doc * Replace file blobs with URLs served by GitHub pages service.

This commit adds a --verbose flag to the run-org-model.py script to enable or disable detailed debug output, such as input and output tensors for each layer. Debug utilities (summarize, debug_hook, setup_rope_debug) have been moved to utils/common.py. The motivation for this is that the detailed debug output can be useful for diagnosing issues with model conversion or execution, but it can also produce a large amount of output that may not always be needed. The script will also be further cleaned/refactored in follow-up commits.

* feat: Enable editing attachments in user messages * feat: Improvements for data handling & UI * docs: Update Architecture diagrams * chore: update webui build output * refactor: Exports * chore: update webui build output * feat: Add handling paste for Chat Message Edit Form * chore: update webui build output * refactor: Cleanup * chore: update webui build output

…global section (#18169) * presets: refactor, allow cascade presets from different sources * update docs * fix neg arg handling * fix empty mmproj * also filter out server-controlled args before to_ini() * skip loading custom_models if not specified * fix unset_reserved_args * fix crash on windows

* llama-server: friendlier error msg when ctx < input This PR adds formatted strings to the server's send_error function * llama-server: use string_format inline * fix test

* arg: fix order to use short form before long form * arg: update doc * arg: update test-arg-parser * arg: address review feedback from ngxson simplified to check first.length() <= last.length() only fixed: --sampler-seq, --rerank, --draft ordering note: middle positions in 3+ arg sets are not verified * arg: update doc

…e accurate mixed-precision matmul operations (#17977) * feat: implement real Q8_0 * feat: adding cmake option for configuring FP32 quantize group size * typo: set() shall be used --------- Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>

* remove non-windows zip artifacts * add cuda dll links

akapoor3518 changed the base branch from master to llama.cpp-syn-sept2 September 2, 2025 20:28

github-actions bot added documentation Improvements or additions to documentation Apple Metal SYCL Nvidia GPU Vulkan IBM zDNN testing build examples devops python script android server ggml nix Ascend NPU OpenCL labels Sep 17, 2025

github-actions bot added the model label Nov 4, 2025

angt and others added 10 commits December 2, 2025 18:21

ggml : use svcntb() for SVE vector length detection (#17474)

e148380

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

Server: Change Invalid Schema from Server Error (500) to User Error (…

c4357dc

…400) (#17572) * Make invalid schema a user error (400) * Move invalid_argument exception handler to ex_wrapper * Fix test * Simplify test back to original pattern

cmake : add utf8 compilation options for msvc (#17682)

e251e5e

mtmd: fix --no-warmup (#17695)

a96283a

server: add --media-path for local media files (#17697)

13628d8

* server: add --media-path for local media files * remove unused fn

build: document how to compile with Vulkan using Debian/Ubuntu packag…

16cc3c6

…es (#17688)

ggml, llama : use defaulted constructors/destructors (#17649)

37adc9c

ci : move release details to the top visible by default (#17719)

b3e3060

ggerganov and others added 30 commits December 17, 2025 19:46

common : restore grammar-based rejection sampling (#18137)

4301e27

* common : restart grammar-based rejection sampling * sampling : allow null samplers

webui: fix chat header width when sidebar is closed (#17981)

d37fc93

* webui: fix chat header width when sidebar is closed * chore: add index.html.gz

llama-fit-params: fix memory print (#18136)

8dcc366

server: (router) disable SSL on child process (#18141)

e85e9d7

convert : force patch_merger tensors to f16/f32 (#18124)

5166aaf

ggml-hexagon: swiglu_oai operation (#18114)

0a0bba0

* snapshot: debug ggml-hexagon swiglu-oai * fix: fix hvx_min_scalar_f32 * feat: working swiglu-oai * chore: fix formating isue

convert : sort and use file parts from model index if present (#18043)

9cff4cc

* keep file part order from model index * treat index as authoritative * sort index parts

llama: offload output layer to GPU first (#18148)

57c1e05

webui: fix chat screen shadow width (#18010)

900316d

* webui: fix chat screen shadow width * chore: add index.html.gz

remove i_major_dual (#18157)

54189c0

Co-authored-by: zhang hui <you@example.com>

gguf-py : use copy-on-write mode for localtensor (#18162)

ec7b932

arg: fix ASAN error on sampler_type_names empty (#18167)

4d1316c

android: fix missing screenshots for Android.md (#18156)

52fc7fe

* Android basic sample app layout polish * Add missing screenshots and polish android README doc * Replace file blobs with URLs served by GitHub pages service.

server: friendlier error msg when ctx < input (#18174)

cc0a043

* llama-server: friendlier error msg when ctx < input This PR adds formatted strings to the server's send_error function * llama-server: use string_format inline * fix test

llama : Changing off_t to size_t for Windows (#18204)

f99ef53

ci : only save ccache on master (#18207)

f74747d

ci : remove non-windows zip artifacts (#18201)

74e0513

* remove non-windows zip artifacts * add cuda dll links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama.cpp SYNC #45

llama.cpp SYNC #45

Uh oh!

akapoor3518 commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

llama.cpp SYNC #45

Are you sure you want to change the base?

llama.cpp SYNC #45

Uh oh!

Conversation

akapoor3518 commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants