llama : add token matching support to llama-grammar #17816

aldehir · 2025-12-06T06:46:31Z

Implementation of idea by @ngxson: #17750 (comment)

Problem

The llama-grammar implementation doesn't have a way to accept tokens directly, which creates a few problems:

Can't disambiguate between a special token (e.g. <|end|>) and the tokenized form <|, end, |> that may occur in content.
Requires awkward "exclusion" rules such as ( [^<] | "<" [^|] | "<|" [^e] | ... | "<|end|" [^>] )* to match chunks of characters that don't accumulate to the desired delimiter (<|end|>).
Adds extra work to grammar sampling from recursively applying character rules.

Proposed Solution

Borrowing some ideas from llguidance, you can define a token by id <[id]> or as raw token text <token> if encased in </>. I'm leaving out support for token id ranges/alternates since I don't see an immediate need for it.

You can negate by prefixing the token with !, e.g. !<|end|>.

Example (gpt-oss)

By token id:

root ::= analysis response
analysis ::= <[200005]> "analysis" <[200008]> (!<[200007]>)* <[200007]>
response ::= <[200006]> "assistant" <[200005]> "final" <[200008]> .*

That's not very readable, but useful for tokens not wrapped in </>. If they are, you can use them directly:

root ::= analysis response
analysis ::= <|channel|> "analysis" <|message|> (!<|end|>)* <|end|>
response ::= <|start|> "assistant" <|channel|> "final" <|message|> .*

Use Case: Reasoning Budget Enforcement

Assuming the model's vocab has unique tokens for its thinking tags, adopting a reasoning budget is fairly trivial via grammar:

root ::= analysis response
analysis ::= <|channel|> "analysis" <|message|> reasoning-with-budget
reasoning-with-budget ::= (!<|end|>){0,200} <|end|>
response ::= <|start|> "assistant" <|channel|> "final" <|message|> .*

# optionally, inject pieces to guide the model when it goes over
reasoning-with-budget ::= (!<|end|>){0,200} (<|end|> | "--I need to provide an immediate response" <|end|>)

Notes:

It is important the grammar is unambiguous, otherwise the model may find a way to continue thinking via other paths in the grammar.
gpt-oss may be a poor example since it has reasoning_effort, but the budget approach works pretty well.

To Do

Implement token support in llama-grammar
Refactor trigger_patterns to collect tokens and replay them after a successful trigger. Support partial token matches by feeding only the matched piece to the grammar.
Update grammar documentation and provide an example in grammars/

AI Disclosure: LLM was used to help understand the grammar code, assist in writing documentation and test cases, and review implementations. All output generated by an LLM has been reviewed.

aviallon · 2025-12-06T07:41:53Z

Very interesting. I'll see what I can build upon that.

ggerganov

Feel free to merge (use squash-merge).

pwilkin · 2025-12-08T13:53:28Z

One question: did you maybe test if there are no memory problems / performance problems if you do very long sequences, like {2000}? I remember there were some issues around that recently.

aldehir · 2025-12-08T17:37:58Z

One question: did you maybe test if there are no memory problems / performance problems if you do very long sequences, like {2000}? I remember there were some issues around that recently.

I didn't have any noticeable problems at 2000, and I didn't increase it in this PR. I do think we could increase this, from what I recall the issue was mostly when overflowed to MAX_UINT. I can test it out more thoroughly.

llama : add token support to llama-grammar

eff2b88

aldehir changed the title ~~llama : add token support to llama-grammar~~ llama : add token matching support to llama-grammar Dec 6, 2025

fix inverse token comment

ab8d137

loci-dev mentioned this pull request Dec 6, 2025

UPSTREAM PR #17816: llama : add token matching support to llama-grammar auroralabs-loci/llama.cpp#468

Open

3 tasks

github-actions bot added the testing Everything test related label Dec 6, 2025

aldehir added 4 commits December 6, 2025 14:03

refactor trigger_patterns to replay tokens instead of the entire string

27dbaef

add token documentation

02621fe

fix test-llama-grammar

83cb005

improve test cases for tokens

49f16b4

aldehir marked this pull request as ready for review December 7, 2025 00:19

aldehir requested a review from ggerganov as a code owner December 7, 2025 00:19

ggerganov approved these changes Dec 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : add token matching support to llama-grammar #17816

llama : add token matching support to llama-grammar #17816

aldehir commented Dec 6, 2025 •

edited

Loading

Uh oh!

aviallon commented Dec 6, 2025

Uh oh!

ggerganov left a comment

Uh oh!

pwilkin commented Dec 8, 2025

Uh oh!

aldehir commented Dec 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

llama : add token matching support to llama-grammar #17816

Are you sure you want to change the base?

llama : add token matching support to llama-grammar #17816

Conversation

aldehir commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Proposed Solution

Example (gpt-oss)

Use Case: Reasoning Budget Enforcement

To Do

Uh oh!

aviallon commented Dec 6, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Dec 8, 2025

Uh oh!

aldehir commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aldehir commented Dec 6, 2025 •

edited

Loading

aldehir commented Dec 8, 2025 •

edited

Loading