-
Notifications
You must be signed in to change notification settings - Fork 14k
llama : add token matching support to llama-grammar #17816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Very interesting. I'll see what I can build upon that. |
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to merge (use squash-merge).
|
One question: did you maybe test if there are no memory problems / performance problems if you do very long sequences, like {2000}? I remember there were some issues around that recently. |
I didn't have any noticeable problems at 2000, and I didn't increase it in this PR. I do think we could increase this, from what I recall the issue was mostly when overflowed to MAX_UINT. I can test it out more thoroughly. |
Implementation of idea by @ngxson: #17750 (comment)
cc: @pwilkin @aviallon
Problem
The
llama-grammarimplementation doesn't have a way to accept tokens directly, which creates a few problems:<|end|>) and the tokenized form<|, end, |>that may occur in content.( [^<] | "<" [^|] | "<|" [^e] | ... | "<|end|" [^>] )*to match chunks of characters that don't accumulate to the desired delimiter (<|end|>).Proposed Solution
Borrowing some ideas from llguidance, you can define a token by id
<[id]>or as raw token text<token>if encased in</>. I'm leaving out support for token id ranges/alternates since I don't see an immediate need for it.You can negate by prefixing the token with
!, e.g.!<|end|>.Example (gpt-oss)
By token id:
That's not very readable, but useful for tokens not wrapped in
</>. If they are, you can use them directly:Use Case: Reasoning Budget Enforcement
Assuming the model's vocab has unique tokens for its thinking tags, adopting a reasoning budget is fairly trivial via grammar:
Notes:
gpt-ossmay be a poor example since it hasreasoning_effort, but the budget approach works pretty well.To Do
llama-grammartrigger_patternsto collect tokens and replay them after a successful trigger. Support partial token matches by feeding only the matched piece to the grammar.grammars/AI Disclosure: LLM was used to help understand the grammar code, assist in writing documentation and test cases, and review implementations. All output generated by an LLM has been reviewed.