Skip to content

Conversation

@traviscross
Copy link
Contributor

@traviscross traviscross commented Dec 14, 2025

The cut operator (^) is a backtracking fence. Once the expression to its left succeeds, we become committed to the alternative; the remainder of the expression must parse successfully or parsing will fail. See Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space, Mizushima et al., https://kmizu.github.io/papers/paste513-mizushima.pdf.

This operator solves a problem for us with C string literals. These literals cannot contain a null escape. But if we simply fail to lex the literal (e.g. c"\0"), we may instead lex it successfully as two separate tokens (`c "\0"), and that would be incorrect.

As long as we only use cut to express constraints that can be expressed in a regular language and we keep our alternations disjoint, the grammar can still be mechanically converted to a CFG.

Let's add the cut operator to our grammar and use it for C string literals and some similar constructs.

In the railroad diagrams, we'll render the cut as a "no backtracking" box around the expression or sequence of expressions after the cut. The idea is that once you enter the box the only way out is forward.

(H/t to @ehuss for suggesting the cut operator to solve this problem.)

cc @ehuss


This is stacked on #2097 and should merge after it.

ehuss added 30 commits December 14, 2025 14:04
This is the beginning of a new mdbook book that will house all of the
guidelines for contributors. This is published via GitHub Pages.
This is just some light editing. I expect that this chapter will have
larger edits in the future, but I want to defer that till later.
This is just a stub, with the expectation that it will be
expanded/rewritten later.
This is just a stub, with the expectation that it will be
expanded/rewritten later.
This has been superseded by the contributor guide.
@traviscross traviscross force-pushed the TC/add-cut-to-grammar branch from 8b74468 to 24690d2 Compare December 14, 2025 17:09
The cut operator (`^`) is a backtracking fence.  Once the expression
to its left succeeds, we become committed to the alternative; the
remainder of the expression must parse successfully or parsing will
fail.  See *Packrat Parsers Can Handle Practical Grammars in Mostly
Constant Space*, Mizushima et al.,
<https://kmizu.github.io/papers/paste513-mizushima.pdf>.

This operator solves a problem for us with C string literals.  These
literals cannot contain a null escape.  But if we simply fail to lex
the literal (e.g. `c"\0"`), we may instead lex it successfully as two
separate tokens (`c "\0"), and that would be incorrect.

As long as we only use cut to express constraints that can be
expressed in a regular language and we keep our alternations disjoint,
the grammar can still be mechanically converted to a CFG.

Let's add the cut operator to our grammar and use it for C string
literals and some similar constructs.

In the railroad diagrams, we'll render the cut as a "no backtracking"
box around the expression or sequence of expressions after the cut.
The idea is that once you enter the box the only way out is forward.
@traviscross traviscross force-pushed the TC/add-cut-to-grammar branch from 24690d2 to fc646a1 Compare December 15, 2025 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants