Skip to content

Conversation

@jahorton
Copy link
Contributor

@jahorton jahorton commented Nov 13, 2025

This PR aims to start an internal doc on the role of SearchSpace, SearchPath, and SearchCluster in the correction-search process.

At present, I don't claim it to be complete by any measure. But, "something" is better than "nothing" here, and this provides a chance to get some eyes on things early in order to determine what works as an explanation and what doesn't. Feedback appreciated, even while in draft mode.

Build-bot: skip
Test-bot: skip

@keymanapp-test-bot
Copy link

User Test Results

Test specification and instructions

User tests are not required

@keymanapp-test-bot keymanapp-test-bot bot changed the title docs(web): starts internal doc on SearchSpace design, requirements, and analysis docs(web): starts internal doc on SearchSpace design, requirements, and analysis 🚂 Nov 13, 2025
@keymanapp-test-bot keymanapp-test-bot bot added this to the A19S16 milestone Nov 13, 2025
@keyman-server keyman-server modified the milestones: A19S16, A19S17 Nov 22, 2025
Copy link
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps a lot in understanding the SearchSpace, SearchPath, SearchCluster types.

But I do have some questions and suggestions:

  • It would help to describe the shape of these types (i.e. properties, methods, and particularly how Path differs from Cluster).
  • Include a concrete example at the top, of a short series of key events + resulting SearchSpace types to illustrate how the types are used. This should be a common case rather than a pathological one!
  • Even after reading, I am not really clear why SearchCluster exists; why do SearchPaths need grouping and how does this help?
  • I am a little unclear on the names of the types - Space vs Path. Why is a Path an implementation of a Space?
  • I am unclear how a SearchPath can 'extend' a SearchSpace given a SearchSpace is just an interface without implementation? Isn't the relationship between SearchPath and SearchSpace 'implements'?
  • I guess a Cluster could be called a PathCluster or a PathGroup to clarify the relationship?
  • It seems like a large part of the reason for these types is fat fingering at word boundaries. Is that right? It's never explictly stated, just obliquely when defining the problem.

Formatting nit: we generally wrap our .md files at 80 chars

The `SearchSpace` interface exists to represent portions of the dynamically-generated graph used for correction-searching within the predictive-text engine. As new input is received, new extensions to previous `SearchSpace`s may be created to extend the graph's reach, appending newly-received input to the context token to be corrected. Loosely speaking, different instances of `SearchSpace` correspond to different potential tokenizations of the input and/or to different requirements for constructing and applying generated suggestions.

There are two implementations of this interface:
- `SearchPath`, which extends a `SearchSpace` by a single set of recent inputs affecting the range of represented text in the same manner.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • SearchPath is a single set of recent inputs -- how do you define the boundaries of this set? Is it 1 keystroke, 10, 100? And what does 'in the same manner' mean -- what are the actual effects?

<!-- - To complicate matters further, note that the letters `c`, `v`, and `n` are also close to `b`.
- Suppose this leads to `van errors`, `NaN errors`, etc..., but also `cannery`, `Vannessa`, etc. -->

2. Each individual `SearchSpace` should only model correction of inputs that result in tokens of the same codepoint length as each other.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confuses me -- you talk about an individual SearchSpace but then say 'the same codepoint length as each other' -- what is the 'other' here?


2. It is not possible to guarantee that one keystroke will only extend a previous `SearchSpace` in one way.
- If the incoming keystroke produces `Transform`s that have different `insert` length without varying the left-deletion count, this _must_ result in multiple `SearchSpace`s, as the total codepoint length will vary accordingly.
- Also of note: if left-deleting, it is possible for a left-deletion to erase the token adjacent to the text insertion point.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this point -- I would assume that a left-deletion would always be deleting the token adjacent to the text insertion point?

Comment on lines +74 to +76
For example, consider a case with two keystrokes, each of which has versions emitting insert strings of one and two characters. Taking two chars from one and one char from the other will result in a `SearchSpace` that models a total of two keystrokes that fully covers the two keys.

For such cases, any future keystrokes can extend both input sequences in the same manner. While the actual correction-text may differ, the net effect it has on the properties of a token necessary for correction and construction of suggestions is identical. The `SearchCluster` variant of `SearchSpace` exists for such cases, modeling the convergence of multiple `SearchPath`s and extending all of them together at once. No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't make sense to me. I don't understand "each of which has versions emitting insert strings of one and two characters."

@@ -0,0 +1,76 @@
# The SearchSpace types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it called SearchSpace/Path/Cluster ? What do we search?

@@ -0,0 +1,76 @@
# The SearchSpace types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure SearchSpace is a good name here. A search space, in my understanding, consists of a set of points (nodes, vertices) over which the search is carried out, looking for the best (or good enough) point by some metric. Consequently, the search path would consist of a series of steps (transitions) from point to point while searching for the best. Hence a search path is something in or through a search space, so an is-a relationship between SearchPath and SearchSpace doesn't seem appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you here... but I'm having trouble finding better nomenclature.

SearchSpace, SearchPath, and SearchCluster all represent a higher-level "branching" of matching-behavior nodes/edges from what came before. They all "condense" groups of nodes & edges into a single behavior shared by all; the inner layer (with SearchNode, etc) handles non-condensed cases matching that behavior.

SearchPath has a single inbound sequence of such behaviors; SearchCluster supports cases where two or more inbound behaviors total to the same cumulative behavior. (The actual Transforms applied will differ, but will still exhibit the same expected total behavior.)

Any suggestions on how to express this? I started looking up graph formal languages and terms, but to no avail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure SearchSpace is a good name here. A search space, in my understanding, consists of a set of points (nodes, vertices) over which the search is carried out, looking for the best (or good enough) point by some metric.

So, this actually is part of what SearchSpace does. Its instances do reflect a set of points (nodes, vertices) and related edges for the search. However, each instance is but a subset of the full, total SearchSpace under consideration; it's impractical to represent the total set of nodes, edges, etc for the search in a single instance without making "partitions" (or similar) of the space. (Sadly, "partition" is already an existing term for graph representation in literature, and what we're doing... isn't that.)

Before my recent work, we were fine with a single SearchSpace instance, and that instance would be mutated as new input came in. The first step then transformed it to "extend" the search space with a single input in a manner not unlike a linked list. The design assumes that it is safe to "factorize" the search space on each input - to treat each input keystroke as being independent of the others. (This is not strictly true, but it does simplify the design.) Factorizing the search in this way then lets use reuse the results for inputs 1 to N-1, then continue from that with input N to find the best results for 1 through to N.

To do whitespace fat-fingering adjustment, we no longer have a single linked-list representing the total possible search space. Instead, we have a diverging set of reusable intermediate searches. Sometimes, some entries may reconverge in a manner such that anything continuing the search results of one "net behavior" may continue the search results of any predecessor behavior sequence - and in the same way. We could almost say that we build a tree of potential behaviors; the only issue with that statement is that occasionally, some branches may reconverge after splitting. The search graph is certainly directed and acyclic, with a clearly-defined start node - whether or not we're looking on the level of individual possible inputs or on the "condensed" level of behaviors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, re "behaviors":

Suppose one keystroke may produce any of the following outputs:

  • a
  • b
  • c
  • de
  • fg
  • hi
  • delete 1 char, then emit jk
  • delete 1 char, then emit lm

This condenses to 3 behaviors:

  • insert 1 char (a, b, or c)
  • insert 2 chars (de, fg, or hi)
  • delete 1 char, then insert 2 chars (jk, lm)

The SearchPath for each of the three behaviors will search a space based on the prefix behaviors and any input keystroke that matches the SearchPath's behavior. A SearchCluster built from the three SearchPaths will search all three at once. All will be but a subset of the true overall search space for the token.

@@ -0,0 +1,76 @@
# The SearchSpace types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would probably be better to introduce the classes/types a little later in the document after at least the problem definition. Some detail on the classes and examples would be helpful too.

## The Underlying Problem

### Defining the Problem
It is easily possible for a user to fat-finger, accidentally typing a standard letter instead of the spacebar or similar when the latter is intended. For languages using standard whitespace-based wordbreaking, this implies that the word boundaries seen in the context should not be considered absolute; we should model cases where the word-boundaries land elsewhere due to fat-finger effects. Additionally, we have standing plans to support dictionary-based wordbreaking for languages that do not utilize whitespaces between words - this adds an extra case in which word-boundaries cannot be considered absolute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... do not utilize whitespaces between words (e.g. Khmer)

### Defining the Problem
It is easily possible for a user to fat-finger, accidentally typing a standard letter instead of the spacebar or similar when the latter is intended. For languages using standard whitespace-based wordbreaking, this implies that the word boundaries seen in the context should not be considered absolute; we should model cases where the word-boundaries land elsewhere due to fat-finger effects. Additionally, we have standing plans to support dictionary-based wordbreaking for languages that do not utilize whitespaces between words - this adds an extra case in which word-boundaries cannot be considered absolute.

Keyman keyboard rules further complicate matters. They do not need to consider side-effects for predictive-text, and it's easily possible for a rule to output text changes that affect (or even _effect_) multiple text tokens within the context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better not to use contractions (it's) to help non-English speakers

- it alters the end of the word currently at the end of context
- it also adds a whitespace token

There also exist keyboards like `khmer_angkor` that may perform character reordering, performing significant left-deletions and insertions in a single keystroke. Furthermore, there's little saying that a keyboard can't be written that deletes a full grapheme cluster, rather than an individual key - a process that would add multiple left-deletions without any insertions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a 'full grapheme cluster'?

- Also of note: if left-deleting, it is possible for a left-deletion to erase the token adjacent to the text insertion point.

3. When constructing and applying `Suggestion`s, it helps greatly to determine which `SearchSpace` led to it.
- This allows us to determine _which_ keystrokes are being replaced, as well as _what_ parts of the Context will be affected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context -> context (unless Context is a class)


For example, consider a case with two keystrokes, each of which has versions emitting insert strings of one and two characters. Taking two chars from one and one char from the other will result in a `SearchSpace` that models a total of two keystrokes that fully covers the two keys.

For such cases, any future keystrokes can extend both input sequences in the same manner. While the actual correction-text may differ, the net effect it has on the properties of a token necessary for correction and construction of suggestions is identical. The `SearchCluster` variant of `SearchSpace` exists for such cases, modeling the convergence of multiple `SearchPath`s and extending all of them together at once. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of the extending of a SearchSpace has yet to be explained.

Copy link
Contributor Author

@jahorton jahorton Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you recall our discussion over Zoom, the correction-search process heavily utilizes dynamic programming principles.

So, for an alternate wording... "extending a SearchSpace" is the same as "treat the existing SearchSpace as the (dynamic-programming) subproblem for new SearchSpace(s)". Each uses all paths from the original, pre-extended SearchSpace as valid prefixes, extending them by one step: by at least one input from a newly-input keystroke.

As the diverging subspace behaviors can reconverge to the same net behavior, it is possible for certain steps to have "overlapping subproblems"; we have a scenario that truly qualifies for dynamic programming. As long as no delete-lefts occur, we also have true "optimal substructure". (Note that #14366 exists in part to address the delete-left cases and handle them more optimally.)

So, we're using dynamic programming principles on a graph - both to search it (and subspaces of it) and to represent it. This graph is iteratively built when new input is received, treating old subspaces as "subproblems" that may be referred to in a dynamic-programming style. These subspaces may also be searched, which is done via "divide and conquer" (as the lower level, with specific sampled inputs, does not truly have overlapping subproblems).

I have yet to find literature describing organization of graphs in this manner, or of reuse of prior pathfinding calculations in a similar manner at runtime. If there were clear existing literature for this, or clear existing nomenclature, that would likely facilitate much clearer documentation here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yen's algorithm uses the term spur or spur path to denote a partial path that is then extended towards a goal.

@jahorton
Copy link
Contributor Author

jahorton commented Dec 1, 2025

After much reading and searching, I've landed on this: https://en.wikipedia.org/wiki/Modular_decomposition

The new target form for the correction-search graph aligns pretty well with what is described there (after parsing all the formalization). To break it down:

  • Upon receiving a new keystroke, a new set of edges and destination nodes for those edges is constructed. The "destination nodes" are then conceptually grouped into modules.
    • On one hand, all transforms for the current keystroke's input may also be used to define a graph module - no transitions on the search-graph will consider duplication of the keystroke.
      • 'delete' edits of the keystroke are included within this outer "module".
      • 'insert' edits that apply after the keystroke's direct effects are also included within this outer "module".
    • On the other hand, we build partitions of this module such that there is only one outbound virtual node for each, which indicates the total length of the token and which keystrokes (and portions thereof) comprise the token. Different (module) partitions of the "outer module" end at different virtual nodes.

Also of note: https://en.wikipedia.org/wiki/Quotient_graph (which is referenced by the prior link)

  • Short version: it's a graph built out of modules comprising another graph, recognizing the connectivity amongst the modules.

We can also build paths on the quotient graph, starting from the root node until the final module(s) added by the incoming keystroke, to somewhat formalize what the current SearchSpace classes are representing.
- If the final transition on the quotient graph to a single "virtual node" only passes through a single keystroke-level module, a SearchPath instance is used to represent that virtual node.
- When multiple such keystroke-level modules terminate the paths to a single "virtual node", this is represented by constructing SearchPath instances for each such keystroke-level module path, then constructing a SearchCluster from that to represent the virtual node.

Obvious remaining clarifications needed:

  • The "outer module" - the module superset of all keystroke-level modules for a single keystroke.
    • Is not really relevant outside of formalization.
  • "Keystroke-level module" - is a quotient-graph path-terminating module representing a subset of the keystroke input effects, which all target the same single "virtual node"
  • "virtual node" - yep, I keep referring back to that.
  • The quotient-graph path. which in combination with the "virtual nodes" alluded to above, line up well to the current SearchSpace interface and implementing types.

@keyman-server keyman-server modified the milestones: A19S17, A19S18 Dec 6, 2025
@github-actions github-actions bot added docs and removed docs labels Dec 10, 2025
@jahorton jahorton force-pushed the docs/web/add-search-space-doc branch from 4888cbf to 0a4bc0b Compare December 12, 2025 22:20
@github-actions github-actions bot added docs and removed docs labels Dec 12, 2025
@github-actions github-actions bot added docs and removed docs labels Dec 15, 2025
@jahorton
Copy link
Contributor Author

jahorton commented Dec 15, 2025

After a lot of work, and some diving through formal graph-theory references online, I think I've arrived at a more precise way to document how the correction-search graph is built and operates - see the new correction-search-graph.md file. It includes a number of Mermaid-based flowcharts representing graphs, so it may help to view that file in rich-diff mode (or just view the current commit's version of the file).

Please re-review and let me know how effective the new document is at documenting the upcoming correction-search graph design. Is it clearer for you than the other document? It's certainly more verbose, and it aims to be more precise in its language. If the prior document communicated any aspect more clearly, that'd be useful to know. Thanks!

Comment on lines +725 to +729
## The `SearchCluster` type

As there are many cases where more than one parent node + edge combination may
transition to a child submodule, `SearchCluster` exists to connect all such
combinations (and their corresponding `SearchPath` representations) together.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer a previous question, the reason I originally landed on the SearchCluster name here is that it clusters multiple SearchPath instances that all land on the same graph submodule, giving them a single common representation and extension point.

Perhaps SearchQuotientDestination might be a better name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SearchQuotientDestination sounds too final, when it will be extended again - the main thing is the clustering - how about SearchQuotientCluster?

Comment on lines +713 to +719
## The `SearchPath` type

The transition from one submodule to another is marked by specific edge types
corresponding to received keystrokes or to `insert` or `delete` edit operations.
Whatever the edge type is, this transition is modeled by the `SearchPath` type,
extending all `SearchNode` paths passing through it via the specified edge type
in order to reach the next submodule.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer a previous question, I originally landed on the SearchPath name here as it represents the paths through the quotient-graph needed to reach the desired destination submodule via a single edge (path) at the last quotient-graph transition.

Clearly, I lacked the formal graph language to explain it this way at the time... and I'm not certain that this explanation is sufficiently clear and/or accessible.

Perhaps SearchQuotientEdge might be a better name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you are carrying the path info up to this point, I think SearchQuotientPath or SearchQuotientSpur (after Yen) might be better.


<!-- TODO - everything after this point. -->

## The `SearchSpace` type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per other comments in this review, perhaps SearchQuotientNode may be a better name for the type?

It does remain a valid subset of the correction-search space, but the quotient-graph and module languages does give us a more precise way to document the behaviors and roles of this interface and its implementing types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SearchQuotientNode sounds better than SearchSpace

@github-actions github-actions bot added docs and removed docs labels Dec 15, 2025
of the most-likely possible input corrections when suggesting words from the
active lexical-model. To do so, it dynamically builds portions of the search
graph as needed to generate corrections to the most recent token in the context.
This token lies immediately before the caret.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'caret' here the cursor or insertion point?

> [...] we assume that each keystroke's `Transform` is 100% independent from the
`Transform` selected for every other keystroke.

Therefore, we can find the cost of selecting a correction by using a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant in this context by cost? Is it edit distance? Probability?

Comment on lines +248 to +330
### Keystroke-Based Modules

Let us start with a simplified case - one without 'insert' or 'delete' edits.
Instead, the only edges result from correcting keystrokes and matching them
against the lexicon.

For a first example, suppose we have the following scenario:
- Keystroke 1: outputs one of the following:
- `{insert: 'a', deleteLeft: 0}`
- `{insert: 'b', deleteLeft: 0}`
- Keystroke 2: outputs one of the following:
- `{insert: 'c', deleteLeft: 0}`
- `{insert: 'd', deleteLeft: 0}`
- Keystroke 3: outputs one of the following:
- `{insert: 'e', deleteLeft: 0}`
- `{insert: 'f', deleteLeft: 0}`

Assuming that all possible combinations are valid prefixes, correction-search's
graph would then expand as follows:

```mermaid
---
title: Low-level Correction-Search graph expansion
---
flowchart LR;
subgraph Start
start{Empty token}
end

subgraph Keystroke 1: a or b
start --> a
start --> b
end

subgraph Keystroke 2: c or d
a --> ac
a --> ad
b --> bc
b --> bd
end

subgraph Keystroke 3: e or f
ac --> ace
ac --> acf
ad --> ade
ad --> adf
bc --> bce
bc --> bcf
bd --> bde
bd --> bdf
end
```

The figure above represents a [**quotient
graph**](https://en.wikipedia.org/wiki/Quotient_graph) of the search space for
this example case.
- Note how there is a clear ordering of events and how the correction-search
process goes through exactly four nodes in this scenario - the only point of
differentiation is _which four_.
- We know correction-search will go through up to one node from each column for
any path, and _exactly_ one for any completed path.

Furthermore, each _column_ represents a [**modular
partition**](https://en.wikipedia.org/wiki/Modular_decomposition#Modular_quotients_and_factors)
of the graph.
- Each column, then, represents a graph
[**module**](https://en.wikipedia.org/wiki/Modular_decomposition#Modules)
while also being a partition of the graph.
- Note that for every node on the graph not in a module (column), each other
node _either_ connects to _all_ of that module's nodes or to _none_ of them.

The graph can thus be condensed as follows:

```mermaid
---
title: Correction-Search graph expansion - Condensed
---
flowchart LR;
start{Empty token}
start --> a(Keystroke 1: a or b)
a --> b(Keystroke 2: c or d)
b --> c(Keystroke 3: e or f)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section/example is very clear

Comment on lines +332 to +504
### Handling Complex Transforms

Let us now examine a case with a bit more complexity. Suppose we have the
following scenario:
- Keystroke 1: outputs one of the following:
- `{insert: 'a', deleteLeft: 0}`
- `{insert: 'b', deleteLeft: 0}`
- `{insert: 'cd', deleteLeft: 0}`
- Keystroke 2: outputs one of the following:
- `{insert: 'e', deleteLeft: 0}`
- `{insert: 'f', deleteLeft: 0}`
- `{insert: 'gh', deleteLeft: 0}`
- Keystroke 3: outputs one of the following:
- `{insert: 'i', deleteLeft: 0}`
- `{insert: 'jk', deleteLeft: 0}`
- `{insert: 'l', deleteLeft: 1}`

Assuming that all possible combinations are valid prefixes, correction-search's
graph would then expand as follows:

```mermaid
---
title: Heterogenous keystroke correction-search graph (expanded)
config:
flowchart:
curve: basis
---
flowchart LR;
subgraph Start
start{Empty token}
end

subgraph After Key 1
subgraph Codepoint length 1
start --> a
start --> b
end

subgraph Codepoint length 2
start --> cd
end
end

subgraph After Key 2
subgraph Codepoint length 2
a --> ae
a --> af
b --> be
b --> bf
end

subgraph Codepoint length 3
a --> agh
b --> bgh
cd --> cde
cd --> cdf
end

subgraph Codepoint length 4
cd --> cdgh
end
end

subgraph After Key 3
subgraph Codepoint length 2
ae ----> al
af ----> al
be ----> bl
bf ----> bl
end

subgraph Codepoint length 3
ae ----> aei
af ----> afi
be ----> bei
bf ----> bfi
agh ----> agl
bgh ----> bgl
cde ----> cdl
cdf ----> cdl
end

subgraph Codepoint length 4
ae ----> aejk
af ----> afjk
be ----> bejk
bf ----> bfjk
agh ----> aghi
bgh ----> bghi
cde ----> cdei
cdf ----> cdfi
cdgh ----> cdgl
end

subgraph Codepoint length 5
agh ----> aghjk
bgh ----> bghjk
cde ----> cdejk
cdf ----> cdfjk
cdgh ---> cdghi
end

subgraph Codepoint length 6
cdgh ----> cdghjk
end
end
```

Note that each member of the "keystroke count" set of modules (i.e, each column)
is comprised of one or more sets of entries of specific codepoint lengths. It
is reasonable to consider each such subset (of equal codepoint length +
processed keystroke count) as its own module.

In this graph's condensed view, we get...

```mermaid
---
title: Heterogenous keystroke correction-search graph (condensed)
---
flowchart LR;
subgraph Start
start{Empty token}
end

subgraph After Key 1
start -- [a, b] --> K1C1(Codepoint length 1)
start -- [cd] --> K1C2(Codepoint length 2)
end

subgraph After Key 2
K1C1 -- [e, f] --> K2C2(Codepoint length 2)

K1C1 -- [gh] --> K2C3(Codepoint length 3)
K1C2 -- [e, f] --> K2C3

K1C2 -- [gh] --> K2C4(Codepoint length 4)
end

subgraph After Key 3
K2C2 -- [ -1 + l ] --> K3C2(Codepoint length 2)

K2C2 -- [i] --> K3C3(Codepoint length 3)
K2C3 -- [ -1 + l ] --> K3C3

K2C2 -- [jk] --> K3C4(Codepoint length 4)
K2C3 -- [i] --> K3C4
K2C4 -- [ -1 + l ] --> K3C4

K2C3 -- [jk] --> K3C5(Codepoint length 5)
K2C4 -- [i] --> K3C5

K2C4 -- [jk] --> K3C6(Codepoint length 6)
end
```

Note that this quotient graph has an implied modular partition, with modules for
each keystroke containing (condensed) submodules for each codepoint length
resulting from following the search path through to that node. These condensed
submodules may represent multiple different internal nodes, each reachable by
slightly different paths that all exhibit the same critical qualities: they
produce the same codepoint length with the same set of processed keystrokes.

We maintain the graph in this manner in order to properly handle left-deletions
for all cases. If any input keystrokes include left-deletion effects, it is
possible to have paths that _decrease_ the total represented codepoint length.

Of particular note: should a later left-deletion eventually erase _all_ of the
search path's codepoint length, or worse - go negative - there will be special
handling required. (This is the specific reason that the submodules require
matching codepoint lengths.) For cases where the left-deletions exceed
currently-modeled codepoint length, the most straightforward model for excess
left-deletions is to edit and correct text that lands before the caret after the
final left-deletion is applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the diagrams really help wrt understanding the merging of codepoint lengths

represent the complete path taken to reach the _expanded_ graph node they
represent _and_ the node itself. As it is possible for the node to be reached
by different paths, the `.resultKey` property may be used to determine if this
has occurred at a lower path cost. Should this occur, the instance may be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should explain the search objective earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part of this in particular are you referring to? I want to make sure I'm on the same page in this regard.

Copy link
Contributor

@markcsinclair markcsinclair Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say 'to determine if this has occurred at a lower path cost. Should this occur, the instance may be discarded, as the optimal version has already been evaluated'. Your search has an objective function, a measure of successt. This is often referred to as 'cost' in the document (e.g. line 109ff]), but the reader is not sure how you are measuring cost. Can you define it fairly early in the document?


`SearchPath` itself _also_ implements `SearchSpace`; for cases where only a
single parent node and edge exists that may transition to a new submodule,
`SearchPath` is sufficient to module the destination submodule.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`SearchPath` is sufficient to module the destination submodule.
`SearchPath` is sufficient to model the destination submodule.

`SearchPath` has a single parent submodule, represented by an earlier
`SearchSpace` instance, whose paths are extended by an edge representing a
single keystroke input type (all with matching insertion codepoint length and
left-deletion count) or edit operation type. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, a very useful addition to the documentation, with the examples and their representation in diagrams being particularly helpful. I found the earlier section ('Correction-Search as Graph Path-Finding') a little more difficult to follow, but if I read it again in the light of the later section this would probably help.


<!-- TODO - everything after this point. -->

## The `SearchSpace` type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SearchQuotientNode sounds better than SearchSpace

Comment on lines +713 to +719
## The `SearchPath` type

The transition from one submodule to another is marked by specific edge types
corresponding to received keystrokes or to `insert` or `delete` edit operations.
Whatever the edge type is, this transition is modeled by the `SearchPath` type,
extending all `SearchNode` paths passing through it via the specified edge type
in order to reach the next submodule.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you are carrying the path info up to this point, I think SearchQuotientPath or SearchQuotientSpur (after Yen) might be better.

Comment on lines +725 to +729
## The `SearchCluster` type

As there are many cases where more than one parent node + edge combination may
transition to a child submodule, `SearchCluster` exists to connect all such
combinations (and their corresponding `SearchPath` representations) together.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SearchQuotientDestination sounds too final, when it will be extended again - the main thing is the clustering - how about SearchQuotientCluster?

@keyman-server keyman-server modified the milestones: A19S18, A19S19 Dec 21, 2025
@mcdurdin mcdurdin changed the title docs(web): starts internal doc on SearchSpace design, requirements, and analysis 🚂 docs(web): start internal doc on SearchSpace design, requirements, and analysis 🚂 Dec 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

6 participants