docs(web): start internal doc on SearchSpace design, requirements, and analysis 🚂 #15161

jahorton · 2025-11-13T19:17:11Z

This PR aims to start an internal doc on the role of SearchSpace, SearchPath, and SearchCluster in the correction-search process.

At present, I don't claim it to be complete by any measure. But, "something" is better than "nothing" here, and this provides a chance to get some eyes on things early in order to determine what works as an explanation and what doesn't. Feedback appreciated, even while in draft mode.

Build-bot: skip
Test-bot: skip

…nd analysis Build-bot: skip Test-bot: skip

keymanapp-test-bot · 2025-11-13T19:17:36Z

User Test Results

Test specification and instructions

User tests are not required

mcdurdin

This helps a lot in understanding the SearchSpace, SearchPath, SearchCluster types.

But I do have some questions and suggestions:

It would help to describe the shape of these types (i.e. properties, methods, and particularly how Path differs from Cluster).
Include a concrete example at the top, of a short series of key events + resulting SearchSpace types to illustrate how the types are used. This should be a common case rather than a pathological one!
Even after reading, I am not really clear why SearchCluster exists; why do SearchPaths need grouping and how does this help?
I am a little unclear on the names of the types - Space vs Path. Why is a Path an implementation of a Space?
I am unclear how a SearchPath can 'extend' a SearchSpace given a SearchSpace is just an interface without implementation? Isn't the relationship between SearchPath and SearchSpace 'implements'?
I guess a Cluster could be called a PathCluster or a PathGroup to clarify the relationship?
It seems like a large part of the reason for these types is fat fingering at word boundaries. Is that right? It's never explictly stated, just obliquely when defining the problem.

Formatting nit: we generally wrap our .md files at 80 chars

mcdurdin · 2025-11-22T06:16:04Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

+The `SearchSpace` interface exists to represent portions of the dynamically-generated graph used for correction-searching within the predictive-text engine.  As new input is received, new extensions to previous `SearchSpace`s may be created to extend the graph's reach, appending newly-received input to the context token to be corrected.  Loosely speaking, different instances of `SearchSpace` correspond to different potential tokenizations of the input and/or to different requirements for constructing and applying generated suggestions.
+
+There are two implementations of this interface:
+- `SearchPath`, which extends a `SearchSpace` by a single set of recent inputs affecting the range of represented text in the same manner.


SearchPath is a single set of recent inputs -- how do you define the boundaries of this set? Is it 1 keystroke, 10, 100? And what does 'in the same manner' mean -- what are the actual effects?

mcdurdin · 2025-11-22T06:21:36Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

+<!--   - To complicate matters further, note that the letters `c`, `v`, and `n` are also close to `b`.
+    - Suppose this leads to `van errors`, `NaN errors`, etc..., but also `cannery`, `Vannessa`, etc. -->
+
+2.  Each individual `SearchSpace` should only model correction of inputs that result in tokens of the same codepoint length as each other.


This confuses me -- you talk about an individual SearchSpace but then say 'the same codepoint length as each other' -- what is the 'other' here?

mcdurdin · 2025-11-22T06:23:19Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

+
+2.  It is not possible to guarantee that one keystroke will only extend a previous `SearchSpace` in one way.
+    - If the incoming keystroke produces `Transform`s that have different `insert` length without varying the left-deletion count, this _must_ result in multiple `SearchSpace`s, as the total codepoint length will vary accordingly.
+    - Also of note:  if left-deleting, it is possible for a left-deletion to erase the token adjacent to the text insertion point.


I don't understand this point -- I would assume that a left-deletion would always be deleting the token adjacent to the text insertion point?

mcdurdin · 2025-11-22T06:25:46Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

+For example, consider a case with two keystrokes, each of which has versions emitting insert strings of one and two characters.  Taking two chars from one and one char from the other will result in a `SearchSpace` that models a total of two keystrokes that fully covers the two keys.
+
+For such cases, any future keystrokes can extend both input sequences in the same manner.  While the actual correction-text may differ, the net effect it has on the properties of a token necessary for correction and construction of suggestions is identical.  The `SearchCluster` variant of `SearchSpace` exists for such cases, modeling the convergence of multiple `SearchPath`s and extending all of them together at once.


This example doesn't make sense to me. I don't understand "each of which has versions emitting insert strings of one and two characters."

ermshiperete · 2025-11-25T11:01:17Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

@@ -0,0 +1,76 @@
+# The SearchSpace types


Why is it called SearchSpace/Path/Cluster ? What do we search?

markcsinclair · 2025-11-26T12:04:41Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

@@ -0,0 +1,76 @@
+# The SearchSpace types


I am not sure SearchSpace is a good name here. A search space, in my understanding, consists of a set of points (nodes, vertices) over which the search is carried out, looking for the best (or good enough) point by some metric. Consequently, the search path would consist of a series of steps (transitions) from point to point while searching for the best. Hence a search path is something in or through a search space, so an is-a relationship between SearchPath and SearchSpace doesn't seem appropriate.

I hear you here... but I'm having trouble finding better nomenclature.

SearchSpace, SearchPath, and SearchCluster all represent a higher-level "branching" of matching-behavior nodes/edges from what came before. They all "condense" groups of nodes & edges into a single behavior shared by all; the inner layer (with SearchNode, etc) handles non-condensed cases matching that behavior.

SearchPath has a single inbound sequence of such behaviors; SearchCluster supports cases where two or more inbound behaviors total to the same cumulative behavior. (The actual Transforms applied will differ, but will still exhibit the same expected total behavior.)

Any suggestions on how to express this? I started looking up graph formal languages and terms, but to no avail.

I am not sure SearchSpace is a good name here. A search space, in my understanding, consists of a set of points (nodes, vertices) over which the search is carried out, looking for the best (or good enough) point by some metric.

So, this actually is part of what SearchSpace does. Its instances do reflect a set of points (nodes, vertices) and related edges for the search. However, each instance is but a subset of the full, total SearchSpace under consideration; it's impractical to represent the total set of nodes, edges, etc for the search in a single instance without making "partitions" (or similar) of the space. (Sadly, "partition" is already an existing term for graph representation in literature, and what we're doing... isn't that.)

Before my recent work, we were fine with a single SearchSpace instance, and that instance would be mutated as new input came in. The first step then transformed it to "extend" the search space with a single input in a manner not unlike a linked list. The design assumes that it is safe to "factorize" the search space on each input - to treat each input keystroke as being independent of the others. (This is not strictly true, but it does simplify the design.) Factorizing the search in this way then lets use reuse the results for inputs 1 to N-1, then continue from that with input N to find the best results for 1 through to N.

To do whitespace fat-fingering adjustment, we no longer have a single linked-list representing the total possible search space. Instead, we have a diverging set of reusable intermediate searches. Sometimes, some entries may reconverge in a manner such that anything continuing the search results of one "net behavior" may continue the search results of any predecessor behavior sequence - and in the same way. We could almost say that we build a tree of potential behaviors; the only issue with that statement is that occasionally, some branches may reconverge after splitting. The search graph is certainly directed and acyclic, with a clearly-defined start node - whether or not we're looking on the level of individual possible inputs or on the "condensed" level of behaviors.

To be clear, re "behaviors":

Suppose one keystroke may produce any of the following outputs:

a

b

c

de

fg

hi

delete 1 char, then emit jk

delete 1 char, then emit lm

This condenses to 3 behaviors:

insert 1 char (a, b, or c)

insert 2 chars (de, fg, or hi)

delete 1 char, then insert 2 chars (jk, lm)

The SearchPath for each of the three behaviors will search a space based on the prefix behaviors and any input keystroke that matches the SearchPath's behavior. A SearchCluster built from the three SearchPaths will search all three at once. All will be but a subset of the true overall search space for the token.

markcsinclair · 2025-11-26T15:38:14Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

@@ -0,0 +1,76 @@
+# The SearchSpace types


I think it would probably be better to introduce the classes/types a little later in the document after at least the problem definition. Some detail on the classes and examples would be helpful too.

markcsinclair · 2025-11-26T15:41:45Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

+## The Underlying Problem
+
+### Defining the Problem
+It is easily possible for a user to fat-finger, accidentally typing a standard letter instead of the spacebar or similar when the latter is intended.  For languages using standard whitespace-based wordbreaking, this implies that the word boundaries seen in the context should not be considered absolute; we should model cases where the word-boundaries land elsewhere due to fat-finger effects.  Additionally, we have standing plans to support dictionary-based wordbreaking for languages that do not utilize whitespaces between words - this adds an extra case in which word-boundaries cannot be considered absolute.


... do not utilize whitespaces between words (e.g. Khmer)

markcsinclair · 2025-11-26T15:44:35Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

+### Defining the Problem
+It is easily possible for a user to fat-finger, accidentally typing a standard letter instead of the spacebar or similar when the latter is intended.  For languages using standard whitespace-based wordbreaking, this implies that the word boundaries seen in the context should not be considered absolute; we should model cases where the word-boundaries land elsewhere due to fat-finger effects.  Additionally, we have standing plans to support dictionary-based wordbreaking for languages that do not utilize whitespaces between words - this adds an extra case in which word-boundaries cannot be considered absolute.
+
+Keyman keyboard rules further complicate matters.  They do not need to consider side-effects for predictive-text, and it's easily possible for a rule to output text changes that affect (or even _effect_) multiple text tokens within the context.


Probably better not to use contractions (it's) to help non-English speakers

markcsinclair · 2025-11-26T15:47:35Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

+- it alters the end of the word currently at the end of context
+- it also adds a whitespace token
+
+There also exist keyboards like `khmer_angkor` that may perform character reordering, performing significant left-deletions and insertions in a single keystroke.  Furthermore, there's little saying that a keyboard can't be written that deletes a full grapheme cluster, rather than an individual key - a process that would add multiple left-deletions without any insertions.


What is a 'full grapheme cluster'?

markcsinclair · 2025-11-26T15:56:30Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

+    - Also of note:  if left-deleting, it is possible for a left-deletion to erase the token adjacent to the text insertion point.
+
+3.  When constructing and applying `Suggestion`s, it helps greatly to determine which `SearchSpace` led to it.
+    - This allows us to determine _which_ keystrokes are being replaced, as well as _what_ parts of the Context will be affected.


Context -> context (unless Context is a class)

markcsinclair · 2025-11-26T16:00:11Z

web/src/engine/predictive-text/worker-thread/docs/search-spaces.md

+
+For example, consider a case with two keystrokes, each of which has versions emitting insert strings of one and two characters.  Taking two chars from one and one char from the other will result in a `SearchSpace` that models a total of two keystrokes that fully covers the two keys.
+
+For such cases, any future keystrokes can extend both input sequences in the same manner.  While the actual correction-text may differ, the net effect it has on the properties of a token necessary for correction and construction of suggestions is identical.  The `SearchCluster` variant of `SearchSpace` exists for such cases, modeling the convergence of multiple `SearchPath`s and extending all of them together at once.


The idea of the extending of a SearchSpace has yet to be explained.

If you recall our discussion over Zoom, the correction-search process heavily utilizes dynamic programming principles.

So, for an alternate wording... "extending a SearchSpace" is the same as "treat the existing SearchSpace as the (dynamic-programming) subproblem for new SearchSpace(s)". Each uses all paths from the original, pre-extended SearchSpace as valid prefixes, extending them by one step: by at least one input from a newly-input keystroke.

As the diverging subspace behaviors can reconverge to the same net behavior, it is possible for certain steps to have "overlapping subproblems"; we have a scenario that truly qualifies for dynamic programming. As long as no delete-lefts occur, we also have true "optimal substructure". (Note that #14366 exists in part to address the delete-left cases and handle them more optimally.)

So, we're using dynamic programming principles on a graph - both to search it (and subspaces of it) and to represent it. This graph is iteratively built when new input is received, treating old subspaces as "subproblems" that may be referred to in a dynamic-programming style. These subspaces may also be searched, which is done via "divide and conquer" (as the lower level, with specific sampled inputs, does not truly have overlapping subproblems).

I have yet to find literature describing organization of graphs in this manner, or of reuse of prior pathfinding calculations in a similar manner at runtime. If there were clear existing literature for this, or clear existing nomenclature, that would likely facilitate much clearer documentation here.

Yen's algorithm uses the term spur or spur path to denote a partial path that is then extended towards a goal.

jahorton · 2025-12-01T19:28:57Z

After much reading and searching, I've landed on this: https://en.wikipedia.org/wiki/Modular_decomposition

The new target form for the correction-search graph aligns pretty well with what is described there (after parsing all the formalization). To break it down:

Upon receiving a new keystroke, a new set of edges and destination nodes for those edges is constructed. The "destination nodes" are then conceptually grouped into modules.
- On one hand, all transforms for the current keystroke's input may also be used to define a graph module - no transitions on the search-graph will consider duplication of the keystroke.
  - 'delete' edits of the keystroke are included within this outer "module".
  - 'insert' edits that apply after the keystroke's direct effects are also included within this outer "module".
- On the other hand, we build partitions of this module such that there is only one outbound virtual node for each, which indicates the total length of the token and which keystrokes (and portions thereof) comprise the token. Different (module) partitions of the "outer module" end at different virtual nodes.

Also of note: https://en.wikipedia.org/wiki/Quotient_graph (which is referenced by the prior link)

Short version: it's a graph built out of modules comprising another graph, recognizing the connectivity amongst the modules.

We can also build paths on the quotient graph, starting from the root node until the final module(s) added by the incoming keystroke, to somewhat formalize what the current SearchSpace classes are representing.
- If the final transition on the quotient graph to a single "virtual node" only passes through a single keystroke-level module, a SearchPath instance is used to represent that virtual node.
- When multiple such keystroke-level modules terminate the paths to a single "virtual node", this is represented by constructing SearchPath instances for each such keystroke-level module path, then constructing a SearchCluster from that to represent the virtual node.

Obvious remaining clarifications needed:

The "outer module" - the module superset of all keystroke-level modules for a single keystroke.
- Is not really relevant outside of formalization.
"Keystroke-level module" - is a quotient-graph path-terminating module representing a subset of the keystroke input effects, which all target the same single "virtual node"
"virtual node" - yep, I keep referring back to that.
The quotient-graph path. which in combination with the "virtual nodes" alluded to above, line up well to the current SearchSpace interface and implementing types.

…tion-search documentation

jahorton · 2025-12-15T20:57:12Z

After a lot of work, and some diving through formal graph-theory references online, I think I've arrived at a more precise way to document how the correction-search graph is built and operates - see the new correction-search-graph.md file. It includes a number of Mermaid-based flowcharts representing graphs, so it may help to view that file in rich-diff mode (or just view the current commit's version of the file).

Please re-review and let me know how effective the new document is at documenting the upcoming correction-search graph design. Is it clearer for you than the other document? It's certainly more verbose, and it aims to be more precise in its language. If the prior document communicated any aspect more clearly, that'd be useful to know. Thanks!

jahorton · 2025-12-15T20:58:55Z