|
| 1 | +# IRIS iFind Fuzzy Matching Research |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +IRIS **does support true Levenshtein-based fuzzy matching** through the **iFind (InterSystems IRIS Full-Text Search)** feature. However, it requires specific setup and cannot be used with simple SQL LIKE queries. |
| 6 | + |
| 7 | +## Key Findings |
| 8 | + |
| 9 | +### 1. iFind Fuzzy Matching Syntax |
| 10 | + |
| 11 | +**Current iFind Syntax (IRIS 2024.x)**: |
| 12 | +```sql |
| 13 | +-- Traditional syntax (requires index name) |
| 14 | +SELECT * FROM RAG.Entities |
| 15 | +WHERE %ID %FIND search_index(entity_name_idx, 'scott', 3) |
| 16 | + |
| 17 | +-- 3 = fuzzy search option with default edit distance threshold |
| 18 | +``` |
| 19 | + |
| 20 | +**Proposed New Syntax (Future IRIS versions)**: |
| 21 | +```sql |
| 22 | +-- Cleaner syntax (uses field name directly) |
| 23 | +SELECT * FROM RAG.Entities |
| 24 | +WHERE entity_name MATCHES fuzzy('scott', 2) -- max edit distance of 2 |
| 25 | +``` |
| 26 | + |
| 27 | +### 2. Requirements for iFind Fuzzy Matching |
| 28 | + |
| 29 | +To use iFind fuzzy matching, you need: |
| 30 | + |
| 31 | +1. **Create iFind Index** on the column: |
| 32 | + ```sql |
| 33 | + CREATE INDEX entity_name_idx ON RAG.Entities (entity_name) |
| 34 | + FOR %iFind.Index.Basic |
| 35 | + ``` |
| 36 | + |
| 37 | +2. **Use %FIND or MATCHES operator** (not LIKE): |
| 38 | + ```sql |
| 39 | + -- Current syntax |
| 40 | + WHERE %ID %FIND search_index(entity_name_idx, 'search_term', 3) |
| 41 | + |
| 42 | + -- Future syntax |
| 43 | + WHERE entity_name MATCHES fuzzy('search_term', 2) |
| 44 | + ``` |
| 45 | + |
| 46 | +3. **iFind Index Types**: |
| 47 | + - `%iFind.Index.Minimal` - Basic tokenization, no transformations |
| 48 | + - `%iFind.Index.Basic` - Tokenization + case normalization |
| 49 | + - `%iFind.Index.Semantic` - Adds stemming, language detection |
| 50 | + - `%iFind.Index.Analytic` - Adds semantic attributes (negation, certainty, etc.) |
| 51 | + |
| 52 | +### 3. Search Options |
| 53 | + |
| 54 | +The traditional `search_index()` function uses integer flags: |
| 55 | + |
| 56 | +- `0` - Regular search (exact match) |
| 57 | +- `1` - Stemmed search |
| 58 | +- `2` - Phrase search |
| 59 | +- `3` - **Fuzzy search** (Levenshtein distance-based) |
| 60 | +- `4` - Decompound search |
| 61 | + |
| 62 | +### 4. Edit Distance Thresholds |
| 63 | + |
| 64 | +Fuzzy matching uses **Levenshtein edit distance** with configurable thresholds: |
| 65 | + |
| 66 | +- **Default threshold**: 2 (allows 2 character substitutions/insertions/deletions) |
| 67 | +- **Range**: 1-3 (higher values = more permissive matching) |
| 68 | +- **Example**: `fuzzy('banona', 2)` matches "banana" (1 substitution) |
| 69 | + |
| 70 | +## Current Implementation Status |
| 71 | + |
| 72 | +### What We Have: SQL LIKE-based Substring Matching |
| 73 | + |
| 74 | +**File**: `iris_vector_rag/services/storage.py:526-631` |
| 75 | + |
| 76 | +```python |
| 77 | +def search_entities( |
| 78 | + self, |
| 79 | + query: str, |
| 80 | + fuzzy: bool = True, |
| 81 | + edit_distance_threshold: int = 2, |
| 82 | + max_results: int = 10, |
| 83 | + entity_types: Optional[List[str]] = None, |
| 84 | + min_confidence: float = 0.0 |
| 85 | +) -> List[Dict[str, Any]]: |
| 86 | + """ |
| 87 | + Search for entities by name with substring matching. |
| 88 | +
|
| 89 | + NOTE: Despite the 'fuzzy' parameter name, this uses SQL LIKE |
| 90 | + for case-insensitive substring matching, NOT true Levenshtein |
| 91 | + edit distance matching. |
| 92 | + """ |
| 93 | + sql = f""" |
| 94 | + SELECT entity_id, entity_name, entity_type, source_doc_id, description, confidence |
| 95 | + FROM {self.entities_table} |
| 96 | + WHERE LOWER(entity_name) LIKE LOWER(?) |
| 97 | + """ |
| 98 | + params = [f"%{query}%"] |
| 99 | + # ... |
| 100 | +``` |
| 101 | + |
| 102 | +**Capabilities**: |
| 103 | +- ✅ Case-insensitive matching |
| 104 | +- ✅ Substring matching ("Scott" matches "Scott Derrickson") |
| 105 | +- ❌ NO edit distance tolerance ("Scot" does NOT match "Scott") |
| 106 | +- ❌ NO typo tolerance ("banona" does NOT match "banana") |
| 107 | + |
| 108 | +### What We Need: iFind Fuzzy Matching |
| 109 | + |
| 110 | +To implement **true fuzzy matching** with Levenshtein edit distance, we need: |
| 111 | + |
| 112 | +1. **Create iFind index** on `RAG.Entities.entity_name` |
| 113 | +2. **Update SQL query** to use `%FIND` or `MATCHES` operator |
| 114 | +3. **Configure edit distance threshold** (currently ignored parameter) |
| 115 | + |
| 116 | +## Implementation Options |
| 117 | + |
| 118 | +### Option 1: Require iFind Index (Recommended) |
| 119 | + |
| 120 | +**Pros**: |
| 121 | +- True Levenshtein-based fuzzy matching |
| 122 | +- Handles typos, misspellings, variations |
| 123 | +- Optimized performance with index |
| 124 | + |
| 125 | +**Cons**: |
| 126 | +- Requires schema change (create iFind index) |
| 127 | +- Requires SchemaManager updates |
| 128 | +- More complex deployment |
| 129 | + |
| 130 | +**Implementation**: |
| 131 | +```python |
| 132 | +def search_entities_ifind(self, query: str, edit_distance: int = 2, ...): |
| 133 | + """Search entities using iFind fuzzy matching.""" |
| 134 | + sql = f""" |
| 135 | + SELECT entity_id, entity_name, entity_type, source_doc_id, description, confidence |
| 136 | + FROM {self.entities_table} |
| 137 | + WHERE %ID %FIND search_index(entity_name_idx, ?, 3) |
| 138 | + """ |
| 139 | + # 3 = fuzzy search option |
| 140 | + # Edit distance configured in index definition |
| 141 | + cursor.execute(sql, [query]) |
| 142 | +``` |
| 143 | + |
| 144 | +### Option 2: Hybrid Approach (Current + Future) |
| 145 | + |
| 146 | +**Pros**: |
| 147 | +- Works today without schema changes (LIKE-based) |
| 148 | +- Upgrade path to iFind when available |
| 149 | +- Graceful degradation |
| 150 | + |
| 151 | +**Cons**: |
| 152 | +- Two codepaths to maintain |
| 153 | +- Confusing API (claims "fuzzy" but delivers substring) |
| 154 | + |
| 155 | +**Implementation**: |
| 156 | +```python |
| 157 | +def search_entities(self, query: str, fuzzy: bool = True, ...): |
| 158 | + """ |
| 159 | + Search entities with optional fuzzy matching. |
| 160 | +
|
| 161 | + If iFind index exists: Use true Levenshtein fuzzy matching |
| 162 | + If no iFind index: Fall back to SQL LIKE substring matching |
| 163 | + """ |
| 164 | + if self._has_ifind_index(): |
| 165 | + return self._search_with_ifind(query, edit_distance_threshold) |
| 166 | + else: |
| 167 | + return self._search_with_like(query) |
| 168 | +``` |
| 169 | + |
| 170 | +### Option 3: Keep LIKE, Document Limitations (Current) |
| 171 | + |
| 172 | +**Pros**: |
| 173 | +- No schema changes required |
| 174 | +- Simple implementation |
| 175 | +- Works today |
| 176 | + |
| 177 | +**Cons**: |
| 178 | +- NOT true fuzzy matching |
| 179 | +- Misleading API (parameter names claim "fuzzy") |
| 180 | +- Limited capability (no typo tolerance) |
| 181 | + |
| 182 | +**Status**: ✅ **Currently Implemented** |
| 183 | + |
| 184 | +## Recommendations |
| 185 | + |
| 186 | +### Short-term (iris-vector-rag 0.5.x) |
| 187 | + |
| 188 | +1. **Update docstring** to be honest about LIKE limitations: |
| 189 | + ```python |
| 190 | + """ |
| 191 | + Search for entities by name with case-insensitive substring matching. |
| 192 | +
|
| 193 | + NOTE: This uses SQL LIKE for substring matching, NOT true fuzzy matching |
| 194 | + with edit distance. The 'fuzzy' and 'edit_distance_threshold' parameters |
| 195 | + are accepted for API compatibility with hipporag2 but are not implemented. |
| 196 | +
|
| 197 | + Examples: |
| 198 | + - "Scott" WILL match "Scott Derrickson" ✅ |
| 199 | + - "Scot" will NOT match "Scott" ❌ (no typo tolerance) |
| 200 | + - "banona" will NOT match "banana" ❌ (no edit distance) |
| 201 | + """ |
| 202 | + ``` |
| 203 | + |
| 204 | +2. **Add iFind tracking issue** for future implementation |
| 205 | + |
| 206 | +3. **Keep current LIKE-based implementation** (works for substring matching use case) |
| 207 | + |
| 208 | +### Long-term (iris-vector-rag 0.6.x+) |
| 209 | + |
| 210 | +1. **Add iFind index creation** to SchemaManager: |
| 211 | + ```python |
| 212 | + class SchemaManager: |
| 213 | + def ensure_ifind_index(self, table_name: str, column_name: str): |
| 214 | + """Create iFind index for fuzzy matching support.""" |
| 215 | + sql = f""" |
| 216 | + CREATE INDEX IF NOT EXISTS {table_name}_{column_name}_idx |
| 217 | + ON {table_name} ({column_name}) |
| 218 | + FOR %iFind.Index.Basic |
| 219 | + """ |
| 220 | + cursor.execute(sql) |
| 221 | + ``` |
| 222 | + |
| 223 | +2. **Implement hybrid search** (Option 2 above) |
| 224 | + |
| 225 | +3. **Add configuration** to enable/disable iFind: |
| 226 | + ```yaml |
| 227 | + entity_extraction: |
| 228 | + storage: |
| 229 | + fuzzy_matching: |
| 230 | + enabled: true |
| 231 | + method: "ifind" # or "like" for substring |
| 232 | + edit_distance: 2 |
| 233 | + ``` |
| 234 | +
|
| 235 | +## References |
| 236 | +
|
| 237 | +- **Confluence**: [iFind Syntax Revision](https://usconfluence.iscinternal.com/pages/viewpage.action?pageId=421659474) |
| 238 | +- **JIRA**: [DP-246668 - iFind Levenshtein Distance](https://usjira.iscinternal.com/browse/DP-246668) |
| 239 | +- **IRIS Docs**: [Using iFind](https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=GSQLSRCH) |
| 240 | +- **Search Options**: [iFind Search Options](https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=GSQLSRCH_txtsrch_select) |
| 241 | +
|
| 242 | +## Related Issues |
| 243 | +
|
| 244 | +- **hipporag2-pipeline Issue**: F1=0.000 score due to missing fuzzy matching |
| 245 | +- **Missing Entities**: 20 entities not found during retrieval (e.g., "Ed Wood", "Johnny Depp") |
| 246 | +- **Foreign Key Failures**: 30 orphaned relationships due to missing entity search |
| 247 | +
|
| 248 | +## Next Steps |
| 249 | +
|
| 250 | +1. ✅ **Document findings** (this file) |
| 251 | +2. ⏭️ **Update search_entities docstring** to clarify LIKE limitations |
| 252 | +3. ⏭️ **Create tracking issue** for iFind implementation |
| 253 | +4. ⏭️ **Test current LIKE implementation** with hipporag2 (validate it works for substring matching) |
| 254 | +5. ⏭️ **Plan iFind migration** for iris-vector-rag 0.6.0 |
0 commit comments