Skip to content

Commit 38e06f1

Browse files
committed
chore: reorganize specs directory structure
1 parent c941501 commit 38e06f1

15 files changed

+506
-0
lines changed

docs/IFIND_FUZZY_MATCHING.md

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
# IRIS iFind Fuzzy Matching Research
2+
3+
## Summary
4+
5+
IRIS **does support true Levenshtein-based fuzzy matching** through the **iFind (InterSystems IRIS Full-Text Search)** feature. However, it requires specific setup and cannot be used with simple SQL LIKE queries.
6+
7+
## Key Findings
8+
9+
### 1. iFind Fuzzy Matching Syntax
10+
11+
**Current iFind Syntax (IRIS 2024.x)**:
12+
```sql
13+
-- Traditional syntax (requires index name)
14+
SELECT * FROM RAG.Entities
15+
WHERE %ID %FIND search_index(entity_name_idx, 'scott', 3)
16+
17+
-- 3 = fuzzy search option with default edit distance threshold
18+
```
19+
20+
**Proposed New Syntax (Future IRIS versions)**:
21+
```sql
22+
-- Cleaner syntax (uses field name directly)
23+
SELECT * FROM RAG.Entities
24+
WHERE entity_name MATCHES fuzzy('scott', 2) -- max edit distance of 2
25+
```
26+
27+
### 2. Requirements for iFind Fuzzy Matching
28+
29+
To use iFind fuzzy matching, you need:
30+
31+
1. **Create iFind Index** on the column:
32+
```sql
33+
CREATE INDEX entity_name_idx ON RAG.Entities (entity_name)
34+
FOR %iFind.Index.Basic
35+
```
36+
37+
2. **Use %FIND or MATCHES operator** (not LIKE):
38+
```sql
39+
-- Current syntax
40+
WHERE %ID %FIND search_index(entity_name_idx, 'search_term', 3)
41+
42+
-- Future syntax
43+
WHERE entity_name MATCHES fuzzy('search_term', 2)
44+
```
45+
46+
3. **iFind Index Types**:
47+
- `%iFind.Index.Minimal` - Basic tokenization, no transformations
48+
- `%iFind.Index.Basic` - Tokenization + case normalization
49+
- `%iFind.Index.Semantic` - Adds stemming, language detection
50+
- `%iFind.Index.Analytic` - Adds semantic attributes (negation, certainty, etc.)
51+
52+
### 3. Search Options
53+
54+
The traditional `search_index()` function uses integer flags:
55+
56+
- `0` - Regular search (exact match)
57+
- `1` - Stemmed search
58+
- `2` - Phrase search
59+
- `3` - **Fuzzy search** (Levenshtein distance-based)
60+
- `4` - Decompound search
61+
62+
### 4. Edit Distance Thresholds
63+
64+
Fuzzy matching uses **Levenshtein edit distance** with configurable thresholds:
65+
66+
- **Default threshold**: 2 (allows 2 character substitutions/insertions/deletions)
67+
- **Range**: 1-3 (higher values = more permissive matching)
68+
- **Example**: `fuzzy('banona', 2)` matches "banana" (1 substitution)
69+
70+
## Current Implementation Status
71+
72+
### What We Have: SQL LIKE-based Substring Matching
73+
74+
**File**: `iris_vector_rag/services/storage.py:526-631`
75+
76+
```python
77+
def search_entities(
78+
self,
79+
query: str,
80+
fuzzy: bool = True,
81+
edit_distance_threshold: int = 2,
82+
max_results: int = 10,
83+
entity_types: Optional[List[str]] = None,
84+
min_confidence: float = 0.0
85+
) -> List[Dict[str, Any]]:
86+
"""
87+
Search for entities by name with substring matching.
88+
89+
NOTE: Despite the 'fuzzy' parameter name, this uses SQL LIKE
90+
for case-insensitive substring matching, NOT true Levenshtein
91+
edit distance matching.
92+
"""
93+
sql = f"""
94+
SELECT entity_id, entity_name, entity_type, source_doc_id, description, confidence
95+
FROM {self.entities_table}
96+
WHERE LOWER(entity_name) LIKE LOWER(?)
97+
"""
98+
params = [f"%{query}%"]
99+
# ...
100+
```
101+
102+
**Capabilities**:
103+
- ✅ Case-insensitive matching
104+
- ✅ Substring matching ("Scott" matches "Scott Derrickson")
105+
- ❌ NO edit distance tolerance ("Scot" does NOT match "Scott")
106+
- ❌ NO typo tolerance ("banona" does NOT match "banana")
107+
108+
### What We Need: iFind Fuzzy Matching
109+
110+
To implement **true fuzzy matching** with Levenshtein edit distance, we need:
111+
112+
1. **Create iFind index** on `RAG.Entities.entity_name`
113+
2. **Update SQL query** to use `%FIND` or `MATCHES` operator
114+
3. **Configure edit distance threshold** (currently ignored parameter)
115+
116+
## Implementation Options
117+
118+
### Option 1: Require iFind Index (Recommended)
119+
120+
**Pros**:
121+
- True Levenshtein-based fuzzy matching
122+
- Handles typos, misspellings, variations
123+
- Optimized performance with index
124+
125+
**Cons**:
126+
- Requires schema change (create iFind index)
127+
- Requires SchemaManager updates
128+
- More complex deployment
129+
130+
**Implementation**:
131+
```python
132+
def search_entities_ifind(self, query: str, edit_distance: int = 2, ...):
133+
"""Search entities using iFind fuzzy matching."""
134+
sql = f"""
135+
SELECT entity_id, entity_name, entity_type, source_doc_id, description, confidence
136+
FROM {self.entities_table}
137+
WHERE %ID %FIND search_index(entity_name_idx, ?, 3)
138+
"""
139+
# 3 = fuzzy search option
140+
# Edit distance configured in index definition
141+
cursor.execute(sql, [query])
142+
```
143+
144+
### Option 2: Hybrid Approach (Current + Future)
145+
146+
**Pros**:
147+
- Works today without schema changes (LIKE-based)
148+
- Upgrade path to iFind when available
149+
- Graceful degradation
150+
151+
**Cons**:
152+
- Two codepaths to maintain
153+
- Confusing API (claims "fuzzy" but delivers substring)
154+
155+
**Implementation**:
156+
```python
157+
def search_entities(self, query: str, fuzzy: bool = True, ...):
158+
"""
159+
Search entities with optional fuzzy matching.
160+
161+
If iFind index exists: Use true Levenshtein fuzzy matching
162+
If no iFind index: Fall back to SQL LIKE substring matching
163+
"""
164+
if self._has_ifind_index():
165+
return self._search_with_ifind(query, edit_distance_threshold)
166+
else:
167+
return self._search_with_like(query)
168+
```
169+
170+
### Option 3: Keep LIKE, Document Limitations (Current)
171+
172+
**Pros**:
173+
- No schema changes required
174+
- Simple implementation
175+
- Works today
176+
177+
**Cons**:
178+
- NOT true fuzzy matching
179+
- Misleading API (parameter names claim "fuzzy")
180+
- Limited capability (no typo tolerance)
181+
182+
**Status**: ✅ **Currently Implemented**
183+
184+
## Recommendations
185+
186+
### Short-term (iris-vector-rag 0.5.x)
187+
188+
1. **Update docstring** to be honest about LIKE limitations:
189+
```python
190+
"""
191+
Search for entities by name with case-insensitive substring matching.
192+
193+
NOTE: This uses SQL LIKE for substring matching, NOT true fuzzy matching
194+
with edit distance. The 'fuzzy' and 'edit_distance_threshold' parameters
195+
are accepted for API compatibility with hipporag2 but are not implemented.
196+
197+
Examples:
198+
- "Scott" WILL match "Scott Derrickson" ✅
199+
- "Scot" will NOT match "Scott" ❌ (no typo tolerance)
200+
- "banona" will NOT match "banana" ❌ (no edit distance)
201+
"""
202+
```
203+
204+
2. **Add iFind tracking issue** for future implementation
205+
206+
3. **Keep current LIKE-based implementation** (works for substring matching use case)
207+
208+
### Long-term (iris-vector-rag 0.6.x+)
209+
210+
1. **Add iFind index creation** to SchemaManager:
211+
```python
212+
class SchemaManager:
213+
def ensure_ifind_index(self, table_name: str, column_name: str):
214+
"""Create iFind index for fuzzy matching support."""
215+
sql = f"""
216+
CREATE INDEX IF NOT EXISTS {table_name}_{column_name}_idx
217+
ON {table_name} ({column_name})
218+
FOR %iFind.Index.Basic
219+
"""
220+
cursor.execute(sql)
221+
```
222+
223+
2. **Implement hybrid search** (Option 2 above)
224+
225+
3. **Add configuration** to enable/disable iFind:
226+
```yaml
227+
entity_extraction:
228+
storage:
229+
fuzzy_matching:
230+
enabled: true
231+
method: "ifind" # or "like" for substring
232+
edit_distance: 2
233+
```
234+
235+
## References
236+
237+
- **Confluence**: [iFind Syntax Revision](https://usconfluence.iscinternal.com/pages/viewpage.action?pageId=421659474)
238+
- **JIRA**: [DP-246668 - iFind Levenshtein Distance](https://usjira.iscinternal.com/browse/DP-246668)
239+
- **IRIS Docs**: [Using iFind](https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=GSQLSRCH)
240+
- **Search Options**: [iFind Search Options](https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=GSQLSRCH_txtsrch_select)
241+
242+
## Related Issues
243+
244+
- **hipporag2-pipeline Issue**: F1=0.000 score due to missing fuzzy matching
245+
- **Missing Entities**: 20 entities not found during retrieval (e.g., "Ed Wood", "Johnny Depp")
246+
- **Foreign Key Failures**: 30 orphaned relationships due to missing entity search
247+
248+
## Next Steps
249+
250+
1. ✅ **Document findings** (this file)
251+
2. ⏭️ **Update search_entities docstring** to clarify LIKE limitations
252+
3. ⏭️ **Create tracking issue** for iFind implementation
253+
4. ⏭️ **Test current LIKE implementation** with hipporag2 (validate it works for substring matching)
254+
5. ⏭️ **Plan iFind migration** for iris-vector-rag 0.6.0

0 commit comments

Comments
 (0)