Deduplicate memory files based on symlinks, content hash, and similarity #20

Copilot · 2025-10-31T22:44:56Z

Memory files can be duplicated through symlinks, file copies, or very similar content, causing redundant content in generated prompts.

Changes

Symlink deduplication: Track canonical paths via filepath.EvalSymlinks() to detect files accessed through different symlinks
Content deduplication: Calculate SHA256 hash of raw file bytes to detect exact copies across directories
Similarity deduplication: Use github.com/adrg/strutil with Sorensen-Dice coefficient to detect very similar files (95%+ similarity threshold)
Early detection: Perform deduplication checks before parsing to avoid wasted work
Logging: Report skipped duplicates with reason, original file path, and similarity score (when applicable)

Implementation

// Track processed files to avoid duplicates
canonicalToOriginalPath := make(map[string]string)
contentHashToOriginalPath := make(map[string]string)
pathToContent := make(map[string]string)

// Resolve symlinks
canonicalPath, _ := filepath.EvalSymlinks(path)
if originalPath, exists := canonicalToOriginalPath[canonicalPath]; exists {
    slog.Info("Skipping duplicate (symlink)", "path", path, "original", originalPath)
    return nil
}

// Hash raw file content (including frontmatter)
rawContent, _ := os.ReadFile(path)
contentHash := fmt.Sprintf("%x", sha256.Sum256(rawContent))
if originalPath, exists := contentHashToOriginalPath[contentHash]; exists {
    slog.Info("Skipping duplicate (identical content)", "path", path, "original", originalPath)
    return nil
}

// Check similarity using Sorensen-Dice coefficient
rawContentStr := string(rawContent)
for processedPath, processedRawContent := range pathToContent {
    similarity := strutil.Similarity(rawContentStr, processedRawContent, similarityMetric)
    if similarity >= 0.95 {
        slog.Info("Skipping duplicate (similar content)", 
            "path", path, "original", processedPath, "similarity", similarity)
        return nil
    }
}

Files with different frontmatter but identical body content are treated as distinct files since both hash and similarity checks use raw file content including frontmatter.

Similarity Detection

The similarity-based deduplication uses:

Sorensen-Dice coefficient with bigrams (2-character n-grams) for text comparison
95% threshold to catch very similar files while avoiding false positives
Raw content comparison including frontmatter to preserve files with different metadata

This catches cases where files have minor wording differences (e.g., "Contact the team if you have any issues" vs "Contact the team if you encounter any problems") that would otherwise result in redundant content.

Original prompt

Sometimes memory files will be duplicated (e.g. a sym-link of copy). Sometimes they might be very similar. Research some libraries to de-dupe memory files.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…ection Co-authored-by: alexec <1142830+alexec@users.noreply.github.com>

…e content Co-authored-by: alexec <1142830+alexec@users.noreply.github.com>

alexec

@copilot Use https://github.com/adrg/strutil to dedupe based on similarity to existing memories.

Co-authored-by: alexec <1142830+alexec@users.noreply.github.com>

Copilot · 2025-10-31T23:10:17Z

@copilot Use https://github.com/adrg/strutil to dedupe based on similarity to existing memories.

Done! Added similarity-based deduplication using github.com/adrg/strutil with the Sorensen-Dice coefficient.

Implementation:

Uses bigram-based Sorensen-Dice similarity metric
95% similarity threshold to catch very similar files while avoiding false positives
Compares raw file content (including frontmatter) to preserve files with different metadata
Logs similarity score when files are skipped

Example:
Two files with minor wording differences ("Contact the team if you have any issues" vs "Contact the team if you encounter any problems") are detected as 95% similar and deduplicated.

Commit: d5a709b

main.go

alexec · 2025-10-31T23:20:21Z

Note to self: not sure this is a real problem.

Initial plan

880d8ab

Copilot AI assigned Copilot and alexec Oct 31, 2025

Copilot started work on behalf of alexec October 31, 2025 22:45 View session

Copilot AI and others added 2 commits October 31, 2025 22:50

Implement memory file deduplication with symlink and content hash det…

a6c0c9b

…ection Co-authored-by: alexec <1142830+alexec@users.noreply.github.com>

Address code review feedback: improve variable names and hash raw fil…

122028f

…e content Co-authored-by: alexec <1142830+alexec@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Research libraries for de-duplicating memory files~~ Deduplicate memory files based on symlinks and content hash Oct 31, 2025

Copilot AI requested a review from alexec October 31, 2025 22:55

Copilot finished work on behalf of alexec October 31, 2025 22:55

alexec requested changes Oct 31, 2025

View reviewed changes

Copilot started work on behalf of alexec October 31, 2025 23:01 View session

Add similarity-based deduplication using strutil library

d5a709b

Co-authored-by: alexec <1142830+alexec@users.noreply.github.com>

Copilot AI changed the title ~~Deduplicate memory files based on symlinks and content hash~~ Deduplicate memory files based on symlinks, content hash, and similarity Oct 31, 2025

Copilot finished work on behalf of alexec October 31, 2025 23:11

Copilot AI requested a review from alexec October 31, 2025 23:11

alexec approved these changes Oct 31, 2025

View reviewed changes

main.go Show resolved Hide resolved

Merge branch 'main' into copilot/research-memory-file-deduplication

d6941dd

alexec marked this pull request as ready for review October 31, 2025 23:19

alexec marked this pull request as draft October 31, 2025 23:20

alexec closed this Oct 31, 2025

alexec deleted the copilot/research-memory-file-deduplication branch November 1, 2025 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deduplicate memory files based on symlinks, content hash, and similarity #20

Deduplicate memory files based on symlinks, content hash, and similarity #20

Uh oh!

Copilot AI commented Oct 31, 2025 •

edited

Loading

Uh oh!

alexec left a comment

Uh oh!

Copilot AI commented Oct 31, 2025

Uh oh!

Uh oh!

alexec commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Deduplicate memory files based on symlinks, content hash, and similarity #20

Deduplicate memory files based on symlinks, content hash, and similarity #20

Uh oh!

Conversation

Copilot AI commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Implementation

Similarity Detection

Uh oh!

alexec left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Oct 31, 2025

Uh oh!

Uh oh!

alexec commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 31, 2025 •

edited

Loading