Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 31, 2025

Memory files can be duplicated through symlinks, file copies, or very similar content, causing redundant content in generated prompts.

Changes

  • Symlink deduplication: Track canonical paths via filepath.EvalSymlinks() to detect files accessed through different symlinks
  • Content deduplication: Calculate SHA256 hash of raw file bytes to detect exact copies across directories
  • Similarity deduplication: Use github.com/adrg/strutil with Sorensen-Dice coefficient to detect very similar files (95%+ similarity threshold)
  • Early detection: Perform deduplication checks before parsing to avoid wasted work
  • Logging: Report skipped duplicates with reason, original file path, and similarity score (when applicable)

Implementation

// Track processed files to avoid duplicates
canonicalToOriginalPath := make(map[string]string)
contentHashToOriginalPath := make(map[string]string)
pathToContent := make(map[string]string)

// Resolve symlinks
canonicalPath, _ := filepath.EvalSymlinks(path)
if originalPath, exists := canonicalToOriginalPath[canonicalPath]; exists {
    slog.Info("Skipping duplicate (symlink)", "path", path, "original", originalPath)
    return nil
}

// Hash raw file content (including frontmatter)
rawContent, _ := os.ReadFile(path)
contentHash := fmt.Sprintf("%x", sha256.Sum256(rawContent))
if originalPath, exists := contentHashToOriginalPath[contentHash]; exists {
    slog.Info("Skipping duplicate (identical content)", "path", path, "original", originalPath)
    return nil
}

// Check similarity using Sorensen-Dice coefficient
rawContentStr := string(rawContent)
for processedPath, processedRawContent := range pathToContent {
    similarity := strutil.Similarity(rawContentStr, processedRawContent, similarityMetric)
    if similarity >= 0.95 {
        slog.Info("Skipping duplicate (similar content)", 
            "path", path, "original", processedPath, "similarity", similarity)
        return nil
    }
}

Files with different frontmatter but identical body content are treated as distinct files since both hash and similarity checks use raw file content including frontmatter.

Similarity Detection

The similarity-based deduplication uses:

  • Sorensen-Dice coefficient with bigrams (2-character n-grams) for text comparison
  • 95% threshold to catch very similar files while avoiding false positives
  • Raw content comparison including frontmatter to preserve files with different metadata

This catches cases where files have minor wording differences (e.g., "Contact the team if you have any issues" vs "Contact the team if you encounter any problems") that would otherwise result in redundant content.

Original prompt

Sometimes memory files will be duplicated (e.g. a sym-link of copy). Sometimes they might be very similar. Research some libraries to de-dupe memory files.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits October 31, 2025 22:50
…ection

Co-authored-by: alexec <1142830+alexec@users.noreply.github.com>
…e content

Co-authored-by: alexec <1142830+alexec@users.noreply.github.com>
Copilot AI changed the title [WIP] Research libraries for de-duplicating memory files Deduplicate memory files based on symlinks and content hash Oct 31, 2025
Copilot AI requested a review from alexec October 31, 2025 22:55
Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Use https://github.com/adrg/strutil to dedupe based on similarity to existing memories.

Co-authored-by: alexec <1142830+alexec@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Oct 31, 2025

@copilot Use https://github.com/adrg/strutil to dedupe based on similarity to existing memories.

Done! Added similarity-based deduplication using github.com/adrg/strutil with the Sorensen-Dice coefficient.

Implementation:

  • Uses bigram-based Sorensen-Dice similarity metric
  • 95% similarity threshold to catch very similar files while avoiding false positives
  • Compares raw file content (including frontmatter) to preserve files with different metadata
  • Logs similarity score when files are skipped

Example:
Two files with minor wording differences ("Contact the team if you have any issues" vs "Contact the team if you encounter any problems") are detected as 95% similar and deduplicated.

Commit: d5a709b

Copilot AI changed the title Deduplicate memory files based on symlinks and content hash Deduplicate memory files based on symlinks, content hash, and similarity Oct 31, 2025
Copilot AI requested a review from alexec October 31, 2025 23:11
@alexec alexec marked this pull request as ready for review October 31, 2025 23:19
@alexec alexec marked this pull request as draft October 31, 2025 23:20
@alexec
Copy link
Contributor

alexec commented Oct 31, 2025

Note to self: not sure this is a real problem.

@alexec alexec closed this Oct 31, 2025
@alexec alexec deleted the copilot/research-memory-file-deduplication branch November 1, 2025 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants