Skip to content

Conversation

@greysquirr3l
Copy link

Summary

Fixes runtime errors in EmbeddingsGrouper when tool embeddings cache contains invalid data.

Problem

The extension was crashing with two related errors in the embeddings system:

  1. TypeError: Cannot read properties of undefined (reading 'reduce') in normalizeVector
  2. TypeError: embeddings[0] is not iterable in computeCentroid

These occur when the tool embeddings cache contains undefined, null, or non-array values.

Solution

  • Added null/undefined guards in normalizeVector method
  • Added validation in computeCentroid to filter out invalid embeddings
  • Handle edge cases with mismatched dimensions and NaN values
  • Ensures tool grouping continues to work gracefully even with corrupted embeddings

Testing

  • Code compiles without errors
  • Unit tests pass
  • Verified fixes prevent the specific runtime errors

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Fixes console errors that prevent proper tool grouping functionality.

@greysquirr3l
Copy link
Author

@microsoft-github-policy-service agree

Copilot finished reviewing on behalf of greysquirr3l December 4, 2025 05:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds defensive guards to prevent runtime crashes in EmbeddingsGrouper when the tool embeddings cache contains corrupted or invalid data (null, undefined, or non-array values). The changes add validation in computeCentroid to filter invalid embeddings and handle dimension mismatches, plus guards in normalizeVector for null/undefined inputs.

Key Changes:

  • Added null/undefined validation guards in normalizeVector method
  • Added filtering of invalid embeddings in computeCentroid before processing
  • Added handling for embeddings with mismatched dimensions and NaN values

Comment on lines 444 to 445
const embeddingLength = Math.min(embedding.length, dimensions);
for (let i = 0; i < embeddingLength; i++) {
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dimension mismatch handling creates a subtle bug. When embeddings have mismatched dimensions, this logic sums only up to embeddingLength but then divides by validEmbeddings.length for all dimensions at line 452. This means dimensions beyond the shorter embedding won't be divided correctly, resulting in incorrect centroid values.

For example, if you have embeddings of length 4 and 2, dimensions [2] and [3] will only be summed from the first embedding but divided by 2, incorrectly halving those values.

Consider either:

  1. Filtering out embeddings with mismatched dimensions in the initial validation (line 428)
  2. Tracking per-dimension counts for the division step

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Fixed - Replaced the Math.min logic with proper dimension consistency filtering. Now all embeddings are filtered to have the same dimensions as the first valid embedding before processing, eliminating the division mismatch issue you identified.

if (embeddings.length === 1) {
return [...embeddings[0]]; // Copy to avoid mutations
// Filter out invalid embeddings
const validEmbeddings = embeddings.filter(embedding => embedding && Array.isArray(embedding) && embedding.length > 0);
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new validation logic for filtering invalid embeddings (null, undefined, empty arrays) lacks test coverage. Given that these changes fix runtime errors mentioned in the PR description, adding tests would prevent regressions.

Consider adding tests for:

  1. computeCentroid with null/undefined embeddings in the array
  2. computeCentroid with embeddings containing NaN values
  3. computeCentroid with embeddings of mismatched dimensions
  4. normalizeVector with null/undefined vectors

These can be added to src/platform/embeddings/test/node/embeddingsGrouper.spec.ts.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about test coverage. While this PR focuses on the immediate runtime crash fix, I agree that comprehensive tests would be valuable. The existing test file at src/platform/embeddings/test/node/embeddingsGrouper.spec.ts could be extended with the edge cases you mentioned (null/undefined embeddings, NaN values, dimension mismatches). Would you prefer I add those tests in this PR or as a follow-up?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied GitHub Copilot feedback suggestions and added comprehensive test coverage

Mathematical Robustness Improvements:

  • Enhanced NaN handling: Improved computeCentroid() and normalizeVector() with comprehensive NaN protection
  • Strengthened zero vector handling: Better division-by-zero prevention in normalization
  • Early validation: Added defensive filtering in addNode() and addNodes() methods to catch invalid embeddings before processing

Comprehensive Test Coverage:

Added 5 new robustness tests covering all edge cases mentioned in the Copilot review:

  • NaN values in embeddings (graceful handling without crashes)
  • Zero vector handling (no division by zero errors)
  • Dimension mismatches (proper validation and clustering)
  • Null/undefined embeddings (early filtering)
  • Empty embedding arrays (defensive validation)

Validation Results:

  • 35/35 tests pass
  • 0 TypeScript compilation errors
  • Extension builds and packages successfully

The embeddings system is now more robust with comprehensive defensive programming practices, as requested by the Copilot code review. All mathematical edge cases are properly handled without breaking existing functionality.

@sandy081 sandy081 assigned connor4312 and unassigned sandy081 Dec 4, 2025
@connor4312
Copy link
Member

We should validate embeddings when we receive them from the service not inside the grouper

greysquirr3l and others added 6 commits December 4, 2025 17:37
- Add null/undefined guards in normalizeVector to prevent 'Cannot read properties of undefined (reading 'reduce')' errors
- Add validation in computeCentroid to filter out invalid embeddings and prevent 'embeddings[0] is not iterable' errors
- Handle edge cases with mismatched dimensions and NaN values in embedding vectors
- Ensures tool grouping continues to work even with corrupted or missing embeddings

Fixes runtime errors in EmbeddingsGrouper when tool embeddings cache contains invalid data.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Fix || 0 replacing legitimate zero values with proper type/NaN checks
- Fix dimension mismatch causing incorrect centroid division
- Add comprehensive NaN handling in normalizeVector magnitude calculation
- Ensure all embeddings have consistent dimensions before processing

Addresses Copilot feedback on mathematical correctness and edge cases.
- Improve NaN handling in computeCentroid() and normalizeVector() methods
- Add early validation in addNode() and addNodes() with defensive filtering
- Strengthen zero vector handling to prevent division by zero errors
- Add 5 comprehensive robustness tests covering all edge cases:
  * NaN values in embeddings
  * Zero vector handling
  * Dimension mismatches
  * Null/undefined embeddings
  * Empty embedding arrays
- All 35 tests pass, validating mathematical correctness

Addresses feedback from GitHub Copilot code review for improved
mathematical stability and defensive programming practices.
- Add null checks in embeddingsGrouper.ts for undefined/null groupedEmbeddings
- Update embeddingsComputer.ts to handle null embeddings values
- Enhance remoteEmbeddingsComputer.ts with proper null handling
- Ensures robust handling of edge cases in embeddings processing
@greysquirr3l greysquirr3l force-pushed the fix/embeddings-grouper-null-handling branch from 9af9354 to aae3bc2 Compare December 4, 2025 22:38
Per architectural feedback, validation should occur at service boundaries
rather than in business logic. The service boundary validation in
remoteEmbeddingsComputer.ts provides the necessary robustness.

- Simplified computeCentroid() to match upstream clean implementation
- Removed null checks from normalizeVector()
- Maintains enhanced isValidEmbedding() with NaN/infinity detection
- Keeps service boundary validation with logging
@greysquirr3l
Copy link
Author

You're absolutely correct about the architectural principle. I've updated the PR to remove the defensive validation from the embeddings grouper and rely solely on service boundary validation.

Changes made:

✅ Removed defensive null checks from computeCentroid() and normalizeVector() in the grouper
✅ Kept service boundary validation in remoteEmbeddingsComputer.ts (both GitHub API and CAPI endpoints)
✅ Maintained enhanced isValidEmbedding() function with NaN/infinity detection
Architecture now follows proper boundaries:

Service layer validates embeddings when received from external APIs
Business logic (grouper) assumes valid data and focuses on its core responsibility
Enhanced validation catches mathematical edge cases (NaN/infinity) that basic validation missed
The PR now provides clean architectural separation while still adding the robustness improvements for mathematical edge cases and logging when external services return invalid data.

Key value retained:

Enhanced validation catches NaN/infinity values from external services
Logging helps debug when external APIs return corrupted data
Service boundary validation follows proper architectural patterns

@greysquirr3l
Copy link
Author

I see now this was a duplicate of microsoft/vscode#277290. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants