-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix embeddings grouper null/undefined handling #2383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix embeddings grouper null/undefined handling #2383
Conversation
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds defensive guards to prevent runtime crashes in EmbeddingsGrouper when the tool embeddings cache contains corrupted or invalid data (null, undefined, or non-array values). The changes add validation in computeCentroid to filter invalid embeddings and handle dimension mismatches, plus guards in normalizeVector for null/undefined inputs.
Key Changes:
- Added null/undefined validation guards in
normalizeVectormethod - Added filtering of invalid embeddings in
computeCentroidbefore processing - Added handling for embeddings with mismatched dimensions and NaN values
| const embeddingLength = Math.min(embedding.length, dimensions); | ||
| for (let i = 0; i < embeddingLength; i++) { |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dimension mismatch handling creates a subtle bug. When embeddings have mismatched dimensions, this logic sums only up to embeddingLength but then divides by validEmbeddings.length for all dimensions at line 452. This means dimensions beyond the shorter embedding won't be divided correctly, resulting in incorrect centroid values.
For example, if you have embeddings of length 4 and 2, dimensions [2] and [3] will only be summed from the first embedding but divided by 2, incorrectly halving those values.
Consider either:
- Filtering out embeddings with mismatched dimensions in the initial validation (line 428)
- Tracking per-dimension counts for the division step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Fixed - Replaced the Math.min logic with proper dimension consistency filtering. Now all embeddings are filtered to have the same dimensions as the first valid embedding before processing, eliminating the division mismatch issue you identified.
| if (embeddings.length === 1) { | ||
| return [...embeddings[0]]; // Copy to avoid mutations | ||
| // Filter out invalid embeddings | ||
| const validEmbeddings = embeddings.filter(embedding => embedding && Array.isArray(embedding) && embedding.length > 0); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new validation logic for filtering invalid embeddings (null, undefined, empty arrays) lacks test coverage. Given that these changes fix runtime errors mentioned in the PR description, adding tests would prevent regressions.
Consider adding tests for:
computeCentroidwith null/undefined embeddings in the arraycomputeCentroidwith embeddings containing NaN valuescomputeCentroidwith embeddings of mismatched dimensionsnormalizeVectorwith null/undefined vectors
These can be added to src/platform/embeddings/test/node/embeddingsGrouper.spec.ts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about test coverage. While this PR focuses on the immediate runtime crash fix, I agree that comprehensive tests would be valuable. The existing test file at src/platform/embeddings/test/node/embeddingsGrouper.spec.ts could be extended with the edge cases you mentioned (null/undefined embeddings, NaN values, dimension mismatches). Would you prefer I add those tests in this PR or as a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied GitHub Copilot feedback suggestions and added comprehensive test coverage
Mathematical Robustness Improvements:
- Enhanced NaN handling: Improved
computeCentroid()andnormalizeVector()with comprehensive NaN protection - Strengthened zero vector handling: Better division-by-zero prevention in normalization
- Early validation: Added defensive filtering in
addNode()andaddNodes()methods to catch invalid embeddings before processing
Comprehensive Test Coverage:
Added 5 new robustness tests covering all edge cases mentioned in the Copilot review:
- NaN values in embeddings (graceful handling without crashes)
- Zero vector handling (no division by zero errors)
- Dimension mismatches (proper validation and clustering)
- Null/undefined embeddings (early filtering)
- Empty embedding arrays (defensive validation)
Validation Results:
- 35/35 tests pass
- 0 TypeScript compilation errors
- Extension builds and packages successfully
The embeddings system is now more robust with comprehensive defensive programming practices, as requested by the Copilot code review. All mathematical edge cases are properly handled without breaking existing functionality.
|
We should validate embeddings when we receive them from the service not inside the grouper |
- Add null/undefined guards in normalizeVector to prevent 'Cannot read properties of undefined (reading 'reduce')' errors - Add validation in computeCentroid to filter out invalid embeddings and prevent 'embeddings[0] is not iterable' errors - Handle edge cases with mismatched dimensions and NaN values in embedding vectors - Ensures tool grouping continues to work even with corrupted or missing embeddings Fixes runtime errors in EmbeddingsGrouper when tool embeddings cache contains invalid data.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Fix || 0 replacing legitimate zero values with proper type/NaN checks - Fix dimension mismatch causing incorrect centroid division - Add comprehensive NaN handling in normalizeVector magnitude calculation - Ensure all embeddings have consistent dimensions before processing Addresses Copilot feedback on mathematical correctness and edge cases.
- Improve NaN handling in computeCentroid() and normalizeVector() methods - Add early validation in addNode() and addNodes() with defensive filtering - Strengthen zero vector handling to prevent division by zero errors - Add 5 comprehensive robustness tests covering all edge cases: * NaN values in embeddings * Zero vector handling * Dimension mismatches * Null/undefined embeddings * Empty embedding arrays - All 35 tests pass, validating mathematical correctness Addresses feedback from GitHub Copilot code review for improved mathematical stability and defensive programming practices.
- Add null checks in embeddingsGrouper.ts for undefined/null groupedEmbeddings - Update embeddingsComputer.ts to handle null embeddings values - Enhance remoteEmbeddingsComputer.ts with proper null handling - Ensures robust handling of edge cases in embeddings processing
9af9354 to
aae3bc2
Compare
Per architectural feedback, validation should occur at service boundaries rather than in business logic. The service boundary validation in remoteEmbeddingsComputer.ts provides the necessary robustness. - Simplified computeCentroid() to match upstream clean implementation - Removed null checks from normalizeVector() - Maintains enhanced isValidEmbedding() with NaN/infinity detection - Keeps service boundary validation with logging
|
You're absolutely correct about the architectural principle. I've updated the PR to remove the defensive validation from the embeddings grouper and rely solely on service boundary validation. Changes made: ✅ Removed defensive null checks from computeCentroid() and normalizeVector() in the grouper Service layer validates embeddings when received from external APIs Key value retained: Enhanced validation catches NaN/infinity values from external services |
|
I see now this was a duplicate of microsoft/vscode#277290. Sorry about that. |
Summary
Fixes runtime errors in
EmbeddingsGrouperwhen tool embeddings cache contains invalid data.Problem
The extension was crashing with two related errors in the embeddings system:
TypeError: Cannot read properties of undefined (reading 'reduce')innormalizeVectorTypeError: embeddings[0] is not iterableincomputeCentroidThese occur when the tool embeddings cache contains undefined, null, or non-array values.
Solution
normalizeVectormethodcomputeCentroidto filter out invalid embeddingsTesting
Type of Change
Fixes console errors that prevent proper tool grouping functionality.