Skip to content

Conversation

@mrproliu
Copy link
Contributor

  1. Support enabling/disabling the DNS node registry. If disabled, it will still use etcd as storage.
  2. Add a new gRPC protocol to support querying the self node info.
  3. Wrap the oberverbility as a sub package(service) to avoid package dependency issue.

@mrproliu mrproliu added this to the 0.10.0 milestone Dec 31, 2025
@mrproliu mrproliu requested a review from hanahmily December 31, 2025 17:55
@mrproliu mrproliu added the enhancement New feature or request label Dec 31, 2025
@codecov-commenter
Copy link

codecov-commenter commented Dec 31, 2025

Codecov Report

❌ Patch coverage is 62.33766% with 174 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.39%. Comparing base (3530dd9) to head (f6230d3).
⚠️ Report is 92 commits behind head on main.

Files with missing lines Patch % Lines
banyand/metadata/dns/dns.go 79.25% 51 Missing and 16 partials ⚠️
banyand/metadata/client.go 26.66% 38 Missing and 6 partials ⚠️
banyand/queue/sub/server.go 0.00% 20 Missing ⚠️
banyand/metadata/dns/metrics.go 0.00% 17 Missing ⚠️
banyand/measure/svc_data.go 0.00% 4 Missing ⚠️
banyand/queue/sub/node.go 0.00% 4 Missing ⚠️
banyand/internal/storage/disk_monitor.go 0.00% 3 Missing ⚠️
banyand/measure/svc_liaison.go 0.00% 2 Missing ⚠️
banyand/stream/svc_liaison.go 0.00% 2 Missing ⚠️
banyand/internal/storage/metrics.go 50.00% 1 Missing ⚠️
... and 10 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #920      +/-   ##
==========================================
+ Coverage   45.97%   46.39%   +0.41%     
==========================================
  Files         328      370      +42     
  Lines       55505    56901    +1396     
==========================================
+ Hits        25520    26400     +880     
- Misses      27909    27977      +68     
- Partials     2076     2524     +448     
Flag Coverage Δ
banyand 49.68% <62.47%> (?)
bydbctl 81.91% <ø> (?)
fodc 90.65% <ø> (?)
pkg 29.08% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wu-sheng
Copy link
Member

wu-sheng commented Jan 1, 2026

  1. Support enabling/disabling the DNS node registry. If disabled, it will still use etcd as storage.

Please update document about service discovery, and indicate which is the default option.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds DNS-based node discovery as an alternative to the existing etcd-based node registry, enabling BanyanDB to support DNS SRV-based service discovery (e.g., in Kubernetes environments). The implementation includes:

  • Refactoring of the observability package: moved implementation to a services subpackage to avoid circular dependencies
  • Factory interface changed from pointer type to interface for better flexibility
  • New DNS discovery service with comprehensive test coverage

Reviewed changes

Copilot reviewed 59 out of 59 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
banyand/observability/type.go Defines Factory as interface instead of concrete type, moved implementation details to services package
banyand/observability/services/*.go Moved observability implementation from parent package to services subpackage
banyand/metadata/dns/*.go New DNS-based node discovery implementation with SRV record queries and gRPC node metadata fetching
banyand/queue/sub/node.go New gRPC NodeQueryService endpoint to expose current node information
banyand/metadata/client.go Integration of DNS discovery with configuration flags and lifecycle management
api/proto/banyandb/database/v1/rpc.proto New NodeQueryService gRPC API definition
docs/api-reference.md API documentation for the new NodeQueryService
Multiple test files Updated imports to use observability/services package

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 60 out of 60 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 382 to 401
s.cacheMutex.RLock()
_, exists := s.nodeCache[addr]
s.cacheMutex.RUnlock()

if !exists {
// fetch node metadata from gRPC
node, fetchErr := s.fetchNodeMetadata(ctx, addr)
if fetchErr != nil {
s.log.Warn().
Err(fetchErr).
Str("address", addr).
Msg("Failed to fetch node metadata")
addErrors = append(addErrors, fetchErr)
continue
}

// update cache and notify handlers
s.cacheMutex.Lock()
s.nodeCache[addr] = node
s.cacheMutex.Unlock()
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential race condition in the cache check. Between releasing the RLock on line 384 and acquiring the Lock on line 399, another goroutine could have added the same node to the cache. This could lead to duplicate fetchNodeMetadata calls and duplicate handler notifications for the same address. Consider using a check-lock-check pattern or using a single Lock for the entire operation.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the check-lock-check pattern

Comment on lines 382 to 401
s.cacheMutex.RLock()
_, exists := s.nodeCache[addr]
s.cacheMutex.RUnlock()

if !exists {
// fetch node metadata from gRPC
node, fetchErr := s.fetchNodeMetadata(ctx, addr)
if fetchErr != nil {
s.log.Warn().
Err(fetchErr).
Str("address", addr).
Msg("Failed to fetch node metadata")
addErrors = append(addErrors, fetchErr)
continue
}

// update cache and notify handlers
s.cacheMutex.Lock()
s.nodeCache[addr] = node
s.cacheMutex.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the check-lock-check pattern


// if there have any error occurred,
// then just return the query error to ignore the result to make sure the cache correct
if len(queryErrors) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should return resolved addresses, even if some cannot be resolved.

Expect(mockResolver.getCallCount("_grpc._tcp.test.local")).To(Equal(1))
})

It("should fail when any DNS query fails (🎯 critical scenario)", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not fail when part of the DNS query fails.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 60 out of 60 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 145 to 146
// Validate etcd endpoints (required when using etcd-based node discovery for schema storage)
if s.nodeDiscoveryMode == NodeDiscoveryModeEtcd && len(s.endpoints) == 0 {
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DNS mode still requires etcd for schema storage, but the validation comment on line 145 says "required when using etcd-based node discovery for schema storage" which is misleading. When DNS mode is used, etcd endpoints are still required for schema operations. The validation should enforce that etcd endpoints are always required (regardless of discovery mode), or the comment should be clarified.

Suggested change
// Validate etcd endpoints (required when using etcd-based node discovery for schema storage)
if s.nodeDiscoveryMode == NodeDiscoveryModeEtcd && len(s.endpoints) == 0 {
// Validate etcd endpoints (always required for schema storage, regardless of node discovery mode)
if len(s.endpoints) == 0 {

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug, pls fix it.

Comment on lines +215 to 231
if s.nodeDiscoveryMode == NodeDiscoveryModeDNS {
l.Info().Strs("srv-addresses", s.dnsSRVAddresses).Msg("Initializing DNS-based node discovery")

var createErr error
s.dnsDiscovery, createErr = dns.NewService(dns.Config{
SRVAddresses: s.dnsSRVAddresses,
InitInterval: s.dnsFetchInitInterval,
InitDuration: s.dnsFetchInitDuration,
PollInterval: s.dnsFetchInterval,
GRPCTimeout: s.grpcTimeout,
TLSEnabled: s.dnsTLSEnabled,
CACertPaths: s.dnsCACertPaths,
})
if createErr != nil {
return fmt.Errorf("failed to create DNS discovery service: %w", createErr)
}
}
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DNS discovery service is created in PreRun but never started. Looking at the Serve method, s.dnsDiscovery.Start() is called there. However, if PreRun fails after creating the DNS discovery service (lines 219-230), the service and its associated resources (like TLS reloaders) are not cleaned up, leading to a resource leak.

Copilot uses AI. Check for mistakes.

// ListNode list all existing nodes from cache.
func (s *Service) ListNode(ctx context.Context, role databasev1.Role) ([]*databasev1.Node, error) {
// if the service is haven't begun/finished, then try to query and update DNS first
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 550 has a grammar error: "haven't begun/finished" should be "hasn't begun/finished" since "service" is singular.

Suggested change
// if the service is haven't begun/finished, then try to query and update DNS first
// if the service hasn't begun/finished, then try to query and update DNS first

Copilot uses AI. Check for mistakes.

- Watches etcd prefix `/nodes/` for PUT (add/update) and DELETE events.
- Revision-based streaming resumes from last known revision after reconnection.
- Periodic full sync every 5-35 minutes (randomized) to detect missed events.
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states that periodic full sync happens "every 5-35 minutes (randomized)" but the default value for --etcd-full-sync-interval flag is 30 minutes (non-randomized). This creates a discrepancy between the documentation and the actual implementation.

Copilot uses AI. Check for mistakes.
@wu-sheng wu-sheng requested a review from hanahmily January 6, 2026 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants