-
Notifications
You must be signed in to change notification settings - Fork 104
Support node registry by DNS #920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #920 +/- ##
==========================================
+ Coverage 45.97% 46.39% +0.41%
==========================================
Files 328 370 +42
Lines 55505 56901 +1396
==========================================
+ Hits 25520 26400 +880
- Misses 27909 27977 +68
- Partials 2076 2524 +448
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Please update document about service discovery, and indicate which is the default option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds DNS-based node discovery as an alternative to the existing etcd-based node registry, enabling BanyanDB to support DNS SRV-based service discovery (e.g., in Kubernetes environments). The implementation includes:
- Refactoring of the observability package: moved implementation to a
servicessubpackage to avoid circular dependencies - Factory interface changed from pointer type to interface for better flexibility
- New DNS discovery service with comprehensive test coverage
Reviewed changes
Copilot reviewed 59 out of 59 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| banyand/observability/type.go | Defines Factory as interface instead of concrete type, moved implementation details to services package |
| banyand/observability/services/*.go | Moved observability implementation from parent package to services subpackage |
| banyand/metadata/dns/*.go | New DNS-based node discovery implementation with SRV record queries and gRPC node metadata fetching |
| banyand/queue/sub/node.go | New gRPC NodeQueryService endpoint to expose current node information |
| banyand/metadata/client.go | Integration of DNS discovery with configuration flags and lifecycle management |
| api/proto/banyandb/database/v1/rpc.proto | New NodeQueryService gRPC API definition |
| docs/api-reference.md | API documentation for the new NodeQueryService |
| Multiple test files | Updated imports to use observability/services package |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 60 out of 60 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
banyand/metadata/dns/dns.go
Outdated
| s.cacheMutex.RLock() | ||
| _, exists := s.nodeCache[addr] | ||
| s.cacheMutex.RUnlock() | ||
|
|
||
| if !exists { | ||
| // fetch node metadata from gRPC | ||
| node, fetchErr := s.fetchNodeMetadata(ctx, addr) | ||
| if fetchErr != nil { | ||
| s.log.Warn(). | ||
| Err(fetchErr). | ||
| Str("address", addr). | ||
| Msg("Failed to fetch node metadata") | ||
| addErrors = append(addErrors, fetchErr) | ||
| continue | ||
| } | ||
|
|
||
| // update cache and notify handlers | ||
| s.cacheMutex.Lock() | ||
| s.nodeCache[addr] = node | ||
| s.cacheMutex.Unlock() |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential race condition in the cache check. Between releasing the RLock on line 384 and acquiring the Lock on line 399, another goroutine could have added the same node to the cache. This could lead to duplicate fetchNodeMetadata calls and duplicate handler notifications for the same address. Consider using a check-lock-check pattern or using a single Lock for the entire operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the check-lock-check pattern
banyand/metadata/dns/dns.go
Outdated
| s.cacheMutex.RLock() | ||
| _, exists := s.nodeCache[addr] | ||
| s.cacheMutex.RUnlock() | ||
|
|
||
| if !exists { | ||
| // fetch node metadata from gRPC | ||
| node, fetchErr := s.fetchNodeMetadata(ctx, addr) | ||
| if fetchErr != nil { | ||
| s.log.Warn(). | ||
| Err(fetchErr). | ||
| Str("address", addr). | ||
| Msg("Failed to fetch node metadata") | ||
| addErrors = append(addErrors, fetchErr) | ||
| continue | ||
| } | ||
|
|
||
| // update cache and notify handlers | ||
| s.cacheMutex.Lock() | ||
| s.nodeCache[addr] = node | ||
| s.cacheMutex.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the check-lock-check pattern
banyand/metadata/dns/dns.go
Outdated
|
|
||
| // if there have any error occurred, | ||
| // then just return the query error to ignore the result to make sure the cache correct | ||
| if len(queryErrors) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should return resolved addresses, even if some cannot be resolved.
banyand/metadata/dns/dns_test.go
Outdated
| Expect(mockResolver.getCallCount("_grpc._tcp.test.local")).To(Equal(1)) | ||
| }) | ||
|
|
||
| It("should fail when any DNS query fails (🎯 critical scenario)", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not fail when part of the DNS query fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 60 out of 60 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
banyand/metadata/client.go
Outdated
| // Validate etcd endpoints (required when using etcd-based node discovery for schema storage) | ||
| if s.nodeDiscoveryMode == NodeDiscoveryModeEtcd && len(s.endpoints) == 0 { |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DNS mode still requires etcd for schema storage, but the validation comment on line 145 says "required when using etcd-based node discovery for schema storage" which is misleading. When DNS mode is used, etcd endpoints are still required for schema operations. The validation should enforce that etcd endpoints are always required (regardless of discovery mode), or the comment should be clarified.
| // Validate etcd endpoints (required when using etcd-based node discovery for schema storage) | |
| if s.nodeDiscoveryMode == NodeDiscoveryModeEtcd && len(s.endpoints) == 0 { | |
| // Validate etcd endpoints (always required for schema storage, regardless of node discovery mode) | |
| if len(s.endpoints) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bug, pls fix it.
| if s.nodeDiscoveryMode == NodeDiscoveryModeDNS { | ||
| l.Info().Strs("srv-addresses", s.dnsSRVAddresses).Msg("Initializing DNS-based node discovery") | ||
|
|
||
| var createErr error | ||
| s.dnsDiscovery, createErr = dns.NewService(dns.Config{ | ||
| SRVAddresses: s.dnsSRVAddresses, | ||
| InitInterval: s.dnsFetchInitInterval, | ||
| InitDuration: s.dnsFetchInitDuration, | ||
| PollInterval: s.dnsFetchInterval, | ||
| GRPCTimeout: s.grpcTimeout, | ||
| TLSEnabled: s.dnsTLSEnabled, | ||
| CACertPaths: s.dnsCACertPaths, | ||
| }) | ||
| if createErr != nil { | ||
| return fmt.Errorf("failed to create DNS discovery service: %w", createErr) | ||
| } | ||
| } |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DNS discovery service is created in PreRun but never started. Looking at the Serve method, s.dnsDiscovery.Start() is called there. However, if PreRun fails after creating the DNS discovery service (lines 219-230), the service and its associated resources (like TLS reloaders) are not cleaned up, leading to a resource leak.
banyand/metadata/dns/dns.go
Outdated
|
|
||
| // ListNode list all existing nodes from cache. | ||
| func (s *Service) ListNode(ctx context.Context, role databasev1.Role) ([]*databasev1.Node, error) { | ||
| // if the service is haven't begun/finished, then try to query and update DNS first |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 550 has a grammar error: "haven't begun/finished" should be "hasn't begun/finished" since "service" is singular.
| // if the service is haven't begun/finished, then try to query and update DNS first | |
| // if the service hasn't begun/finished, then try to query and update DNS first |
docs/operation/node-discovery.md
Outdated
|
|
||
| - Watches etcd prefix `/nodes/` for PUT (add/update) and DELETE events. | ||
| - Revision-based streaming resumes from last known revision after reconnection. | ||
| - Periodic full sync every 5-35 minutes (randomized) to detect missed events. |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation states that periodic full sync happens "every 5-35 minutes (randomized)" but the default value for --etcd-full-sync-interval flag is 30 minutes (non-randomized). This creates a discrepancy between the documentation and the actual implementation.
etcdas storage.service) to avoid package dependency issue.