-
Notifications
You must be signed in to change notification settings - Fork 161
K8SPSMDB-1296: improve readiness probe #1917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| "github.com/percona/percona-server-mongodb-operator/pkg/psmdb/mongo" | ||
| ) | ||
|
|
||
| func getStatus(ctx context.Context, client mongo.Client) (ReplSetStatus, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we have all the mongo client-related functions together as part of the type Client interface? I understand that we are not committing to the interface segregation rule by doing that, but that interface is already containing everything (almost in terms of functionality).
Also the response type seems related to the generic mongo model and maybe can be moved to the mongo model file.
type ReplSetStatus struct {
...
}
This removes the need to have a utils file completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if err != nil { | ||
| log.Error(err, "Failed to get replset status") | ||
| return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if we should ignore all errors or only this node is not a member of replset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves the MongoDB readiness probe to verify replica set member states by checking the stateStr field from the replSetGetStatus command. Key improvements include:
- Enhanced readiness probe to validate MongoDB replica set member states (Primary, Secondary, or Arbiter)
- Graceful handling of invalid replica set configurations to allow initial deployment
- Addition of TLS/SSL arguments to readiness probe commands when TLS is enabled
- Context-aware MongoDB client connections with configurable timeouts
- Refactored healthcheck logic to reduce code complexity
Key Changes
- Modified
mongo.Dial()to accept context parameter and support configurable timeouts - Added
ErrInvalidReplsetConfigerror for replica set code 93 handling - Enhanced
MongodReadinessCheck()to validate replica set state after TCP connection check - Updated all TLS-enabled readiness probe configurations to include SSL arguments
Reviewed changes
Copilot reviewed 213 out of 213 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/psmdb/mongo/mongo.go | Added context parameter and timeout configuration to Dial function |
| pkg/psmdb/mongo/models.go | Added InitialSyncStatus field to Status model |
| pkg/psmdb/client.go | Updated all Dial calls to pass context |
| cmd/mongodb-healthcheck/healthcheck/readiness.go | Enhanced readiness check to validate replica set state |
| cmd/mongodb-healthcheck/healthcheck/health.go | Simplified health check logic and removed JSON marshaling workaround |
| cmd/mongodb-healthcheck/tool/tool.go | Updated readiness check to pass full config |
| cmd/mongodb-healthcheck/db/db.go | Updated Dial calls and fixed typo |
| cmd/mongodb-healthcheck/db/ssl.go | Added context parameter and fixed log message |
| pkg/apis/psmdb/v1/psmdb_defaults.go | Added SSL arguments to readiness probes with version gating |
| e2e-tests/**/compare/*.yml | Updated test expectations with SSL arguments |
| e2e-tests/upgrade-consistency-sharded-tls/run | Updated generation numbers for certificate renewal tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| compare_generation "7" "statefulset" "${CLUSTER}-rs0" | ||
| compare_generation "7" "statefulset" "${CLUSTER}-cfg" |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generation number jumps from 5 to 7, skipping generation 6. This appears to be an error. Based on the pattern in the test, after renewing the some-name-ssl-internal certificate, the generation should be 6, not 7. Similarly at line 102-103, the generation changes to 8 which is inconsistent with the expected sequential numbering (should be 7 after generation 6 at line 102).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 225 out of 225 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 221 out of 221 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func CheckState(rs ReplSetStatus, startupDelaySeconds int64, oplogSize int64) error { | ||
| func CheckState(rs mongo.Status, startupDelaySeconds int64, oplogSize int64) error { | ||
| if rs.GetSelf() == nil { | ||
| return errors.New("invalid replset status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this error message right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| func CheckState(rs ReplSetStatus, startupDelaySeconds int64, oplogSize int64) error { | ||
| func CheckState(rs mongo.Status, startupDelaySeconds int64, oplogSize int64) error { | ||
| if rs.GetSelf() == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that on L126 we are using again rs.GetSelf, assigning here to a variable, then performing the nil check and then using it in the remaining function is better since that function is looping through the members and it is not needed for every invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| var d net.Dialer | ||
|
|
||
| addr := cnf.Hosts[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we ensure that hosts are not empty/nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| cnf.Timeout = time.Second | ||
| client, err := db.Dial(ctx, cnf) | ||
| if err != nil { | ||
| return nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we swallowing this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return &rs, nil | ||
| }() | ||
| if err != nil || s == nil { | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add wrap some context to this error, MongodReadinessCheck already returns multiple errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 221 out of 221 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
commit: 02b4bc7 |
https://perconadev.atlassian.net/browse/K8SPSMDB-1296
DESCRIPTION
This PR improves readiness probe by verifying the
stateStrfield in thereplSetGetStatusoutput. If it's not possible to execute the command, the readiness probe will not fail, because otherwise it wouldn't be possible to deploy a mongod statefulset. The readiness probe will fail if the value of thestateStris not equal toPrimary,SecondaryorArbiterCHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
compare/*-oc.yml)?Config/Logging/Testability