Skip to content

Conversation

@pooknull
Copy link
Contributor

@pooknull pooknull commented May 12, 2025

K8SPSMDB-1296 Powered by Pull Request Badge

https://perconadev.atlassian.net/browse/K8SPSMDB-1296

DESCRIPTION

This PR improves readiness probe by verifying the stateStr field in the replSetGetStatus output. If it's not possible to execute the command, the readiness probe will not fail, because otherwise it wouldn't be possible to deploy a mongod statefulset. The readiness probe will fail if the value of the stateStr is not equal to Primary, Secondary or Arbiter

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/XXL 1000+ lines label May 12, 2025
@github-actions github-actions bot added the tests label May 12, 2025
@pooknull pooknull marked this pull request as ready for review May 26, 2025 12:00
"github.com/percona/percona-server-mongodb-operator/pkg/psmdb/mongo"
)

func getStatus(ctx context.Context, client mongo.Client) (ReplSetStatus, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have all the mongo client-related functions together as part of the type Client interface? I understand that we are not committing to the interface segregation rule by doing that, but that interface is already containing everything (almost in terms of functionality).

Also the response type seems related to the generic mongo model and maybe can be moved to the mongo model file.

type ReplSetStatus struct {
...
}

This removes the need to have a utils file completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 60 to 62
if err != nil {
log.Error(err, "Failed to get replset status")
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we should ignore all errors or only this node is not a member of replset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pooknull pooknull requested a review from valmiranogueira as a code owner May 27, 2025 05:21
@hors hors added this to the v1.21.0 milestone May 27, 2025
@gkech gkech requested review from egegunes, gkech and nmarukovich June 26, 2025 09:33
@hors hors removed this from the v1.21.0 milestone Sep 16, 2025
Copilot AI review requested due to automatic review settings December 12, 2025 10:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the MongoDB readiness probe to verify replica set member states by checking the stateStr field from the replSetGetStatus command. Key improvements include:

  • Enhanced readiness probe to validate MongoDB replica set member states (Primary, Secondary, or Arbiter)
  • Graceful handling of invalid replica set configurations to allow initial deployment
  • Addition of TLS/SSL arguments to readiness probe commands when TLS is enabled
  • Context-aware MongoDB client connections with configurable timeouts
  • Refactored healthcheck logic to reduce code complexity

Key Changes

  • Modified mongo.Dial() to accept context parameter and support configurable timeouts
  • Added ErrInvalidReplsetConfig error for replica set code 93 handling
  • Enhanced MongodReadinessCheck() to validate replica set state after TCP connection check
  • Updated all TLS-enabled readiness probe configurations to include SSL arguments

Reviewed changes

Copilot reviewed 213 out of 213 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/psmdb/mongo/mongo.go Added context parameter and timeout configuration to Dial function
pkg/psmdb/mongo/models.go Added InitialSyncStatus field to Status model
pkg/psmdb/client.go Updated all Dial calls to pass context
cmd/mongodb-healthcheck/healthcheck/readiness.go Enhanced readiness check to validate replica set state
cmd/mongodb-healthcheck/healthcheck/health.go Simplified health check logic and removed JSON marshaling workaround
cmd/mongodb-healthcheck/tool/tool.go Updated readiness check to pass full config
cmd/mongodb-healthcheck/db/db.go Updated Dial calls and fixed typo
cmd/mongodb-healthcheck/db/ssl.go Added context parameter and fixed log message
pkg/apis/psmdb/v1/psmdb_defaults.go Added SSL arguments to readiness probes with version gating
e2e-tests/**/compare/*.yml Updated test expectations with SSL arguments
e2e-tests/upgrade-consistency-sharded-tls/run Updated generation numbers for certificate renewal tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 79 to 80
compare_generation "7" "statefulset" "${CLUSTER}-rs0"
compare_generation "7" "statefulset" "${CLUSTER}-cfg"
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generation number jumps from 5 to 7, skipping generation 6. This appears to be an error. Based on the pattern in the test, after renewing the some-name-ssl-internal certificate, the generation should be 6, not 7. Similarly at line 102-103, the generation changes to 8 which is inconsistent with the expected sequential numbering (should be 7 after generation 6 at line 102).

Copilot uses AI. Check for mistakes.
@egegunes egegunes added this to the v1.22.0 milestone Dec 15, 2025
Copilot AI review requested due to automatic review settings December 16, 2025 09:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 225 out of 225 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 17, 2025 15:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 221 out of 221 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

func CheckState(rs ReplSetStatus, startupDelaySeconds int64, oplogSize int64) error {
func CheckState(rs mongo.Status, startupDelaySeconds int64, oplogSize int64) error {
if rs.GetSelf() == nil {
return errors.New("invalid replset status")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this error message right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


func CheckState(rs ReplSetStatus, startupDelaySeconds int64, oplogSize int64) error {
func CheckState(rs mongo.Status, startupDelaySeconds int64, oplogSize int64) error {
if rs.GetSelf() == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that on L126 we are using again rs.GetSelf, assigning here to a variable, then performing the nil check and then using it in the remaining function is better since that function is looping through the members and it is not needed for every invocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


var d net.Dialer

addr := cnf.Hosts[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we ensure that hosts are not empty/nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cnf.Timeout = time.Second
client, err := db.Dial(ctx, cnf)
if err != nil {
return nil, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we swallowing this error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return &rs, nil
}()
if err != nil || s == nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add wrap some context to this error, MongodReadinessCheck already returns multiple errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot AI review requested due to automatic review settings December 18, 2025 09:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 221 out of 221 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pooknull pooknull requested a review from gkech December 18, 2025 10:32
@JNKPercona
Copy link
Collaborator

Test Name Result Time
arbiter passed 00:11:19
balancer passed 00:19:31
cross-site-sharded passed 00:21:14
custom-replset-name passed 00:10:10
custom-tls passed 00:15:45
custom-users-roles passed 00:10:21
custom-users-roles-sharded passed 00:11:27
data-at-rest-encryption passed 00:15:22
data-sharded failure 00:10:46
demand-backup passed 00:15:16
demand-backup-eks-credentials-irsa passed 00:00:07
demand-backup-fs passed 00:23:54
demand-backup-if-unhealthy passed 00:09:56
demand-backup-incremental failure 00:19:31
demand-backup-incremental-sharded passed 00:58:54
demand-backup-physical-parallel passed 00:10:04
demand-backup-physical-aws passed 00:12:13
demand-backup-physical-azure passed 00:11:41
demand-backup-physical-gcp-s3 passed 00:12:20
demand-backup-physical-gcp-native passed 00:11:42
demand-backup-physical-minio passed 00:20:40
demand-backup-physical-minio-native passed 00:19:57
demand-backup-physical-sharded-parallel passed 00:11:01
demand-backup-physical-sharded-aws passed 00:17:40
demand-backup-physical-sharded-azure passed 00:17:16
demand-backup-physical-sharded-gcp-native passed 00:17:18
demand-backup-physical-sharded-minio passed 00:16:54
demand-backup-physical-sharded-minio-native passed 00:17:18
demand-backup-sharded passed 00:24:59
expose-sharded passed 00:33:04
finalizer passed 00:09:49
ignore-labels-annotations passed 00:07:34
init-deploy passed 00:12:58
ldap passed 00:08:46
ldap-tls passed 00:12:48
limits passed 00:06:07
liveness passed 00:07:59
mongod-major-upgrade passed 00:13:29
mongod-major-upgrade-sharded passed 00:21:02
monitoring-2-0 passed 00:25:01
monitoring-pmm3 passed 00:26:00
multi-cluster-service passed 00:15:43
multi-storage passed 00:19:21
non-voting-and-hidden passed 00:17:29
one-pod passed 00:07:26
operator-self-healing-chaos passed 00:13:14
pitr passed 00:32:40
pitr-physical passed 01:00:39
pitr-sharded passed 00:22:27
pitr-to-new-cluster passed 00:24:45
pitr-physical-backup-source passed 00:54:47
preinit-updates passed 00:05:41
pvc-resize passed 00:12:38
recover-no-primary passed 00:27:49
replset-overrides passed 00:17:34
rs-shard-migration passed 00:13:40
scaling passed 00:11:04
scheduled-backup passed 00:16:58
security-context passed 00:07:21
self-healing-chaos passed 00:15:38
service-per-pod passed 00:18:41
serviceless-external-nodes passed 00:07:36
smart-update passed 00:08:21
split-horizon passed 00:07:55
stable-resource-version passed 00:04:54
storage passed 00:07:49
tls-issue-cert-manager passed 00:28:40
upgrade passed 00:09:26
upgrade-consistency passed 00:07:30
upgrade-consistency-sharded-tls passed 00:55:21
upgrade-sharded passed 00:19:31
upgrade-partial-backup failure 00:01:14
users passed 00:17:13
version-service passed 00:25:54
Summary Value
Tests Run 74/74
Job Duration 04:13:30
Total Test Time 21:16:49

commit: 02b4bc7
image: perconalab/percona-server-mongodb-operator:PR-1917-02b4bc7d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants