Skip to content

[Bug] Orchestration Deadlock: Check cluster readiness runs before clustermgtd restart, causing timeout on stale DynamoDB records #7166

@almightychang

Description

@almightychang

Required Info:

  • AWS ParallelCluster version [e.g. 3.1.1]: 3.14.0
  • Full cluster configuration without any credentials or personal data.
  • Cluster name: pcluster-prod
  • Output of pcluster describe-cluster command.
{
  "creationTime": "2025-12-03T00:53:22.894Z",
  "headNode": {
    "launchTime": "2025-12-03T00:58:05.000Z",
    "instanceId": "i-0353ad77176f3a317",
    "publicIpAddress": "18.224.95.13",
    "instanceType": "m5.xlarge",
    "state": "running",
    "privateIpAddress": "172.31.41.53"
  },
  "version": "3.14.0",
  "clusterConfiguration": {
    "url": "https://XXX"
  },
  "tags": [
    {
      "value": "3.14.0",
      "key": "parallelcluster:version"
    },
    {
      "value": "pcluster-prod",
      "key": "parallelcluster:cluster-name"
    }
  ],
  "cloudFormationStackStatus": "UPDATE_COMPLETE",
  "clusterName": "pcluster-prod",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "XXX",
  "lastUpdatedTime": "2025-12-22T11:43:49.414Z",
  "region": "us-east-2",
  "clusterStatus": "UPDATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

  • [Optional] Arn of the cluster CloudFormation main stack:

Bug description and how to reproduce:

An orchestration deadlock occurs during pcluster update-cluster because the HeadNode stops clustermgtd and triggers execute[Check cluster readiness] before restarting the daemon. The readiness check script (check_cluster_ready.py) relies on a consistent state in the DynamoDB table. If instances are terminated or replaced during the update, their records persist in DynamoDB as "Ghost Records" because the garbage collection mechanism (which runs within clustermgtd) is offline. This causes the HeadNode to wait indefinitely for signals from non-existent instances, eventually leading to a HeadNodeWaitCondition timeout.

Steps to reproduce:

  1. Initiate a cluster update (v3.14.0) that includes changes to AdditionalIamPolicies or CustomActions.
  2. Ensure some compute instances are terminated or failed during the process.
  3. Observe the HeadNode log /var/log/chef-client.log. It will show wrong records for instances that no longer exist in EC2.
  4. The HeadNode remains stuck because clustermgtd is in STOPPED state and cannot prune the stale DynamoDB entries.

Evidence & Logs

From Head node (/var/log/chef-client.log):

[2025-12-22T03:23:29+00:00] FATAL: execute[Check cluster readiness] (aws-parallelcluster-slurm::update_head_node line 169)
ERROR:__main__:Some cluster readiness checks failed:
  * wrong records (5): [('i-07ebfd7ab147b8072', 'old-version-id'), ...]
[execute] clustermgtd: ERROR (not running)

Supervisor Status on HeadNode:

$ supervisorctl status
clustermgtd      STOPPED    Dec 22 03:42 AM  # Stuck after update failure

Root Cause Analysis
In the recipe aws-parallelcluster-slurm::update_head_node, the sequence is:

  1. execute[stop clustermgtd]
  2. execute[Check cluster readiness] (Deadlock occurs here)
  3. execute[start clustermgtd] (Never reached)

The readiness check expects all entries in the DynamoDB table to match the desired version. However, without an active clustermgtd, stale entries for terminated instances ("Ghost Records") are never removed, resulting in a permanent version mismatch.

Additional context
The only way to recover without a full rollback was to manually delete the stale instance records from the DynamoDB table parallelcluster-pcluster-prod. Once the ghost records were removed, the HeadNode immediately recognized the cluster as ready and signaled SUCCESS to CloudFormation.

Image

Part of the /var/log/chef-client.log

ubuntu@ip-172-31-41-53:~$ tail -f /var/log/chef-client.log
                  raise e
                File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 140, in check_cluster_ready
                  check_deployed_config_version(cluster_name, table_name, config_version, region)
                File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 116, in check_deployed_config_version
                  raise CheckFailedError(
              common.exceptions.CheckFailedError: Check failed due to the following erroneous records:
                * missing records (0): []
                * incomplete records (0): []
                * wrong records (5): [('i-07ebfd7ab147b8072', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0accf185802c0e3df', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09322aea55eb1ad50', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09cfc94d5b81ddc12', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0ad608de89b8f13a0', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8')]
[2025-12-22T10:25:40+00:00] INFO: Retrying execution of execute[Check cluster readiness], 2 attempts left


    [execute] INFO:__main__:Checking cluster readiness with arguments: cluster_name=pcluster-prod, table_name=parallelcluster-pcluster-prod, config_version=Y3fffTZocLDDtDwaqdhTPWaUxdHRMx5I, region=us-east-2
              INFO:__main__:Checking that cluster configuration deployed on cluster nodes for cluster pcluster-prod is Y3fffTZocLDDtDwaqdhTPWaUxdHRMx5I
              INFO:botocore.credentials:Found credentials from IAM Role: pcluster-prod-RoleHeadNode-S4jDDpgsWiSM
              INFO:__main__:Found batch of 5 cluster node(s): ['i-07ebfd7ab147b8072', 'i-0accf185802c0e3df', 'i-09322aea55eb1ad50', 'i-09cfc94d5b81ddc12', 'i-0ad608de89b8f13a0']
              INFO:__main__:Retrieved 5 DDB item(s):
              	{'Id': {'S': 'CLUSTER_CONFIG.i-0ad608de89b8f13a0'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'}, 'lastUpdateTime': {'S': '2025-12-22 10:19:56 UTC'}, 'status': {'S': 'DEPLOYED'}}}}
              	{'Id': {'S': 'CLUSTER_CONFIG.i-09322aea55eb1ad50'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'}, 'lastUpdateTime': {'S': '2025-12-22 10:20:04 UTC'}, 'status': {'S': 'DEPLOYED'}}}}
              	{'Id': {'S': 'CLUSTER_CONFIG.i-09cfc94d5b81ddc12'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'}, 'lastUpdateTime': {'S': '2025-12-22 10:19:47 UTC'}, 'status': {'S': 'DEPLOYED'}}}}
              	{'Id': {'S': 'CLUSTER_CONFIG.i-0accf185802c0e3df'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'}, 'lastUpdateTime': {'S': '2025-12-22 10:19:43 UTC'}, 'status': {'S': 'DEPLOYED'}}}}
              	{'Id': {'S': 'CLUSTER_CONFIG.i-07ebfd7ab147b8072'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'}, 'lastUpdateTime': {'S': '2025-12-22 10:19:45 UTC'}, 'status': {'S': 'DEPLOYED'}}}}
              ERROR:__main__:Some cluster readiness checks failed: Check failed due to the following erroneous records:
                * missing records (0): []
                * incomplete records (0): []
                * wrong records (5): [('i-07ebfd7ab147b8072', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0accf185802c0e3df', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09322aea55eb1ad50', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09cfc94d5b81ddc12', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0ad608de89b8f13a0', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8')]
              Traceback (most recent call last):
                File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 152, in <module>
                  check_cluster_ready()  # pylint: disable=no-value-for-parameter
                  ^^^^^^^^^^^^^^^^^^^^^
                File "/opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/lib/python3.12/site-packages/click/core.py", line 1161, in __call__
                  return self.main(*args, **kwargs)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/lib/python3.12/site-packages/click/core.py", line 1082, in main
                  rv = self.invoke(ctx)
                       ^^^^^^^^^^^^^^^^
                File "/opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/lib/python3.12/site-packages/click/core.py", line 1443, in invoke
                  return ctx.invoke(self.callback, **ctx.params)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/lib/python3.12/site-packages/click/core.py", line 788, in invoke
                  return __callback(*args, **kwargs)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 143, in check_cluster_ready
                  raise e
                File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 140, in check_cluster_ready
                  check_deployed_config_version(cluster_name, table_name, config_version, region)
                File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 116, in check_deployed_config_version
                  raise CheckFailedError(
              common.exceptions.CheckFailedError: Check failed due to the following erroneous records:
                * missing records (0): []
                * incomplete records (0): []
                * wrong records (5): [('i-07ebfd7ab147b8072', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0accf185802c0e3df', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09322aea55eb1ad50', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09cfc94d5b81ddc12', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0ad608de89b8f13a0', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8')]
[2025-12-22T10:27:11+00:00] INFO: Retrying execution of execute[Check cluster readiness], 1 attempt left

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions