-
Notifications
You must be signed in to change notification settings - Fork 318
Description
Required Info:
- AWS ParallelCluster version [e.g. 3.1.1]: 3.14.0
- Full cluster configuration without any credentials or personal data.
- Cluster name: pcluster-prod
- Output of
pcluster describe-clustercommand.
{
"creationTime": "2025-12-03T00:53:22.894Z",
"headNode": {
"launchTime": "2025-12-03T00:58:05.000Z",
"instanceId": "i-0353ad77176f3a317",
"publicIpAddress": "18.224.95.13",
"instanceType": "m5.xlarge",
"state": "running",
"privateIpAddress": "172.31.41.53"
},
"version": "3.14.0",
"clusterConfiguration": {
"url": "https://XXX"
},
"tags": [
{
"value": "3.14.0",
"key": "parallelcluster:version"
},
{
"value": "pcluster-prod",
"key": "parallelcluster:cluster-name"
}
],
"cloudFormationStackStatus": "UPDATE_COMPLETE",
"clusterName": "pcluster-prod",
"computeFleetStatus": "RUNNING",
"cloudformationStackArn": "XXX",
"lastUpdatedTime": "2025-12-22T11:43:49.414Z",
"region": "us-east-2",
"clusterStatus": "UPDATE_COMPLETE",
"scheduler": {
"type": "slurm"
}
}
- [Optional] Arn of the cluster CloudFormation main stack:
Bug description and how to reproduce:
An orchestration deadlock occurs during pcluster update-cluster because the HeadNode stops clustermgtd and triggers execute[Check cluster readiness] before restarting the daemon. The readiness check script (check_cluster_ready.py) relies on a consistent state in the DynamoDB table. If instances are terminated or replaced during the update, their records persist in DynamoDB as "Ghost Records" because the garbage collection mechanism (which runs within clustermgtd) is offline. This causes the HeadNode to wait indefinitely for signals from non-existent instances, eventually leading to a HeadNodeWaitCondition timeout.
Steps to reproduce:
- Initiate a cluster update (v3.14.0) that includes changes to AdditionalIamPolicies or CustomActions.
- Ensure some compute instances are terminated or failed during the process.
- Observe the HeadNode log /var/log/chef-client.log. It will show wrong records for instances that no longer exist in EC2.
- The HeadNode remains stuck because clustermgtd is in STOPPED state and cannot prune the stale DynamoDB entries.
Evidence & Logs
From Head node (/var/log/chef-client.log):
[2025-12-22T03:23:29+00:00] FATAL: execute[Check cluster readiness] (aws-parallelcluster-slurm::update_head_node line 169)
ERROR:__main__:Some cluster readiness checks failed:
* wrong records (5): [('i-07ebfd7ab147b8072', 'old-version-id'), ...]
[execute] clustermgtd: ERROR (not running)
Supervisor Status on HeadNode:
$ supervisorctl status
clustermgtd STOPPED Dec 22 03:42 AM # Stuck after update failure
Root Cause Analysis
In the recipe aws-parallelcluster-slurm::update_head_node, the sequence is:
- execute[stop clustermgtd]
- execute[Check cluster readiness] (Deadlock occurs here)
- execute[start clustermgtd] (Never reached)
The readiness check expects all entries in the DynamoDB table to match the desired version. However, without an active clustermgtd, stale entries for terminated instances ("Ghost Records") are never removed, resulting in a permanent version mismatch.
Additional context
The only way to recover without a full rollback was to manually delete the stale instance records from the DynamoDB table parallelcluster-pcluster-prod. Once the ghost records were removed, the HeadNode immediately recognized the cluster as ready and signaled SUCCESS to CloudFormation.
Part of the /var/log/chef-client.log
ubuntu@ip-172-31-41-53:~$ tail -f /var/log/chef-client.log
raise e
File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 140, in check_cluster_ready
check_deployed_config_version(cluster_name, table_name, config_version, region)
File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 116, in check_deployed_config_version
raise CheckFailedError(
common.exceptions.CheckFailedError: Check failed due to the following erroneous records:
* missing records (0): []
* incomplete records (0): []
* wrong records (5): [('i-07ebfd7ab147b8072', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0accf185802c0e3df', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09322aea55eb1ad50', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09cfc94d5b81ddc12', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0ad608de89b8f13a0', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8')]
[2025-12-22T10:25:40+00:00] INFO: Retrying execution of execute[Check cluster readiness], 2 attempts left
[execute] INFO:__main__:Checking cluster readiness with arguments: cluster_name=pcluster-prod, table_name=parallelcluster-pcluster-prod, config_version=Y3fffTZocLDDtDwaqdhTPWaUxdHRMx5I, region=us-east-2
INFO:__main__:Checking that cluster configuration deployed on cluster nodes for cluster pcluster-prod is Y3fffTZocLDDtDwaqdhTPWaUxdHRMx5I
INFO:botocore.credentials:Found credentials from IAM Role: pcluster-prod-RoleHeadNode-S4jDDpgsWiSM
INFO:__main__:Found batch of 5 cluster node(s): ['i-07ebfd7ab147b8072', 'i-0accf185802c0e3df', 'i-09322aea55eb1ad50', 'i-09cfc94d5b81ddc12', 'i-0ad608de89b8f13a0']
INFO:__main__:Retrieved 5 DDB item(s):
{'Id': {'S': 'CLUSTER_CONFIG.i-0ad608de89b8f13a0'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'}, 'lastUpdateTime': {'S': '2025-12-22 10:19:56 UTC'}, 'status': {'S': 'DEPLOYED'}}}}
{'Id': {'S': 'CLUSTER_CONFIG.i-09322aea55eb1ad50'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'}, 'lastUpdateTime': {'S': '2025-12-22 10:20:04 UTC'}, 'status': {'S': 'DEPLOYED'}}}}
{'Id': {'S': 'CLUSTER_CONFIG.i-09cfc94d5b81ddc12'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'}, 'lastUpdateTime': {'S': '2025-12-22 10:19:47 UTC'}, 'status': {'S': 'DEPLOYED'}}}}
{'Id': {'S': 'CLUSTER_CONFIG.i-0accf185802c0e3df'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'}, 'lastUpdateTime': {'S': '2025-12-22 10:19:43 UTC'}, 'status': {'S': 'DEPLOYED'}}}}
{'Id': {'S': 'CLUSTER_CONFIG.i-07ebfd7ab147b8072'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'}, 'lastUpdateTime': {'S': '2025-12-22 10:19:45 UTC'}, 'status': {'S': 'DEPLOYED'}}}}
ERROR:__main__:Some cluster readiness checks failed: Check failed due to the following erroneous records:
* missing records (0): []
* incomplete records (0): []
* wrong records (5): [('i-07ebfd7ab147b8072', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0accf185802c0e3df', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09322aea55eb1ad50', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09cfc94d5b81ddc12', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0ad608de89b8f13a0', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8')]
Traceback (most recent call last):
File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 152, in <module>
check_cluster_ready() # pylint: disable=no-value-for-parameter
^^^^^^^^^^^^^^^^^^^^^
File "/opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/lib/python3.12/site-packages/click/core.py", line 1161, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/lib/python3.12/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/lib/python3.12/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/lib/python3.12/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 143, in check_cluster_ready
raise e
File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 140, in check_cluster_ready
check_deployed_config_version(cluster_name, table_name, config_version, region)
File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 116, in check_deployed_config_version
raise CheckFailedError(
common.exceptions.CheckFailedError: Check failed due to the following erroneous records:
* missing records (0): []
* incomplete records (0): []
* wrong records (5): [('i-07ebfd7ab147b8072', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0accf185802c0e3df', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09322aea55eb1ad50', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-09cfc94d5b81ddc12', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8'), ('i-0ad608de89b8f13a0', 'cXwyunc_wUD6bD2hgjjUfxToGqJM20y8')]
[2025-12-22T10:27:11+00:00] INFO: Retrying execution of execute[Check cluster readiness], 1 attempt left