Skip to content

Conversation

@zxhe-sean
Copy link
Collaborator

Description

Adding a flag --multi-container for supporting multi-container workloads.

With this flag enabled, running workload create will duplicate the container part in the final yaml file so that the resulting workload will be running in two containers running on one host.

Issue

b/459820949

Testing

https://cloudlogging.app.goo.gl/F81QQQ4Ca9Gew7hs8

@github-actions
Copy link

🤖 Hi @zxhe-sean, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@scaliby
Copy link
Member

scaliby commented Nov 27, 2025

@zxhe-sean why do we need this change? Could you cover it with tests?

@scaliby scaliby self-assigned this Nov 27, 2025
@zxhe-sean zxhe-sean marked this pull request as draft December 6, 2025 07:29
@zxhe-sean zxhe-sean force-pushed the zxhe/multi-container branch 2 times, most recently from a62db5b to f21d198 Compare December 6, 2025 08:01
@zxhe-sean zxhe-sean marked this pull request as ready for review December 6, 2025 08:02
@zxhe-sean zxhe-sean added the release-features features label Dec 6, 2025
@zxhe-sean zxhe-sean force-pushed the zxhe/multi-container branch 8 times, most recently from 0e8f567 to cfb8a07 Compare December 8, 2025 03:49
@jamOne- jamOne- self-assigned this Dec 8, 2025
Copy link
Collaborator

@jamOne- jamOne- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes and the unit tests!
I left a couple of comments, PTAL.

@zxhe-sean zxhe-sean force-pushed the zxhe/multi-container branch 2 times, most recently from 48b43f5 to 0f07392 Compare December 9, 2025 07:12
@zxhe-sean zxhe-sean force-pushed the zxhe/multi-container branch from 0f07392 to 1fe5b6d Compare December 9, 2025 07:22
Copy link
Collaborator

@jamOne- jamOne- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, looks good! Left one more comment

tpu_type_requires_workload_policy=True,
supports_accelerator_network_profile=False,
docker_platform=AMD_PLATFORM,
parallel_containers=2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the FR, Victor said that not all tpu7x should support this, for instance 2x2x1 should not? Or am I wrong?

Could we add a support for that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2x2x1 is 4 chips on a full host. I think it is the smallest topology where we can start using 2 containers @Obliviour

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 chips can support it but less than 4 can't

@jamOne- jamOne- requested a review from Obliviour December 9, 2025 09:31
@jamOne-
Copy link
Collaborator

jamOne- commented Dec 9, 2025

@Obliviour @FIoannides PTAL

@Obliviour
Copy link
Collaborator

So this PR provides us the ability from the workload to enable two container workloads.

There are two other dimensions that need to be covered that I think we need xpk team's help in:

  1. In order for it to be numa aware, the gke nodepool topology manager needs to be configured with best effort. The default is none so when the two containers are scheduled, it won't be numa aware. GKE has an improvement on the way to make best effort the default so that this problem is solved. But in the short term, we need to create the nodepools with an additional topology manager set to best effort.

  2. When nap is used, we need the nodepool to created with the same topology manager through nap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants