Skip to content

Conversation

@crusaderky
Copy link
Contributor

@crusaderky crusaderky commented Nov 30, 2022

When one significantly increases the cluster size (#549), the current hardcoded levels of parallelism can cause Coiled to fail to allocate all the necessary workers.

Additionally, pytest-xdist tends to eat the pytest output, making debugging more difficult.
Expose parallelisation levels as knobs in the A/B config file and optionally disable pytest-xdist completely.

@crusaderky crusaderky self-assigned this Nov 30, 2022
@ntabris
Copy link
Member

ntabris commented Nov 30, 2022

What sort of cluster size are you talking about? Do you have example of test runs where Coiled failed to provision the desired number of instances?

@crusaderky
Copy link
Contributor Author

@ntabris in #355 we ascertained that AWS DoS protection will kick in beyond a certain, unspecified threshold.

#483 is about having the option to play with 1,000 workers clusters.
This PR makes the difference between 1,000 instances at once and 20,000.

@ntabris
Copy link
Member

ntabris commented Nov 30, 2022

I don't know what you mean by AWS DoS, there are lots of limits and I don't see anything in #355 that describes what you were seeing.

Maybe let's talk before you start making multiple 1000 worker clusters. That's fine to do occasionally, but we should make sure we aren't doing this way more than necessary, and 20 1000 worker clusters is a lot (even if you're making them one at a time).

@crusaderky crusaderky marked this pull request as ready for review December 1, 2022 16:53
@crusaderky
Copy link
Contributor Author

My above comment also applies to 100 workers. This PR makes the difference between 100 workers and 2000 workers.

@ntabris
Copy link
Member

ntabris commented Dec 1, 2022

Sure, I'm not saying this PR has a problem, I'm just saying me/platform should be in the loop when you're thinking about doing larger scale testing.

Copy link
Contributor

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @crusaderky! I have one nit about naming, feel free to ignore.

# Number of parallel test_*.py modules per A/B test job.
# Each module typically spawns one Coiled cluster at a time.
# Set to 1 to disable pytest-xdist.
modules_per_job: 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I first ran into the variable name without the description and it took me a moment to figure out how the number of modules would impact parallelism. Would it make sense to rename this to something like pytest_workers_per_job to highlight that this is about pytest-xdist parallelism?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just left a small comment

- runtime
- benchmarks
- stability
max_parallel:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same way we explain opther parameters (e.g category, test_null_hypothesis, etc ) in the README, can we explain these two in the README as well?

I know there are some comments in the config.yml but it'll be useful having them explain in the readme too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@crusaderky crusaderky merged commit b4f894c into main Dec 14, 2022
@crusaderky crusaderky deleted the guido/config_parallelism branch December 14, 2022 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Configure number of workers and instance types in A/B tests

5 participants