Skip to content

Conversation

@kkaarreell
Copy link
Collaborator

The default retry session configuration has been updated to address frequent 503 error responses from ReportPortal:

  • Increased DEFAULT_RETRY_SESSION_RETRIES from 3 to 10
  • Increased DEFAULT_RETRY_SESSION_BACKOFF_FACTOR from 0.1 to 1

These values are now configurable via environment variables using the configure_constant() pattern:

  • TMT_RETRY_SESSION_RETRIES overrides retry count
  • TMT_RETRY_SESSION_BACKOFF_FACTOR overrides backoff factor

Added configure_float_constant() function to support float-type environment variable configuration, following the pattern of configure_constant() for integers.

This addresses errors like:
HTTPSConnectionPool: Max retries exceeded with url: /api/v1/... (Caused by ResponseError('too many 503 error responses'))

🤖 Generated with Claude Code

Pull Request Checklist

  • implement the feature
  • write the documentation
  • extend the test coverage
  • update the specification
  • adjust plugin docstring
  • modify the json schema
  • mention the version
  • include a release note

Copy link
Contributor

@LecrisUT LecrisUT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you run pre-commit. We don't have the bot to do that for you.

Comment on lines 270 to 276
DEFAULT_RETRY_SESSION_RETRIES: int = 10
RETRY_SESSION_RETRIES: int = configure_constant(
DEFAULT_RETRY_SESSION_RETRIES, 'TMT_RETRY_SESSION_RETRIES')

DEFAULT_RETRY_SESSION_BACKOFF_FACTOR: float = 1
RETRY_SESSION_BACKOFF_FACTOR: float = configure_float_constant(
DEFAULT_RETRY_SESSION_BACKOFF_FACTOR, 'TMT_RETRY_SESSION_BACKOFF_FACTOR')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default maximum is 17 min (of the last retry), which is even bigger than DEFAULT_BACKOFF_MAX. Some adjustments of the numbers is needed. Would have been good to have more control of the power as well since we probably shouldn't be doing the retry 10 times, but instead have a more aggressive power

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to discuss the actual values, for now I have just copied ENVFILE defaults below. The previous value allowed only the delays 0.1 s, 0.2 s, 0.4 s which are IMHO way too small if we are dealing with systems under a load.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after a discussion and given the default BACKOFF_MAX value, I have decided to stick with 10 retries which result in a cumulative delay being 10 minutes and 7 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after a discussion and given the default BACKOFF_MAX value, I have decided to stick with 10 retries which result in a cumulative delay being 10 minutes and 7 seconds.

How did you come up with this number. My maths is 1 * (2^11 - 1) [s] = 34.11 [min] . No bug opinions on what the correct value should be, just make sure you also adjust the backoff_max used in creating the retry. Given that this is user configurable I think it should do a similar mathematical evaluation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I did the math wrong, the cumulated sleep time is 487 seconds, which is 8 minutes and 7 seconds.

The DEFAULT_BACKOFF_MAX is 120 seconds. Therefore, even though 2^7 > 120, the delay would be 120.
Therefore, following the formula backoff_factor×2^(retry_number−1) we are getting

 retry 1: 1s
 retry 2: 2s
 retry 3: 4s
 ...
 retry 7: 64
 retry 8: 120s (capped by default BACKOFF_MAX)
 retry 9: 120s (capped)
 retry 10: 120s (capped)
with cumulative delay being 487 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's (2 ^ (retry_number + 1) - 1) excluding capoff. But if the user alter the values from env-var, I do not expect them to account for the built-in backoff_max, thus the suggestion to eliminate it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not. Check backoff_factor documentation at https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html
It states {backoff factor} * (2 ** ({number of previous retries}))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not. Check backoff_factor documentation at https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html It states {backoff factor} * (2 ** ({number of previous retries}))

Aware of that, was off by one due to missing 0-index, should have been (2 ^ (retry_number) - 1) as in 2^0 + 2^1 + ... + 2^n = 2^(n+1) - 1

Comment on lines 278 to 281
# Defaults for HTTP/HTTPS retries for getting environment file
# Retry with exponential backoff, maximum duration ~511 seconds
ENVFILE_RETRY_SESSION_RETRIES: int = 10
ENVFILE_RETRY_SESSION_BACKOFF_FACTOR: float = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify? The constant is used in the very same file but since it was not concert I didn't touch unrelated settings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new values of DEFAULT_RETRY_* match the values for these so these can be replaced with the common one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, unless we want to keep those settings separated on purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most like we do not want separate setting. There is no good reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the motivation why the settings has been previously introduced specifically for the ENVFILE, most likely it seemed to be necessary to be more benevolent in this particular case. Given that the new settings could be overriden through environment variables, it could happen that someone will set the value too low which will reintroduce the issue with obtaining ENVFILE.

I mean, I can drop these lines, but if there is no reason to distinguish between the two, why it has been introduced as a separate settings at the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In June 2022 it seemed like a good idea, but not anymore. One set to rule them all, envvar file isn't special enough.

@kkaarreell kkaarreell force-pushed the ks_retry_consts branch 2 times, most recently from 7e75c7f to 0e5bd6b Compare December 9, 2025 12:53
@psss psss added step | report Stuff related to the report step plugin | reportportal The reportportal report plugin labels Dec 9, 2025
@psss psss added this to planning Dec 9, 2025
@github-project-automation github-project-automation bot moved this to backlog in planning Dec 9, 2025
@psss
Copy link
Contributor

psss commented Dec 9, 2025

Checked with Ondrej that it's not super-urgent but let's fix as soon as possible. Proposing, together with the issue, for the next sprint.

The default retry session configuration has been updated to address
frequent 503 error responses from ReportPortal:
- Increased DEFAULT_RETRY_SESSION_RETRIES from 3 to 10
- Increased DEFAULT_RETRY_SESSION_BACKOFF_FACTOR from 0.1 to 1
resulting in the following delays:
 retry 1: 1s
 retry 2: 2s
 retry 3: 4s
 ...
 retry 8: 120s (capped by default BACKOFF_MAX)
 retry 9: 120s (capped)
 retry 10: 120s (capped)
with cumulative delay being 8 minutes and 7 seconds.

These values are now configurable via environment variables using
the configure_constant() pattern:
- TMT_RETRY_SESSION_RETRIES overrides retry count
- TMT_RETRY_SESSION_BACKOFF_FACTOR overrides backoff factor

Added configure_float_constant() function to support float-type
environment variable configuration, following the pattern of
configure_constant() for integers.

This addresses errors like:
HTTPSConnectionPool: Max retries exceeded with url: /api/v1/...
(Caused by ResponseError('too many 503 error responses'))

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
  1. DEFAULT_RETRY_SESSION_RETRIES from 9 back to 8 in /home/ksrot/devel/github/tmt/tmt/utils/__init__.py
  2. DEFAULT_RETRY_SESSION_BACKOFF_FACTOR from 1.0 to 2.0 in /home/ksrot/devel/github/tmt/tmt/utils/__init__.py
  3. Documentation in /home/ksrot/devel/github/tmt/docs/overview.rst to reflect 8 retries with backoff factor 2.0 and the correct delay
  sequence: 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s

  With these settings (8 retries, backoff factor 2.0), the cumulative sleep time is 510 seconds (8 minutes 30 seconds).

Assisted-by: Claude AI
@LecrisUT LecrisUT moved this from backlog to implement in planning Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

plugin | reportportal The reportportal report plugin step | report Stuff related to the report step

Projects

Status: implement

Development

Successfully merging this pull request may close these issues.

Consider increasing default back-off factor for ReportPortal report plugin

5 participants