Skip to content

Conversation

@strtgbb
Copy link
Collaborator

@strtgbb strtgbb commented Nov 26, 2025

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Fix crossout not working when the error occurred during teardown.
Crossout a few more known fails.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link

github-actions bot commented Nov 26, 2025

Workflow [PR], commit [8126097]

@strtgbb strtgbb force-pushed the fix-integ-crossout-25.8 branch from bf987f5 to 8126097 Compare November 26, 2025 15:34
@strtgbb strtgbb added cicd Improvements and fixes to the CICD process antalya-25.8 labels Nov 26, 2025
@strtgbb strtgbb merged commit 591b199 into antalya-25.8 Dec 1, 2025
135 of 156 checks passed
@strtgbb strtgbb deleted the fix-integ-crossout-25.8 branch December 1, 2025 16:49
@CarlosFelipeOR
Copy link
Collaborator

QA Validation

1 – Failure on teardown: ✅PASSED

Context

The integration test test_scheduler_cpu_preemptive/test.py::test_downscaling[cpu-slot-preemption-timeout-1ms] was intermittently failing on teardown with a 900000 ms timeout, and in those cases the crossout logic was not working correctly.
Report from 2025-11-24 14:41:11

Expected

  • The test should be marked as BROKEN even if it fails during teardown.
  • A warning message should appear in broken_tests_handler.log indicating that log extraction failed and the full log was used.

Validation

The fix was merged on Dec 1, 2025.

By querying the database, we can see that after the last failure on 2025-11-24, there is a subsequent run where the same test is marked as BROKEN, still with the 900000 ms timeout.
The latest validated run occurred on 2025-12-04 11:22:51.

Additionally, checking the broken_tests_handler.log , We can see the new warning message:

WARNING: Test 'test_scheduler_cpu_preemptive/test.py::test_downscaling[cpu-slot-preemption-timeout-1ms]' has no logs among [], assuming log extraction failed, proceeding with full log

Conclusion

The fix is working as expected.
Even when the test fails during teardown, it is now correctly marked as BROKEN (when a crossout expression exists), and the new warning message is emitted when log extraction fails.


2 - New BROKEN expressions

2.1 - 00024_random_counters: ❌FAILED

Historical data (database query)
Result: These tests are still being marked as FAIL instead of BROKEN. The root cause is that the failure message changed from:

Timeout! Killing process group

to:

Timeout! Processes left in process group

but the specific crossout expression was not updated accordingly.
As a result, the failure is now being caught by the generic sanitizer crossout instead of the intended BROKEN rule.

2.2 - test_storage_s3_queue/test_5.py::test_migration ✅PASSED

This test was failing intermittently, which makes it very difficult to reproduce locally.
By consulting the database, we can see that after inclusion to known broken tests, the test has passed in all subsequent runs and has never been marked as BROKEN. Because of that, it was not possible to validate the behavior under failure conditions directly.

Validation:

  • The code and syntax were reviewed and compared against the error messages observed in previous failures.
  • The BROKEN test definition in broken_tests.yaml was reviewed and is correctly defined for this test.

Based on this, we can consider this change validated.
We will continue monitoring upcoming runs to ensure that, if this test fails again, it is correctly marked as BROKEN.

2.3 - test_storage_s3_queue/test_5.py::test_migration ✅PASSED

This test was failing intermittently, which makes it very difficult to reproduce locally.

Consulting the database and considering the variations [unordered-8], [ordered-1], and [ordered-2], we can see that after inclusion to the known broken tests, all subsequent runs passed, and the test was never marked as BROKEN. Therefore, it was not possible to directly observe the BROKEN state for validation.

Validation:

  • The code and syntax were reviewed
  • The BROKEN test definition in broken_tests.yaml is correct.
  • The logic is consistent with other BROKEN test definitions.

Given this, we can consider this change validated, and the test will continue to be monitored in future runs to ensure that, if it fails again, it is correctly marked as BROKEN.

@CarlosFelipeOR
Copy link
Collaborator

@strtgbb , as discussed above, could you please update the crossout rule to match the new error message for 00024_random_counters and link the PR here?

@strtgbb
Copy link
Collaborator Author

strtgbb commented Dec 18, 2025

The fix for 00024_random_counters is inluded in #1227

@CarlosFelipeOR
Copy link
Collaborator

I checked PR #1227, and the change is correct. It updates the crossout rule to match the new error message.
Also we already can see on database this test marked as BROKEN after this fix.

So we can consider:

2.1 - 00024_random_counters: ✅ PASSED.

I’m adding the verified label to PR #1170.

@CarlosFelipeOR CarlosFelipeOR added the verified Verified by QA label Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya-25.8 cicd Improvements and fixes to the CICD process verified Verified by QA

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants