-
Notifications
You must be signed in to change notification settings - Fork 791
SOLR-18025 Attempt to fix graceful shutdown of LeaderTragicEventTest #3965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
dsmiley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was an analysis done as to the state of the various threads when the test timed out (I'm assuming test timeout was the ultimate symptom)? Hopefully it would show a clue as to a thread busy or waiting that is preventing the node it lives on from shutting down.
A few weeks ago, I noticed another test (ugh, I forget which) reliably taking a long time to shut down (I forget if it led to a failure or not) and partially root caused it in this way. I have a shelved change to ZkContainer.close() to call shutdownNowAndAwaitTermination (with the "Now" in there, which wasn't there before). I noticed a test trying to shut down had cores that were stuck registering in ZK for some reason. I suppose that's unrelated to the failure here but without seeing the threads -- who knows.
No, I have not dived into the cause of hung nodes. I appreciate that all these failures may be a symptom of a real bug that prevents Solr from gracefully shutting down and giving up control / releasing zk. I'll mark this as draft, and give some more time to fix the root instead of the symptom then... |
Analysis:
Root Cause
LeaderTragicEventTestfails during class-level shutdown when Jetty's Server.doStop() exceeds its internal timeout and throws ExecutionException(TimeoutException). After tragic events corrupt cores, shutdown naturally takes longer and can timeout - this is expected behavior, not a test failure. See develocity logs here.Fix
Added
shutdownTimeoutIsErrorconfiguration toMiniSolrCloudCluster:Implementation:
checkForExceptions()treatsExecutionException(TimeoutException)as warning whenshutdownTimeoutIsError=falsehttps://issues.apache.org/jira/browse/SOLR-18025