-
Notifications
You must be signed in to change notification settings - Fork 289
fix health req id gen when httpserver worker num > 1 #1137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @hiworldwzj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical issue related to request ID generation in environments with multiple HTTP server workers. It enhances the reliability of the ID allocation system by implementing a safeguard that detects out-of-sequence ID generation, which can occur due to shared memory inconsistencies during extreme startup conditions. Furthermore, it streamlines the ID generation process for health checks by integrating them with the primary ID management system, thereby ensuring greater consistency and preventing potential ID conflicts across the application. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a bug in request ID generation within a multi-worker environment. The change in lightllm/utils/health_check.py to use a centralized ReqIDGenerator from the HttpServerManager is a good architectural improvement, as it removes a problematic global instance. However, the corresponding change in lightllm/server/req_id_generator.py only detects the symptom of the bug (ID regression) and logs an error, but still returns the invalid ID. This is a critical risk, as using a duplicate ID can lead to severe issues. My review suggests raising an exception to prevent this, and also points out that a more complete fix would address the root cause, which appears to be the re-initialization of shared memory.
lightllm/server/req_id_generator.py
Outdated
| # 导致请求 id 错乱的问题。 | ||
| self.alloced_max_req_id = max(self.alloced_max_req_id, id) | ||
| if id < self.alloced_max_req_id: | ||
| logger.error(f"alloc req_id error, current alloc id {id} < max alloced id {self.alloced_max_req_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check correctly identifies a request ID regression, which is a critical issue. However, it only logs an error and then proceeds to return the invalid id. Returning a duplicate or older ID can lead to severe problems like request misrouting or data corruption in downstream systems.
Instead of just logging, an exception should be raised to halt the process and prevent the invalid ID from being used. This makes the failure explicit and prevents silent corruption.
While this change helps detect the issue, it's worth noting that the root cause is likely the re-initialization of the shared memory counter in the ReqIDGenerator's __init__ method by different worker processes. A more robust solution would involve ensuring the shared memory is initialized only once.
| logger.error(f"alloc req_id error, current alloc id {id} < max alloced id {self.alloced_max_req_id}") | |
| raise RuntimeError(f"alloc req_id error, current alloc id {id} < max alloced id {self.alloced_max_req_id}") |
No description provided.