-
Notifications
You must be signed in to change notification settings - Fork 1
🔔 added health check alerts #905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
|
||
| @property | ||
| def memory_available_percent(self) -> float: | ||
| return 100 - typing.cast("float", psutil.virtual_memory().percent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested that these work on windows?
| name="low_available_virtual_memory", | ||
| metric=f"{RESOURCES_METRIC_PREFIX}/memory.virtual.available.percentage", | ||
| threshold=5, | ||
| aggregation="at least one", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do?
| retention_period: str | None = None, | ||
| timeout: int | None = 180, | ||
| visibility: typing.Literal["public", "tenant"] | list[str] | None = None, | ||
| terminate_on_low_system_health: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would default this to false personally
| aggregation="at least one", | ||
| window=2, | ||
| rule="is below", | ||
| trigger_abort=terminate_on_alert, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also add an email notification option?
|
|
||
| def to_dict(self) -> dict[str, float]: | ||
| """Create metrics dictionary for sending to a Simvue server.""" | ||
| _metrics: dict[str, float] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| attempts: int = 0 | ||
|
|
||
| while run._status == "terminated" and attemps < 5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo 'attemps' - this will crash
| import random | ||
| import datetime | ||
| import simvue | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add unit tests:
- Check the metrics appear automatically
- Check if you add a process that spikes the RAM usage / create a large tempfile, the available RAM / memory metrics change appropriately
- Check the options for the alert (terminate, email if you decide to add that) are added to the alert correctly (ie get the alert definition back once its created, check it matches)

Add System Health Alerts
Issue: #904
Python Version(s) Tested: 3.13.5
Operating System(s): Ubuntu 25.10
Documentation PR: Issue on Docs repo.
📝 Summary
Adds functionality to prevent run loss after system health failure.
🔄 Changes
Adds pre-defined alerts which trigger when the system is low on health:
✔️ Checklist