Skip to content

Conversation

@0xbyt4
Copy link
Contributor

@0xbyt4 0xbyt4 commented Jan 9, 2026

Summary

Add centralized health monitoring system with automatic recovery for tracking and healing provider status across the OM1 system.

Features

Health Monitoring

  • Thread-safe singleton for monitoring provider health
  • Heartbeat tracking with configurable timeout (default: 30s)
  • Error reporting with threshold-based degradation (default: 5 errors)
  • Background monitoring thread with smart logging (no spam when healthy)

Auto-Recovery

  • Automatic recovery attempts when providers become unhealthy
  • Configurable max attempts (default: 3) and cooldown (default: 60s)
  • Recovery callbacks registered per-provider
  • RECOVERING status during recovery attempts
  • Recovery state resets on successful heartbeat

Integrated Providers (16 total)

Core Providers

Provider Type Recovery
ASRProvider Speech stop() + start()
GpsProvider Sensor stop() + start()
ElevenLabsTTSProvider TTS stop() + start()
VLMOpenAIProvider Vision stop() + start()
VLMGeminiProvider Vision stop() + start()
VLMVilaProvider Vision stop() + start()
RivaTTSProvider TTS stop() + start()
UbTtsProvider TTS health check ping

Alternative/Extended Providers

Provider Type Recovery
ASRRTSPProvider Speech (RTSP) stop() + start()
UbtechASRProvider Speech (Ubtech) stop() + start()
UbtechVLMProvider Vision (Ubtech) stop() + start()
VLMOpenAIRTSPProvider Vision (RTSP) stop() + start()
VLMVilaRTSPProvider Vision (RTSP) stop() + start()
VLMVilaZenohProvider Vision (Zenoh) stop() + start()
RtkProvider RTK GPS stop() + start()
D435Provider RealSense D435 stop() + start()

Runtime Integration

Runtime Integration
CortexRuntime start_monitoring on run, stop on cleanup
ModeCortexRuntime start_monitoring on run, stop on cleanup
InputOrchestrator Auto-registers all input sensors

How It Works

Provider stops sending heartbeats
    └── Background thread detects timeout (10s check interval)
            └── Logs ERROR: "Unhealthy providers: GpsProvider (no heartbeat for 35.0s)"
                    └── If recovery_callback registered:
                            └── Attempt 1/3: Call _recover()
                                    ├── Success: Reset attempts, provider healthy
                                    └── Failure: Wait cooldown (60s), try again

Example Output

# Detection
ERROR: Unhealthy providers: GpsProvider (no heartbeat for 35.0s)

# Recovery attempt
INFO: Attempting recovery for 'GpsProvider' (attempt 1/3)
INFO: GpsProvider: Attempting recovery...
INFO: GpsProvider: Recovery successful
INFO: Recovery successful for 'GpsProvider'

# If recovery keeps failing
WARNING: Provider 'GpsProvider' exceeded max recovery attempts (3), giving up

Test Coverage

Test Category Tests
HealthMonitorProvider unit tests 32
Integration tests (runtime simulation) 27
Core provider tests 45
Extended provider tests 65
Total 169+

Configuration

HealthMonitorProvider(
    heartbeat_timeout=30.0,    # Seconds before unhealthy
    error_threshold=5,          # Errors before degraded
    check_interval=10.0,        # Background check interval
    max_recovery_attempts=3,    # Max recovery tries
    recovery_cooldown=60.0,     # Seconds between attempts
    auto_recovery=True,         # Enable/disable recovery
)

Test plan

  • All 800+ tests pass locally
  • 16 providers integrated with health monitoring
  • Auto-recovery tested with success/failure scenarios
  • Max attempts and cooldown verified
  • Manual verification with simulated providers
  • pre-commit hooks pass

Add centralized health monitoring for OM1 providers with:
- Provider registration with metadata
- Heartbeat tracking with configurable timeout
- Error reporting with threshold-based degradation
- System health summary and unhealthy provider detection
- Thread-safe singleton implementation
- 16 unit tests with full coverage
@0xbyt4 0xbyt4 requested review from a team as code owners January 9, 2026 22:15
@github-actions github-actions bot added robotics Robotics code changes python Python code tests Test files labels Jan 9, 2026
- Register all inputs with health monitor on orchestrator init
- Send heartbeat after each successful input event
- Report errors to health monitor when inputs fail
- Update tests to verify health monitoring integration
Add health monitoring to:
- ASRProvider (speech recognition)
- VLMOpenAIProvider, VLMGeminiProvider, VLMVilaProvider (vision)
- ElevenLabsTTSProvider, RivaTTSProvider, UbTtsProvider (text-to-speech)
- GpsProvider (sensor)

Each provider now registers with health monitor, sends heartbeats
on successful operations, and reports errors when failures occur.
- Add start_monitoring() and stop_monitoring() methods to HealthMonitorProvider
- Background thread periodically checks provider health status
- Only logs when issues detected (no spam when healthy)
- Logs ERROR for unhealthy providers (heartbeat timeout)
- Logs WARNING for degraded providers (error threshold exceeded)
- Integrate health monitoring into CortexRuntime (single-mode)
- Integrate health monitoring into ModeCortexRuntime (multi-mode)
- Add 4 new tests for monitoring functionality
Add 20 integration tests covering:
- Runtime lifecycle (start/stop monitoring)
- Provider lifecycle (registration, heartbeat, errors)
- InputOrchestrator integration (sensor registration, events)
- Realistic failure scenarios (timeout, degradation, recovery)
- Background monitoring log verification
- Singleton behavior across components
- Add recovery_callback parameter to register() method
- Automatic recovery attempts when provider becomes unhealthy
- Configurable max attempts (default: 3) and cooldown (default: 60s)
- RECOVERING status during recovery attempts
- Recovery state resets on successful heartbeat
- Add _recover() method to ASRProvider, GpsProvider, ElevenLabsTTSProvider
- Add 12 new tests for auto-recovery functionality
- Add _recover() method to VLMOpenAI, VLMGemini, VLMVila, RivaTTS, UbTts providers
- Add recovery tests to all provider test files
- Create new test_ub_tts_provider.py with full coverage
- Add TestAutoRecoveryIntegration class with 7 integration tests
- Total: 89 tests for recovery functionality
Add health monitoring with auto-recovery to:
- ASRRTSPProvider: RTSP-based speech recognition
- UbtechASRProvider: Ubtech robot ASR with error reporting
- UbtechVLMProvider: Ubtech robot vision
- VLMOpenAIRTSPProvider: OpenAI VLM with RTSP input
- VLMVilaRTSPProvider: Vila VLM with RTSP input
- VLMVilaZenohProvider: Vila VLM with Zenoh input
- RtkProvider: RTK GPS sensor
- D435Provider: Intel RealSense D435 depth camera

All providers include:
- Health monitor registration with recovery callbacks
- Heartbeat on successful operations
- Error reporting where applicable
- Comprehensive test coverage (65 new tests)

Total providers with health monitoring: 16
Total new tests: 65
@openminddev
Copy link
Contributor

Hi @0xbyt4 the idea is good, but why don't we develop with prometheus?

@0xbyt4
Copy link
Contributor Author

0xbyt4 commented Jan 10, 2026

Hi @openminddev, I will definitely give it a try. Additionally, independent of the main topic, I would like you to review the technical analysis I conducted through reverse engineering on a mobile app grid and telemetry system. More specifically, I would appreciate your feedback. Thank you. I know you are very busy. ( x article https://x.com/eyeofquantum/status/2009412169289384366?s=20 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python Python code robotics Robotics code changes tests Test files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants