-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Description
code:
| for _, device := range deviceInfos { |
Currently, the GenerateDeviceMetrics and GenerateContainerMetrics functions in the exporter iterate through devices and containers serially. When querying device status (e.g., via DCGM or other provider APIs), each operation involves I/O latency.
In clusters with a large number of devices (e.g., 100+ GPUs) or when individual device queries are slow, the total scrape duration accumulates linearly (O(N)), potentially leading to timeouts (e.g., exceeding Prometheus scrape_timeout of 10s). User reports indicate scrape times reaching 4-5 seconds in a 20-nodes environments.
Additionally, the GenerateMetrics function lacks synchronization, which can lead to race conditions if multiple Prometheus scrapes occur simultaneously or if a scrape occurs while metrics are being reset/populated.
Proposed Changes
- Parallelization: Refactor the device iteration loops in
GenerateDeviceMetricsandGenerateContainerMetricsto use Goroutines andsync.WaitGroup. This allows device metrics to be collected concurrently, reducing the total scrape time from the sum of all device latencies to the maximum latency of a single device (O(1)effectively). - Concurrency Control: Introduce a
sync.MutexinMetricsGeneratorto lock the critical section of the scrape cycle (Reset -> Collect -> Cache). This prevents data races and ensures that concurrent scrape requests wait for the ongoing collection to complete (or hit the cache) rather than corrupting the data.
Benefits
- Significantly reduced
/metricsresponse time, especially for nodes with many devices. - Improved stability and prevention of data race issues during concurrent scrapes.
- Better scalability for large-scale AI clusters.