Skip to content

perf: Parallelize metric generation to reduce scrape latency and fix race conditions #67

@grissomsh

Description

@grissomsh

Description

code:

for _, device := range deviceInfos {

Currently, the GenerateDeviceMetrics and GenerateContainerMetrics functions in the exporter iterate through devices and containers serially. When querying device status (e.g., via DCGM or other provider APIs), each operation involves I/O latency.

In clusters with a large number of devices (e.g., 100+ GPUs) or when individual device queries are slow, the total scrape duration accumulates linearly (O(N)), potentially leading to timeouts (e.g., exceeding Prometheus scrape_timeout of 10s). User reports indicate scrape times reaching 4-5 seconds in a 20-nodes environments.

Additionally, the GenerateMetrics function lacks synchronization, which can lead to race conditions if multiple Prometheus scrapes occur simultaneously or if a scrape occurs while metrics are being reset/populated.

Proposed Changes

  1. Parallelization: Refactor the device iteration loops in GenerateDeviceMetrics and GenerateContainerMetrics to use Goroutines and sync.WaitGroup. This allows device metrics to be collected concurrently, reducing the total scrape time from the sum of all device latencies to the maximum latency of a single device (O(1) effectively).
  2. Concurrency Control: Introduce a sync.Mutex in MetricsGenerator to lock the critical section of the scrape cycle (Reset -> Collect -> Cache). This prevents data races and ensures that concurrent scrape requests wait for the ongoing collection to complete (or hit the cache) rather than corrupting the data.

Benefits

  • Significantly reduced /metrics response time, especially for nodes with many devices.
  • Improved stability and prevention of data race issues during concurrent scrapes.
  • Better scalability for large-scale AI clusters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions