Skip to content

Conversation

@com6056
Copy link

@com6056 com6056 commented Jan 13, 2026

Some devices (e.g., CyberPower CP1500PFCLCD) have firmware bugs that cause random I/O errors on specific HID reports during normal polling. Rather than triggering expensive reconnection attempts that can fail in daemon mode, skip the failing report and continue with remaining polls.

Add safety check to detect true device disconnection: if all polls fail during an update cycle (items_succeeded == 0), trigger reconnect as before.

This improves stability especially in daemon mode while still detecting real disconnections via other error codes or complete poll failure.

Fixes #3116


More in-depth details from my investigation in case it is helpful:

Problem Description

CyberPower UPS devices (tested with CP1500PFCLCD, ProductID 0x0601) experience "Data stale" errors when running in daemon mode, but work perfectly when debug mode is enabled (-D flag or NUT_DEBUG_LEVEL=1).

Root Cause

The device has firmware bugs that cause transient LIBUSB_ERROR_IO failures on certain HID reports during normal polling operations. Reports like:

  • 0x1a (input.sensitivity)
  • 0x19 (ups.realpower)
  • Others during "Full update" cycles (every 12th poll with default pollfreq)

These reports exist and work most of the time, but randomly fail with I/O errors.

Current Behavior (Broken)

When LIBUSB_ERROR_IO occurs during polling:

  1. Driver immediately enters reconnect.trying state
  2. Calls reconnect_ups() to close and reopen USB device
  3. In daemon/background mode: USB reconnection fails (due to process being daemonized with setsid(), closed file descriptors, etc.)
  4. Driver sets dstate_datastale() → "Data stale" error to upsd
  5. Loop repeats on next poll → continuous reconnect failures

Why Debug Mode "Worked"

The -D flag sets foreground = 1, preventing the driver from daemonizing:

  • Process doesn't fork
  • No setsid() call
  • Original file descriptors remain open
  • USB reconnection succeeds in this environment

So debug mode didn't fix the firmware bug—it just made the reconnection workaround actually work. But reconnecting on every transient error is expensive and unnecessary.

Solution

Skip transient errors instead of reconnecting:

  1. When LIBUSB_ERROR_IO occurs, log it and continue to next poll item instead of triggering reconnect
  2. Track successful polls with items_succeeded counter
  3. Safety mechanism: If zero items succeed during an UPDATE cycle, then trigger reconnect (true device failure)

Why This Works

  • Transient errors on SOME reports = firmware bug → skip and continue with remaining polls
  • Errors on ALL reports = true disconnection → reconnect
  • Most polls succeed despite occasional failures → driver stays stable
  • True disconnections still detected via:
    • Other error codes (LIBUSB_ERROR_NO_DEVICE, LIBUSB_ERROR_ACCESS, etc.)
    • All polls failing (safety check)
    • upsd's MAXAGE timeout

Testing

With this fix and pollonly=true:

  • Driver runs indefinitely in daemon mode
  • Occasional LIBUSB_ERROR_IO logged at debug level 3, but skipped
  • Data collection continues normally
  • No more reconnect loops or "Data stale" errors

Related

This is likely the same underlying issue reported in other contexts (see instantlinux/docker-tools#67) where CyberPower devices experience intermittent failures. The difference is our specific model (0x0601) fails predictably during full update cycles, making the root cause easier to identify.


General points

  • Described the changes in the PR submission or a separate issue, e.g.
    known published or discovered protocols, applicable hardware (expected
    compatible and actually tested/developed against), limitations, etc.

  • There may be multiple commits in the PR, aligned and commented with
    a functional change. Notably, coding style changes better belong in a
    separate PR, but certainly in a dedicated commit to simplify reviews
    of "real" changes in the other commits. Similarly for typo fixes in
    comments or text documents.

  • Please star NUT on GitHub, this helps with sponsorships! ;)

Frequent "underwater rocks" for driver addition/update PRs

  • Revised existing driver families and added a sub-driver if applicable
    (nutdrv_qx, usbhid-ups...) or added a brand new driver in the other
    case.

  • Did not extend obsoleted drivers with new hardware support features
    (notably blazer and other single-device family drivers for Qx protocols,
    except the new nutdrv_qx which should cover them all).

  • For updated existing device drivers, bumped the DRIVER_VERSION macro
    or its equivalent.

  • For USB devices (HID or not), revised that the driver uses unique
    VID/PID combinations, or raised discussions when this is not the case
    (several vendors do use same interface chips for unrelated protocols).

  • For new USB devices, built and committed the changes for the
    scripts/upower/95-upower-hid.hwdb file

  • Proposed NUT data mapping is aligned with existing docs/nut-names.txt
    file. If the device exposes useful data points not listed in the file, the
    experimental.* namespace can be used as documented there, and discussion
    should be raised on the NUT Developers mailing list to standardize the new
    concept.

  • Updated data/driver.list.in if applicable (new tested device info)

Frequent "underwater rocks" for general C code PRs

  • Did not "blindly assume" default integer type sizes and value ranges,
    structure layout and alignment in memory, endianness (layout of bytes and
    bits in memory for multi-byte numeric types), or use of generic int where
    language or libraries dictate the use of size_t (or ssize_t sometimes).
  • Progress and errors are handled with upsdebugx(), upslogx(),
    fatalx() and related methods, not with direct printf() or exit().
    Similarly, NUT helpers are used for error-checked memory allocation and
    string operations (except where customized error handling is needed,
    such as unlocking device ports, etc.)

  • Coding style (including whitespace for indentations) follows precedent
    in the code of the file, and examples/guide in docs/developers.txt file.

  • For newly added files, the Makefile.am recipes were updated and the
    make distcheck target passes.

General documentation updates

  • Updated docs/acknowledgements.txt (for vendor-backed device support)

  • Added or updated manual page information in docs/man/*.txt files
    and corresponding recipe lists in docs/man/Makefile.am for new pages

  • Passed make spellcheck, updated spell-checking dictionary in the
    docs/nut.dict file if needed (did not remove any words -- the make
    rule printout in case of changes suggests how to maintain it).

Additional work may be needed after posting this PR

  • Propose a PR for NUT DDL with detailed device data dumps from tests
    against real hardware (the more models, the better).

  • Address NUT CI farm build failures for the PR: testing on numerous
    platforms and toolkits can expose issues not seen on just one system.

  • Revise suggestions from LGTM.COM analysis about "new issues" with
    the changed codebase.

Some devices (e.g., CyberPower CP1500PFCLCD) have firmware bugs that cause
random I/O errors on specific HID reports during normal polling. Rather than
triggering expensive reconnection attempts that can fail in daemon mode, skip
the failing report and continue with remaining polls.

Add safety check to detect true device disconnection: if all polls fail during
an update cycle (items_succeeded == 0), trigger reconnect as before.

This improves stability especially in daemon mode while still detecting real
disconnections via other error codes or complete poll failure.

Fixes networkupstools#3116

Signed-off-by: Jordan Rodgers <com6056@gmail.com>
@com6056 com6056 force-pushed the fix-cyberpower-eio-tolerance branch from cdee828 to badbed7 Compare January 13, 2026 03:41
@jimklimov jimklimov added enhancement CyberPower (CPS) USB Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time labels Jan 13, 2026
@jimklimov jimklimov added this to the 2.8.5 milestone Jan 13, 2026
@jimklimov
Copy link
Member

Great analysis, thanks. That must have been a fun debugging session...

As for setsid() implications - I've re-checked, this method gets called from common::background() and indeed relatively late in general driver main.c startup (after device init and perhaps data dump, optionally before the update loop - unless deciding to remain foregrounded due to one or another reason).

This is probably a separate issue from this avoidance of reconnections in the first place, but did you have a chance to experiment whether reconnection in this driver actually works/fails? I wonder if the problem is coincidental, e.g. nothing wrong with the driver code, but it is the UPS firmware that gets stuck, reboots and is not responding just at the moment we try to connect back to it... I suppose pulling the USB cable and plugging it back while the driver is running can help in the investigation (unless this is one of those UPSes that power on the USB chip when it is connected, and power it off/cycle when not in use).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time CyberPower (CPS) enhancement USB

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CyberPower CP1500AVRLCD3 doesn't reliably work unless NUT_DEBUG_LEVEL set

2 participants