Skip to content

Conversation

@jackyzha0
Copy link
Member

@jackyzha0 jackyzha0 commented Sep 25, 2025

turns out #51 reduced the race rate but did not eliminate it completely. we still see the data race bug very rarely and it is exacerbated on high-numbers of repeats in the tests and/or high nodejs event loop utilization

my old theory was:

  • the user fd is O_NONBLOCK clear so any writes there should block until its made its way into the 4kb intermediary buffer
  • as a result, when the .wait returns, we know that the program output has at least made it fully into the the user fd input buffer
  • because we poll until controller_inq == 0 && controller_outq == 0 && user_inq == 0 && user_outq == 0, in theory we should never call the js side exit code until theres truly nothing left in the pipe (i.e. the readstream on the nodejs side has it in its own buffer)

HOWEVER, with some logging i saw a few cases where in fact controller_inq == 0 && controller_outq == 0 && user_inq == 0 && user_outq == 0, the program had exited, yet the nodejs side still missed some data

what i think is actually happening is that

  1. one of libuv's io threads reads the data from the controller side and queues a data event on the stream
  2. polling exits as now theres no more data in the queues we immediately drop(user_fd); which sets TTY_OTHER_CLOSED synchronously and we get an error event queued on the stream
  3. if the nodejs event loop happens to read error before data, we emit 'end' and mark the data as read even though technically nodejs hasnt processed the data event yet so we drop data :(

how we fix it:

  1. axe poll_pty_fds_until_read
  2. make Pty struct own user_fd and expose a method for the js side to drop this fd when its done
  3. when child.wait finishes, write a synthetic 'EOF' that is actually a custom OSC terminal control sequence (\x1B]7878\x1B\\, 7878 is RUST on the phonepad :)) to the user fd (a cursory search shows no results, it seems very unlikely for this sequence to appear randomly)
  4. on the nodejs wrapper side, create a transform stream that parses out the synthetic EOF and emits it as a custom event when it happens
  5. when the nodejs side hits this EOF, we know we are actually at the end of the data stream and can safely drop user_fd

node-pty had the same problem and did the :grug: brain thing and added a wait 250ms so im calling it slightly more ok

@jackyzha0 jackyzha0 changed the title try keeping user fd alive longer dont drop user fd until node has had more time to read Sep 25, 2025
@jackyzha0 jackyzha0 marked this pull request as ready for review September 25, 2025 23:07
@jackyzha0 jackyzha0 changed the title dont drop user fd until node has had more time to read dont drop user fd until node reads a synthetic EOF Sep 26, 2025
Copy link
Contributor

@lhchavez lhchavez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the Linux kernel this bad?

@jackyzha0 jackyzha0 merged commit db9ac0a into main Sep 29, 2025
9 checks passed
@jackyzha0 jackyzha0 deleted the jackyzha0/keep-user-fd-alive-longer branch September 29, 2025 19:24
@agniv-the-marker
Copy link

HELL YEAH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants