Skip to content

Conversation

@matheus23
Copy link
Member

@matheus23 matheus23 commented Nov 20, 2025

Description

Depends on n0-computer/n0-future#23 for JoinSet::poll_join_next and JoinError::try_into_panic.

  • Removes guarded_channel.rs
  • Instead, uses normal mpsc channels, and once it closes them, collects all messages that were missed before the channel was closed, sends them back to the RemoteMap, which will immediately spawn a new RemoteStateActor in case it missed some messages during shutdown.
  • Spawns the RemoteStateActor in a JoinSet that's held in a Mutex inside the RemoteMap (effectively owned by the magic socket).

Notes & open questions

I think it's good to look at this right now, but some considerations/TODOs:

  • I don't like the mutex around the JoinSet in RemoteMap. Maybe only the magicsock::Actor should own the JoinSet, and the RemoteMap shouldn't actually store it itself.
  • I'm not sure if draining all "initial messages" in RemoteStateActor::run is the right way to do this.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.

@matheus23 matheus23 self-assigned this Nov 20, 2025
@n0bot n0bot bot added this to iroh Nov 20, 2025
@github-project-automation github-project-automation bot moved this to 🏗 In progress in iroh Nov 20, 2025
@matheus23 matheus23 changed the base branch from feat-multipath to Frando/mp-fix-remote-state-actor-loop November 20, 2025 09:09
@matheus23 matheus23 force-pushed the matheus23/no-guarded-channel branch from 5c66034 to de05f27 Compare November 20, 2025 09:16
@github-actions
Copy link

github-actions bot commented Nov 20, 2025

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3681/docs/iroh/

Last updated: 2026-01-05T16:50:01Z

@github-actions
Copy link

github-actions bot commented Nov 20, 2025

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: 92545ef

@matheus23 matheus23 force-pushed the matheus23/no-guarded-channel branch from de05f27 to 4c4d11d Compare November 20, 2025 10:14
@matheus23 matheus23 marked this pull request as ready for review November 20, 2025 10:40
@matheus23 matheus23 marked this pull request as draft November 20, 2025 13:58
@matheus23
Copy link
Member Author

matheus23 commented Nov 20, 2025

This PR has the problem that we only run the remove_closed_remote_state_actors fn every 60s.

Also: The start_remote_actor fn is called from two places and might become racy.

Base automatically changed from Frando/mp-fix-remote-state-actor-loop to feat-multipath November 20, 2025 14:33
@flub flub added this to the iroh: v1.0.0-rc.0 milestone Dec 22, 2025
@flub
Copy link
Contributor

flub commented Dec 22, 2025

This would still be nice to have, please don't forget about it.

@matheus23 matheus23 force-pushed the matheus23/no-guarded-channel branch from 6e09bda to f9c98d1 Compare December 23, 2025 11:53
@matheus23 matheus23 changed the base branch from feat-multipath to main January 2, 2026 14:15
@matheus23 matheus23 force-pushed the matheus23/no-guarded-channel branch from f9c98d1 to 15cbd59 Compare January 2, 2026 14:18
@matheus23
Copy link
Member Author

Also: The start_remote_actor fn is called from two places and might become racy.

This is actually not possible. Calls to start_remote_state_actor are always behind an acquisition of the actor_senders lock in remote_state.rs.

Also: Make sure to always acquire locks in the same order to avoid deadlocks, and also avoid reentrant locking.
@matheus23 matheus23 marked this pull request as ready for review January 5, 2026 15:14
@matheus23
Copy link
Member Author

This PR has the problem that we only run the remove_closed_remote_state_actors fn every 60s.

Got around this problem by calling JoinSet::poll_join_next right after acquiring the lock for the join set, and calling that in the main magicsocket actor.


image

Yeah - fair point clippy.
If people have ideas for a struct or type alias encapsulating this, I'm all ears 😅

@matheus23 matheus23 requested a review from flub January 5, 2026 15:17
"oops, flub didn't think this would happen");
}

let sender = self.msock.remote_map.remote_state_actor(node_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was starting the RemoteStateActor if it's not running, whereas the new try_send only sends to running actors. Is this change intended?

Edit: Right, I think it's fine because sending to a just-spawned actor is a noop anyways.

/// The state kept for spawning new tasks for remote state actors and cleaning them up.
///
/// The lock for this needs to be acquired *after* the lock for the `actor_senders` is acquired.
actor_tasks: Mutex<Tasks>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put the senders and the tasks in the same mutex, to avoid the deadlock risk?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.

4 participants