Skip to content

Conversation

@PalanQu
Copy link

@PalanQu PalanQu commented Nov 2, 2025

No description provided.

@PalanQu PalanQu force-pushed the feat/available-storage branch 2 times, most recently from 8258b0e to f8e6d87 Compare November 2, 2025 12:39
@PalanQu PalanQu force-pushed the feat/available-storage branch from f8e6d87 to 53b4111 Compare November 2, 2025 12:40
@blacks1ne
Copy link
Contributor

Adding @CassOnMars comment from TG Development channel:

The issue at hand is presently, workers aren't aware of worker-local storage vs cluster-wide storage: the worker can check the partition the storage is on, but that is shared with other workers on the machine (unless the storage itself is partitioned per store folder, which to my understanding nobody has done yet).

What I think the right path forward is, is somewhat more involved, and warrants further discussion: the idea would be either of the following:

  1. (Cheap, easy, not great) workers use the total storage available on the partition the store is running in, divides that by the number of total workers (with some buffer room for the master process if it too lives in that partition), and then measures a ratio of the storage use as a proxy for available storage per worker

  2. (Foundationally more correct, more complicated) Diving into OS-specific syscalls to essentially self-containerize the worker process and limit its internal view of available storage

@CassOnMars
Copy link
Contributor

I swear I remember writing this somewhere, but I can't recall if it was TG or discord, so adding my thoughts here so they don't get lost from the topic at hand: the thing I'm not sure about with this approach, is that it still doesn't quite address the problem with clustered arrangements. Once a node's workers extend beyond the machine itself, or the node itself doesn't support the relevant syscall, it'll error out/report zero. I'm struggling to find an approach that doesn't do the more complicated path, i.e. workers actually containerize themselves, relevant syscalls to scope what they have access rights to, etc. Part of this could be alleviated by having worker groups as a configuration option, such that when a worker is determining their own available space, they know only to compare the count of workers in their own group when dividing available storage.

@blacks1ne
Copy link
Contributor

blacks1ne commented Nov 17, 2025

I was more thinking about having a kind of "worker manager" (e.g. "worker supervisor") on slave nodes that would run on remote nodes supervising local worker processes: (re)starting them, exposing centralized metrics including disk space usage, etc.
That would require adding the node launch flag for the supervised core list, e.g. -cores=X-Y,Z

@tjsturos
Copy link
Contributor

hmm, this actually ties into what I'm doing right now-- allowing a worker to register with the master node rather than needing to be defined in a config.

One of the things that I was considering is adding support for a proxy node, or a psuedo-master that relayed start/stop commands, but it could actually have additional features, like storage calculations, automatic worker generation (based on the results of the calculations).

This would make things simpler in the manner that you could avoid overages, but it still wouldn't make the worker aware of it's limitations. You'd probably need to have a hard limit passed from the proxy or a param flag on individual workers (maybe determined by a your deployment script).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants