Use dense matrices in diffusion_nn #428
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I observed that the i-connectivity matrices in diffusion_nn are actually fairly dense for i > 1, and the use of sparse matrices and operations contributes to the memory and time bottleneck. (And, diffusion_nn consumes most of the time and memory of kBET.) So, I introduced some switches to dense matrices and operations. Of course, there are approximations and samplings that also reduce memory and runtime usage, but it is nice to be able to run the full method for reproducibility and/or in pipelines that call scib specifically (e.g., OpenProblems). In those cases, these changes make kBET much more runnable:
For a representative embedding of hypomap (with ~220k cells of type neuron), peak memory usage was reduced from 1583G to 541G, and overall runtime was reduced by a factor of ~8. On another embedding of hypomap, peak memory was reduced to 513G from >2TB, the limit of the machine. (So, I was only able to run kBET / replicate that portion of the pipeline with these modifications.) As kBET has by far the highest memory usage of any metric, this improves execution of a whole pipeline.