Skip to content

Inconsistent preprocessing for scPRINT #81

@mumichae

Description

@mumichae

Describe the bug
Feature selection step seems to be missing.

print("\n>>> Preprocessing data...", flush=True)
preprocessor = Preprocessor(
min_valid_genes_id=min(0.9 * adata.n_vars, 10000), # 90% of features up to 10,000
# Turn off cell filtering to return results for all cells
filter_cell_by_counts=False,
min_nnz_genes=False,
do_postp=False,
# Skip ontology checks
skip_validate=True,
)
adata = preprocessor(adata)

Expected behavior
Feature selection should be handled via command line argument

- name: --n_hvg
type: integer
default: 2000
description: Number of highly variable genes to use.

if par["n_hvg"]:
print(f"Select top {par['n_hvg']} high variable genes", flush=True)
idx = adata.var["hvg_score"].to_numpy().argsort()[::-1][:par["n_hvg"]]
adata = adata[:, idx].copy()

Additional context
If full feature matrix for the model is desired by default, this should be achieved by adjusting the --n_hvg parameter in, or adding new variant here:

variants:
scprint_large:
model_name: "large"
scprint_medium:
model_name: "v2-medium"
scprint_small:
model_name: "small"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions