Skip to content

Xboost BUG: #93

@MmasterT

Description

@MmasterT

Describe the bug

Final step of the pipeline is failing for some parameter reason as described in this stackoverflow issue:

https://stackoverflow.com/questions/66491801/i-got-this-error-dataframe-dtypes-for-data-must-be-int-float-bool-or-categori

To Reproduce
I've cloned the repo and changed some of the configs to run in a slurm context with no internet access. Everthing is creates and analyzed as expected but the final file.

sbatch -p ei-cb -J predector_test -o predector_test.%j.log -c 1 --mem 10G --wrap " source nextflow-22.04.0_CBG && nextflow run ~/singularity/predector/predector/main.nf --phibase /ei/cb/common/Databases/predector/phi-base_current.fas --pfam_hmm /ei/cb/common/Databases/predector/Pfam-A.hmm.gz --pfam_dat /ei/cb/common/Databases/predector/Pfam-A.hmm.dat.gz --dbcan /ei/cb/common/Databases/predector/dbCAN-HMMdb-V11.txt --effectordb /ei/cb/common/Databases/predector/effectordb.hmm.gz -profile test -with-singularity ~/singularity/predector/predector-1.2.7.sif -resume ~/singularity/predector/predector/ -c ~/singularity/predector/predector/nextflow.config -with-report"

Expected behavior
Expeceted to get the *rank_result.tsv file of the test

Error Log
Error executing process > 'rank_results (test_set)'

Caused by:
Process rank_results (test_set) terminated with an error exit status (2)

Command executed:

predutils load_db --mem "2" tmp.db results.ldjson

predutils rank --mem "2" --dbcan dbcan.txt --pfam pfam.txt --outfile "test_set-ranked.tsv" --secreted-weight "2" --sigpep-good-weight "0.003" --sigpep-ok-weight "0.0001" --single-transmembrane-weight "-0.7" --multiple-transmembrane-weight "-1.0" --deeploc-extracellular-weight "1.3" --deeploc-intracellular-weight "-1.3" --deeploc-membrane-weight "-0.25" --targetp-mitochondrial-weight "-0.5" --effectorp1-weight "0.5" --effectorp2-weight "2.5" --effectorp3-apoplastic-weight "0.5" --effectorp3-cytoplasmic-weight "0.5" --effectorp3-noneffector-weight "-2.5" --deepredeff-fungi-weight "0.1" --deepredeff-oomycete-weight "0.0" --effector-homology-weight "2" --virulence-homology-weight "0.5" --lethal-homology-weight "-2" --tmhmm-first-60-threshold "10" tmp.db

rm -f tmp.db

Command exit status:
2

Command output:
(empty)

DataFrame.dtypes for data must be int, float, bool or category. When
categorical type is supplied, DMatrix parameter enable_categorical must
be set to True. Invalid columns:signalp3_nn_d
Traceback (most recent call last):
File "/opt/conda/envs/predector/lib/python3.9/site-packages/predectorutils/main.py", line 253, in main
rank_runner(args)
File "/opt/conda/envs/predector/lib/python3.9/site-packages/predectorutils/subcommands/rank.py", line 1577, in runner
raise e
File "/opt/conda/envs/predector/lib/python3.9/site-packages/predectorutils/subcommands/rank.py", line 1575, in runner
inner(con, cur, args)
File "/opt/conda/envs/predector/lib/python3.9/site-packages/predectorutils/subcommands/rank.py", line 1561, in inner
df["effector_score"] = run_ltr(df)
File "/opt/conda/envs/predector/lib/python3.9/site-packages/predectorutils/subcommands/rank.py", line 1503, in run_ltr
dmat = xgb.DMatrix(df_features)
File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/core.py", line 532, in inner_f
return f(**kwargs)
File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/core.py", line 643, in init
handle, feature_names, feature_types = dispatch_data_backend(
File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/data.py", line 896, in dispatch_data_backend
return _from_pandas_df(data, enable_categorical, missing, threads,
File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/data.py", line 345, in _from_pandas_df
data, feature_names, feature_types = _transform_pandas_df(
File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/data.py", line 283, in _transform_pandas_df
_invalid_dataframe_dtype(data)
File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/data.py", line 247, in _invalid_dataframe_dtype
raise ValueError(msg)
ValueError: DataFrame.dtypes for data must be int, float, bool or category. When
categorical type is supplied, DMatrix parameter enable_categorical must
be set to True. Invalid columns:signalp3_nn_d

Operating system (please enter the following information as appropriate):

  • OS/Linux distribution: CentOS
  • Dependency management: Singularity
  • Linux HPC

Additional context
I think changin the xgb.DMatrix(df_features) to xgb.DMatrix(df_features, enable_categorical=True) shoould do the fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions