Skip to content

Nextflow removes data columns when no values are greater than 1 #58

@srcrowl

Description

@srcrowl

Describe the bug
There is a weird quirk of the nextflow implementation of KSTAR, where data columns are removed if none of the values are greater than 1. Then, this causes an error downstream when trying to binarize the dataset but the data column is no longer there (and it is not noted by these functions). These lead to a silent error (not obvious what is causing the issue) during the binarize experiment step where it can't find the data column needed.

I have traced this error back to Line 91-92 of nextflow/src/helpers.py (check_data_columns.py function)

def check_data_columns(evidence, data_columns, logger=None):
        """
        Checks data columns to make sure column is in evidence and that evidence filtered on that data column 
        has at least one point of evidence. Removes all columns that do not meet criteria
        """
        
        if logger is None:
            logger = get_logger("check_data", "check_data.log")

        if logger is None:
            lof
        new_data_columns = []
        for col in data_columns:
            if col in evidence.columns:
                if len(evidence[evidence[col] >= 1]) > 0:   #################### issue is here##################
                    new_data_columns.append(col)
                else:
                    logger.warning(f"{col} does not have any evidence")

            else:
                logger.warning(f"{col} not in evidence")
        data_columns = new_data_columns
        return data_columns

To fix, should just need to add threshold parameter to the function, change the >=1 to >=threshold, and then make sure main() inputs in the correct threshold.

However, this will still cause an error if there are data columns that are removed due to lacking evidence according to the threshold. To fix this, we will also want to update the binarize experiment.py file in the nextflow implementation to actual use the updated data columns. In Line97 of the main() function, change binary_evidence = create_binary_evidence(evidence, results.data_columns, results.activity_agg, results.threshold, greater) to binary_evidence = create_binary_evidence(evidence, data_columns, results.activity_agg, results.threshold, greater)

To Reproduce
Steps to reproduce the behavior:

  1. Find a dataset which has quantification less than 1 for all sites
  2. Map the dataset, and run nextflow implementation as normal

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions