Skip to content

Expected number of INDELs using genomic_distribution #85

@cutleraging

Description

@cutleraging

Hello,

Thanks for such a great package!

In the genomic_distribution function, I understand that the expected amount of mutations for a region of interest is calculated as

n_muts / surveyed_length * surveyed_region_length

However, does this proved an accurate estimate when dealing with INDELs? I would not think so since n_muts is not equal to the amount of total mutated bases (such as for SNVs).

Any thoughts on a better way to calculate the expected number of INDELs?

One solution I have tried is to randomly shuffle the INDELs (accounting for sequence context) and then count how many are in the region of interest. When I do this, I get a observed/expected ratio of ~1, which is what I would expect. However, I am confused how then I would calculate if this is significant using the binomial_test function. Would it make sense to do something along these lines?

p = n_INDELs /  surveyed_length
n = surveyed_region_length
x = observed_INDELs # number of INDELs observed to land in region of interest from the randomly shuffled files
binomial_test(p, n, x)

Any input would be wonderful, thanks!
Ronnie

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions