diff --git a/README.md b/README.md index f689e2d..378fd56 100644 --- a/README.md +++ b/README.md @@ -17,12 +17,12 @@ under construction. # Requirements -Install the Python requirements from requirements.txt: +Install the Python requirements from `requirements.txt`: ``` $ python -m pip install -r requirements.txt ``` -You will also need a working R installation with reticulate and irkernel installed. +You will also need a working R installation with `reticulate` and `irkernel` installed. This command should do the trick: ``` $ R -e 'install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()' @@ -35,8 +35,8 @@ $ R -e 'install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()' [jupytext](https://github.com/mwouts/jupytext) to convert the notebook into the right format. - To build locally, run ``make``. The output tells you where to find the - built HTML. -- Pages rendered at https://tskit.dev/tutorials + built `HTML` file(s). +- Pages rendered at https://tskit.dev/tutorials. - Pages might take a while to be updated after a new tutorial is merged. If you have an idea for a tutorial, please open an issue to discuss. diff --git a/tskitr.md b/tskitr.md index 4e27f4c..cc878dc 100644 --- a/tskitr.md +++ b/tskitr.md @@ -19,13 +19,12 @@ kernelspec: # Tskit and R -To interface with `tskit` in R, we can use the [reticulate](https://rstudio.github.io/reticulate/) R package, which lets you call Python functions within an R session. In this tutorial, we'll go through a couple of examples to show you how to get started. If you haven't done so already, you'll need to install `reticulate` in your R session via `install.packages("reticulate")`. +To interface with `tskit` in R, we can use the [reticulate](https://rstudio.github.io/reticulate) R package, which lets you call `tskit`'s Python API within an R session. In this tutorial, we'll go through a couple of examples to show you how to get started. You'll need to install `reticulate` in your R session via `install.packages("reticulate")` and at a minimum state the required Python packages via `reticulate::py_require(c("tskit", "msprime"))`. This setting will use `reticulate`'s isolated Python virtual environment, but you can also use an alternative Python environment as described at https://rstudio.github.io/reticulate. We'll begin by simulating a small tree sequence using `msprime`. ```{code-cell} msprime <- reticulate::import("msprime") - ts <- msprime$sim_ancestry(80, sequence_length=1e4, recombination_rate=1e-4, random_seed=42) ts # See "Jupyter notebook tips", below for how to render this nicely ``` @@ -33,15 +32,16 @@ ts # See "Jupyter notebook tips", below for how to render this nicely ## Attributes and methods `reticulate` allows us to access a Python object's attributes via -the `$` operator. For example, we can access (and assign to a variable) the number of -samples in the tree sequence: +the `$` operator. For example, we can access (and assign to a native R object) +the number of samples in the tree sequence: ```{code-cell} n <- ts$num_samples n +is(n) # Show the type of the object ``` -The `$` operator can also be used to call methods, for example, the +The `$` operator can also be used to call methods, for example, the {meth}`~TreeSequence.simplify` method associated with the tree sequence. The method parameters are given as native R objects (but note that object IDs still use tskit's 0-based indexing system). @@ -58,33 +58,44 @@ paste( ### IDs and indexes Note that if a bare digit is provided to one of these methods, it will be treated as a -floating point number. This is useful to know when calling `tskit` methods that -require integers (e.g. object IDs). For example, the following will not work: +floating point number (numeric in R). This is useful to know when calling `tskit` methods that +require integers (such as object IDs). For example, the following will not work: ```{code-cell} :tags: [raises-exception, remove-output] -ts$node(0) # Will raise an error +ts$node(0) # Will raise a TypeError +is(0) ``` In this case, to force the `0` to be passed as an integer, you can either coerce it -using `as.integer` or simply prepend the letter `L`: +using `as.integer` or simply append the letter `L`: ```{code-cell} +is(as.integer(0)) ts$node(as.integer(0)) # or +is(0L) ts$node(0L) ``` Coercing in this way is only necessary when passing parameters to those underlying -`tskit` methods that expect integers. It is not needed e.g. to index into numeric arrays. +`tskit` methods that expect integers. It is not needed to index into numeric arrays. _However_, when using arrays, very careful attention must be paid to the fact that `tskit` IDs start at zero, whereas R indexes start at one: ```{code-cell} -root_id <- ts$first()$root -paste("Root time via tskit method:", ts$node(root_id)$time) -# When indexing into tskit arrays in R, add 1 to the ID -paste("Root time via array access:", ts$nodes_time[root_id + 1]) +root_id_int <- ts$first()$root +is(root_id_int) +root_id_num <- as.numeric(root_id_int) + +# Using integer ID will work +paste("Root time via tskit method:", ts$node(root_id_int)$time) +# Using numeric ID will not work +# paste("Root time via tskit method:", ts$node(root_id_num)$time) + +# When indexing into tskit arrays in R, add 1 to the ID (integer or numeric) +paste("Root time via array access:", ts$nodes_time[root_id_int + 1]) +paste("Root time via array access:", ts$nodes_time[root_id_num + 1]) ``` ## Analysis @@ -98,7 +109,6 @@ for each of the tree sequence's sample nodes: ```{code-cell} ts_mut = msprime$sim_mutations(reduced_ts, rate=1e-4, random_seed=321) - paste(ts_mut$num_mutations, "mutations, genetic diversity is", ts_mut$diversity()) ``` @@ -109,6 +119,7 @@ the tree sequence as a matrix object in R. ```{code-cell} G = ts_mut$genotype_matrix() G +is(G) ``` We can then use R functions directly on the genotype matrix: @@ -142,10 +153,9 @@ It also allows trees and tree sequences to be plotted inline: ts_mut$draw_svg(y_axis=TRUE, y_ticks=0:10) ``` - ## Interaction with R libraries -R has a number of libraries to deal with genomic data and trees. Below we focus on the +R has a number of libraries to deal with genomic data and trees. Below we demo the phylogenetic tree representation defined in the the popular [ape](http://ape-package.ird.fr) package, taking all the trees {meth}`exported in Nexus format`, or @@ -165,14 +175,24 @@ plot(tree, direction="downward", srt=90, adj=0.5) # or equivalently use trees[[ ``` Note that nodes are labelled with the prefix `n`, so that nodes `0`, `1`, `2`, ... -become `n0`, `n1`, `n2` ... etc. This helps to avoid +become `n0`, `n1`, `n2`, ... This helps to avoid confusion between the the zero-based counting system used natively by `tskit`, and the one-based counting system used in `R`. ## Further information -Be sure to check out the [reticulate](https://rstudio.github.io/reticulate/) +Be sure to check out the [reticulate](https://rstudio.github.io/reticulate) documentation, in particular on [Calling Python from R](https://rstudio.github.io/reticulate/articles/calling_python.html), which includes important information on how R data types are converted to their -equivalent Python types. +equivalent Python types. + +## Advanced usage through the tskit's C API + +While the approach with `reticulate` is very powerful and will be useful for +most analyses, it does have an overhead due to calling Python and conversion of objects. +This overhead will be minimal for some use cases but could be significant for others. +An alternative approach is to use the `tskit` C API from R, +via the standard [R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) or +via [Rcpp](https://www.rcpp.org), +but these are advanced topics beyond the scope of this tutorial.