I recently submitted a paper I'd like to talk about a little bit. We did a reanalysis of a previously-published dataset from Pyron and Burbrink 2014. This paper assembled a really impressive dataset of squamate parity modes and performed some comparative methods analysis with them. We wanted to look at how phylogenetic uncertainty and model assumptions affect the conclusion reached.
Today, I'm writing a little about the reproducibility of the paper.
In Writing This: A few people have pointed out recently that open science has kind of an image problem. And I fully agree. From reading Twitter about open science, you'd get the impression that if you don't use the latest tools and your pipeline isn't perfect, and you can't just type 'make' and have the full paper spit back at you from the command line, then you aren't doing it right.
Not true, and that's a really alienating message. There's a very tech-y modality to a lot of the discussions of open science, but in my experience, a lot of people are quietly and contentedly practicing openness and are really happy to be nice to new practitioners. I came up in a lab where it was expected that you provide everything a person would need to reproduce the paper, even if the journal makes no provisions for that. Lab alums have gone on to release free and open source software prodigiously. Many are involved with generating permissively-licensed and fully archived teaching materials and tutorials. It never occurred to me to do science another way, really. You can do good work, and open work, without breaking your back. This post is fundamentally about striking the right balance for you.
In using the tools I used: I had never written a Makefile before and wanted to try. That's all really. I had read about using them for reproducibility, and they seemed useful. What I did is not full reproducibility, in the sense that you type make and are left with a figure, but you type make, get some numbers and use an iPython notebook to plot them. As we simplified the paper, the Makefiles got simpler, so the ones attached to the final paper might actually be basic enough for an intro user to grasp. And I'm all about novice resources.
The iPython Notebook is a really great tool to make code easy-to-use for collaborators and novices. Since I was working with my labmate, Katie Lyons, who was a total novice at computation, I wanted to make sure the scripts were easy for her to follow. In an iPython notebook, code is held in a cell. The user executes the cell and sees the output. This is especially nice for graphics - you run your code and your graph shows up, in the window, in full color. You can make very easy-to-see, easy-to-use interactive documents in this way.
What we did
The paper had three main analysis: 1) Running BiSSE on four time-scaled phylogenetic trees to look at the effect of topology and branch length on the conclusions reached.
2) Running BiSSE on each of these trees with a variety of parameters to look at the effects of the exact model settings on the conclusion reached.
3) Using PAUP to count changes in parity mode on the tree.
There were also little ancillary analyses that we did, for example, looking at Robinson-Foulds distances between trees. These aren't the main thrust of the paper and are provided as iPython notebooks.
Goal One: Run BiSSE on a few trees
Easy enough. We made trees using Examl and garli. The estimations for these trees are quite long - about a week - so we didn't include this step in any of the makefiles. I'm sure I could have beat my head against a wall figuring out a solution to the long estimation problem, but ... well, that really didn't seem worth it. The configuration files for tree-building software can be found in our materials under PhylogenyMaking.
Then, we time-scale the trees. Pyron and Burbrink used treePL for this and since we're reproducing their work, we did too. TreePL is kind of finicky about where it compiles. I found that using the Linux instructions in a clean Vagrant box with Ubuntu 12 is the easiest way to do this.
If you don't want to bother with these two long steps, the trees are in the Trees directory of the repository.
You can run BiSSE on our trees in a couple ways, from my repository. The easiest is to open a terminal, change directories into the repository and type 'make'. This will output for you results of BiSSE's parameter fitting as csv files and newick strings of the trees with the marginal likelihoods of ancestral states written to the nodes. You can then visualize these trees in FigTree or your preferred tree-viewing software.
Goal Two: Parameter Sampling
Pyron and Burbrink used a sampling parameter we weren't sure we understood the function of. So we did an exercise in which we sampled many values of this parameter and made a heatmap to look at how this changed the probability of the ancestor of all squamates being viviparous.
If you type 'make -f Makefileheat', this analysis will run on your computer. However, it makes a lot of replicates and takes a long time. I wouldn't actually do it. You can find my heatmap data in the Data directory, and the interactive heatmapping script in the iPython notebooks. This script illustrates well what I meant about iPython notebooks being really nice for graphics earlier.
We just provided the PAUP file for this. Donezo.
How I feel about it
This paper was a massive effort. I didn't anticipate how big of an undertaking it would be just to do the research. But I was strongly motivated by pedagogy - a new student in my lab wanted to learn computation, and she really took ownership of this project. She stepped up, and that made it easy for me to step up on a non-dissertation project in my last year of grad school, while pregnant. See her post in the coming days about learning by doing.
About the reproducibility ... I intended to do more. I intended to pipe together the R and Python better. But, right now, I have a paper in review that I'm massively proud of. You can reproduce it. Katie learned a lot. I learned some skills I wanted (make, mostly) and made some resources I think will be helpful for novice users. We have a product that is reproducible enough for my needs and the needs of my colleagues. And that's what I'd like readers, especially novice readers to take away: your project doesn't need to be perfectly, fully reproducible to be useful. You don't need to devote less time to your other projects in order to make the perfect pipeline. Do good science. Do your best. Learn something new. Don't get bogged down in the fear that the Twitterverse will chuckle at your puny attempt at reproducibility.
Do good enough. I did good science, I made it accessible, I learned some new tricks and I helped a new student learn some foundational computing skills. I deserve a beer. Maybe next month:)
Many thanks to Greg Wilson for comments
There are comments.