Modeling character change heterogeneity in phylogenetic analyses of morphology through the use of priors

Wrighting April

Note: Post written with input from David Hillis

Graeme Lloyd, David Hillis and I have a new paper out. This paper, "Modeling character change heterogeneity in phylogenetic analyses of morphology through the use of priors", looks at using a prior in MrBayes, the symmetric Dirichlet hyperprior to model character change asymmetry. In the Lewis' Mk model of evolution for morphological characters, it is assumed that a character is as likely to transition forward and backwards - i.e., it is as likely to be lost as is to be gained. We can all think of cases where this is not true.

So we had a look using at a prior distribution for character change that allows us to relax this assumption. Relaxing this assumption for morphological data is trickier than with sequence data. With sequence data, we can define an expected rate of change from one nucleotide to another because we expect that a base at one site has pretty similar properties to that same base at another site. Not so with morphology. The relative rates of changes between states (gains versus losses, for example) may differ dramatically from one character to the next. In that light, it's very hard to define many commonalities across a whole data set. This is a big part of why we still use a Jukes-Cantor-like process to model morphology.

Because the paper is currently paywalled (Systematic Biology papers, unless an open option is paid for, are paywalled for a year. I'm happy to discuss why I chose to publish this way), I will refer to some slides in my defense talk, so every one can see the figures to which I am referring.

The symmetric Dirichelet (hyper)prior is a (hyper)prior on state frequencies. It allows characters in a matrix to have varying equilibrium character frequencies. This, in turn, allows those characters to have different transition rates due to the relationship between number of expected changes and equilibrium frequency (slides 55-59). In the case of binary characters, a discrete beta distribution can be used as a prior on the state frequency. This distribution is constrained to be symmetric, meaning that if you have some characters with a high 0 to 1 transition probability, it is assumed you also have some with a high 1 to 0 probability (so it is important to randomize the assignment of states to 0 and 1). You can see different shapes of this prior on slides 60-68.

The first thing we did was a little model-fitting on empirical datasets. We used fixed values for the symmetric Dirichlet - we did not specify any distributions from which the parameter of the distribution is drawn (i.e., we used this as a prior, not a hyperprior). We used Bayes Factor comparisons to see if some datasets favored a prior that allows asymmetrical transitions between states. About half of the datasets we looked at did (slides 81-83)! For many datasets, applying this change to the model made little or no difference to the tree estimated (slide 86-87). But for some, it mattered quite a lot. But with these empirical datasets, we don't know the true tree, so can't say that a different tree is necessariy better.

So we turned to simulations. We simulated data according to four different parameterizations of the symmetric Dirichlet: one that conforms to the default Mk assumption of equal rates, one that favors this assumption, one that assumes characters will come from all possible values of asymmetry, and one that assumes characters will favor asymmetric transitions. We simulated along two trees, a simple eight-taxon tree and one empirical tree.

Firstly, the generating model is highly detectable (slide 89), meaning we can use Bayes Factor model fitting to find the true model easily. This is good for empirical researchers! Detectable models are easier to not misspecify.

Secondly, when the problem is easy, such as when we do not have missing data, choosing the right prior matters (slides 92-93). If your characters demonstrate strictly symmetrical transition rates between states, using a prior assumes the opposite causes your estimations to be pretty bad. But specifying a prior that doesn't punish this assumption severely doesn't hurt you that much.

Lastly, when the problem is hard, getting the prior right matters a lot. On slides 96-98, what we see if that error is much higher when we have missing data (especially structured missing data) and model misspecification working in concert.

What's the bottom line?

Choosing the right prior matters for estimation of trees using the Mk model. How much it matters depends on how challenging your dataset is. But I highly encourage all of you to give using different values a whirl and to use Bayes Factor model fitting to choose the best-fit prior. I've made a website for my paper. In the zip that you download, or the github repo linked there, you can find sample MrBayes blocks showing different priors to try out.

Also in the repository (or zip file) is a script called test_symDir.py. This script can be called to calculate the likelihood of a character along a four-taxon with different values of the symmetric Dirichelt prior. For example, if you wanted to calculate the likleihood of this character under the symdirihyperpr = fixed(2), one of the simulation conditions, type:

python test_symDir.py 2 2

and you should receive the output

pi0 ~ Beta(2.00000, 2.00000) divided into 5 equal-volume categories
  categ 1: pi0 = 0.19580, pi1 = 0.80420, logLike = -3.51177
  categ 2: pi0 = 0.36326, pi1 = 0.63674, logLike = -3.32104
  categ 3: pi0 = 0.50000, pi1 = 0.50000, logLike = -3.29225
  categ 4: pi0 = 0.63674, pi1 = 0.36326, logLike = -3.32104
  categ 5: pi0 = 0.80420, pi1 = 0.19580, logLike = -3.51177
log-likelihood = -3.38676984928

This helpful script, written by Dr. Paul Lewis, illustrates how the prior affects character likelihoods. If you're still trying to wrap your head around the relationship between this prior and the results of an analysis, try playing with this script. The differences in likelihood score can be quite large for just this one character on a small tree. This script depends on the numpy and scipy libraries.