Choose Your Own Adventure

Pick Two. Do it and answer the questions.

Lognormally-distributed among-character rate variation

As we discussed, there is reason to believe a lognormal distribution may fit morphological data better than a gamma. RevBayes makes it very natural to discretize any distribution you might like to work with.

    rates_morpho := fnDiscretizeDistribution( dnLognormal(ln(alpha_morpho), 0.01), 4 )

Can you figure out which line you might replace with this code?

Create a copy of your previous Rev script, and call it mcmc_MkLog.Rev. You will need to modify the Rev code provided in this section in this file.

Make this change in your duplicated copy, and change the names of your output files to indicate that this is estimation uses a lognormal distribution. Run your new Rev script.

Questions

  1. Compare an estimation with the Gamma-distributed among character rate variaiton (ACRV) to the lognormal ACRV. Are they the same? Does the likelihood, posterior, or tree length change?

  2. Plot the Gamma and Lognormal distributions in R (See the lesson for examples on doing this). Do they look the same? How do they differ? What do you think this means for the underlying biology we’re trying to model?

  3. Try summarizing the tree and viewing in Fig Tree. Does it differ from the Gamma-distributed tree?

Ascertainment Bias

As discussed earlier in the section Ascertainment_Bias, we also need to correct for ascertainment bias.

Create a copy of your previous Rev script, and call it mcmc_Mkv.Rev. You will need to modify the Rev code provided in this section in this file.

In RevBayes it is actually very simple to add a correction for ascertainment bias. You only need to set the option coding="variable" in the dnPhyloCTMC. Coding specifies what type of ascertainment bias is expected. We are using the variable correction, as we have no invariant character in our matrix. If we also lacked parsimony non-informative characters, we would use the coding informative.

phyMorpho ~ dnPhyloCTMC(tree=phylogeny, siteRates=rates_morpho, Q=Q_morpho, type="Standard", coding="variable")
  1. Compare this estimation with one without correcting for ascertainment bias. Are they the same? Does the likelihood, posterior, or tree length change?

  2. Try both the “variable” and “informative” corrections. Which do you think is correct?

  3. Try summarizing the tree and viewing in Fig Tree. Does it differ from the ascertainment bias non-corrected tree?

Relaxing Character State Symmetry

The Mk model makes a number of assumptions, but one that may strike you as particularly unrealistic is the assumption that characters are equally likely to change from any one state to any other. That means that a trait is as likely to be gained as lost. While this may hold true for some traits, we expect that it may be untrue for many others.

RevBayes has functionality to allow us to relax this assumption. We do this by specifying a beta prior on state frequencies. Stationary frequencies impact how likely we are to see changes in a character. For example, it may be very likely, in a character, to change from 0 to 1. But if the frequency of 0 is very low, we will still seldom see this change.

We can think of a Q matrix as looking like so:

\[Q = \begin{pmatrix} -\mu_0\pi0 & \mu_{01}\pi0 \\ \mu_{10}\pi1 & -\mu_1\pi1 &\\ \end{pmatrix} \mbox{ ,}\]

In which the probability of changing states depends not solely on the transition probability, but also the frequency of the starting state. For example, if we have a rare character state, we do not expect to see many transitions from the rare state to another. \(\pi\) is the value chosen to represent state frequency commonly in phylogenetic models.

We can exploit the relationship between state frequencies and observed changes to allow for variable Q-matrices across characters. To do this, we generate a beta distribution on state frequencies, and use the state frequencies from that distribution to generate a series of Q-matrices used to evaluate our data [@Pagel2004, @Nylander2004, @Wright2016].

This type of model is called a mixture model. There are assumed to be subdivisions in the data, which may require different parameters (in this case, state frequencies). These subdivisions are not defined a priori. This model has previously been shown to be effective for a range of empirical and simulated datasets [Wright2016].

Modifying the Rev-script

Make a copy of the Rev script you made earlier. Call it mcmc_mk_discretized.Rev. This new script will contain the new model parameters and models.

We will use a discretized beta distribution to place a prior on the state frequencies. The beta distribution has two parameters, \(\alpha\) and \(\beta\). These two parameters specify the shape of the distribution. State frequencies will be evaluated according to this distribution, in the same way that rate variation is evaluated according to the gamma distribution. The discretized distribution is split into multiple classes, each with it’s own set of frequencies for the 0 and 1 characters. The number of classes can vary; we have chosen 4 for tractability. Note that we need to make sure that this discretization results in a symmetric model, therefore we will use only one parameter for the beta distribution: beta_scale such that \(\alpha = \beta\).

num_cats = 4
beta_scale ~ dnLognormal( 0.0, sd=2*0.587405 )
moves.append( mvScale(beta_scale, lambda=1, weight=5.0 ) )

Above, we initialized the number of categories, the parameters of the beta distribution, and the moves on these parameters.

Next, we set the categories to each represent a quadrant of the beta distribution specified by beta_scale.

cats := fnDiscretizeBeta(beta_scale, beta_scale, num_cats)

If you were to print the cats variable, you would see a list of state frequencies like so:

[ 0.011, 0.236, 0.764, 0.989 ]

Using these state frequencies, we will generate a new vector of Q-matrices. Because we are varying the state frequencies, we must use a Q-matrix generation function that allows for state frequencies to vary as a parameter. We will, therefore, use the fnF81 function.

for (i in 1:cats.size())
{
    Q[i] := fnF81(simplex(abs(1-cats[i]), cats[i]))
}

Additionally, in RevBayes we need to specify the probabilities that a site evolves according to one of the Q-matrices. For this model the probabilities must be equal because we need to guarantee that the model is symmetric. Thus, we use a simplex function to create a vector that sums to 1.0.

matrix_probs <- simplex( rep(1,num_cats) )

Questions

  1. Look at the output in tracer. There are several variables added to the trace. What are they and what do they mean?

  2. What do you think this model means for the underlying biology we’re trying to model?

  3. Try summarizing the tree and viewing in Fig Tree. Does it differ from the JC model tree?