Statsblogs.com

Hierarchical models for phylogeny: Here’s what everyone’s talking about

2016-02-14

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

The other day on the Stan users list, we had a long discussion on hierarchical models in phylogeny that I thought might be of general interest, so I’m reconstructing it here.

It started with this question from Ben Lambert:

I am hoping that you can help me settle a debate.

My collaborators and I have data for experiments structured by the following categories (from top to bottom): genus -> species -> individual time series.

I believe that the best way to approach this is to use a hierarchical model which has 3 levels; one for each of the categories. However, my collaborators (entomologists) argue that the different species within a particular genus are so incredibly different (they use the analogy that the species within a particular genus are more different than say, lions and elephants, at a genetic level), that it does not make any sense to group them in any way. Furthermore, they argue that any ‘genus-level’ parameters that are estimated would be meaningless biologically, since they are averages across a range of very heterogeneous entities.

I definitely do see their point, but can’t help thinking that species are categorised within a particular genus for a reason; some sort of similarity. I agree with them that the overall parameters probably won’t mean much. However, I always default to using hierarchies whenever I can, due to all the benefits (reduced variance, less overfit, more parsimony etc).

I suspect it won’t make all that much difference to the results (some preliminary analysis results hint at this), but wanted to see what others thought of this. Do you think that this model should be a three-level hierarchy, or independent analyses consisting of a two-level hierarchy within each genus?

I replied that if you use a grouping factor that’s irrelevant, then it shouldn’t matter much in the analysis. That is, you could include a variance component for genus, and it shouldn’t really hurt you if genus doesn’t really matter for your purposes. So you could include it, or you could do the analysis both ways and it probably won’t matter. But the one thing you wouldn’t want to do, I think, is a separate two-level model for each genus. There’s information to be shared between genuses. Especially if genus doesn’t matter—then you’d really want to be combining across them.

But let’s step back for a moment. Why do we do hierarchical models? Why include grouping in the analysis? Because if we don’t have enough local data (i.e., if we have noisy time series, in this example), we want to do partial pooling to get better, more reasonable estimates. So . . . if your colleagues think that partial pooling across species within genus is a waste of time, then that’s fine. But then maybe there’s another model that could be fit, using some other characteristics of the species. To the extent that you can group the species a priori into reasonable categories, or to the extent that you can construct good species-level predictors, your partial pooling will be more effective.

And three biologists responded along the same lines, but with more specifics.

Josh Rosenau:

If this is a well-understood taxonomic group, then the genus ought to consist of species that are all more closely related to one another than to any other species, as the members of each species are more closely related to one another than they are to any other species. That would tend to argue for treating it hierarchically. OTOH, many groups have not been revised thoroughly and the taxonomic structure may not reflect the phylogeny as well as one might like, which could argue against it in some situations. In the third hand, not all biologists have really embraced cladistic taxonomy (naming groups based on relatedness) and the importance of controlling for the effects of shared ancestry in such analyses. In that case, you either fight wit the biologists or defer to their specialized knowledge.

If it’s a particularly speciose group with lots of phylogenetic structure, and there are good published estimates of phylogenetic distance, it might make sense to incorporate that into your model. There are a few common approaches to phylogenetically-informed analysis which should be adaptable to your model.

In terms of interpreting the parameter estimates at the genus level, in an analysis fully incorporating the effects of phylogeny, the genus-level estimate ought to be something like an estimate of the common ancestor’s character state. Whether such a thing is meaningful in this instance I couldn’t say, but in general it seems like it should be. In an analysis that doesn’t incorporate phylogenetic distance (i.e. just the taxonomic levels), the interpretation would indeed be unclear.

Lizzie Wolkovich:

In the past two weeks I have had two such similar debates. I have two conflicting views—on the first side I find it interesting that your colleagues are fully behind the species concept but seem to think genus is completely irrelevant. Though I can understand that view a little I suspect the systematists who came up with the arthropods’ genera may be offended by this (or for this particular group it could well be that the genus level classifications are a mess and poorly done currently). I wonder if there is some other taxonomic level they would be comfortable with such as family. Lions and elephants are actually pretty similar once you start comparing them with komodo dragons and earthworms, for example, and even I can tell a beetle from an insect. [Are beetles not a type of insect?? I had no idea. — ed.]

Then to echo Andrew and go in the other direction, if your colleagues really think genus is irrelevant, or worse, poorly defined for these groups just now such that species A is in genus 1 but really belongs in genus 2 then you would be pooling in a way that could give you less biologically accurate estimates. In line with Andrew I would wonder if there are other characteristics you could group by.

In my own experiences we sometimes (1) skip all taxonomy above species and actually don’t partially pool species together even because we don’t want one species influencing the other too much (and, in last week’s example we realized we had missing data for some species that varied by treatment which would make the pooling effect especially strong for some species with the most unique treatment effects), (2) use distances from a phylogenetic tree instead of categorical levels (but then I echo that phylogenies are their own balls of wax to start batting at and evaluating) and (3) worry about how species can often be confounded with other effects, like site or who did the study. It sounds like from the below you might be in the realm of (3) perhaps.

Simon Blomberg:

The modern way to compare species is to incorporate the evolutionary relationships as a prior covariance matrix in a mixed-effects model, either on random effects associated with species (G level) or residuals (R level). Species are real entities (distinct, non-interbreeding populations) but genera are not: they are subjective artificial classifications built by taxonomists that may or may not have any basis in biology. The solution is to work with the evolutionary relationships (the phylogeny) directly, and forget about the biological classification.

Lambert had some concerns:

My fear from pooling information across genuses is that you are really comparing apples with pears. Even if the data looks similar, (imagine that we are doing mark-release-release-recaptures for elephants and insects, say), then does it make sense to share information in this context?

I agree it probably won’t hurt to group by genus ultimately, but I fear I may run into problems trying to justify it biologically/ecologically.

In my case the measurements were taken over very different geographies, and it is likely that the organisms (here mosquitoes) across species/genuses, probably do not respond that similarly to environmental conditions. Hence, I suspect that it may not make sense to pool information?

Bob Carpenter interpolated to ask if anyone’s been using Gaussian processes in this area, and he got two responses.

Maxwell Joseph:

I have toyed around with Gaussian processes with phylogenetic distance as an input in order to allow for correlation among species as it relates to evolutionary history. For prediction, these GP models are outperforming models with hierarchical structure based only on taxonomy (genus, family, order, etc.). Conveniently, previous work (e.g., Hansen and Martins 1996) points to correlation functions with mechanistic support. I haven’t been accounting for uncertainty in the phylogeny, however, though such approaches do exist (e.g., de Villemereuil et al. 2012).

Simon Blomberg:

Gaussian processes are currently the standard approach to analysing cross-species data. Essentially, you need a model of evolution for your traits. The most common models are Brownian motion and the Ornstein-Uhlenbeck process (both Gaussian). In both cases there is a simple relationship between the branch lengths on the phylogeny and the covariance structure for the data. I don’t know about mixture models in this context.

And Michael Betancourt wrote in to emphasize the value of hierarchical modeling from a purely predictive perspective:

Hierarchical models in absolutely no way imply a causal structure, so a phylogenetic relation is absolutely not necessary. It may be helpful, but in many cases is may not. For example, consider some model that depends on vision—if you assume that the eye evolved independently then species in very different branches of phylogenetic trees may have similar responses to vision-based observables which motivates an entirely different hierarchical structure.

The hierarchy represents similarities in the observations and consequently can only be justified as such. Even if genera are artificial classifications they can imply good hierarchical structure if the classification strategies consider observable behaviors. And phylogenetic clustering requires knowing the correct phylogenetic tree which is itself a huge uncertain problem.

And then Blomberg looped back in to connect these ideas back to the underlying biological models:

A phylogenetic model explicitly accounts for non-independence in the data due to common ancestry. Organisms can be similar due to common ancestry OR convergence. If convergence is your hypothesis, there are methods for testing for that, although how to do it correctly is a topic of current research.

If the genera are monophyletic, then you (Betancourt) may not be far wrong. But it completely ignores the intra-generic lack of independence, and it also treats each genus equally. Some genera will be more closely related to each other than others. Let’s be clear. You cannot avoid making a phylogenetic assumption in these kinds of models. The taxonomic approach implies a certain correlation structure for the data: All genera are monophyletic and assumed to arise at the same time, and independently of each other, and all species within a genus are assumed to arise at the same time and independently of each other. We know that this assumption is always wrong. Why build models which we know to be wrong on such an important aspect of the data? The only objective way to correctly model the phylogenetic independence is to use the phylogeny. Now, all models are wrong etc. But why be deliberately wrong by using a taxonomic hierarchical structure?

Bayesian methods are ideal for incorporating phylogenetic uncertainty. We have methods for that.

Now here’s Betancourt again:

And even if there are no similarities between the groups the hierarchical model will learn that and shouldn’t penalize you that much.

Blomberg:

It may be that the phylogeny might not add much to the analysis. But that is always an empirical question and not an argument for not using a phylogenetic model when you suspect that lack of independence due to phylogenetic relationships may be a problem. You can’t really know that in advance.

I’m not arguing against a hierarchical structure to the model. I’m just emphasising that using the taxonomy is the wrong way to do it. . . .

You could, for example classify species into ecological guilds (a guild is a group of possibly unrelated species that inhabit similar niches). But that doesn’t get around the problem of phylogenetic lack of independence in the data.

Not using any phylogenetic information at all, treating species as IID is perhaps the worst thing you could do. There is a strong analogy between time series and spatial data here. We would not treat time series as IID (no temporal autocorrelation), or spatial data without considering spatial autocorrelation. We should have the same respect for cross-species data. We should incorporate phylogenetic information on cross-species correlations, or at least entertain the idea that phylogenetic covariance could be a problem. There are some situations in which phylogenetic information could conceivably be ignored. For the time series analogy, we might pretend that our data are IID if the time series is short. Or for spatial data where the data are very far apart in space, making the IID assumption more plausible. So for small data sets or data sets where the phylogenetic covariance is thought to be extremely small, e.g. comparing phyla, it is possible that phylogenetic effects could be neglected. But that is an empirical question. . . .

A phylogenetic model explicitly accounts for non-independence in the data due to common ancestry. Organisms can be similar due to common ancestry OR convergence. If convergence is your hypothesis, there are methods for testing for that, although how to do it correctly is a topic of current research.

At this point, Betancourt shot back:

You are missing the point entirely. As I said before, a phylogenetic model is a causal model—but the observational correlations between species need not be causal. Hierarchical structure is statistical so it doesn’t need nor really care about whether any correlations are causal, which is why they are so amazingly powerful and widely applicable.

And, once again, there is no real cost of adding hierarchical structure that’s not there other than increased computation (the hierarchical model will converge to an IID model if necessitated by the data). Another reason why they are so awesome. . . .

We might not care a lick about the phylogenetic structure! Taxonomic structure can capture useful correlations—if you think otherwise than criticize the particular choice of taxonomy relative to a given observational model, but blanket criticizing taxonomies is equivalent to blanket criticizing hierarchical models in general which are in general not cause relationships.

Can phylogenetics provide useful motivation for building statistical models? Absolutely. Are they necessary and sufficient for all statistical analyses? No way.

Again, we’re not talking about phylogenetics. The point is that even if you don’t like a chosen taxonomic structure then you don’t have to worry about the fit because the hierarchical model will learn the independence of the groups.

To connect back to the models, Betancourt wrote:

Yes, Bayesian methods are ideal. Or they will be we have any idea how to effectively explore and sample from tree spaces with corresponding guarantees/validation methods/diagnostics that we can represent the true posterior uncertainty with any fidelity. Topological real talk.

Blomberg replied:

I disagree that phylogenetic models are necessarily causal models. It still makes sense to use the phylogeny as a hypothesis about covariance among species even when there is no notion of a variable having evolved along a tree. It’s the lack of independence in the data that is what is being modelled here. A hierarchical modelling approach is completely appropriate. It’s the structure of the hierarchy that is the issue. Taxonomy doesn’t cut it.

I agree that hierarchical models are good, awesome, whatever. And the model will converge to an IID model if there is really no phylogenetic “signal” in the data. But again it’s an empirical question. When dealing with cross-species data, your baseline assumption (your prior) should be that there is lack of independence in the data due to phylogenetic effects.

But you should care about the phylogenetic structure! You should care for the same reason as that you would care about temporal autocorrelation in time series, spatial autocorrelation in spatial data, pedigree information in genetic models etc. The phylogeny is the only way to incorporate that information in cross-species data. I am criticising taxonomy in general because they a) don’t represent anything like phylogenetic hierarchical structure. To the extent that taxonomies are useful is only because of some (perhaps accidental) similarities to the underlying phylogeny and b) they are not objective. I am not criticising hierarchical models in general. I use them all the time, and I think Bayes is the best way to implement these models. It’s just the structure of the hierarchy that I am arguing should be based on the phylogeny.

Can phylogenetics provide useful motivation for building statistical models? Absolutely. Are they necessary and sufficient for all statistical analyses? No way.

In other parts of the thread I have alluded to situations in which the phylogeny may not be useful. But I still maintain that your prior on the covariance structure of the data should be based on the phylogeny! Any other approach is a) wrong a priori and b) subjective. Are they necessary? No. for the reasons I have mentioned elsewhere in the thread. Are they sufficient? No because there are other substantive questions that we are interested in about our data. The model should “learn” whether that phylogenetic prior has any relevance to the posterior parameter estimates. That’s great. But the model should be given the chance to learn that!

Back to Betancourt:

a) Taxonomical structure may be based on previous observations that may be compatible with new measurements, hence a quite good motivation for a hierarchical prior.

b) Even if the hierarchical structure is chosen poorly the model will adapt and inferences will largely remain valid.

c) Even if there is some “objective” phylogenetic structure, it need not manifest in the observables and hence need not be relevant to a hierarchical model.

d) Known phylogenetic structures depend on data models, models which are built out of assumptions, assumptions which are in no way “objective”.

e) On top of that, even state-of-the-art phylogenetic MCMC methods are extremely limited. So even if there was an “objective” model we wouldn’t be able to use it to construct the necessary inferences to pick out the corresponding “objective” phylogenetic trees compatible with the data.

So the statement that “phylogenetic trees are objective and known a priori” and “always the correct hierarchical structure” are both incorrect. In a real problem using neither taxonomical structure or phylogentic structure will lead to poor inferences. One may certainly be better, but which is better depends very strongly on the details of a given model and hence no approach can be determined “correct” for all models.

Again the original question was not “should I use a taxonomical hierarchy or phylogenetic hierarchy” but rather “will using a taxonomical hierarchy lead to poor inferences.” The latter is absolutely not. End of answer.

Last word on this from Blomberg:

Modern classifications are based on data. And most often they are now built to reflect some aspect of phylogenies. But the Linnean hierarchy cannot accurately reflect phylogeny. Life just doesn’t evolve according to the Linnean hierarchy. The Linnean hierarchy is a pre-Darwinian human construct that can only be an imperfect representation of evolutionary history. Estimates of phylogenies are also an imperfect representation of evolutionary history (a tentative model). But if they are based on good data and reasonable assumptions made explicit at every step of the analysis, and appropriate diagnostics are used and sensitivity to assumptions is examined, then it make sense to use this information as a better tentative description of reality and use it as information informing further analyses.

Taxonomies work less well than known phylogenies in simulations (by frequentist criteria). Inferences using a “mostly correct” phylogeny have better properties than using a bad classification. Inferences based on a prior set of highly probable trees also have good frequentist properties.

There may be no “phylogenetic signal” in the data, this is true. But that is an empirical question dependent on the observables and the phylogeny. The phylogeny may not be relevant to the particular model, but then a taxonomic model will not either. There will be times when it does matter. For that you want the best estimates of among-species covariances that you can get. From the phylogeny.

Yes, phylogenies are estimates based on models. And they are almost always wrong in some regard. This is partly why modern phylogenetic comparative methods try to account for phylogenetic uncertainty. Systematists routinely publish their data sets and try different ways of analysing them to try to get the most robust estimate of the phylogeny. Methods sections in papers are usually very detailed and explicit about models and assumptions. This is becoming even more so with the new push for Open Science. And more data are always welcome. The genomics revolution has meant that we are getting better and better at estimating phylogenies. Science is built on these incremental advances. To say that the models and assumptions are not objective is not to criticise phylogenetic models. It is to criticise all of scientific practice. There is subjective choice in models and assumptions. But that is nowhere_near the degree of subjectivism and assumption ladenness inherent in making up a Linnean hierarchy for a given set of taxa and then using it to model the hierarchical structure and consequent lack of independence in the data.

Phylogenetics is hard. It is NP-hard. ALL our methods are based on heuristics (assumptions) that seem to work in practice but we generally have no idea whether the best tree we have is the true course of evolution that really happened. Maybe another heuristic could find an even better tree. I don’t see this as a problem, just part of the scientific process. All statements about reality are tentative, until something better comes along.

We know from simulations how consistent the estimating methods are as the amount of data increases, under different conditions. This research has been going on since the “likelihood” versus “parsimony” wars of the 1970s. Now real data are more messy than simulated data, no question. But we are working on that! And again, there is One True Objective Phylogeny which we are trying to estimate: the real process of evolution that actually happened. That is the benchmark we are trying to achieve.

We are generally not trying to pick the corresponding “objective” tree that is compatible with the observed data. Trees just represent phylogenetic covariance. Some trees will necessarily fit the data better than others, and that may not be the true tree. But our best guess at the covariances (the prior) should come from our best guess at the true phylogeny.

Phylogenetic trees are not known a priori. They are estimates (statements, hypotheses, models) of the true pattern of evolution. Well-supported phylogenies aren’t necessarily the true pattern of evolution. But they are our current best guess. It’s not “always the correct hierarchical structure”. But I argue that it is our best least wrong guess, given our current knowledge of the study species. Further data may change the phylogeny. That’s OK. That’s science. It may mean that the analysis will have to be re-visited.

But if your starting point is to use the Linnean hierarchy, my view is that you are quite possibly shooting yourself in the foot. A good phylogeny is “least wrong” as I said above. I’m going to stick my neck out here and say that other researchers in my field agree. There is about 30 years of research, thinking about and analysing multi-species data sets to find ways to best account for the hierarchical nature of multi-species data. I really think we have progressed, and one of those milestones was to ditch the use of the Linnean hierarchy.

If I was to bet money (and as a statistician, I never do), I would bet on a good phylogeny over a Linnean hierarchy any time.

The funny thing is, from the tone of this discussion, it looks like Blomberg and Betancourt are having a big argument. But after reading more carefully, I think they’re basically in agreement:

– Hierarchical modeling is a good way to account for partial information in this sort of predictive setting.

– Hierarchical models are most effective when used in the context of substantive

The interaction between statistical modeling and substantive concerns in this discussion is fascinating. This one’s not as important as football and elections, but I’ve heard that some people care about it nonetheless.

The post Hierarchical models for phylogeny: Here’s what everyone’s talking about appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

The post Hierarchical models for phylogeny: Here’s what everyone’s talking about appeared first on All About Statistics.