News:

Testimonial: "I cannot see a slither of a viable defense for this godawful circlejerk board."

Main Menu

The problems with Maximum Likelyhood (ML) and Bayesian analysis in Systematik.

Started by Kai, June 10, 2010, 05:19:48 PM

Previous topic - Next topic

Kai

Written last night at 1 am in the spirit and edited this morning.

I've been recently called an overanalyzing myopic stick in the mud who can't get with the times, because I get upset when I see ML and Bayesian analysis used in systematics papers, presentations and posters. It's the point of this essay to explain why.

In systematics, we infer relationships between taxa (groups), whether those groups be species or some higher groupings, by what is variously called homology, synapomorphy, or synapotypy. These words refer to a single term, which means a specially shared character, an objective intrinsic aspect of a group of organisms that are shared by that group and not by other groups; they are hypotheses of origin. This is opposed to synplesiomorpies, or simply plesiomorphy, which are more general characters shared by a larger array of species than the group in question. These synapomorphies are the signal by which a systematist can build relationships. The guiding principle, as in all science, is parsimony. Parsimony, also known as Occam's Razor, is the best explanation is the one with the least assumptions, which reduces the variables involved and allows for a more confident test of hypotheses. In systematics this equates to reducing the number of character character changes to the least possible, rather than using wild guesses of character evolution that do not follow the simplest explanation based on the data at hand. This allows for scientific, trackable, empirical, and /reciprocally testable/ hypotheses of relationships.

So, in systematic analysis, we use the principle of parsimony to group species or higher taxa based upon their specially shared characters, in comparison and polarization with outgroup species that do not posess the specially shared characters of the group in question. These then can be visualized in nesting groupings as a branching tree like diagram, which is a hypothesis of all of the relationships between the groups in question. The test of these relationships is the consistancy of all the characters with the topography of the tree (as opposed to nonpattern) and the congruence of newly discovered characters with the ones already present. If there are small number of species, and fairly straightforward patterns of synapomorphy, all of this can be done by hand. If there are more than 12 species or taxa involved, or if there is lots of noise or nonpattern (sometimes called homoplasy) then computer algorithms can be used to swap branches of the tree together in many combinations to find the tree with the shortest length, that is, with the least number of  of changes between character states (from plesiomorphic to synapomorphic conditions or "reversing". As you can see, parsimony is not an algorhythm or a model, but simply limiting assumptions for a more reciprocally testable result.

With the onset of DNA sequencing, many people have taken to using these newly discovered characters for inference of relationships. This is great, the more characters the better, especially if there are clear homologies in the sequence; they can be analyzed in the same method as morphological characters, except they require some sequence alignment beforehand. However, a great deal of excitement on some people's part have lead to several methods used in population genetics being applied widespread to phylogenetic inference, and while they are quite good methods for the population geneticist, they are quite poor methods within the context of parsimony, synapomorphy and plesiomorphy.

Maximum Likelyhood (ML) is a method of sequence comparison and analysis that uses any of countless models of gene evolution to apply weighting to different parts of the sequence, giving that portion higher priority when it comes to signal. In population genetics of a single species, where geneology is well established, and multiple models can be tested repeatedly to determine actual modes of sequence mutation and evolution, ML can be very useful in determining gene flow, genetic drift, diversity, immigration and emigration, dispersal, and other factors within and between populations of a single species.

However, in systematics, where ML is often applied over a wide range of taxa, geneologies are completely unknown, as are the relationships between species. Models of evolution are variously imployed with no explicit reference and no reason or rational except the observation of pattern, and whether the groupings are based on synplesiomorphy or synapomorpy is completely unknown, removing the whole concept of homology which is central to our understanding of systematic relationships and our ability to test them. The whole process of analysis becomes a giant black box of assumptions where it is unknown if the resulting tree represents natural groupings or polyphyletic groupings of general characters (like if you put fleas, lobsters and lice together in a group because they lack wings; this wingless group is based on the absense of a character which is actually a plesiomorphy, or a reversal, and is not a natural group of organisms). This renders the whole of the results completely useless as a test of hypotheses, and ends up being a just so story. Even if the models of evolution are explicitly stated, this is an untested assumption with no evidence within the group in question. It becomes a luck of the draw to actually come up with a tree based on specially shared characters rather than more general characters.

Overall, ML is not suitable for systematics, due to multiple untested assumptions, often not explicit, and lack of reference to homology and hypotheses of specially shared characters. Using it in this manner is not science, and leads to unreciprocally testable conclusions.

Bayesian analysis is a clustering algorhythm often used in statistical analyses and sometimes used in population genetics. It takes data unit groups and compares them based on overall similarity, giving the amount of similarity a number. Higher similarity gives higher numbers, and those units with the highest overall similarity are clustered together. Clustering algorhythms are used in many sciences, and can actually be utilized in systematics for the delineation of species; percent difference of gene sequences can be a useful tool for uncovering cryptic species and when coupled with careful studies of morphology can be effective in resolving these species complexes. Bayesian analysis can also be used to associate adults and larvae of the same species when one or the other is of unknown determination, and gene sequences, especially of the mitochondrial gene Cytochrome Oxidase I (COI), have been used in wholesale identification and referencing projects such as International Barcode Of Life (iBOL) and Barcode Of Life Database (BOLD).

In phylogenetic inference, the data unit groups become sequences and the comparison yields overall number of basepairs shared between the groups, clustering them by overall similarity. While this lacks the untestable assumptions of ML, it has the same problems of lacking reference to homology. With bayesian analysis, sequence lengths are not analyzed individually for specially shared characters, but are simply just grouped together and given a number. There is no reference to an outgroup to establish character polarity, and it is impossible to know what portions of the sequence the quantities of overall similarity are refering to. Therefore, it is impossible to go back and check conclusions for plesiomorphy and synapomorphy. It becomes completely unknown whether the resulting tree diagram reflects the true relationships or is a spattering of unnatural groups, and it is impossible to go back and reference the characters to test these conclusions.

Using overall similarity to establish a system of classification has a history in systematics, known as phenetics. The original pheneticists believed that true phylogenetic classification was nearly impossible and therefore did not include it in their classifications. Instead, they grouped their taxa based on overall similarity, using characters known today (such as pigmentation) to often have a basis in ecological similarity but not phylogenetic history. Bayesian analysis is the final holdout of pheneticists within systematics, and it is interesting that the very methods used prior with complete disregard to inferring phylogeny are now being used TO infer phylogeny.  This method lacks any reference to homology, plesiomorphy, synapomorphy, polarization, reciprocal illumination or any other important steps in modern phylogenetics. It yields completely untestable hypotheses of relationships without reference to individual characters based on overal similarity or "distance" methods that have long been discredited. Overall, it is completely unsuitable to modern phylogenetic inference, and is in my opinion quite damaging when it is accepted as a valid technique.

So, I have explained my reasons for being upset with ML and Bayesian analysis when applied to inferring phylogenies. I understand they have some or much utility in other biological sciences, and even of important use in alpha taxonomy (species description). But due to the nature of information about relationships between species and how they are properly tested, using ML or Bayesian analysis to infer phylogenies, rather than bringing us closer to understanding evolutionary relationships, instead sets us back.
If there is magic on this planet, it is contained in water. --Loren Eisley, The Immense Journey

Her Royal Majesty's Chief of Insect Genitalia Dissection
Grand Visser of the Six Legged Class
Chanticleer of the Holometabola Clade Church, Diptera Parish

LMNO


Kai

Quote from: LMNO on June 10, 2010, 06:21:26 PM
I think I actually understood that.  Cool.

I tried my best to define terms that I often use without thinking.

One of the really cool things about systematics is that while the subjects range across the entire diversity of life, and the different computer softwares that get used to work with large data sets can be difficult, the basic concepts, rationales and non computer methods are quiet simple and easy to understand. Philosophically, the basic concepts of species, homology, and the rest are extremely complex, but a complete understanding isn't required for a layperson to get what's going on. If you understand what I said above, you could, for example, look at any parsimony tree and with a little information on how the data are presented, understand what was going on. For example, plesiomorphies are often marked by open circles and synapomorphies by dark circles. The more open circles, the more non-pattern and the less consistant the character data are with the topography of the tree. Characters are numbered above the circles, and the character state (usually just 0 for plesiomorphy and 1 for synapomorphy, but sometimes there are mutiple states so there can be 2, 3, etc) are below the circles, so you can track the changes of any character down the tree.

I think the reason people feel it is so esoteric is because of the philosophical bases, and the software that is often used. But a valid method is to take insect genetalia drawings and spread them out over a table, and then start sticking them in nested groups based on observed synapomorphy. My advisor does it this way, then just codes a whole bunch of characters and runs it through the software to check.

I remember talking to a colleague about mapping ecological characters to a phylogeny, to check for, say, habitat distributional congruence with the phylogeny. She asked how you would do that, and it was funny because the process is so incredibly simple: you just take the inferred phylogeny, and stick the characters with the names on the tree, and then look whether there are any groups that have these ecologies as specially shared. She says, "That's really all there is to it?". Yeah, that's really all there is to it.


A discussion of "what is a species" can on the other hand take weeks.
If there is magic on this planet, it is contained in water. --Loren Eisley, The Immense Journey

Her Royal Majesty's Chief of Insect Genitalia Dissection
Grand Visser of the Six Legged Class
Chanticleer of the Holometabola Clade Church, Diptera Parish


Kai

If there is magic on this planet, it is contained in water. --Loren Eisley, The Immense Journey

Her Royal Majesty's Chief of Insect Genitalia Dissection
Grand Visser of the Six Legged Class
Chanticleer of the Holometabola Clade Church, Diptera Parish

Adios


Kai

If there is magic on this planet, it is contained in water. --Loren Eisley, The Immense Journey

Her Royal Majesty's Chief of Insect Genitalia Dissection
Grand Visser of the Six Legged Class
Chanticleer of the Holometabola Clade Church, Diptera Parish

Kai

Quote from: Hawk on June 11, 2010, 01:15:45 AM
Quote from: Kai on June 10, 2010, 08:12:14 PM
Quote from: Hawk on June 10, 2010, 07:46:46 PM
Quote from: LMNO on June 10, 2010, 06:21:26 PM
I think I actually understood that.  Cool.


:?

What would you like me to explain in more detail?

Galapagos vs un-isolated evolution?

Okay, I think I know what you are asking now.

First I'd like to define a few terms.

Species (at least for this discussion): a unit of diversity, which is an ecologically unique metapopulation of interreproducing individuals.

Speciation: the process by which new species are formed.

Anagenesis: a historical concept of speciation whereby one species turns into another by a linear path. This is not the commonly accepted speciation concept of today, because it is not only historically untestable, but a species that "turns into another" is IMO still the same lineage.

Cladogenesis: the commonly accepted modern speciation concept, by which populations of a single species diverge in characters and become ecologically unique and no longer interbreed.

Isolating mechanism: the event or character by which species originally arise and continue to be separated.

Vicariance: a historical geological event that causes a split in the landscape. IOW, formation of oceans, deserts, mountains, etc.

Dispersal: an event where individuals of a population leave that location and move elsewhere. This is as opposed to migration, where individuals return to the original location on a cyclical basis.

Sympatry: When two or more species occur at the same geographic location, or within the same ecosystem.

Allopatry: When two or more species occur at separate geographic locations or in different ecosystems.


Systematics is a science split in two parts. The first is the understanding of species, their metaphysical nature, what they are and how many there are, and their origin. The other is the understanding of the relationship between species.

The cool thing about systematics is that one does not need to know how species come about to study their relationships, since there are no assumptions of origin in the process of phylogenetic inference. The same is true for the most part of when studying species origins, since I only need to know a small sample of relationships.

If we define species as I have above, and we understand that species form during cladogenesis, the splitting of species, then the biggest question in origins is what is the isolating mechanism. There are three types of isolating mechanisms: geographic, biological/behavioral, and ecological.

In the case of geographic isolation, either a barrier cleaves a species in two, leaving some individuals on either side (vicariance) or individuals of a population cross a barrier that was already formed (dispersal). In the case of volcanic islands which were never connected to the mainland, the isolating mechanism is almost always dispersal. This is true for example in the Galapagos. In other cases, populations are separated by the formation of mountains, deserts, forests, and other large barriers.

Geographic isolation is an example of allopatric speciation. Sympatric speciation, where the new species are not geographically isolated from each other, has other isolating mechanisms.

One of the most studied of these is ecological niche partitioning, where competition within a population causes, over evolutionary time, some members to specialize on different foods/locations in an ecosystem. This could even effect temporal distribution of organisms, where some insects may emerge to adult earlier or later, thus splitting the population into two parts where the intermediate conditions are lost, and eventually those two parts will become ecologically distinct and no longer interbreed.

The final isolating mechanism is poorly understood in origin, that of biological/behavioral mechanisms. Populations, and therefore species, have a range of variation. If a selective pressure causes two or more different physiological or behavioral conditions to become distinct to the point when they no longer interbreed, then these become the isolating mechanisms. Such things include courtship behavior, pheromones, genetic mutations which cause sterility, morphological changes which disallow the physical act of coupling, and others.


I hope I answered your question. I've been thinking about this for the last few hours and how to explain it. I am by no means an expert, but I think I hit the major points. Its important to understand that allopatric speciation may occur and then later the species become once again sympatric but separate, so although the isolating mechanism was geographic originally, the current isolating mechanism would be ecological or biological/behavioral.
If there is magic on this planet, it is contained in water. --Loren Eisley, The Immense Journey

Her Royal Majesty's Chief of Insect Genitalia Dissection
Grand Visser of the Six Legged Class
Chanticleer of the Holometabola Clade Church, Diptera Parish


Kai

An excellent open access article from Zootaxa (Mooi and Gill 2010) that voices my concerns better than I could have ever said them:

http://www.mapress.com/zootaxa/2010/f/zt02450p040.pdf
If there is magic on this planet, it is contained in water. --Loren Eisley, The Immense Journey

Her Royal Majesty's Chief of Insect Genitalia Dissection
Grand Visser of the Six Legged Class
Chanticleer of the Holometabola Clade Church, Diptera Parish