**Computational phylogenetics** is the application of computational algorithms , methods, and programs to phylogenetic analyzes. The goal is to assemble a phylogenetic tree representing the evolutionary ancestry of a set of genes , species , or other taxa . For example, these techniques have been used to explore the family tree of hominid species ^{[1]} and the relationship between specific types of organisms. ^{[2]} Traditional phylogenetics related to morphological data obtained by measuring and quantifying the phenotypicmolecular nucleotide sequences encoding genes or amino acid sequences encoding proteins as the basis for classification. Numerous forms of molecular phylogenetics are closely related to and make extensive use of sequence alignment and refining phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genome of divergent species. The phylogenetic trees made by the computational methods are perfectly reproduce the evolutionary treethat represents the historical relationship between the species being analyzed. The historical species may also differ from the historical tree of an individual homologous gene.

Producing a phylogenetic tree requires a measure of homology among the characteristics shared by the taxa being compared. In morphological studies, this requires the use of separate codes for states. In molecular studies, a primary problem is in producing a multiple sequence alignment (MSA) between the genes or amino acid sequences of interest. Progressive sequence alignment methods produce a phylogenetic tree by the necessity of they become new sequences in the distance of order of genetic distance .

## Types of phylogenetic trees and networks

Phylogenetic trees generated by computational phylogenetics can be *rooted* or *unrooted* depending on the data and algorithm used. A rooted tree is a directed graph That Explicitly has identified Most recent common ancestor(MRCA), usually an imputed sequence That Is not Represented in the input. Genetic distance measures can be used to calculate the size of the genome and their distance from the root proportional to their genetic distance from the hypothesized MRCA. Identification of a root usually requires the inclusion of the “outgroup” known to be distantly related to the sequence of interest.

By contrast, uncrossed trees and their descent. An unrooted tree can be produced from a rooted tree, but it is not possible to have a divergent rate of divergence rates, such as the assumption of the molecular clock hypothesis. ^{[3]}

The set of all possible phylogenetic trees for a given group of input sequences can be conceptualized as a discretely defined multidimensional “tree space” through which search paths can be traced by optimization algorithms. Although the number of variables in a number of variables can be increased by the number of inputs in the definition of a tree topology, it is always more important that the number of inputs and choice of parameters. ^{[4]}

Both rooted and Unrooted phylogenetic trees can be generalized to further Top rooted gold Unrooted phylogenetic networks , qui allow for the modeling of evolutionary phenomena Such As hybridization or horizontal gene transfer.

## Coding characters and defining homology

### Morphological analysis

The basic problem in morphological phylogenetics is the assembly of a matrixrepresenting a mapping from each of the taxa being compared to representative measurements for each of the phenotypic characteristics being used as a classifier. The types of phenotypic data used to construct this matrix depend on the taxa being compared; For individual species, they may be involved in the measurement of the body size, lengths or sizes of the individual, or even behavioral manifestations. Of course, since not every possible phenotypic characteristic could be measured and encoded for analysis, the selection of which is a major obstacle to the method. The decision on which characteristics of a matrix for the matrix is of sufficient value. ^{[5]}Morphological studies can be confounded by examples of convergent evolution of phenotypes. ^{[6]} A major challenge in constructing useful classes is the high likelihood of inter-taxon overlap in the distribution of the phenotype’s variation. The inclusion of extinct taxa in morphological analysis is often difficult to achieve , but has not been reported. In this study, the inclusion of the species of the genetically modified species has been shown to be consistent with that of molecular data. ^{[1]}

Some phenotypic classifications, particularly those used when analyzing many diverse groups of taxa, are discrete and unambiguous; categorizing organisms as possessing or lacking a tail, for example, is a straightforward in the majority of cases, as is counting features such as eyes or vertebrae. However, the most appropriate representation of an entirely different phenotype is a controversial problem with a general solution. A common method is simply one of the following: classifications of discrete classifiable properties (eg, all forms of humerus bones are shorter than the cutoff are scored as members of a second state). This results in an easily manipulateddata set à la composition et de la composition et de la data et de la composition et de la computation et la sacrificing. ^{[7]}

Because morphological data is extremely labor-intensive to collect, whether from literature sources or from field observations, these data are not uncommon, although this propagates in the original matrix into multiple derivative analyzes. ^{[8]}

### Molecular analysis

The following are the characters in the field of analysis and are distinguished from the following: – distinct nucleotides in DNA or RNA sequences and distinct amino acids in protein sequences. However, defining homology can be challenging due to the inherent difficulties of multiple sequence alignment . For a Given gapped MSA Several rooted phylogenetic trees can be constructed That Vary In Their interpretations of qui exchange are ” mutations ” versus ancestral characters, and events are qui insertion mutations or deletion mutations. For example, it is impossible to determine whether or not to insert a mutation or other carcinoma for deletion. The problem is magnified in MSAs with unaligned and nonoverlapping gaps. In practice, sizable regions de une estimation de fabrication de l’invention de l’invention de l’invention de l’invention de l’invention de l’invention de l’invention de l’invention.

## Distance-matrix methods

Methods of phylogenetic analysis of a molecule in the field of MSA as an input. Distance is often defined as the fraction of mismatches and aligned positions, with gaps or ignored or counted as mismatches. ^{[3]} Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same interior nodeand whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may be produced or rooted or unrooted trees, depending on the algorithm used to calculate them. They are frequently used as the basis for progressive and iterative types of multiple sequence alignments . The main disadvantage of distance-matrix methods is their inability to provide information about local high-variation regions that appears across multiple subtrees. ^{[4]}

### UPGMA and WPGMA

The UPGMA ( *Unweighted Pair Group Method with Arithmetic mean* ) and WPGMA ( *Weighted Pair Group Methods with Arithmetic mean* ) methods produce rooted trees and require a constant-rate assumption – that is, it assumes an ultrametric tree in which the distances from the root to every branch tip are equal. ^{[9]}

### Neighbor-joining

Neighbor-joining methods apply general cluster analysis techniques to sequence analysis using genetic distance as a clustering metric. The simple neighbor-joining method produces unrooted trees, but it does not assume a constant rate of evolution (ie, a molecular clock ) across lineages. ^{[10]}

### Fitch-Margoliash method

The Fitch-Margoliash method uses a weighted least squares method for clustering based on genetic distance. ^{[11]} Closely related sequences are given in the construction process to correct the inaccuracy in measuring distances between distantly related sequences. The distances used as input to the algorithm must be normalized to prevent large artifacts in computing relationships between closely related and remotely related groups. The distances calculated by this method must be linear ; the linearity criterion for distances requires that the expected valuesof the branch lengths for two individual branches must equal the expected value of the sum of the two branch distances – a property that applies to biological sequences only when they have been corrected for the possibility of mutations at individual sites. This correction is done through the use of matrix substitution as it is derived from the Jukes-Cantor model of DNA evolution. The distance correction is only necessary when the evolution rates differ among branches. ^{[4]} Another modification of the algorithm can be helpful, Especially in case of Concentrated distances (please postponement to concentration of measure phenomenon and curse of dimensionality): that modification, described in, ^{[12]} has been shown to improve the efficiency of the algorithm and its robustness.

The least-squares is applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional measure of correctness for correlations between distances that are closely related in the data set can also be applied to increased computational cost. Finding the optimal least-squares Any tree with correcting factor is NP-complete , ^{[13]} so heuristic search methods like Those used in maximum-parsimony analysis are applied to the search tree through space.

### Using outgroups

Independent information about the relationship between sequences and groups can be used unrooted trees. Standard use of distance-matrix methods Involves the inclusion of au moins un outgroup sequence Known to be only distantly related to the sequences of interest in the query set. ^{[3]} This usage can be seen as a type of experimental control . If the outgroup has been appropriately chosen, it will have a much greater genetic distanceand thus, it will appear from the root of a rooted tree. Choosing an appropriate outgroup requires the selection of a process that is moderately related to the sequence of interests; Too close a relationship defeats the purpose of the remote and adds noise to the analysis. ^{[3]} Care should also be taken to avoid situations in which the sequences are distantly related, but the gene encoded by the sequence is highly conserved across lineages. Horizontal gene transfer , especially between divergent bacteria , can also confound outgroup use.

## Maximum parsimony

Maximum parsimony (MP) is a method of identifying the potential phylogenetic tree that requires the smallest total number of evolutionary events to explain the observed sequence data. Some ways of scoring trees also include “cost” associated with particular types of evolutionary events and the cost of locating the smallest total cost. Where are the nucleotides or amino acids ?

The most naive way of identifying the most parsimonious tree is simple enumeration – considering each possible tree in succession and searching for the tree with the smallest score. However, this is only possible for a relatively small number of sequences or species because of the problem of identifying the most likely to be NP-hard ; ^{[4]} consequently a number of heuristic search methods for optimization -have-been Developed to locate a highly parsimonious tree, if not the best in the set. Most such methods involve stepping descent -style minimization mechanism operating on a tree rearrangement criterion.

### Branch and bound

The branch and bound algorithm is a general method used to increase the efficiency of investigations for near-optimal solutions of NP-hard problems first applied to phylogenetics in the early 1980s. ^{[14]} Branch and bound PARTICULARLY is well suited to build phylogenetic tree Because It inherently requires dividing a problem into a tree structureas it subdivides the problem space into smaller regions. As a name implies, it requires a branching rule (in the case of phylogenetics, the addition of the next species or a sequence to the tree) and a bound (a rule that excludes certain regions of the assuming that the optimal solution can not occupy that region). Identifying a good bound is the most challenging aspect of the algorithm’s application to phylogenetics. A simple way of defining the bound is a maximum number of assumed evolutionary changes allowed per tree. A set of rules known as Zharkikh’s rules ^{[15]}severely limited the search space by the most popular “most parsimonious” trees. The two most basic rules require the elimination of all but one redundant sequence (for cases where multiple observations have been made identical) and the elimination of character at least two species. Under ideal conditions these rules and their associated algorithm would completely define a tree.

### Sankoff-Morel-Cedergren algorithm

The Sankoff-Morel-Cedergren algorithm was one of the first published methods of MSA and a phylogenetic tree for nucleotide sequences. ^{[16]} The method uses a maximum parsimony calculation in conjunction with a scoring function that penalizes gaps and mismatches, thus favoring the tree that introduces a minimum number of such events the amount of sequence similarity that can be interpreted as a homology, a point of view that may lead to different optimal trees ^{[17]} ). The imputed sequences at the interior nodesof the tree are measured and summed over all the nodes in each possible tree. The optimal scoring tree provides an optimal tree and an optimal MSA given scoring function. Because the method is highly computationally intensive, an approximate method in which the guesses for the interior alignments are refined one node at a time. Both the full and the approximate version are in practice calculated by dynamic programming. ^{[4]}

### MALIGN and POY

More recent phylogenetic tree / MSA methods use heuristics to isolate high scoring, but not necessarily optimal, trees. The MALIGN method uses a maximum-parsimony technique to compute a multiple alignment by maximizing a cladogram score, and its companion POY uses an iterative method that couples the optimization of the phylogenetic tree with improvements in the MSA. ^{[18]} However, the use of these methods in constructing evolutionary hypotheses has been criticized as biased by the deliberate construction of trees reflecting minimal evolutionary events. ^{[19]}This, in turn, has been countered by the view that such methods should be seen as optimistic approaches to homology. ^{[17] }^{[20]}

## Maximum likelihood

The maximum likelihood method can be used to determine the probability of phylogenetic trees. The method requires a substitution model to assess the probability of particular mutations ; roughly, a tree that requires more mutations in the interior of the world. This method can be used to determine the maximum-parsimony method. In fact, the method requires that different sites and different lineages must be differentstatistically independent . Computedly intractable NP-hardness is the most likely to be computationally intractable. ^{[21]}

The “pruning” algorithm, a variant of dynamic programming , is often used to reduce the search by calculating the likelihood of subtrees. ^{[4]} The method calculates the likelihood for each site in a “linear” manner, starting with a node whose only descendants are leaves (which is the tips of the tree) and working backwards towards the “bottom” node in nested sets. However, the trees produced by the method are only one of those which are irreversible, which is not true of biological systems. The search for the maximum-likelihood tree general global optimization tools such as the Newton-Raphson method are often used.

## Bayesian inference

Bayesian inference can be used to produce phylogenetic trees in a manner closely related to the maximum likelihood methods. Bayesian methods assumes a prior probability distribution of the possible trees, which may simply be one of the most likely to be derived from the assumption that divergence events such as as speciation occur as stochastic processes . Bayesian inference phylogenetics methods. ^{[4]}

Implementations of Bayesian methods Markov chain Monte Carlo sampling algorithms, although the choice of move set varies; Selections used in Bayesian phylogenetics include circularly permuting leaf nodes of a Proposed tree at Each Step ^{[22]} and swapping descendant subtrees of a random internal node entre two related trees. ^{[23]} The use of Bayesian methods in phylogenetics has been controversial, largely due to incomplete specification of the choice of move set, acceptance criteria, and prior distribution in published work. ^{[4]} Bayesian methods are parsimony-based methods; they can be more prone to long-branch attraction than maximum likelihood techniques,^{[24]} they are better able to accommodate missing data. ^{[25]}

The most likely clades, by drawing on the posterior distribution, is a Bayesian approach to a tree that represents the most probable clades. However, estimates of the ultimate probability of clades (measuring their ‘support’) can be quite wide, especially in cases that are not overwhelmingly likely. As such, other methods have been put forward to estimate posterior probability. ^{[26]}

## Model selection

Molecular phylogenetics methods rely on a substitution model that encodes the hypothesis of the relative rates of mutation at various sites along which the amino acid sequences are studied. At their simplest, substitution models for transitions and transversions in nucleotide sequences. The use of substitution models is necessitated by the fact that the genetic distance between two sequences is linearly separated by the two sequences (alternatively, the distance is linear only shortly before coalescence). The divergence of the mutations occurs at the same nucleotide site. Simple genetic distance calculations will thus be under the influence of evolutionary events. The extent of this undercount increases with increasing divergence, which can lead to the phenomenon of long-distance attraction , or the misassignment of two distantly related but convergently evolving sequences as closely related. ^{[27]} The maximum parsimony method is particularly susceptible to this problem in the context of a minimum number of distinct evolutionary events. ^{[4]}

### Types of models

All substitution models assign a set of weights to each possible change of state represented in the sequence. The most common types are implicitly reversible because they assign the same weight to, for example, a nucleotide mutation as a mutation. The simplest possible model, the Jukes-Cantor model , assigns an equal probability to every possible change of state for a given nucleotide base. The rate of change between any two distinct nucleotides will be one-third of the overall substitution rate. ^{[4]} More advanced models between transitions and transversions. The most general time-reversible model, called the GTR model, has six mutation rate parameters. An even more generalized model known as the general-variable model-time-reversibility, at the cost of much more complexity in calculating genetic distances that are consistent among multiple lineages. ^{[4]} One possible variation on this theme. ^{[28]}

Models may also allow for the variation of rates with positions in the input sequence. The most obvious example of such variation of the arrangement of nucleotides in protein-coding genes into three-base codons . If the rental of the open reading frame (ORF) is Known, rates of mutation can be adjusted for of position Given Site Within a codon, since it is Known That wobble base pairing can allow for Higher mutation rates in the third nucleotide of a given codon without affecting the codon’s meaning in the genetic code . ^{[27]} A hypothesis-driven example that does not rely on ORF identification simply assigns to each site a frequency often drawn from agamma gold distribution log-normal distribution . ^{[4]} Finally, a more conservative estimate of the rate changes Known As covarion method Allows autocorrelated variation in rates, so que la mutation rate of a Given Site is correlated across websites and lineages. ^{[29]}

### Choosing the best model

The selection of an appropriate model is critical for the production of good phylogenetic assays, both because they can be subparative or when they are overproduced. . ^{[27]} The most common method of modeling is the likelihood ratio test (LRT), which produces a likelihood that can be interpreted as a measure of ” goodness of fit ” between the model and the input data. ^{[27]}However, care must be taken from these results, having a greater likelihood than a simplified version of the same model, which can lead to the selection of models that are overly complex. ^{[4]} For this reason, model selection is a simpler model that is more complex than substitution models. A significant disadvantage of the LRT is the need for making comparisons between models; it has been shown that the order of the day is in the future. ^{[30]}

An alternative model selection method is the Akaike information criterion (AIC), formally an estimate of the Kullback-Leibler divergence between the true model and the model being tested. It can be interpreted as a likelihood estimate with a correction factor to penalize overparameterized models. ^{[27]} The AIC is calculated on an individual model rather than a peer, so it is independent of the order in which models are assessed. A related alternative, the Bayesian Information Criterion (BIC). ^{[27]}

Comprehensive step-by-step protocol on constructing phylogenetic tree, including DNA / Amino Acid contiguous sequence assembly, multiple sequence alignment, best-fitting substitution models and phylogeny reconstruction using Maximum Likelihood and Bayesian Inference, is available at Nature Protocol ^{[31]}

A traditional way of evaluating the phylogenetic tree is to compare with clustering result. One can use a Multidimensional Scaling technique, so called Interpolative Joining to do dimensionality reduction to visualize the clustering result for the 3D sequences, and then map the phylogenetic tree onto the clustering result. A better tree usually has a higher correlation with the clustering result. ^{[32]}

## Evaluating tree support

As with all statistical analysis, the estimation of phylogenies from the data requires an evaluation of confidence. A variant of the method for the support of a phylogenetic tree, or by evaluation of the support for each sub-tree in the phylogeny (nodal support) or ).

### Nodal support

The most common method for assessing a tree is to evaluate the statistical support for each node on the tree. Typically, a node with very low carrier is not regarded valid in further Top analysis, and visually May be collapsed into a polytomy to indicate indication That relationships Within a clade are unresolved.

#### Consensus tree

Many methods for assessing nodal support involving consideration of multiple phylogenies. The consensus tree summarizes the nodes that are shared among a set of trees. ^{[33]} In a * strict consensus, * only nodes found in every tree are shown, and the rest are collapsed into an unresolved polytomy . Less conservative methods, such as the majority-rule consensus, are considered to be supported by a percentage of trees under consideration (such as at least 50%).

For example, in parsimony analysis, there may be many trees with the same parsimony score. A strict consensus tree would show which nodes are found in all other countries, and which nodes differ. Consensus trees are also used to evaluate support for phylogenies reconstructed with Bayesian inference (see below).

#### Bootstrapping and jackknifing

In statistics, the bootstrap is a method for inferring the variability of data that has an unknown distribution using pseudoreplications of the original data. For example, a data set of 100 data points, a pseudo data set of 100 data points , with replacement. That is, every single item can be represented in the pseudoreplicate, or not at all. Statistical support supports the evaluation of a data set with a large set of pseudoreplicates.

In phylogenetics, bootstrapping is conducted using the columns of the character matrix. Each pseudoreplicate contains the same number of species (rows) and characters (columns) randomly sampled from the original matrix, with replacement. A phylogeny is reconstructed from each pseudoreplicate, with the same methods used to reconstruct the phylogeny from the original data. For each node on the phylogeny, the nodal support is the percentage of pseudoreplicates containing that node. ^{[34]}

Statistics Canada, Catalog No. 12-001-XIE has empirically evaluated using viral populations with known evolutionary histories, ^{[35]} finding that 70% bootstrap support corresponds to a 95% probability that the clade exists. However, this has been tested under ideal conditions (eg, no change in evolutionary rates, symmetric phylogenies). In practice, values above 70% are supported and left to the researcher or reader to evaluate confidence. Nodes with support lower than 70% are typically considered unresolved.

Jackknifing in phylogenetics is a similar procedure, except the columns of the matrix are sampled without replacement. Pseudoreplicates are generated by randomly subsamplating the data-for example, a “10% jackknife” would be randomly sampling 10% of the matrix many times to evaluate nodal support.

#### Posterior probability

Reconstruction of phylogenies using Bayesian inference, a posterior distribution of highly probable trees given the data and evolutionary model, rather than a single “best” tree. The trees in the posterior distribution generally have many different topologies. Most Bayesian inference methods used in Markov-chain Monte Carlo iteration, and the initial steps of this chain are not considered reliable reconstructions of the phylogeny. Trees generated early in the chain are usually discarded as burn-in . The most common method of evaluation is in the Bayesian phylogenetic analysis is to calculate the percentage of trees in the posterior distribution (post-burn-in) which contains the node.

The statistical support for a node in Bayesian inference is expected to reflect the probability that a clade really exists given the data and evolutionary model. ^{[36]} Therefore, the threshold for accepting a node for bootstrapping.

#### Step counting methods

Bremer support counts the number of extra steps needed to contradict a clade.

### Shortcomings

These measures each have their weaknesses. For example, smaller or larger clades tends to attract larger support values than mid-sized clades, simply as a result of the number of taxa in them. ^{[37]}

Bootstrap support can provide high estimates of the level of support for a given source of noise in the data rather than the true existence of a clade. ^{[38]}

## Limitations and workarounds

The present case is one of the most important hypotheses in the world, but it is well established that the subject is already known to exist under laboratory conditions. The best result is an empirical phylogeneticist that can hopefully be made available. Several potential pitfalls have been identified:

### Homoplasy

Some characters are more likely to evolve convergently than others; logically, such characters should be given less weight in the reconstruction of a tree. ^{[39]} Weights in the form of a model of evolution can be inferred from sets of molecular data, so that maximum likelihood or Bayesiancan be used to analyze them. For molecular sequences, this problem is exacerbated when the taxa under study have diverged substantially. As the difference of two taxa increases, so does the probability of multiple substitutions on the same site, or back mutations, all of which result in homoplasies. For morphological data, unfortunately, the only way to determine convergence is by the construction of a tree – a somewhat circular method. Even so, weighting homoplasious characters ^{[ how? ]} does indeed lead to better-supported trees. ^{[39]}Further refinement can be brought by weighting changes in one direction higher than changes in another; for instance, the presence of thoracic wings, which are often not limited to insomnia, because they are more often than not. ^{[40]}

### Horizontal gene transfer

In general, organisms can be inherited in two ways: vertical gene transfer and horizontal gene transfer . Vertical gene transfer is the passage of genes from parent to offspring, and horizontal (also called lateral) gene transfer occurs when genes jump between unrelated organisms, a common phenomenon especially in prokaryotes ; A good example of this is the antibiotic resistance as a result of multi-drug-resistant bacterial species. There are also well-documented cases of horizontal gene transfer between eukaryotes .

Abstract: Uncertainties in the development of phylogenies of organisms, and inconsistencies in phylogeny. The only way to determine which genes has been acquired vertically and which horizontally is parsimoniously assumes that the largest set of genes has been inherited vertically; this requires an analysis of large numbers of genes.

### Hybrids, speciation, introgression and incomplete lineage sorting

The basic assumption of the mathematical model of cladistics is a situation where species split neatly in bifurcating fashion. While such an assumption may hold on a larger scale (horizontal bar gene transfer, see above), speciation is often much less orderly. Research since the cladistic method has been introduced that hybrid speciation , rare thought, is in fact quite common, particularly in plants. ^{[41] }^{[42]} Also paraphyletic speciation is common, making the assumption of a bifurcating pattern unsuitable, leading to phylogenetic networks rather than trees. ^{[43] }^{[44]} Introgressioncan also move genes between other distinct species and sometimes even genera, complicating phylogenetic analysis based on genes. ^{[45]} This phenomenon can contribute to “incomplete lineage sorting” and is thought to be a common phenomenon across a number of groups. In particular, this issue can be dealt with by larger or better whole genome analysis. ^{[46]} Often the problem is avoided by restricting the analysis to less, not closely related specimens.

### Taxon sampling

This method is used for the development of advanced sequencing techniques in molecular biology , it has become possible to collect large amounts of data (DNA or amino acid sequences) to infer phylogenetic hypotheses. For example, it is not uncommon to find studies with character matrices based on whole mitochondrial genomes (~ 16,000 nucleotides, in many animals). However, simulations have shown that it is more important to increase the number of taxa in the matrix because of the more taxa there, the more accurate and more robust is the resultant phylogenetic tree. ^{[47] }^{[48]} This may be due to the breaking up of long branches .

### Phylogenetic signal

Another important factor that affects the accuracy of a tree is that it is useful in the field of phylogenetic signaling. . Tests for phylogenetic signal exist. ^{[49]}

### Continuous characters

Morphological characters that sample a continuum may contain phylogenetic signal, but are hard to code as discrete characters. Several methods have been used, one of which is gap coding, and there are variations on gap coding. ^{[50]} In the original form of gap coding: ^{[50]}

group means for a character are first ordered by size. The pooled within-group standard deviation is calculated … and differences between adjacent means … are compared to this standard deviation. Any of the two methods are considered different and given different integer scores … if the means are separated by a “gap” greater than the standard standard deviation … times some arbitrary constant.

If more taxa are added to the analysis, the gaps between taxa can become so small that all information is lost. Generalized gap coding works around that problem by comparing individual taxa rather than considering one set that contains all of the taxa. ^{[50]}

### Missing data

In general, the more data that are available when constructing a tree, the more accurate and reliable the resulting tree will be. Missing data are no longer detrimental than simply having fewer data, but they are more important than others. Concentrating the missing data in a small number of characters produces a more robust tree. ^{[51]}

## The role of fossils

Because many characters involve embryological, or soft-tissue or molecular characters that (at best) hardly ever fossilize, and the interpretation of fossils is more ambiguous than that of living taxa , extinct taxa almost invariably have higher proportions of missing data than living ones. However, despite these limitations, the inclusion of fossils is invaluable, as they can provide information in areas of trees, breaking up long branches and constraining intermediate character states; thus, fossil taxa. ^{[52]} Fossils can also be related to the age of lineages and thus demonstrate how tree is with the stratigraphic record; ^{[53]} stratocladistics data for matrices for phylogenetic analyzes.