Alignment-free sequence analysis

{{Short description|Methods in computational biology}}

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

The emergence and need for the analysis of different types of data generated through biological research has given rise to the field of bioinformatics.{{cite journal | vauthors = Rothberg J, Merriman B, Higgs G | title = Bioinformatics. Introduction | journal = The Yale Journal of Biology and Medicine | volume = 85 | issue = 3 | pages = 305–308 | date = September 2012 | pmid = 23189382 | pmc = 3447194 }} Molecular sequence and structure data of DNA, RNA, and proteins, gene expression profiles or microarray data, metabolic pathway data are some of the major types of data being analysed in bioinformatics. Among them sequence data is increasing at the exponential rate due to advent of next-generation sequencing technologies. Since the origin of bioinformatics, sequence analysis has remained the major area of research with wide range of applications in database searching, genome annotation, comparative genomics, molecular phylogeny and gene prediction. The pioneering approaches for sequence analysis were based on sequence alignment either global or local, pairwise or multiple sequence alignment.{{cite journal | vauthors = Batzoglou S | title = The many faces of sequence alignment | journal = Briefings in Bioinformatics | volume = 6 | issue = 1 | pages = 6–22 | date = March 2005 | pmid = 15826353 | doi = 10.1093/bib/6.1.6 | doi-access = free }}{{cite journal | vauthors = Mullan L | title = Pairwise sequence alignment--it's all about us! | journal = Briefings in Bioinformatics | volume = 7 | issue = 1 | pages = 113–115 | date = March 2006 | pmid = 16761368 | doi = 10.1093/bib/bbk008 | doi-access = }} Alignment-based approaches generally give excellent results when the sequences under study are closely related and can be reliably aligned, but when the sequences are divergent, a reliable alignment cannot be obtained and hence the applications of sequence alignment are limited. Another limitation of alignment-based approaches is their computational complexity and are time-consuming and thus, are limited when dealing with large-scale sequence data.{{cite journal | vauthors = Kemena C, Notredame C | title = Upcoming challenges for multiple sequence alignment methods in the high-throughput era | journal = Bioinformatics | volume = 25 | issue = 19 | pages = 2455–2465 | date = October 2009 | pmid = 19648142 | pmc = 2752613 | doi = 10.1093/bioinformatics/btp452 }} The advent of next-generation sequencing technologies has resulted in generation of voluminous sequencing data. The size of this sequence data poses challenges on alignment-based algorithms in their assembly, annotation and comparative studies.

Alignment-free methods

Alignment-free methods can broadly be classified into five categories: a) methods based on k-mer/word frequency, b) methods based on the length of common substrings, c) methods based on the number of (spaced) word matches, d) methods based on micro-alignments, e) methods based on information theory and f) methods based on graphical representation. Alignment-free approaches have been used in sequence similarity searches,{{cite journal | vauthors = Hide W, Burke J, Davison DB | title = Biological evaluation of d2, an algorithm for high-performance sequence comparison | journal = Journal of Computational Biology | volume = 1 | issue = 3 | pages = 199–215 | date = 1994 | pmid = 8790465 | doi = 10.1089/cmb.1994.1.199 }} clustering and classification of sequences,{{cite journal | vauthors = Miller RT, Christoffels AG, Gopalakrishnan C, Burke J, Ptitsyn AA, Broveak TR, Hide WA | title = A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base | journal = Genome Research | volume = 9 | issue = 11 | pages = 1143–1155 | date = November 1999 | pmid = 10568754 | pmc = 310831 | doi = 10.1101/gr.9.11.1143 }} and more recently in phylogenetics{{cite journal | vauthors = Chan CX, Ragan MA | title = Next-generation phylogenomics | journal = Biology Direct | volume = 8 | pages = 3 | date = January 2013 | pmid = 23339707 | pmc = 3564786 | doi = 10.1186/1745-6150-8-3 | doi-access = free }} (Figure 1).

Such molecular phylogeny analyses employing alignment-free approaches are said to be part of next-generation phylogenomics. A number of review articles provide in-depth review of alignment-free methods in sequence analysis.{{cite journal | vauthors = Vinga S, Almeida J | title = Alignment-free sequence comparison-a review | journal = Bioinformatics | volume = 19 | issue = 4 | pages = 513–523 | date = March 2003 | pmid = 12611807 | doi = 10.1093/bioinformatics/btg005 | doi-access = free }}{{cite journal | vauthors = Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F | title = New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing | journal = Briefings in Bioinformatics | volume = 15 | issue = 3 | pages = 343–353 | date = May 2014 | pmid = 24064230 | pmc = 4017329 | doi = 10.1093/bib/bbt067 | author3-link = Gesine Reinert }}{{cite journal | vauthors = Haubold B | title = Alignment-free phylogenetics and population genetics | journal = Briefings in Bioinformatics | volume = 15 | issue = 3 | pages = 407–418 | date = May 2014 | pmid = 24291823 | doi = 10.1093/bib/bbt083 | doi-access = free }}{{cite journal | vauthors = Bonham-Carter O, Steele J, Bastola D | title = Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis | journal = Briefings in Bioinformatics | volume = 15 | issue = 6 | pages = 890–905 | date = November 2014 | pmid = 23904502 | pmc = 4296134 | doi = 10.1093/bib/bbt052 }}{{cite journal | vauthors = Zielezinski A, Vinga S, Almeida J, Karlowski WM | title = Alignment-free sequence comparison: benefits, applications, and tools | journal = Genome Biology | volume = 18 | issue = 1 | pages = 186 | date = October 2017 | pmid = 28974235 | pmc = 5627421 | doi = 10.1186/s13059-017-1319-7 | doi-access = free }}{{cite journal | vauthors = Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA | display-authors = 6 | title = Alignment-free inference of hierarchical and reticulate phylogenomic relationships | journal = Briefings in Bioinformatics | volume = 20 | issue = 2 | pages = 426–435 | date = March 2019 | pmid = 28673025 | pmc = 6433738 | doi = 10.1093/bib/bbx067 }}{{cite journal | vauthors = Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F | title = Alignment-Free Sequence Analysis and Applications | journal = Annual Review of Biomedical Data Science | volume = 1 | pages = 93–114 | date = July 2018 | pmid = 31828235 | pmc = 6905628 | doi = 10.1146/annurev-biodatasci-080917-013431 | arxiv = 1803.09727 | bibcode = 2018arXiv180309727R }}

The AFproject is an international collaboration to benchmark and compare software tools for alignment-free sequence comparison.{{cite journal | vauthors = Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM | display-authors = 6 | title = Benchmarking of alignment-free sequence comparison methods | journal = Genome Biology | volume = 20 | issue = 1 | pages = 144 | date = July 2019 | pmid = 31345254 | pmc = 6659240 | doi = 10.1186/s13059-019-1755-7 | doi-access = free }}

= Methods based on ''k''-mer/word frequency =

The popular methods based on k-mer/word frequencies include feature frequency profile (FFP),{{cite journal | vauthors = Sims GE, Jun SR, Wu GA, Kim SH | title = Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions | journal = Proceedings of the National Academy of Sciences of the United States of America | volume = 106 | issue = 40 | pages = 17077–17082 | date = October 2009 | pmid = 19805074 | pmc = 2761373 | doi = 10.1073/pnas.0909377106 | doi-access = free | bibcode = 2009PNAS..10617077S }}{{cite journal | vauthors = Sims GE, Kim SH | title = Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs) | journal = Proceedings of the National Academy of Sciences of the United States of America | volume = 108 | issue = 20 | pages = 8329–8334 | date = May 2011 | pmid = 21536867 | pmc = 3100984 | doi = 10.1073/pnas.1105168108 | doi-access = free | bibcode = 2011PNAS..108.8329S }} Composition vector (CV),{{cite journal | vauthors = Gao L, Qi J | title = Whole genome molecular phylogeny of large dsDNA viruses using composition vector method | journal = BMC Evolutionary Biology | volume = 7 | issue = 1 | pages = 41 | date = March 2007 | pmid = 17359548 | pmc = 1839080 | doi = 10.1186/1471-2148-7-41 | doi-access = free | bibcode = 2007BMCEE...7...41G }}{{cite journal | vauthors = Wang H, Xu Z, Gao L, Hao B | title = A fungal phylogeny based on 82 complete genomes using the composition vector method | journal = BMC Evolutionary Biology | volume = 9 | issue = 1 | pages = 195 | date = August 2009 | pmid = 19664262 | pmc = 3087519 | doi = 10.1186/1471-2148-9-195 | doi-access = free | bibcode = 2009BMCEE...9..195W }} Return time distribution (RTD),{{cite journal | vauthors = Kolekar P, Kale M, Kulkarni-Kale U | title = Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping | journal = Molecular Phylogenetics and Evolution | volume = 65 | issue = 2 | pages = 510–522 | date = November 2012 | pmid = 22820020 | doi = 10.1016/j.ympev.2012.07.003 }} frequency chaos game representation (FCGR).{{cite journal | vauthors = Hatje K, Kollmar M | title = A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method | journal = Frontiers in Plant Science | volume = 3 | pages = 192 | year = 2012 | pmid = 22952468 | pmc = 3429886 | doi = 10.3389/fpls.2012.00192 | doi-access = free }} and Spaced Words.

== Feature frequency profile (FFP) ==

The methodology involved in FFP based method starts by calculating the count of each possible k-mer (possible number of k-mers for nucleotide sequence: 4k, while that for protein sequence: 20k) in sequences. Each k-mer count in each sequence is then normalized by dividing it by total of all k-mers' count in that sequence. This leads to conversion of each sequence into its feature frequency profile. The pair wise distance between two sequences is then calculated Jensen–Shannon (JS) divergence between their respective FFPs. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc.

== Composition vector (CV) ==

In this method frequency of appearance of each possible k-mer in a given sequence is calculated. The next characteristic step of this method is the subtraction of random background of these frequencies using Markov model to reduce the influence of random neutral mutations to highlight the role of selective evolution. The normalized frequencies are put a fixed order to form the composition vector (CV) of a given sequence. Cosine distance function is then used to compute pairwise distance between CVs of sequences. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc. This method can be extended through resort to efficient pattern matching algorithms to include in the computation of the composition vectors: (i) all k-mers for any value of k, (ii) all substrings of any length up to an arbitrarily set maximum k value, (iii) all maximal substrings, where a substring is maximal if extending it by any character would cause a decrease in its occurrence count.{{cite journal | vauthors = Apostolico A, Denas O | title = Fast algorithms for computing sequence distances by exhaustive substring composition | journal = Algorithms for Molecular Biology | volume = 3 | pages = 13 | date = October 2008 | pmid = 18957094 | pmc = 2615014 | doi = 10.1186/1748-7188-3-13 | doi-access = free }}{{cite journal | vauthors = Apostolico A, Denas O, Dress A | title = Efficient tools for comparative substring analysis | journal = Journal of Biotechnology | volume = 149 | issue = 3 | pages = 120–126 | date = September 2010 | pmid = 20682467 | doi = 10.1016/j.jbiotec.2010.05.006 }}

== Return time distribution (RTD) ==

The RTD based method does not calculate the count of k-mers in sequences, instead it computes the time required for the reappearance of

k-mers. The time refers to the number of residues in successive appearance of particular k-mer. Thus the occurrence of each k-mer in a sequence is calculated in the form of RTD, which is then summarised using two statistical parameters mean (μ) and standard deviation (σ). Thus each sequence is represented in the form of numeric vector of size 2⋅4k containing μ and σ of 4k RTDs. The pair wise distance between sequences is calculated using Euclidean distance measure. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc. A recent approach Pattern Extraction through Entropy Retrieval (PEER) provides direct detection of the k-mer length and summarised the occurrence interval using entropy.

== Frequency chaos game representation (FCGR) ==

The FCGR methods have evolved from chaos game representation (CGR) technique, which provides scale independent representation for genomic sequences.{{cite journal | vauthors = Jeffrey HJ | title = Chaos game representation of gene structure | journal = Nucleic Acids Research | volume = 18 | issue = 8 | pages = 2163–2170 | date = April 1990 | pmid = 2336393 | pmc = 330698 | doi = 10.1093/nar/18.8.2163 }} The CGRs can be divided by grid lines where each grid square denotes the occurrence of oligonucleotides of a specific length in the sequence. Such representation of CGRs is termed as Frequency Chaos Game Representation (FCGR). This leads to representation of each sequence into FCGR. The pair wise distance between FCGRs of sequences can be calculated using the Pearson distance, the Hamming distance or the Euclidean distance.{{cite journal | vauthors = Wang Y, Hill K, Singh S, Kari L | title = The spectrum of genomic signatures: from dinucleotides to chaos game representation | journal = Gene | volume = 346 | pages = 173–185 | date = February 2005 | pmid = 15716010 | doi = 10.1016/j.gene.2004.10.021 }}

== Spaced-word frequencies==

While most alignment-free algorithms compare the word-composition of sequences, Spaced Words uses a pattern of care and don't care positions. The occurrence of a spaced word in a sequence is then defined by the characters at the match positions only, while the characters at the don't care positions are ignored. Instead of comparing the frequencies of contiguous words in the input sequences, this approach compares the frequencies of the spaced words according to the pre-defined pattern. Note that the pre-defined pattern can be selected by analysis of the Variance of the number of matches,{{cite journal | vauthors = Hahn L, Leimeister CA, Ounit R, Lonardi S, Morgenstern B | title = rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison | journal = PLOS Computational Biology | volume = 12 | issue = 10 | pages = e1005107 | date = October 2016 | pmid = 27760124 | pmc = 5070788 | doi = 10.1371/journal.pcbi.1005107 | arxiv = 1511.04001 | doi-access = free | bibcode = 2016PLSCB..12E5107H }} the probability of the first occurrence on several models,{{cite journal | vauthors = Noé L | title = Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds | journal = Algorithms for Molecular Biology | volume = 12 | issue = 1 | pages = 1 | date = Feb 14, 2017 | pmid = 28289437 | pmc = 5310094 | doi = 10.1186/s13015-017-0092-1 | doi-access = free }} or the Pearson correlation coefficient between the expected word frequency and the true alignment distance.

= Methods based on length of common substrings =

The methods in this category employ the similarity and differences of substrings in a pair of sequences. These algorithms

were mostly used for string processing in computer science.{{cite book | vauthors = Gusfield D |title=Algorithms on strings, trees, and sequences: computer science and computational biology |year=1997 |publisher=Cambridge Univ. Press|location=Cambridge [u.a.]|isbn=9780521585194|edition=Reprinted (with corr.)}}

== Average common substring (ACS) ==

In this approach, for a chosen pair of sequences (A and B of lengths n and m respectively), longest substring starting at some position is identified in one sequence (A) which exactly matches in the other sequence (B) at any position. In this way, lengths of longest substrings starting at different positions in sequence A and having exact matches at some positions in sequence B are calculated. All these lengths are averaged to derive a measure L(A, B). Intuitively, larger the L(A, B), the more similar the two sequences are. To account for the differences in the length of sequences, L(A, B) is normalized [i.e. L(A, B)/\log(m)]. This gives the similarity measure between the sequences.

In order to derive a distance measure, the inverse of similarity measure is taken and a correction term is subtracted from it to assure that d(A, A) will be zero. Thus

: d(A, B) = \left [ \frac{\log m}{L(A, B)} \right] - \left[\frac{\log n}{L(A, A)} \right].

This measure d(A, B) is not symmetric, so one has to compute d_s(A, B) = d_s(B, A) = (d(A, B) + d(B, A))/2, which gives final ACS measure between the two strings (A and B).{{cite journal | vauthors = Ulitsky I, Burstein D, Tuller T, Chor B | title = The average common substring approach to phylogenomic reconstruction | journal = Journal of Computational Biology | volume = 13 | issue = 2 | pages = 336–350 | date = March 2006 | pmid = 16597244 | doi = 10.1089/cmb.2006.13.336 | citeseerx = 10.1.1.106.5122 }} The subsequence/substring search can be efficiently performed by

using suffix trees.{{cite book| vauthors = Weiner P |doi=10.1109/SWAT.1973.13|title=14th Annual Symposium on Switching and Automata Theory (swat 1973)|pages=1–11|year=1973|chapter=Linear pattern matching algorithms|citeseerx=10.1.1.474.9582}}{{cite book| vauthors = He D |title=2006 International Conference of the IEEE Engineering in Medicine and Biology Society |chapter=Using suffix tree to discover complex repetitive patterns in DNA sequences |year=2006|volume=1|pages=3474–7|pmid=17945779|doi=10.1109/IEMBS.2006.260445|isbn=978-1-4244-0032-4|s2cid=5953866}}{{cite journal | vauthors = Välimäki N, Gerlach W, Dixit K, Mäkinen V | title = Compressed suffix tree--a basis for genome-scale sequence analysis | journal = Bioinformatics | volume = 23 | issue = 5 | pages = 629–630 | date = March 2007 | pmid = 17237063 | doi = 10.1093/bioinformatics/btl681 | doi-access = free }}

== ''k''-mismatch average common substring approach (kmacs) ==

This approach is a generalization of the ACS approach. To define the distance between two DNA or protein sequences, kmacs estimates for each position i of the first sequence the longest substring starting at i and matching a substring of the second sequence with up to k mismatches. It defines the average of these values as a measure of similarity between the sequences and turns this into a symmetric distance measure. Kmacs does not compute exact k-mismatch substrings, since this would be computational too costly, but approximates such substrings.

== Mutation distances (Kr) ==

This approach is closely related to the ACS, which calculates the number of substitutions per site between two DNA sequences using the shortest

absent substring (termed as {{Not a typo|shustring}}).{{cite journal | vauthors = Haubold B, Pfaffelhuber P, Domazet-Loso M, Wiehe T | title = Estimating mutation distances from unaligned genomes | journal = Journal of Computational Biology | volume = 16 | issue = 10 | pages = 1487–1500 | date = October 2009 | pmid = 19803738 | doi = 10.1089/cmb.2009.0106 | hdl-access = free | hdl = 11858/00-001M-0000-000F-D624-D }}

== Length distribution of k-mismatch common substrings ==

This approach uses the program kmacs to calculate longest common substrings with up to k mismatches for a pair of DNA sequences. The phylogenetic distance between the sequences can then be estimated from a local maximum in the length distribution of the k-mismatch common substrings.{{cite journal | vauthors = Morgenstern B, Schöbel S, Leimeister CA | title = Phylogeny reconstruction based on the length distribution of k-mismatch common substrings | journal = Algorithms for Molecular Biology | volume = 12 | pages = 27 | date = 2017 | pmid = 29238399 | pmc = 5724348 | doi = 10.1186/s13015-017-0118-8 | doi-access = free }}

= Methods based on the number of (spaced) word matches =

== <math>D_2^S</math> and <math>D_2^*</math> ==

These approachese are variants of the D_2 statistics that counts the number of k-mer matches between two sequences. They improve the simple D_2 statistics by taking the background distribution of the compared sequences into account.{{cite journal | vauthors = Reinert G, Chew D, Sun F, Waterman MS | title = Alignment-free sequence comparison (I): statistics and power | journal = Journal of Computational Biology | volume = 16 | issue = 12 | pages = 1615–1634 | date = December 2009 | pmid = 20001252 | pmc = 2818754 | doi = 10.1089/cmb.2009.0198 }}

== MASH ==

This is an extremely fast method that uses the MinHash bottom sketch strategy for estimating the Jaccard index of the multi-sets of k-mers of two input sequences. That is, it estimates the ratio of k-mer matches to the total number of k-mers of the sequences. This can be used, in turn, to estimate the evolutionary distances between the compared sequences, measured as the number of substitutions per sequence position since the sequences evolved from their last common ancestor.{{cite journal | vauthors = Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM | title = Mash: fast genome and metagenome distance estimation using MinHash | journal = Genome Biology | volume = 17 | issue = 1 | pages = 132 | date = June 2016 | pmid = 27323842 | pmc = 4915045 | doi = 10.1186/s13059-016-0997-x | doi-access = free }}

== Slope-Tree ==

This approach calculates a distance value between two protein sequences based on the decay of the number of k-mer matches if k increases.{{cite journal | vauthors = Bromberg R, Grishin NV, Otwinowski Z | title = Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer | journal = PLOS Computational Biology | volume = 12 | issue = 6 | pages = e1004985 | date = June 2016 | pmid = 27336403 | pmc = 4918981 | doi = 10.1371/journal.pcbi.1004985 | doi-access = free | bibcode = 2016PLSCB..12E4985B }}

== Slope-SpaM ==

This method calculates the number N_k of k-mer or spaced-word matches

(SpaM) for different values for the word length or number of match positions k in the underlying pattern, respectively. The slope of an affine-linear function F that depends on N_k is calculated to estimate the Jukes-Cantor distance between the input sequences .{{cite journal | vauthors = Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B | title = The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances | journal = PLOS ONE | volume = 15 | issue = 2 | pages = e0228070 | date = 2020 | pmid = 32040534 | pmc = 7010260 | doi = 10.1371/journal.pone.0228070 | doi-access = free | bibcode = 2020PLoSO..1528070R }}

== Skmer ==

Skmer calculates distances between species from unassembled sequencing reads. Similar to MASH, it uses the Jaccard index on the sets of k-mers from the input sequences. In contrast to MASH, the program is still accurate for low sequencing coverage, so it can be used for genome skimming.{{cite journal | vauthors = Sarmashghi S, Bohmann K, P Gilbert MT, Bafna V, Mirarab S | title = Skmer: assembly-free and alignment-free sample identification using genome skims | journal = Genome Biology | volume = 20 | issue = 1 | pages = 34 | date = February 2019 | pmid = 30760303 | pmc = 6374904 | doi = 10.1186/s13059-019-1632-4 | doi-access = free }}

= Methods based on micro-alignments =

Strictly spoken, these methods are not alignment-free. They are using simple gap-free micro-alignments where sequences are required to match at certain pre-defined positions. The positions aligned at the remaining positions of the micro-alignments where mismatches are allowed, are then used for phylogeny inference.

== Co-phylog ==

This method searches for so-called structures that are defined as pairs of k-mer matches between two DNA sequences that are one position apart in both sequences. The two k-mer matches are called the context, the position between them is called the object. Co-phylog then defines the distance between two sequences the fraction of such structures for which the two nucleotides in the object are different. The approach can be applied to unassembled sequencing reads.{{cite journal | vauthors = Yi H, Jin L | title = Co-phylog: an assembly-free phylogenomic approach for closely related organisms | journal = Nucleic Acids Research | volume = 41 | issue = 7 | pages = e75 | date = April 2013 | pmid = 23335788 | pmc = 3627563 | doi = 10.1093/nar/gkt003 }}

== andi ==

andi estimates phylogenetic distances between genomic sequences based on ungapped local alignments that are flanked by maximal exact word matches. Such word matches can be efficiently found using suffix arrays. The gapfree alignments between the exact word matches are then used to estimate phylogenetic distances between genome sequences. The resulting distance estimates are accurate for up to around 0.6 substitutions per position.{{cite journal | vauthors = Haubold B, Klötzl F, Pfaffelhuber P | title = andi: fast and accurate estimation of evolutionary distances between closely related genomes | journal = Bioinformatics | volume = 31 | issue = 8 | pages = 1169–1175 | date = April 2015 | pmid = 25504847 | doi = 10.1093/bioinformatics/btu815 | doi-access = free }}

== Filtered Spaced-Word Matches (FSWM) ==

FSWM uses a pre-defined binary pattern P representing so-called match positions and don't-care positions. For a pair of input DNA sequences, it then searches for spaced-word matches w.r.t. P, i.e. for local gap-free alignments with matching nucleotides at the match positions of P and possible mismatches at the don't-care positions. Spurious low-scoring spaced-word matches are discarded, evolutionary distances between the input sequences are estimated based on the nucleotides aligned to each other at the don't-care positions of the remaining, homologous spaced-word matches.{{cite journal | vauthors = Leimeister CA, Sohrabi-Jahromi S, Morgenstern B | title = Fast and accurate phylogeny reconstruction using filtered spaced-word matches | journal = Bioinformatics | volume = 33 | issue = 7 | pages = 971–979 | date = April 2017 | pmid = 28073754 | pmc = 5409309 | doi = 10.1093/bioinformatics/btw776 }} FSWM has been adapted to estimate distances based on unassembled NGS reads, this version of the program is called Read-SpaM.{{cite journal | vauthors = Lau AK, Dörrer S, Leimeister CA, Bleidorn C, Morgenstern B | title = Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage | journal = BMC Bioinformatics | volume = 20 | issue = Suppl 20 | pages = 638 | date = December 2019 | pmid = 31842735 | pmc = 6916211 | doi = 10.1186/s12859-019-3205-7 | doi-access = free }}

== Prot-SpaM ==

Prot-SpaM (Proteome-based Spaced-word Matches) is an implementation of the FSWM algorithm for partial or whole proteome sequences.{{cite journal | vauthors = Leimeister CA, Schellhorn J, Dörrer S, Gerth M, Bleidorn C, Morgenstern B | title = Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences | journal = GigaScience | volume = 8 | issue = 3 | pages = giy148 | date = March 2019 | pmid = 30535314 | pmc = 6436989 | doi = 10.1093/gigascience/giy148 }}

== Multi-SpaM ==

Multi-SpaM (MultipleSpaced-word Matches) is an approach to genome-based phylogeny reconstruction that extends the FSWM idea to multiple sequence comparison.{{cite journal | vauthors = Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B | title = 'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees | journal = NAR Genomics and Bioinformatics | volume = 2 | issue = 1 | pages = lqz013 | date = March 2020 | pmid = 33575565 | pmc = 7671388 | doi = 10.1093/nargab/lqz013 | doi-access = free }} Given a binary pattern P of match positions and don't-care positions, the program searches for P-blocks, i.e. local gap-free four-way alignments with matching nucleotides at the match positions of P and possible mismatches at the don't-care positions. Such four-way alignments are randomly sampled from a set of input genome sequences. For each P-block, an unrooted tree topology is calculated using RAxML.{{cite journal | vauthors = Stamatakis A | title = RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models | journal = Bioinformatics | volume = 22 | issue = 21 | pages = 2688–2690 | date = November 2006 | pmid = 16928733 | doi = 10.1093/bioinformatics/btl446 | doi-access = free }} The program Quartet MaxCut is then used to calculate a supertree from these trees.

= Methods based on information theory =

Information Theory has provided successful methods for alignment-free sequence analysis and comparison. The existing applications of information theory include global and local characterization of DNA, RNA and proteins, estimating genome entropy to motif and region classification. It also holds promise in gene mapping, next-generation sequencing analysis and metagenomics.{{cite journal | vauthors = Vinga S | title = Information theory applications for biological sequence analysis | journal = Briefings in Bioinformatics | volume = 15 | issue = 3 | pages = 376–389 | date = May 2014 | pmid = 24058049 | pmc = 7109941 | doi = 10.1093/bib/bbt068 | doi-access = free }}

== Base–base correlation (BBC) ==

Base–base correlation (BBC) converts the genome sequence into a unique 16-dimensional numeric vector using the following equation,

: T_{ij}(K) = \sum_{\ell=1}^K P_{ij}(\ell) \cdot \log_2 \left ( \frac{P_{ij}(\ell)}{P_i P_j} \right)

The P_i and P_j denotes the probabilities of bases i and j in the genome. The P_{ij}(\ell) indicates the probability of bases i and j at distance in the genome. The parameter K indicates the maximum distance between the bases i and j. The variation in the values of 16 parameters reflect variation in the genome content and length.{{cite journal | vauthors = Liu Z, Meng J, Sun X | title = A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping | journal = Biochemical and Biophysical Research Communications | volume = 368 | issue = 2 | pages = 223–230 | date = April 2008 | pmid = 18230342 | doi = 10.1016/j.bbrc.2008.01.070 }}{{cite journal | vauthors = Liu ZH, Sun X | title = Coronavirus phylogeny based on base-base correlation | journal = International Journal of Bioinformatics Research and Applications | volume = 4 | issue = 2 | pages = 211–220 | year = 2008 | pmid = 18490264 | doi = 10.1504/ijbra.2008.018347 }}{{cite journal | vauthors = Cheng J, Zeng X, Ren G, Liu Z | title = CGAP: a new comprehensive platform for the comparative analysis of chloroplast genomes | journal = BMC Bioinformatics | volume = 14 | pages = 95 | date = March 2013 | pmid = 23496817 | pmc = 3636126 | doi = 10.1186/1471-2105-14-95 | doi-access = free }}

== Information correlation and partial information correlation (IC-PIC) ==

IC-PIC (information correlation and partial information correlation) based method employs the base correlation property of DNA sequence. IC and PIC were calculated using following formulas,

: IC_\ell = -2 \sum_i P_i \log_2 P_i + \sum_{ij} P_{ij} (\ell) \log_2 P_{ij} (\ell)

: PIC_{ij} (\ell) = (P_{ij} (\ell) - P_i P_j (\ell))^2

The final vector is obtained as follows:

: V = {IC_\ell \over PIC_{ij} (\ell)} \text{ where } \ell \isin \left \{ \ell_0, \ell_0 + 1, \ldots, \ell_0 + n \right \},

which defines the range of distance between bases.{{cite journal | vauthors = Gao Y, Luo L | title = Genome-based phylogeny of dsDNA viruses by a novel alignment-free method | journal = Gene | volume = 492 | issue = 1 | pages = 309–314 | date = January 2012 | pmid = 22100880 | doi = 10.1016/j.gene.2011.11.004 }}

The pairwise distance between sequences is calculated using Euclidean distance measure. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA, etc..

== Compression ==

Examples are effective approximations to Kolmogorov complexity, for example Lempel-Ziv complexity. In general compression-based methods use the mutual information between the sequences. This is expressed in conditional Kolmogorov complexity, that is, the length of the shortest self-delimiting program required to generate a string given the prior knowledge of the other string. This measure has a relation to measuring k-words in a sequence, as they can be easily used to generate the sequence. It is sometimes a computationally intensive method. The theoretic basis for the Kolmogorov complexity approach was

laid by Bennett, Gacs, Li, Vitanyi, and Zurek (1998) by proposing the information distance.Bennett, C.H., Gacs, P., Li, M., Vitanyi, P. and Zurek, W., Information distance, IEEE Trans. Inform. Theory, 44, 1407--1423 The Kolmogorov complexity being incomputable it was approximated by compression algorithms. The better they compress the better they are. Li, Badger, Chen, Kwong,, Kearney, and Zhang (2001) used a non-optimal but normalized form of this approach,Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P. and

Zhang, H., (2001) An information-based sequence distance and

its application to whole mitochondrial genome phylogeny.

Bioinformatics, 17:(2001), 149--154 and the optimal normalized form by Li, Chen, Li, Ma, and Vitanyi (2003) appeared in M. Li, X. Chen, X. Li, B. Ma, P.M.B. Vitanyi.

The similarity metric, IEEE Trans. Inform. Th., 50:12(2004),

3250--3264 and more extensively and proven by Cilibrasi and Vitanyi (2005) in.R.L. Cilibrasi and P.M.B. Vitanyi, Clustering by compression,

IEEE Trans. Informat. Th., 51:4(2005), 1523--1545

Otu and Sayood (2003) used the Lempel-Ziv complexity method to construct five different distance measures for phylogenetic tree construction.{{cite journal | vauthors = Otu HH, Sayood K | title = A new sequence distance measure for phylogenetic tree construction | journal = Bioinformatics | volume = 19 | issue = 16 | pages = 2122–2130 | date = November 2003 | pmid = 14594718 | doi = 10.1093/bioinformatics/btg295 | doi-access = free }}

== Context modeling compression ==

In the context modeling complexity the next-symbol predictions, of one or more statistical models, are combined or competing to yield a prediction that is based on events recorded in the past. The algorithmic information content derived from each symbol prediction can be used to compute algorithmic information profiles with a time proportional to the length of the sequence. The process has been applied to DNA sequence analysis.{{cite journal | vauthors = Pinho AJ, Garcia SP, Pratas D, Ferreira PJ | title = DNA sequences at a glance | journal = PLOS ONE | volume = 8 | issue = 11 | pages = e79922 | date = Nov 21, 2013 | pmid = 24278218 | pmc = 3836782 | doi = 10.1371/journal.pone.0079922 | doi-access = free | bibcode = 2013PLoSO...879922P }}

= Methods based on graphical representation =

== Iterated maps ==

The use of iterated maps for sequence analysis was first introduced by HJ Jefferey in 1990 when he proposed to apply the Chaos Game to map genomic sequences into a unit square. That report coined the procedure as Chaos Game Representation (CGR). However, only 3 years later this approach was first dismissed as a projection of a Markov transition table by N Goldman.{{cite journal | vauthors = Goldman N | title = Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences | journal = Nucleic Acids Research | volume = 21 | issue = 10 | pages = 2487–2491 | date = May 1993 | pmid = 8506142 | pmc = 309551 | doi = 10.1093/nar/21.10.2487 }} This objection was overruled by the end of that decade when the opposite was found to be the case – that CGR bijectively maps Markov transition is into a fractal, order-free (degree-free) representation.{{cite journal | vauthors = Almeida JS, Carriço JA, Maretzek A, Noble PA, Fletcher M | title = Analysis of genomic sequences by Chaos Game Representation | journal = Bioinformatics | volume = 17 | issue = 5 | pages = 429–437 | date = May 2001 | pmid = 11331237 | doi = 10.1093/bioinformatics/17.5.429 | doi-access = free }} The realization that iterated maps provide a bijective map between the symbolic space and numeric space led to the identification of a variety of alignment-free approaches to sequence comparison and characterization. These developments were reviewed in late 2013 by JS Almeida in.{{cite journal | vauthors = Almeida JS | title = Sequence analysis by iterated maps, a review | journal = Briefings in Bioinformatics | volume = 15 | issue = 3 | pages = 369–375 | date = May 2014 | pmid = 24162172 | pmc = 4017330 | doi = 10.1093/bib/bbt072 }} A number of web apps such as https://github.com/usm/usm.github.com/wiki,{{cite journal | vauthors = Almeida JS, Grüneberg A, Maass W, Vinga S | title = Fractal MapReduce decomposition of sequence alignment | journal = Algorithms for Molecular Biology | volume = 7 | issue = 1 | pages = 12 | date = May 2012 | pmid = 22551205 | pmc = 3394223 | doi = 10.1186/1748-7188-7-12 | doi-access = free }} are available to demonstrate how to encode and compare arbitrary symbolic sequences in a manner that takes full advantage of modern MapReduce distribution developed for cloud computing.

Comparison of alignment based and alignment-free methods

class="wikitable"
Alignment-based methodsAlignment-free methods
These methods assume that homologous regions are contiguous (with gaps)Does not assume such contiguity of homologous regions
Computes all possible pairwise comparisons of sequences; hence computationally expensiveBased on occurrences of sub-sequences; composition; computationally inexpensive, can be memory-intensive
Well-established approach in phylogenomicsRelatively recent and application in phylogenomics is limited; needs further testing for robustness and scalability
Requires substitution/evolutionary modelsLess dependent on substitution/evolutionary models
Sensitive to stochastic sequence variation, recombination, horizontal (or lateral) genetic transfer, rate heterogeneity and sequences of varied lengths, especially when similarity lies in the "twilight zone"Less sensitive to stochastic sequence variation, recombination, horizontal (or lateral) genetic transfer, rate heterogeneity and sequences of varied lengths
Best practice uses inference algorithms with complexity at least O(n2); less time-efficientInference algorithms typically O(n2) or less; more time-efficient
Heuristic in nature; statistical significance of how alignment scores relate to homology is difficult to assessExact solutions; statistical significance of the sequence distances (and degree of similarity) can be readily assessed
Relies on dynamic programming (computationally expensive) to find alignment that has optimal score.

|side-steps computational expensive dynamic programming by indexing word counts or positions in fractal space.{{cite journal | vauthors = Vinga S, Carvalho AM, Francisco AP, Russo LM, Almeida JS | title = Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis | journal = Algorithms for Molecular Biology | volume = 7 | issue = 1 | pages = 10 | date = May 2012 | pmid = 22551152 | pmc = 3402988 | doi = 10.1186/1748-7188-7-10 | doi-access = free }}

Applications of alignment-free methods

  • Genomic rearrangements{{cite journal | vauthors = Pratas D, Silva RM, Pinho AJ, Ferreira PJ | title = An alignment-free method to find and visualise rearrangements between pairs of DNA sequences | journal = Scientific Reports | volume = 5 | pages = 10203 | date = May 2015 | pmid = 25984837 | pmc = 4434998 | doi = 10.1038/srep10203 | number = 10203 | bibcode = 2015NatSR...510203P }}{{cite journal | vauthors = Hosseini M, Pratas D, Morgenstern B, Pinho AJ | title = Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements | journal = GigaScience | volume = 9 | issue = 5 | pages = giaa048 | date = May 2020 | pmid = 32432328 | pmc = 7238676 | doi = 10.1093/gigascience/giaa048 | doi-access = free }}
  • Molecular phylogenetics{{cite journal | vauthors = Bernard G, Greenfield P, Ragan MA, Chan CX | title = k-mer Similarity, Networks of Microbial Genomes, and Taxonomic Rank | journal = mSystems | volume = 3 | issue = 6 | pages = e00257–18 | date = Nov 20, 2018 | pmid = 30505941 | pmc = 6247013 | doi = 10.1128/mSystems.00257-18 | ref = Bernard2 }}
  • Metagenomics{{cite journal | vauthors = Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F | title = New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing | journal = Briefings in Bioinformatics | volume = 15 | issue = 3 | pages = 343–353 | date = May 2014 | pmid = 24064230 | pmc = 4017329 | doi = 10.1093/bib/bbt067 }}{{cite journal | vauthors = Břinda K, Sykulski M, Kucherov G | title = Spaced seeds improve k-mer-based metagenomic classification | journal = Bioinformatics | volume = 31 | issue = 22 | pages = 3584–3592 | date = November 2015 | pmid = 26209798 | doi = 10.1093/bioinformatics/btv419 | arxiv = 1502.06256 | s2cid = 8626694 | bibcode = 2015Bioin..31.3584B }}{{cite journal | vauthors = Ounit R, Lonardi S | title = Higher classification sensitivity of short metagenomic reads with CLARK-S | journal = Bioinformatics | volume = 32 | issue = 24 | pages = 3823–3825 | date = December 2016 | pmid = 27540266 | doi = 10.1093/bioinformatics/btw542 | doi-access = free }}{{cite bioRxiv | vauthors = Pratas D, Pinho AJ, Silva RM, Rodrigues JM, Hosseini M, Caetano T, Ferreira PJ |title=FALCON: a method to infer metagenomic composition of ancient DNA |date=February 2018 |biorxiv= 10.1101/267179 }}{{cite journal | vauthors = Wood DE, Salzberg SL | title = Kraken: ultrafast metagenomic sequence classification using exact alignments | journal = Genome Biology | volume = 15 | issue = 3 | pages = R46 | date = March 2014 | pmid = 24580807 | pmc = 4053813 | doi = 10.1186/gb-2014-15-3-r46 | doi-access = free }}
  • Next generation sequence data analysis{{cite journal | vauthors = Noé L, Martin DE | title = A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances | journal = Journal of Computational Biology | volume = 21 | issue = 12 | pages = 947–963 | date = December 2014 | pmid = 25393923 | pmc = 4253314 | doi = 10.1089/cmb.2014.0173 | arxiv = 1412.2587 | bibcode = 2014arXiv1412.2587N }}
  • Epigenomics{{cite journal | vauthors = Pinello L, Lo Bosco G, Yuan GC | title = Applications of alignment-free methods in epigenomics | journal = Briefings in Bioinformatics | volume = 15 | issue = 3 | pages = 419–430 | date = May 2014 | pmid = 24197932 | pmc = 4017331 | doi = 10.1093/bib/bbt078 }}
  • Barcoding of species{{cite journal | vauthors = La Rosa M, Fiannaca A, Rizzo R, Urso A | title = Alignment-free analysis of barcode sequences by means of compression-based methods | journal = BMC Bioinformatics | volume = 14 | issue = Suppl 7 | pages = S4 | year = 2013 | pmid = 23815444 | pmc = 3633054 | doi = 10.1186/1471-2105-14-S7-S4 | doi-access = free }}
  • Population genetics
  • Horizontal gene transfer{{cite journal | vauthors = Domazet-Lošo M, Haubold B | title = Alignment-free detection of local similarity among viral and bacterial genomes | journal = Bioinformatics | volume = 27 | issue = 11 | pages = 1466–1472 | date = June 2011 | pmid = 21471011 | doi = 10.1093/bioinformatics/btr176 | doi-access = free }}
  • Sero/genotyping of viruses{{cite journal | vauthors = Kolekar P, Hake N, Kale M, Kulkarni-Kale U | title = WNV Typer: a server for genotyping of West Nile viruses using an alignment-free method based on a return time distribution | journal = Journal of Virological Methods | volume = 198 | pages = 41–55 | date = March 2014 | pmid = 24388930 | doi = 10.1016/j.jviromet.2013.12.012 | doi-access = free }}{{cite journal | vauthors = Struck D, Lawyer G, Ternes AM, Schmit JC, Bercoff DP | title = COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification | journal = Nucleic Acids Research | volume = 42 | issue = 18 | pages = e144 | date = October 2014 | pmid = 25120265 | pmc = 4191385 | doi = 10.1093/nar/gku739 }}
  • Allergenicity prediction{{cite journal | vauthors = Dimitrov I, Naneva L, Doytchinova I, Bangov I | title = AllergenFP: allergenicity prediction by descriptor fingerprints | journal = Bioinformatics | volume = 30 | issue = 6 | pages = 846–851 | date = March 2014 | pmid = 24167156 | doi = 10.1093/bioinformatics/btt619 | doi-access = free }}
  • SNP discovery{{cite journal | vauthors = Gardner SN, Hall BG | title = When whole-genome alignments just won't work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes | journal = PLOS ONE | volume = 8 | issue = 12 | pages = e81760 | date = Dec 9, 2013 | pmid = 24349125 | pmc = 3857212 | doi = 10.1371/journal.pone.0081760 | doi-access = free | bibcode = 2013PLoSO...881760G }}
  • Recombination detection{{cite journal | vauthors = Haubold B, Krause L, Horn T, Pfaffelhuber P | title = An alignment-free test for recombination | journal = Bioinformatics | volume = 29 | issue = 24 | pages = 3121–3127 | date = December 2013 | pmid = 24064419 | pmc = 5994939 | doi = 10.1093/bioinformatics/btt550 }}
  • Viral Classification {{cite journal | vauthors = Silva JM, Pratas D, Caetano T, Matos S | title = The complexity landscape of viral genomes | journal = GigaScience | volume = 11 | pages = 1–16 | date = August 2022 | pmid = 35950839 | pmc = 9366995 | doi = 10.1093/gigascience/giac079 }}
  • Archaea Taxonomic Identification{{Citation | vauthors = Silva JM, Pratas D, Caetano T, Matos S |title=Feature-Based Classification of Archaeal Sequences Using Compression-Based Methods |date=2022 |url=https://link.springer.com/10.1007/978-3-031-04881-4_25 |work=Pattern Recognition and Image Analysis |volume=13256 |pages=309–320 | veditors = Pinho AJ, Georgieva P, Teixeira LF, Sánchez JA |place=Cham |publisher=Springer International Publishing |language=en |doi=10.1007/978-3-031-04881-4_25 |isbn=978-3-031-04880-7 |access-date=2022-08-31 |series=Lecture Notes in Computer Science }}
  • Taxonomic Classification{{Cite journal |last1=Silva |first1=Jorge Miguel |last2=Almeida |first2=João Rafael |date=2024-10-01 |title=Enhancing metagenomic classification with compression-based features |url=https://linkinghub.elsevier.com/retrieve/pii/S0933365724001908 |journal=Artificial Intelligence in Medicine |volume=156 |pages=102948 |doi=10.1016/j.artmed.2024.102948 |pmid=39173422 |issn=0933-3657|doi-access=free }}
  • Temporal Analysis{{Cite journal |last1=Silva |first1=Jorge M |last2=Pinho |first2=Armando J |last3=Pratas |first3=Diogo |date=2024 |title=AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data |journal=GigaScience |volume=13 |doi=10.1093/gigascience/giae086 |issn=2047-217X |pmc=11590114 |pmid=39589438}}
  • Low-complexity Regions Identification{{cite journal | vauthors = Silva JM, Qi W, Pinho AJ, Pratas D | title = AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data | journal = GigaScience | volume = 12 | date = December 2022 | pmid = 38091509 | pmc = 10716826 | doi = 10.1093/gigascience/giad101 }}

List of web servers/software for alignment-free methods

class="wikitable"
NameDescriptionAvailabilityReference
ProtcompMost Expressed Features scoring approach[https://yadamp.unisa.it/PROTCOMP PROTCOMP]{{cite conference | vauthors = Di Biasi L, Piotto S | title = ARISE: Artificial Intelligence Semantic Search Engine | conference = WIVACE2021 }}
kmacsk-mismatch average common substring approach[http://kmacs.gobics.de/ kmacs]{{cite journal | vauthors = Leimeister CA, Morgenstern B | title = Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison | journal = Bioinformatics | volume = 30 | issue = 14 | pages = 2000–2008 | date = July 2014 | pmid = 24828656 | pmc = 4080746 | doi = 10.1093/bioinformatics/btu331 }}
Spaced wordsSpaced-word frequencies[http://spaced.gobics.de/ spaced-words]{{cite journal | vauthors = Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B | title = Fast alignment-free sequence comparison using spaced-word frequencies | journal = Bioinformatics | volume = 30 | issue = 14 | pages = 1991–1999 | date = July 2014 | pmid = 24700317 | pmc = 4080745 | doi = 10.1093/bioinformatics/btu177 }}
Co-phylogassembly-free micro-alignment approach[https://github.com/yhg926/co-phylog Co-phylog]
Prot-SpaMProteome-based spaced-word matches[https://github.com/jschellh/ProtSpaM Prot-SpaM]
FSWMFiltered Spaced-Word Matches[http://fswm.gobics.de/ FSWM]
FFPFeature frequency profile based phylogeny[http://sourceforge.net/projects/ffp-phylogeny/ FFP]
CVTreeComposition vector based server for phylogeny[http://tlife.fudan.edu.cn/cvtree/ CVTree]{{cite journal | vauthors = Xu Z, Hao B | title = CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes | journal = Nucleic Acids Research | volume = 37 | issue = Web Server issue | pages = W174–W178 | date = July 2009 | pmid = 19398429 | pmc = 2703908 | doi = 10.1093/nar/gkp278 }}
RTD PhylogenyReturn time distribution based server for phylogeny[http://bioinfo.net.in/RTD/home.html RTD Phylogeny]
AGPA multimethods web server for alignment-free genome phylogeny[https://web.archive.org/web/20140109135217/http://www.herbbol.org:8000/agp AGP]{{cite journal | vauthors = Cheng J, Cao F, Liu Z | title = AGP: a multimethods web server for alignment-free genome phylogeny | journal = Molecular Biology and Evolution | volume = 30 | issue = 5 | pages = 1032–1037 | date = May 2013 | pmid = 23389766 | pmc = 7574599 | doi = 10.1093/molbev/mst021 | doi-access = free }}
AlfyAlignment-free detection of local similarity among viral and bacterial genomes[http://guanine.evolbio.mpg.de/alfy/ Alfy]
decaf+pyDistancE Calculation using Alignment-Free methods in PYthon[http://acb.qfab.org/acb/decaf+py/ decaf+py]{{cite journal | vauthors = Höhl M, Rigoutsos I, Ragan MA | title = Pattern-based phylogenetic distance estimation and tree reconstruction | journal = Evolutionary Bioinformatics Online | volume = 2 | pages = 359–375 | date = February 2007 | pmid = 19455227 | pmc = 2674673 | bibcode = 2006q.bio.....5002H | arxiv = q-bio/0605002 }}
Dengue SubtyperGenotyping of Dengue viruses based on RTDDengue Subtyper
WNV TyperGenotyping of West nile viruses based on RTDWNV Typer
AllergenFPAllergenicity prediction by descriptor fingerprints[http://ddg-pharmfac.net/AllergenFP/ AllergenFP]
kSNP v2Alignment-Free SNP Discovery[http://sourceforge.net/projects/ksnp/ kSNP v2]
d2ToolsComparison of Metatranscriptomic Samples Based on k-Tuple Frequencies[https://code.google.com/p/d2-tools/ d2Tools]{{cite journal | vauthors = Wang Y, Liu L, Chen L, Chen T, Sun F | title = Comparison of metatranscriptomic samples based on k-tuple frequencies | journal = PLOS ONE | volume = 9 | issue = 1 | pages = e84348 | date = Jan 2, 2014 | pmid = 24392128 | pmc = 3879298 | doi = 10.1371/journal.pone.0084348 | doi-access = free | bibcode = 2014PLoSO...984348W }}
rushRecombination detection Using SHustrings[http://guanine.evolbio.mpg.de/rush/ rush]
smashGenomic rearrangements detection and visualisation[http://bioinformatics.ua.pt/software/smash/ smash]
Smash++Finding and visualizing genomic rearrangements[https://github.com/smortezah/smashpp Smash++]
GScompareOligonucleotide-based fast clustering of bacterial genomes[http://gscompare.ehu.eus/ GScompare]
COMETAlignment-free subtyping of HIV-1, HIV-2 and HCV viral sequences[http://comet.lih.lu/ COMET]
USM

|Fractal MapReduce decomposition of sequence alignment

|[https://usm.github.io usm.github.io]

|

FALCONAlignment-free method to infer metagenomic composition of ancient DNA[http://github.com/pratas/falcon FALCON]
KrakenTaxonomic classification using exact k-mer matches[http://github.com/DerrickWood/kraken2 Kraken 2]
EAGLEAn ultra-fast tool to find relative absent words in genomic data[https://github.com/cobilab/eagle EAGLE2]{{cite journal | vauthors = Pratas D, Silva JM | title = Persistent minimal sequences of SARS-CoV-2 | journal = Bioinformatics | volume = 36 | issue = 21 | pages = 5129–5132 | date = January 2021 | pmid = 32730589 | pmc = 7559010 | doi = 10.1093/bioinformatics/btaa686 | doi-access = free }}
CLCPhylogenetic trees using reference-free k-mer based matching[https://www.qiagenbioinformatics.com/products/clc-microbial-genomics-module/ CLC Microbial Genome Module]{{cite web |url=https://www.qiagenbioinformatics.com/products/clc-microbial-genomics-module/ |title= CLC Microbial Genomics Module |date=2019 |website=QIAGEN Bioinformatics}}
xgTaxonomy

|A tool for metagenomic classification that uses data compression algorithms to classify genomic sequences.

|[https://github.com/ieeta-pt/xgTaxonomy xgTaxonomy]

|

AlcoR

|An extremely efficient method for identifying and visualizing low-complexity regions in genomic and proteomic sequences.

|[https://cobilab.github.io/alcor/ AlcoR]

|

AltaiR

|A C toolkit for alignment-free and temporal analysis of multi-FASTA data.

|[https://github.com/cobilab/altair AltaiR]

|

See also

References