[Paleopsych] Gene Finding with Hidden Markov Models
shovland at mindspring.com
Wed Mar 30 12:14:40 UTC 2005
The application of phylogeny to HMMs is improving gene annotation | By
Karen Heyman <mailto:kheyman at the-scientist.com>
Haemophilus influenzae made history in 1995 when it became the first
free-living organism to have its genome completely sequenced. In the decade
since, some 180 or so organisms have followed suit.
For every one of these genomes, the sequence is only the beginning. The
challenge for the computational biologists charged with making sense of the
data: to find the gene sequences hidden within those strings, billions of
bases long, of As, Cs, Gs, and Ts. The genome annotation strategies these
computer scientists cum biologists have developed clearly have come a long
way. The most recent iteration (version 4.0) of the Drosophila genome
annotation, for instance, updated only 25 predictions out of 13,472
But improvements can still be made. "If they were 100% reliable, then they
would have been run on the April 2003 complete human sequence and that
would've been it. Those would have been your genes," says David Haussler,
who directs the Center for Biomolecular Science and Engineering at the
University of California, Santa Cruz. Instead, the human gene count has
been downsized by nearly one-third, from 35,000 genes to closer to 25,000.
Much of the field's slow progress stems directly from the contradiction at
the heart of the field of computational biology: the attempt to wed the
logic of math to the messiness of biology. Algorithms can be simple,
elegant, and fast, but they may not capture biology, which is often
anything but simple.
"I think what people overlooked was the complexity of biology," says
Zhirong Bao, coauthor with Sean Eddy of Recon, a software program for
identifying repeat sequences. "They start with simple and beautiful graph
theories and implement them in software, assuming that all their DNA
sequences are going to behave like their perfect nodes in their perfect
Now as more genomes are sequenced, researchers are looking to bring their
computations in line with the underlying biology. They are creating
software that incorporates phylogenetics, the descriptions of evolutionary
distance, into the field's favorite computational tool, the hidden Markov
LEGACY LEGOS Eddy, associate professor at Washington University in St.
Louis and coauthor of the textbook, Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids (Cambridge University
Press), has dubbed HMMs "the Legos of computational sequence analysis."1
HMMs are special instances of graphical models, which were originally
developed by computer scientists studying machine learning and speech
recognition. In technical parlance, says Eddy, HMMs "describe a probability
distribution over an infinite number of sequences." To the uninitiated,
they resemble a cross between a flow chart and a doodle. In order to
understand conceptually how HMMs work, consider their origin in speech r
ecognition, says Haussler. In that field, a computer is asked, given a
speech wave, what are the phonemes (sounds) that it encodes. The wave is
the measured signal; the phonemes are the "hidden" signals that give the
HMM its name.
"There is a probabilistic relationship between phonemes," Haussler
explains. "After a 'th' sound can easily come an 'r' or an 'ah' or several
other types of sounds, but not, for example, a 'k' sound. A hidden Markov
model for speech incorporates all possible phonemes, and for each phoneme
the probability that it's followed by any other phoneme."
Haussler says the HMM also "models the stochastic relationship between each
phoneme and the speech wave one might measure for it. In this way it can be
used to infer the sequence of phonemes that best fits a given segment of
recorded speech." Translating that to molecular biology, he explains, the
measured signal is the sequence of nucleotides, while the hidden signal is
their function. "Biology is trying to speak a language to us, and the HMM
model is helping us to distinguish the phonemes of that language."
Structurally, says Lincoln Stein of Cold Spring Harbor Laboratory, "HMMs
work when you have information that moves from left to right, [when] the
probability of an event to the right is dependent on previous events to the
left." That logical structure ensures that almost all annotation problems
can be cast as HMMs, says Eddy. "You've got an observed residue, and
there's one or more labels [or 'states'] that you can attach to it." He
continues, "You don't know what the states are. So as you move from left to
right, you ask, 'what state am I in,' which is 'what label do I attach to
this residue now?"' So, just as in speech recognition, there is a measured
signal (the residue) and a hidden one (its label or function).
That's not to say HMMs will solve every problem, says Stein. Noncoding RNA
genes, for instance, in which function is encoded both in sequence and in
secondary structure, present a particular problem. "The problem with
folding is that there's a kind of symmetry. You move left to right and then
right to left, depending on where you are in the fold," he says.
Taking their cue from computational linguistics, biologists address these
and similar problems using stochastic context-free grammars. SCFGs are
really "just HMMs extended to a second dimension," says Eddy. "A lot of the
concepts are the same in HMMs and SCFGs. SCFG algorithms are just a little
nastier to work with, much more memory and compute time needed, which is
the big thing limiting us from deploying SCFGs on every cool computational
RNA problem we can think of."
Courtesy of Sean Eddy PLAN 7 FROM OUTER SPACE? This diagram shows the
Plan7 architecture used in the gene finding software, HMMER. Squares
indicate match states (modeling consensus positions in the alignment).
Diamonds indicate insert states (modeling insertions relative to consensus)
and special random sequence emitting states. Circles indicate delete states
(modeling deletions relative to consensus) and special begin/end states.
Arrows indicate state transitions.
HMMS EVOLVE Molecular biologists use the term generically, yet many classes
or implementations of HMM exist, each of which is adapted to particular
annotation challenges. Eddy's own work concentrates on identifying
structural and catalytic RNAs. His lab uses a subclass of HMMs called
profile HMMs, which were presented in 1992 by Haussler and Anders Krogh as
an alternative to profiles, the then-popular approach to protein
"A profile is the technical term for the model used to represent a family
of related proteins," says Haussler. Essentially, he says, it is "a
computer trick" for deciding how to classify a given protein. But profiles
turned out to be incomplete models, and Krogh and Haussler's innovation was
to attack the same problem using HMMs. They were not the first researchers
to apply HMMs to biology, but Krogh and Haussler's work2 on profile HMMs
electrified the field. "Their technical report flew through the community,"
Eddy recalls. Today, profile HMMs drive the popular protein
sequence-analysis software, HMMER.
Now Haussler, along with other researchers in the United States and Europe,
is again generating excitement in the annotation community. This time the
hubbub surrounds so-called PhyloHMMs, implementations that combine HMMs
with phylogenetics to drive better gene prediction.3 Among their more
popular current incarnations is the "conservation graph" on the UCSC Genome
Browser. Haussler's graduate student, Adam Siepel, created the conservation
PhyloHMMs should help with the "sequence weighting" problem that has
plagued earlier attempts to use HMMs to extract information from multiple
sequences. "If you just make a naive kind of assumption about these
sequences all being independent, you introduce biases into your analyses,"
The original workaround was to assign arbitrary weights. "People came up
with these ad hoc ways of adding weights," says Siepel. For instance, if
the human genome was being compared to those of the chimp and the mouse,
the researcher might assign less weight to the chimp sequence than the
mouse, because the two primates are so closely related that it can be
difficult to identify which sequences are under evolutionary pressure. Eddy
says that these models have so little genuine statistical basis that he
calls them "sequence weighting crap."
? 2005 Nature Publishing Group WHERE'S THE SPLICE SITE? This "toy" HMM is
trying to find the 5' splice site (5) in a sequence containing an exon
(E)-intron (I) boundary. (Reprinted with permission from S. Eddy, Nat
Biotechnol, 22:1315-6, 2004.)
By including phylogeny in the model, though, the weighting begins to make
biological sense, says Eddy. "With sequence-to-sequence phylogenetic
correlation, because the species are actually related to each other, the
number of parameters goes down. There's a phylogenetic tree that now says
how these sequences are related to each other." Still he warns, "The
mathematics is a complicated beast, but this is where the field has to go."
Haussler and Siepel4 have used PhyloHMMs to create software they call
"ExoniPhy," which detects new, novel sequences that are conserved from
genomes by modeling how codons evolve. "We're trying to find genes that are
not supported by cDNA evidence, not supported by sequenced mRNAs or ESTs,"
says Siepel, "Many genes, we believe, are being missed by the experimental
techniques that are used to fish out mRNAs. One of the problems is that
these genes are very rarely expressed or they're expressed at very low
levels, so for example you might have a gene that's turned on in the ninth
day of development for a few hours and then afterwards it's turned off, and
so it never shows up in these libraries of mRNA."
A phylogenetic approach assumes that a critical gene, even a rarely
expressed one, will be highly conserved. "Even if it's hard to find the
mRNA, it's going to have an evolutionary signature in the genome, so we're
trying to use these evolutionary footprints to help find genes that haven't
been able to be found," concludes Siepel. He says his team believes it has
detected "at least several hundred genes" that are strongly supported by
the evolutionary/cross-species evidence, but they are not supported or only
very weakly supported by cDNA evidence. They are now working on validating
their predictions in the lab.
TWINSCAN 3.0 Other researchers, including Michael Brent, are trying similar
approaches. Brent, a colleague of Eddy's at Washington University, created
TWINSCAN, a gene-prediction tool that works by comparing two genomes. His
lab is now working on a TWINSCAN update dubbed N-SCAN.5 As the name
implies, N-SCAN allows comparisons between unlimited numbers of genomes.
Tests of the program have been encouraging, Brent says. "With N-SCAN, we're
now up to more than one-third of the known genes [in the human genome]
predicted exactly right. With more compact genomes like worms, we can get
more than 60% of the known genomes exactly right."
Courtesy of Adam Siepel A REAL-WORLD PHYLO-HMM: The conservation graph
from the University of California, Santa Cruz, human genome browser
[<http://genome.ucsc.edu>] is based on a PhyloHMM. This plot shows the
known gene structure alongside the computer gene predictions and genomic
comparisons with other species for the MET proto-oncogene. Boxes represent
exons, while lines with chevrons represent introns.
Like ExoniPhy, N-SCAN combines HMM work with phylogenetic concepts, but
Brent cites important distinctions. "The ExoniPhy model is based on an
improvement to the classical continuous-time Markov chain model of DNA
evolution. The key to that is that you have a phylogenetic tree which has
branch lengths that represent the amount of evolution that's gone on since
the common ancestor, but the pattern of evolution is the same throughout
the entire tree."
In other words, the model assumes that the tolerance for substituting a C
for a T in a particular position doesn't change throughout evolutionary
history. Instead, what changes is how much time, hence opportunity, there
has been for such a substitution to actually occur. (Haussler says Siepel
is currently writing code that relaxes that assumption.)
By contrast, says Brent, "Our model is more flexible; it doesn't have that
assumption. It allows for one branch to develop a separate pattern of
evolution at a given site from another branch." Nevertheless, he praises
ExoniPhy as "elegant" and says, "It has a higher specificity, a lower
false-positive rate on exons than anybody else."
Such subtle distinctions ensure that no one program will be best suited for
every application, says Lior Pachter, a mathematician at UC-Berkeley, and
author of SLAM,7 which does simultaneous gene finding and sequence align
ment. Indeed, the very idea of comparing various programs amuses Pachter.
"People like to write programs like SLAM or TWINSCAN and then they compare
them to other programs, and always in their comparisons they come out
getting better results," he says. "But of course, it's impossible that
everybody has the best program."
>From his experiences working on the mouse genome, Pachter observes, "There
are certain genes that have certain properties that make them easy to
predict and then we all agree, but there are a lot of cases where we
disagree." He advises biologists, "It's always a good idea to try all the
available tools. It's not always that one program is universally better
than the other in every case."
Consider phylogenetics-based programs, for instance. Roderic Guigo of the
Universitat Pompeu Fabra in Catalonia, Spain, developer of the comparative
gene-prediction program called SGP2,6 notes that the phylogenetics data
that make PhyloHMM-based programs so powerful are also their Achilles'
heel. "While in general more accurate than purely ab initio methods," he
writes in an E-mail, "a drawback of the PhyloHMM gene predictors is that
they require the sequence of one or more genomes at the appropriate
phylogenetic distance: C. elegans and C. brigasse, human and mouse, Fugu
and Tetraodon, etc. [On occasion], such a second sequence does not exist."
One thing is certain: Even when these computer tools are finally perfected,
biology will still be done at the bench, not at the keyboard. "Ultimately
all computer predictions must be tested in the laboratory," says Haussler,
"The more you predict, the more wet lab validation there is to be done."
While lay people may find it ego-deflating that the gene set keeps
shrinking, experimental biologists may find it something of a relief.
1. SR Eddy "What is a hidden Markov model?" Nat Biotechnol 2004, 22:
1315-6. [PubMed Abstract
<http://www.biomedcentral.com/pubmed/15470472>][Publisher Full Text
2. A Krogh et al, "Hidden Markov models in computational biology:
Applications to protein modeling," J Mol Biol 1994, 235: 1501-31. [PubMed
Abstract <http://www.biomedcentral.com/pubmed/8107089>][Publisher Full Text
3. A Siepel, D Haussler "Phylogenetic hidden Markov models," Statistical
Methods in Molecular Evolution (Edited by: Nielsen R). New York: Springer
4. A Siepel, D Haussler "Computational identification of evolutionarily
conserved exons," <http://www.cse.ucsc.edu/~acs/recomb2004.pdf> Proceedings
of the 8th Annual International Conference on Research in Computational
Biology (RECOMB) 2004.,
5. SS Gross, MR Brent "Using multiple alignments to improve gene
prediction," Proceedings of RECOMB 2005, in press. ,
6. M Alexandersson et al, "SLAM: Cross-species gene finding and alignment
with a generalized pair hidden Markov model," Genome Re 2003, 13: 496-502.
[Publisher Full Text <http://dx.doi.org/10.1101/gr.424203>]
7. G Parra et al, "Comparative gene prediction in human and mouse," Genome
Res 2003, 13: 108-17. [PubMed Abstract
<http://www.biomedcentral.com/pubmed/12529313>][Publisher Full Text
prlinks&retmode=ref&id=12529313>][PubMed Central Full Text
More information about the paleopsych