I wasn't the first (or the brightest) informatics geek to notice the similarity between the molecular structures of the various *omics diciplines and the concept of the textual corpus. A number of researchers here at Michigan are applying linquistic techniques in systems biology, so there's a natural curiosity about that similarity on my part, even if it's not where I spend my time.
Last week I came across a, dare I say, "textbook" example of the use of natural language processing (NLP) to advance genomics, in a news release from UC Berkeley entitled Improved method for comparing genomes as well as written text. Here's a fairly large taste:
Taking a hint from the text comparison methods used to detect plagiarism in books, college papers and computer programs, University of California, Berkeley, researchers have developed an improved method for comparing whole genome sequences.
With nearly a thousand genomes partly or fully sequenced, scientists are jumping on comparative genomics as a way to construct evolutionary trees, trace disease susceptibility in populations, and even track down people's ancestry.
To date, the most common techniques have relied on comparing a limited number of highly conserved genes - no more than a couple dozen - in organisms that have all these genes in common.
The new method can be used to compare even distantly related organisms or organisms with genomes of vastly different sizes and diversity, and can compare the entire genome, not just a selected small fraction of the gene-containing portion known to code for proteins, which in the human genome is only 1 percent of the DNA.
The technique produces groupings of organisms largely consistent with current groupings, but with some interesting discrepancies, according to Sung-Hou Kim, professor of chemistry at UC Berkeley and faculty researcher at Lawrence Berkeley National Laboratory. However, the relative positions of the groups in the family tree - that is, how recently these groups evolved - are quite different from those based on conventional gene alignment methods.
The computational results have surprised scientists in being able to classify some bacteria and viruses that until now were enigmatic.
The real deal has appeared online in the Proceedings of the National Academy of Sciences; here's the citation:
I can't say I'm intimately familiar with the feature frequency profiles concept, but then I'm an NLP buff rather than an expert. The link above takes you to a page where you can download the preprint as a PDF, which has a few typos, but gives a fairly clear grad-student-reading-level description of the algorithm and how they optimize it for particular genomes. Cool stuff!
I'm not sure how directly this will apply to other *omics disciplines, such as proteomics and metabolomics, which are where much of the real action is found in biomedical research. Still, it's an exciting development, and a harbinger of things to come, as the NLP community takes on the challenge of the greatest story ever told, the story of life as viewed through the lens of molecular biology.










