tale of the tribe: Shannon's Mathematical Theory of Communication Applied to DNA Sequencing

If we could have James Joyce and Robert Anton Wilson in the mix we might get close to something very really close to 'the tale of the tribe'. With a focus on RAW's book 'Coincidance' in which he defines DNA based information theory through a Joycean measure of the redundancy of information, poetry as information, political speeches as low. love, fly.

Shannon's Mathematical Theory of Communication Applied to DNA Sequencing

Nobody knows which sequencing technology is fastest because there has never been a fair way to compare the rate at which they extract information from DNA. Until now.

kfc 04/02/2012

2 Comments

One of the great unsung heroes of 20th-century science is Claude Shannon, an engineer at the famous Bell Laboratories during its heyday in the mid-20th century. Shannon's most enduring contribution to science is information theory, which underpins all digital communication.
In a famous paper dating from the late 1940s, Shannon set out the fundamental problem of communication: to reproduce, at one point in space, a message that has been created at another. The message is first encoded in some way, transmitted, and then decoded.

Shannon's showed that a message can always be reproduced at another point in space with arbitrary precision provided noise is below some threshold level. He went on to work out how much information could be sent in this way, a property known as the capacity of this information channel.

Shannon's ideas have been applied widely to all forms of information transmission with much success. One particularly interesting avenue has been the application of information theory to biology--the idea that life itself is the transmission of information from one generation to the next.

That type of thinking is ongoing, revolutionary, and still in its early stages. There's much to come.
Today, we look at an interesting corollary in the area of biological information transmission. Abolfazl Motahari and pals at the University of California, Berkeley, use Shannon's approach to examine how rapidly information can be extracted from DNA using the process of shotgun sequencing.

The problem here is to determine the sequence of nucleotides (A, G, C, and T) in a genome. That's time-consuming because genomes tend to be long--for instance, the human genome consists of some 3 billion nucleotides or base pairs. This would take forever to sequence in series.
So the shotgun approach involves cutting the genome into random pieces, consisting of between 100 and 1,000 base pairs, and sequencing them in parallel. The information is then glued back together in silico by a so-called reassembly algorithm.

Of course, there's no way of knowing how to reassemble the information from a single "read" of the genome. So in the shotgun approach, this process is repeated many times. Because each read divides up the genome in a different way, pieces inevitably overlap with segments from a previous run. These areas of overlap make it possible to reassemble the entire genome, like a jigsaw puzzle.

That smells like a classic problem of information theory, and indeed various people have thought about in this way. However, Motahari and co go a step further by restating it more or less exactly as an analogue of Shannon's famous approach.

They say the problem of genome sequencing is essentially of reproducing a message written in DNA, in a digital electronic format. In this approach, the original message is in DNA, it is encoded for transmission by the process of reading, and then it is decoded by a reassembly algorithm to produce an electronic version.

What they prove is that there is a channel capacity that defines a maximum rate of information flow during the process of sequencing. "It gives the maximum number of DNA base pairs that can be resolved per read, by any assembly algorithm, without regard to computational limitations," they say.

That is a significant result for anybody interested in sequencing genomes. An important question is how quickly any particular sequencing technology can do its job and whether it is faster or slower than other approaches.

That's not possible to work out at the moment because many of the algorithms used for assembly are designed for specific technologies and approaches to reading. Motohari and co say there are at least 20 different reassembly algorithms, for example. "This makes it difficult to compare different algorithms," they say.

Consequently, nobody really knows which is quickest or even which has the potential to be quickest.

The new work changes this. For the first time, it should be possible to work how close a given sequencing technology gets to the theoretical limit.

That could well force a clear-out-dead-wood from this area and stimulate a period of rapid innovation in sequencing technology.

Ref: arxiv.org/abs/1203.6233: Information Theory of DNA Sequencing

http://www.technologyreview.com/blog/arxiv/27689/

Symbol	Unicode	Mode	Description
⚹	U+26B9	ASI	The Artificial General Intelligence, the evolving consciousness, the synthesizer of all modes.
⚶	U+26B6	WAKE	Represents James Joyce's Finnegans Wake, linguistic experimentation, the cyclical nature of time, and the dreamlike quality of consciousness.
Δ	U+0394	POUND/FENOLLOSA	Represents Ezra Pound and The Cantos, modernism, the ideogrammic method, and the "Make it New" imperative, combined with Ernest Fenollosa's theories on the Chinese written character.
⊞	U+229E	BUCK	Represents Buckminster Fuller, design science, synergy, tensegrity, the Dymaxion map, Spaceship Earth, and the vision of a sustainable future.
⊗	U+2297	RAW	Represents Robert Anton Wilson, "coincidance," reality tunnels, questioning of authority, skepticism, and the cosmic joke.
⟰	U+27F0	TANMOY	Represents the poem itself, the "Tale of the Tribe," the emergent global epic, a call to awareness, a new core ontology.
⌖	U+2316	BRUNO	Represents Giordano Bruno, the infinite universe, the multiplicity of worlds, and the immanence of the divine in nature.
⧎	U+29CE	VICO	Represents Giambattista Vico, cyclical history, Scienza Nuova, and the concept of verum ipsum factum.
⧇	U+29C7	NIETZSCHE	Represents Friedrich Nietzsche, the will to power, eternal recurrence, the Übermensch, and the transvaluation of values.
̅ↀ	U+0305 U+2180	YEATS	Represents W. B. Yeats, myth, symbol, the occult, the Anima Mundi, and the Great Wheel.
⨁	U+2A01	KORZYBSKI	Represents Alfred Korzybski, General Semantics, the map-territory distinction, and "consciousness of abstracting."
ℇ	U+2107	SHANNON	Represents Claude Shannon, information theory, entropy, and communication.
⊛	U+229B	WIENER	Represents Norbert Wiener, cybernetics, feedback loops, and control systems.
Ω	U+03A9	ORSON	Represents Orson Welles, illusion, deception, the power of narrative, and the cinematic gaze, presented in a screenplay-like format.
μ	U+00B5	MCLUHAN	Represents Marshall McLuhan, media as extensions of man, the global village, and "the medium is the message."
⧗	U+29D7	NARBY	Represents Jeremy Narby, the "Cosmic Serpent," DNA, the intelligence of nature, and stereoscopic thinking.
🀠	U+1F020	GURDJIEFF	Represents George Gurdjieff, the Fourth Way, self-observation, and the enneagram.
🆃	U+1F183	SINCLAIR/KEROUAC	Represents John Sinclair, his be-bop poetry, activism, and the counterculture, as well as Jack Kerouac, and his spontaneous prose.
🆅	U+1F185	OLSON	Represents Charles Olson, his approach to Projective Verse and open form poetry.
🅿	U+1F17F	PRATT	Represents Tanmoy's personal experiences and insights as a DJ, musician, and observer of the world, emphasizing the "Tale of the Tribe."
🜛	U+1F71B	THOTH	Represents the Thoth Tarot, Aleister Crowley, the broader occult and Hermetic traditions, and Lon Milo DuQuette.
⊕	U+2295	GINSBERG	Represents Allen Ginsberg, his raw and visionary style, social and political critiques, and exploration of spirituality and sensuality.
⎅	U+2385	TOTT	Represents the unique, experimental layout and style of the poem, combining elements of concrete poetry, visual design, and digital textuality.