Author Topic: 700 terabytes of data fit in 1 gram of DNA  (Read 574 times)

HandsomeGirl

  • Known
  • *
  • Posts: 648
  • I masturbate furiously to hentai. all day. <- True
    • View Profile
700 terabytes of data fit in 1 gram of DNA
« on: August 25, 2012, 08:28:54 pm »
Looked and didn't see this here, thought it might be appreciated:

http://www.extremetech.com/extreme/134672-harvard-cracks-dna-storage-crams-700-terabytes-of-data-into-a-single-gram

I must admit I really don't understand it all. It led me to look up how much DNA one person's body contains, which is apparently between 6 and 60 grams.  I don't understand the maths on that either, so I'm utterly useless in discussing this topic until I find out more.  Blows my mind, though.  Learned a bit from the read.
I've nothing worthwhile to put here.

Golden Applesauce

  • Token Apologist
  • Deserved It
  • ****
  • Posts: 22298
  • Where does this text go?
    • View Profile
Re: 700 terabytes of data fit in 1 gram of DNA
« Reply #1 on: August 26, 2012, 04:28:14 am »
The first question you're going to ask is "If you can fit 700 TB of data into a gram of DNA, and the human body has 6-60g of DNA in it, how much data is encoded in human DNA?" According to Wikipedia, the source of all truth:

Quote from: https://en.wikipedia.org/wiki/Human_genome
The haploid human genome (23 chromosomes) is estimated to be about 3.2 billion base pairs long and to contain 20,000–25,000 distinct genes. Since every base pair can be coded by 2 bits, this is about 800 megabytes of data. Since individual genomes vary by less than 1% from each other, the variations of a given human's genome from a common reference can be losslessly compressed to roughly 4 megabytes.

There are two numbers here:
800 MB is approximately the upper bound on amount of information that 3.2 billion arbitrary base pairs of DNA can hold.
4 MB is an upper bound on the true "information content" of a given individual human's genome, given that we know it's a human genome.

To explain that, we need to back up to what megabytes and terabytes really are. A "bit" is the basic unit in which information is measured. 1 bit = the amount of information needed to describe the result of a fair coin flip (i.e., either "heads" or "tails.") 2 bits is the amount of information needed to describe two coin flips ("two heads", "heads then tails", "tails then heads", "two tails"). Actually, describing anything that has four equally probably outcomes is two bits worth of information - so telling someone whether a DNA base pair is adenosine, thymine, guanine, or cytosine is still two bits of information. That's what Wikipedia means when it says "every base pair can be coded by 2 bits." And from there, 8 bits = 1 byte, 1024 bytes = 1 kilobyte (kB), 1024 kB = 1 megabyte (MB), 1024 MB = 1 GB, and 1024 GB = 1 TB and so on.

The key here is the information needed to describe your coin flips or what have you. I could write a thousand page novel about the result of a coin toss, or I could grunt once for heads and not grunt for tails - it's still one bit of information either way. Just as mailing one pound of merchandise with ten pounds of packing peanuts doesn't give you eleven pounds of goods, being inefficient in communication/storage doesn't mean you are handling more information.

There are two reasons why the total informational content of a given human's DNA is so much lower than you'd predict by multiplying 700 TB/g * grams DNA. The first is that DNA is really, really redundant. Every cell* in your body has a copy of the same 3.2 billion base pair DNA. Remember, information content is only the absolute minimum needed to describe something. Since every cell has the same DNA, you can describe the whole body's DNA by describing one cell's DNA and then saying "...and then repeat that again for every cell in the body." That message doesn't get any longer or more complicated the more cells you add, so the information content of a whole human body's worth of DNA is the same as the information in a single cell. 1 cell has 3.2 billion base pairs which is at most 800 MB of information.

The second reason is that natural DNA is very predictable. The base pairs aren't random sequences; there are a lot of cases where if you were to read a sequence of base pairs to a biologist, she'd be able to guess with very high accuracy what the next set of base pairs are. For example, if you had the first half of the ALB gene, you'd know that the next set would have to be the rest of the ALB gene. Further, she'd know that we were discussing a human genome, since ALB is the human version of a gene found in all mammals, and therefore she'd know all of the other genes that are common to all humans were present, etc. Just saying "This DNA came from a real human" would tell here 99.5% of what she'd need to know to reconstruct that DNA.** All she needs after that is the ~4 MB of information which describes how this particular human's DNA differs from the theoretical "average human" DNA. Since we only need 4 MB of information to fully describe a human's DNA to the appropriate expert, the actual information content of the DNA in a human body is at most 4 MB.

*except sperm and eggs, which only have subsets of a normal cell's DNA. And damaged / mutated cells, of course.
**Humans, weirdly, have a lot less genetic diversity than other mammals. There's a theory that humans went very nearly extinct - down to thousands of  individuals - before leaving Africa, and that maybe only a thousand people actually left Africa in the initial colonization of Eurasia. Which makes us all inbred as fuck.
« Last Edit: August 26, 2012, 04:55:35 am by Golden Applesauce »
Q: How regularly do you hire 8th graders?
A: We have hired a number of FORMER 8th graders.

Pæs

  • Grabby-Girl Squadron Commander for Lowland Operations™
  • Deserved It
  • ****
  • Posts: 31300
  • I ain't even mad.
    • View Profile
Re: 700 terabytes of data fit in 1 gram of DNA
« Reply #2 on: August 26, 2012, 04:35:06 am »
That was interesting but the first question I was going to ask was "how do I use DNA to store my pirated movies and pr0n?"

Golden Applesauce

  • Token Apologist
  • Deserved It
  • ****
  • Posts: 22298
  • Where does this text go?
    • View Profile
Re: 700 terabytes of data fit in 1 gram of DNA
« Reply #3 on: August 26, 2012, 05:01:59 am »
That was interesting but the first question I was going to ask was "how do I use DNA to store my pirated movies and pr0n?"

Your biggest problem is that every time you want to watch your movie, you have to destroy your DNA storage in order to read it and then write it back out again afterwards. That's easy when you're flipping magnetic bits, hard and time consuming when it involves chemosynthesis. You could do the writing part ahead of time and make lots of one-use duplicates, I suppose, which gets you this neat beaker of DNA that slowly empties as you read data out of it.
Q: How regularly do you hire 8th graders?
A: We have hired a number of FORMER 8th graders.

HandsomeGirl

  • Known
  • *
  • Posts: 648
  • I masturbate furiously to hentai. all day. <- True
    • View Profile
Re: 700 terabytes of data fit in 1 gram of DNA
« Reply #4 on: August 26, 2012, 07:24:17 pm »
Okay Golden Applesauce, you have been immensely helpful.  For that, I thank you.  I think I understand this far more than I did before.  I believe I am at least ready to start asking questions, based on what you've given me and what I've read.

Now, am I correct in thinking storing this data in living, organic DNA would fuck a person up severely?  As they did this with synthetic DNA, I would think so.  Even though we're using a minimum of information repeated over and over, subtracting or adding to that would cause mutation and rejection by the body, right?
I've nothing worthwhile to put here.

Golden Applesauce

  • Token Apologist
  • Deserved It
  • ****
  • Posts: 22298
  • Where does this text go?
    • View Profile
Re: 700 terabytes of data fit in 1 gram of DNA
« Reply #5 on: August 26, 2012, 09:19:42 pm »
Okay Golden Applesauce, you have been immensely helpful.  For that, I thank you.  I think I understand this far more than I did before.  I believe I am at least ready to start asking questions, based on what you've given me and what I've read.

Now, am I correct in thinking storing this data in living, organic DNA would fuck a person up severely?  As they did this with synthetic DNA, I would think so.  Even though we're using a minimum of information repeated over and over, subtracting or adding to that would cause mutation and rejection by the body, right?

If you were to replace the DNA in a cell nucleus wholesale with synthetic DNA that encodes a computer file or something, the worst thing that would happen would be that that particular cell would die. If you injected it somewhere else, like in blood plasma or some other extracellular space, I think it would be either be ignored or broken down. It might even get eliminated by the same mechanisms that protect you from viruses.

If it were spliced into an existing cell's DNA very cleverly, the cell would probably be fine. There are two ways for something to go wrong, both avoidable. The first would be that you would destroy an existing vital function by scrambling it with the new synthetic DNA. You'd get around this by finding some section of unused DNA* and splicing the new DNA in the middle, so you don't break anything important. The other problem would be that your synthetic DNA might contain sub-sequences that are meaningful in the biology of the target cell. You'd need to avoid the sequence that says "What follows next is a gene; make a protein out of it, please" for starters. There are also proteins that bind to specific segments of DNA; if your synthetic DNA contained a large number of those segments, maybe all of that protein would stick to the fake DNA instead of to the real DNA where they're supposed to and something could go wrong? That's solvable, though, you just need to come up with a way of "quote" your input data so it doesn't get encoded as meaning something else by the cell machinery. Even if you did mess up, though, I wouldn't worry to much - cells are very good at self destructing when something goes wrong with their DNA to avoid messing up the rest of the body.

The larger question, though, is why are you trying to store this in a person? What's cool and exciting about DNA as a storage medium is that it's very information dense. Bringing along all the baggage of a complex organism brings a lot of weight for no benefit. A culture of specially engineered bacteria to host the DNA I could see - whenever you wanted copy the data, you'd just feed them some sugar and let them all divide and clone the DNA for you - but most likely I think you'd just keep it in beakers or pipettes or something. If you just want to carry information in your body, you can already do that; shrink wrap a USB stick and swallow it.

*You've probably heard about junk DNA. "Junk" isn't quite accurate - it was originally used to describe all the DNA that doesn't code for proteins, which is misleading because most that DNA is still important, just for other reasons - but there still are stretches of DNA that don't do a whole lot. "Fossil" DNA (the remnants of old genes that got turned off permanently and are in the process of being scrambled by evolution) sanitized viral DNA**, and DNA that is literally just a spacer between other bits of important DNA are all good candidates.

**Viruses work by injecting DNA that does virus-y things into the host cell. You can beat this either by destroying the virus, or just by accepting the DNA and then refusing to execute its viral instructions, which can in some cases pass on to offspring and end up being transmitted like the animal's normal DNA. There's a fair bit of human DNA that is actually DNA of various viruses driven to extinction by our immune systems. Your DNA is a trophy case of humanity's defeated enemies!
Q: How regularly do you hire 8th graders?
A: We have hired a number of FORMER 8th graders.

HandsomeGirl

  • Known
  • *
  • Posts: 648
  • I masturbate furiously to hentai. all day. <- True
    • View Profile
Re: 700 terabytes of data fit in 1 gram of DNA
« Reply #6 on: August 26, 2012, 09:33:42 pm »
In your two posts you answered every single question I had about this, and gave answers to questions that sprung from some of your explanations.  You also answered questions I had from things I'd read about on this topic by myself.

You rock like nothing else, and have my sincere thanks.
I've nothing worthwhile to put here.

Nigel

  • in my oonlz
  • Deserved It
  • ****
  • Posts: 586168
  • v=1/3πr2h
    • View Profile
Re: 700 terabytes of data fit in 1 gram of DNA
« Reply #7 on: August 26, 2012, 10:22:25 pm »
The first question you're going to ask is "If you can fit 700 TB of data into a gram of DNA, and the human body has 6-60g of DNA in it, how much data is encoded in human DNA?" According to Wikipedia, the source of all truth:

Quote from: https://en.wikipedia.org/wiki/Human_genome
The haploid human genome (23 chromosomes) is estimated to be about 3.2 billion base pairs long and to contain 20,000–25,000 distinct genes. Since every base pair can be coded by 2 bits, this is about 800 megabytes of data. Since individual genomes vary by less than 1% from each other, the variations of a given human's genome from a common reference can be losslessly compressed to roughly 4 megabytes.

There are two numbers here:
800 MB is approximately the upper bound on amount of information that 3.2 billion arbitrary base pairs of DNA can hold.
4 MB is an upper bound on the true "information content" of a given individual human's genome, given that we know it's a human genome.

To explain that, we need to back up to what megabytes and terabytes really are. A "bit" is the basic unit in which information is measured. 1 bit = the amount of information needed to describe the result of a fair coin flip (i.e., either "heads" or "tails.") 2 bits is the amount of information needed to describe two coin flips ("two heads", "heads then tails", "tails then heads", "two tails"). Actually, describing anything that has four equally probably outcomes is two bits worth of information - so telling someone whether a DNA base pair is adenosine, thymine, guanine, or cytosine is still two bits of information. That's what Wikipedia means when it says "every base pair can be coded by 2 bits." And from there, 8 bits = 1 byte, 1024 bytes = 1 kilobyte (kB), 1024 kB = 1 megabyte (MB), 1024 MB = 1 GB, and 1024 GB = 1 TB and so on.

The key here is the information needed to describe your coin flips or what have you. I could write a thousand page novel about the result of a coin toss, or I could grunt once for heads and not grunt for tails - it's still one bit of information either way. Just as mailing one pound of merchandise with ten pounds of packing peanuts doesn't give you eleven pounds of goods, being inefficient in communication/storage doesn't mean you are handling more information.

There are two reasons why the total informational content of a given human's DNA is so much lower than you'd predict by multiplying 700 TB/g * grams DNA. The first is that DNA is really, really redundant. Every cell* in your body has a copy of the same 3.2 billion base pair DNA. Remember, information content is only the absolute minimum needed to describe something. Since every cell has the same DNA, you can describe the whole body's DNA by describing one cell's DNA and then saying "...and then repeat that again for every cell in the body." That message doesn't get any longer or more complicated the more cells you add, so the information content of a whole human body's worth of DNA is the same as the information in a single cell. 1 cell has 3.2 billion base pairs which is at most 800 MB of information.

The second reason is that natural DNA is very predictable. The base pairs aren't random sequences; there are a lot of cases where if you were to read a sequence of base pairs to a biologist, she'd be able to guess with very high accuracy what the next set of base pairs are. For example, if you had the first half of the ALB gene, you'd know that the next set would have to be the rest of the ALB gene. Further, she'd know that we were discussing a human genome, since ALB is the human version of a gene found in all mammals, and therefore she'd know all of the other genes that are common to all humans were present, etc. Just saying "This DNA came from a real human" would tell here 99.5% of what she'd need to know to reconstruct that DNA.** All she needs after that is the ~4 MB of information which describes how this particular human's DNA differs from the theoretical "average human" DNA. Since we only need 4 MB of information to fully describe a human's DNA to the appropriate expert, the actual information content of the DNA in a human body is at most 4 MB.

*except sperm and eggs, which only have subsets of a normal cell's DNA. And damaged / mutated cells, of course.
**Humans, weirdly, have a lot less genetic diversity than other mammals. There's a theory that humans went very nearly extinct - down to thousands of  individuals - before leaving Africa, and that maybe only a thousand people actually left Africa in the initial colonization of Eurasia. Which makes us all inbred as fuck.

This was AWESOME, GA, thanks for writing it up!
Tiny and Terrible Strap-On Fuckhorde of Tonight's Wrong Turn.

“I’m guessing it was January 2007, a meeting in Bethesda, we got a bag of bees and just started smashing them on the desk,” Charles Wick said. “It was very complicated.”

“People get used to anything. The less you think about your oppression, the more your tolerance for it grows. After a while, people just think oppression is the normal state of things. But to become free, you have to be acutely aware of being a slave.”
― Assata Shaku