How to interpret your own genome using (mostly) Python(如何用Python解析你自己的基因组） by Titus Brown

AirJD

没有录音文件

00:00/00:00

加收藏

How to interpret your own genome using (mostly) Python(如何用Python解析你自己的基因组） by Titus Brown

发布者 pyconf 简介 PyChina 大会

发布于 1445995055524 浏览 7486 关键词 Python, English

分享到

第1页

How to interpret your own genome.

C. Titus Brown
ctbrown@ucdavis.edu
@ctitusbrown
http://ivory.idyll.org/blog/

Second in my ongoing attempt to explain what I actually do to Terry Peppers.

第2页

Some basic facts about DNA

The primary DNA sequence consists of strings of A, C, G, and T.
Most human cells contain approximately 6 billion of these.
They are divided into 23 chromosome pairs.
These chromosomes are the primary unit of heredity.

http://classes.biology.ucsd.edu/bimm110.SP07/lectures_WEB/L08.05_Cytogenetics.htm

第3页

How DNA is interpreted –“It’s complicated.”

http://www.exploringnature.org/db/detail.php?dbID=106&detID=2454

第4页

How inheritance & generation of variation works

http://genetics.thetech.org/ask/ask435

+ approximately 300-600 mutations
per generation

第5页

If we knew a person’s genome sequence perfectly…

We still wouldn’t know all that much!

We could correlate variation between genomes with diseases.
We could identify parentage and genetic inheritance.
We could probably identify ethnic origin.
We could find known “mistakes” or problems.

第6页

But… why wouldn’t we know that muchjQuery110208689744296370732_1445995082543 Isn’t the genome the person?

Let’s ignore environmental factors, first of all…

第7页

Imagine…

…you’re locked in a room, with feral lawyers roaming around outside;
You have a bunch of source code on a stack of CDs to understand;
And you’ve been given a Windows 98 machine with Python installed.
(see David Beazley, “Discovering Python”, PyCon 2014)
This talk came partly from listening to his talk…

第8页

This “locked room” problem is a pretty good analogy to genomics!

“Here are 3 billion characters of DNA! Go figure out what it all means!”
It’s like the previous locked room problem, and:
The code is all written in Perl 8, for which neither a specification or software interpreter exists.
But you have access to the Internet and a world-wide collection of other scientists, and (some of) their data and papers.
Oh, and: the answers hold the keys to life and death.

第9页

Genomes are still useful! How do we find sequence?

Primary approach for human genomes is: spend a lot of money sequencing one, or a few; use that as reference.
Initial cost: $2.7 bn (in 1991)
Current human genome reference is from 13 anonymous volunteers in Buffalo, NY (Wikipedia ;)

Older technology: identify points of variation, then target for further investigation.
Current technology: sequence. (The rest of this talk.
Next technology: longer reads. (Sequence more, better.)

第10页

Working with short read sequencing - overview

第11页

Working with short read sequencing - sequencing

Need about 250 ng of DNA at 2 ng/ul.
“Under $1,000 dollars”
http://biome.biomedcentral.com/welcome-to-the-1000-genome/
…some up front investment required :)

第12页

Working with short read sequencing - sequencing

@D00360:18:H8VC6ADXX:1:1103:1434:46766/1
AACCCCCTCCCCATGCTTACAAGCAAGTACAGCAATCAACCCTCAACTATCACACATCAACTGCAACTCCAAAGCCACCCCTCACCCACTAGGATACCAACAAACCTACCCACCCTTAACAGCACATAGTACATAAAGCCATTTACCG
+
@@@DDDDDFHHFHHIIIBHGIIDGIA;EDGD@CG@FDDEFFB@DCGHGGIG8CHGDFGHCCDA>EEAAHEDFE?@@CEEBB?BBBB?<@<CCCC>CCCC>88ABBBCCCBAA@BBBCCCC@@<C??>C@CD>@@CACCC@9@C:CD@0

Raw data looks something like this (x 2 bn)

第13页

Mapping: locate sequences in reference

http://en.wikipedia.org/wiki/File:Mapping_Reads.png

=> BAM

FASTQ =>

第14页

第15页

Variant detection after mapping

http://www.kenkraaijeveld.nl/genomics/bioinformatics/

BAM =>

=> VCF

第16页

第17页

Working with short-read sequencing – annotate variants

Is it a variant known to have an effect?
Is it in a gene?
Is it in a gene and does it have some “obvious” effect (e.g. breaking the gene)?
Has it been associated with some effect?

第18页

Pipeline, approaches, formats, technologies.

Illumina

BWA
Samtools

FreeBayes

VEP
SNPedia

Gemini

bcbio 

See http://ivory.idyll.org/blog/2015-pycon-talk.html for details.

~1500 hours

~12 hours

~100 hours

第19页

An example data set

Sequences from a “trio” (son, father, mother) of Ashkenazi Jews are available, together with medical records (see links in blog post).
The Ashkenazim branched off from other Jews ~2500 years ago, flourished during Roman Empire, then “went through a 'severe bottleneck' as they dispersed, reducing a population of several million to just 400 families who left Northern Italy around the year 1000.”

http://en.wikipedia.org/wiki/Ashkenazi_Jews#Genetics

第20页

“Raw” human data:

BAM file: 108 GB
(contains sequences + quality scores)

+ human genome (~3 GB or so)
+ lots of databases of varying size.

Full instructions at:
http://ivory.idyll.org/blog/2015-pycon-talk.html

第21页

Working with short-read sequencing – mapping.

Software such as BWA takes in a reference genome and a set of reads and yields tab-delimited output:
D00360:37:HA3HMADXX:1:2104:14000:62852 163 chr22 16050001 15 87S8M1I10M1D41M1S = 16050476 621 CCA….3((…

This contains information about where each read maps, how well it maps, etc.

第22页

Most parts of the genome are sampled many times (~50, here)

HG002 data set

第23页

Calling variants w/FreeBayes

https://github.com/ekg/freebayes

第24页

Working with short-read sequencing – annotate variants

HG002 data set

Variants annotated with VEP using Gemini.

第25页

Most differences are ~uninterpretable!

(Only 2% of human genome makes genes; maybe ~5% of genome thought to be functional)

HG002 data set

第26页

OK, you’ve got your variants – now what??

HT to Slate Star Codex,
http://slatestarcodex.com/2014/11/12/how-to-use-23andme-irresponsibly/

第27页

Chasing down a disease-related variant: Canavan disease.

http://www.snpedia.com/index.php/Rs12948217

第28页

chr17:3397702 (hg19) in HG002 sample (son)

The son and both parents are heterozygous (1/2) for this – they are carriers, but not afflicted with disease.

¼ of their children would have homozygous allele and probably be affected by Canavan’s Disease:

“Children who inherit two copies of the gene appear normal at birth, but between three and nine months of age they begin to show symptoms ... These children cannot sit, crawl, or talk, and few live past age 10.”

http://www.snpedia.com/index.php/Canavan_disease

第29页

Challenges in actually interpreting – “version hell”.

Variant is actually a T.
Snpedia says A is the problematic variant, but that’s on hg38.
On hg19, which is what variants were called on, relevant gene is on reverse strand so T => A.

第30页

Human migrations into Europe (~40kya – fall of Roman Empire)

Veeramah and Novembre, doi:10.1101/cshperspect.a008516

第31页

Veeramah and Novembre, doi:10.1101/cshperspect.a008516

Human genetic comparisons overlayed on map of Europe.

第32页

Predicting new disease variants:

Can we find associations between variants and diseases?

“Genome Wide Association Study (GWAS)”

Wellcome Trust CCT, 2007, doi:10.1038/nature05911

第33页

…cautions of GWAS:

Need to account for relatedness in samples;
Large sample sizes needed;
Complex statistics needed & “multiple testing” issues;
Different identifier/database mixtures;
Correlation is not causation;
Large effects are rare – typically many small signals combined.
The data science problem from hell!

第34页

Where next?

Short-term: next 2-5 years

Medium-term: 10 years

Long-term: 20 years+

第35页

Short term

Lots more data! “Millions to billions of human genomes” coming.
Individual data – est 300,000 human genomes sequenced in 2014.
Tumor and somatic data.
Time course data (“narcissome”) - Mike Snyder
Newer sequencing data types – e.g. longer reads.

see: http://www.nature.com/news/the-rise-of-the-narciss-ome-1.10240

第36页

Short-term software problems

Increasingly many open source Python projects (bcbio, Gemini);
Help with integration between tools (dependency hell, versioning hell);

Optimization of specific approaches not so important.
Lack of concordance => technical problem.
General speed ~meh
Flexible and robust libraries still maturing.

第37页

Medium term

We’ll be sequencing everything all the time (but still won’t really know what it means); => data integration and data mining.
Large scale sequencing is rapidly being extended to agriculture, ecology, and veterinary medicine.
We will soon be able to “edit” whatever genomes we want (check out CRISPR), but will not have a good idea of what to actually edit (c.f. Perl8 analogy, above).
Read up on “gene drive” if you want the bejeezus scared out of you:
http://news.sciencemag.org/biology/2015/03/chain-reaction-spreads-gene-through-insects

第38页

Longer term

No one knows.
We’ve only had large scale sequencing & the human genome for ~15 years!!

Free associate the following:
cheap sequencing; quantified self; Internet of Things.

第39页

How to get involved?

A lot of the software is open source!
(bwa, samtools, etc. etc.)

…but:
Warning: genomics is large, and deep, and largely invisible, and has its own culture.
Sadly, your best bet is probably to come do a PhD with someone like me, for free.
(just kidding! …)

第40页

bcbio and Gemini

Help with:
Gemini: SQLite to PostgreSQL conversion;
Gemini: “bigwig” parsing performance;
bcbio: improving use & cleanliness of Cloud port
bcbio: moving to Common Workflow Language (note, reference implementation in Python)
See talk blog post at http://ivory.idyll.org/2015-pycon-talk.html for more info.

第41页

How can you sequence your own genome?

Most genetic testing services (23andme, etc.) don’t actually sequence your 6 billion bases of DNA; they instead use a more targeted approach and look at common variants or known disease variants.
If it costs < $1000, they’re not actually sequencing you :)
DNA extraction, etc, is fairly straightforward if you have access to a lab and the necessary expertise.
Main suggestion: see http://www.personalgenomes.org/

第42页

Thanks for coming!

Please see links to data, instructions, and more reading at
http://ivory.idyll.org/blog/2015-pycon-talk.html