Genomic Analysis In Spark

Tom White and Jonathan Keebler show off hail, a package to allow you to perform genomic analysis in Apache Spark:

One of the most important downstream analyses is finding genetic trait associations. Association studies look for statistical associations between genetic variation and phenotypic traits, that is, an observable characteristic of an individual, such as hair color or disease. With the increasing availability of whole-genome sequence data, it’s possible to look for variants from across the whole genome that may be associated with a disease, rather than heavily relying only on commonly known variants as in a traditional genome-wide association study (GWAS).

The challenge for downstream processing is scale. Tools that can cope with a few hundred or even a few thousand genomes, such as the well-known 1000 Genomes dataset, can’t handle datasets that are one or more orders of magnitude larger. These datasets are now becoming commonplace, thanks to the multiple sequencing efforts taking place around the world like the 100,000 Genomes Project in the UK and the Precision Medicine Initiative in the US.

Genomic analysis has been right in Hadoop’s wheelhouse for a while.

Related Posts

Data Science And Data Engineering In HDP 3.0

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers: We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different […]

Read More

SparkSession Versus SparkContext

Abhishek Baranwal explains the differences between the SparkSession object and the SparkContext object when writing Spark code: Prior to spark 2.0, SparkContext was used as a channel to access all spark functionality. The spark driver program uses sparkContext to connect to the cluster through resource manager. SparkConf is required to create the spark context object, […]

Read More

Categories