Genomic Analysis In Spark

Tom White and Jonathan Keebler show off hail, a package to allow you to perform genomic analysis in Apache Spark:

One of the most important downstream analyses is finding genetic trait associations. Association studies look for statistical associations between genetic variation and phenotypic traits, that is, an observable characteristic of an individual, such as hair color or disease. With the increasing availability of whole-genome sequence data, it’s possible to look for variants from across the whole genome that may be associated with a disease, rather than heavily relying only on commonly known variants as in a traditional genome-wide association study (GWAS).

The challenge for downstream processing is scale. Tools that can cope with a few hundred or even a few thousand genomes, such as the well-known 1000 Genomes dataset, can’t handle datasets that are one or more orders of magnitude larger. These datasets are now becoming commonplace, thanks to the multiple sequencing efforts taking place around the world like the 100,000 Genomes Project in the UK and the Precision Medicine Initiative in the US.

Genomic analysis has been right in Hadoop’s wheelhouse for a while.

Related Posts

Spark Streaming DStreams

Manish Mishra explains the fundamental abstraction of Spark Streaming: Before going into details of the operations available on the DStream API, let us look at the input sources from which we can start a Stream. There are multiple ways in which we can get the inputs from e.g. Kafka, Flume, etc. Or simple Idle files. […]

Read More

Diagnosing TCP SACKs-Related Slowdown in Databricks

Chris Stevens, et al, walk us through troubleshooting a slowdown after using Linux images which have been patched for the TCP SACKs vulnerabilities: In order to figure out why the straggler task took 15 minutes, we needed to catch it in the act. We reran the benchmark while monitoring the Spark UI, knowing that all […]

Read More

Categories