Querying Genomic Data With Athena

Aaron Friedman explains how to use Amazon Athena to query S3 files:

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR 

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms.  I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

This feels a lot like a data lake PaaS process where they’re spinning up a Hadoop cluster in the background, but one which you won’t need to manage. Cf. Azure Data Lake Analytics.

Related Posts

LISTAGG In Snowflake DB

Koen Verbeeck continues investigating Snowflake capabilities: Since SQL Server 2017, you have the STRING_AGG function, which has almost the exact same syntax as its Snowflake counterpart. There are two minor differences:– Snowflake has an optional DISTINCT– SQL Server has a default ascending sorting. If you want another sorting, you can specify one in the WITHIN GROUP clause. […]

Read More

The Forgotten Infrastructure Below Azure BI Architecture Diagrams

Meagan Longoria reminds us that there are several products which Azure BI projects need but which we tend to forget when building architectural diagrams: Let’s start with Azure Active Directory (AAD). In order to provision the resources in the diagram, your Azure subscription must already be associated with an Active Directory. AAD is Microsoft’s cloud-based […]

Read More


December 2016
« Nov Jan »