Taxi Rides And Amazon Athena

Mark Litwintschik looks at using Amazon Athena to process the New York City taxi rides data set:

It’s important to note that Athena is not a general purpose database. Under the hood is Presto, a query execution engine that runs on top of the Hadoop stack. Athena’s purpose is to ask questions rather than insert records quickly or update random records with low latency.

That being said, Presto’s performance, given it can work on some of the world’s largest datasets, is impressive. Presto is used daily by analysts at Facebook on their multi-petabyte data warehouse so the fact that such a powerful tool is available via a simple web interface with no servers to manage is pretty amazing to say the least.

Athena is Amazon’s response to Azure Data Lake Analytics.  Check out Mark’s blog post for a good way of getting started with Athena.

Querying Genomic Data With Athena

Aaron Friedman explains how to use Amazon Athena to query S3 files:

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR 

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms.  I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

This feels a lot like a data lake PaaS process where they’re spinning up a Hadoop cluster in the background, but one which you won’t need to manage. Cf. Azure Data Lake Analytics.

AWS Data Lake

Nick Corbett announces that Amazon is rolling out their own data lake solution:

Separating storage from processing can also help to reduce the cost of your data lake. Until you choose to analyze your data, you need to pay only for S3 storage. This model also makes it easier to attribute costs to individual projects. With the correct tagging policy in place, you can allocate the costs to each of your analytical projects based on the infrastructure that they consume. In turn, this makes it easy to work out which projects provide most value to your organization.

The data lake stores metadata in both DynamoDB and Amazon ES. DynamoDB is used as the system of record. Each change of metadata that you make is saved, so you have a complete audit trail of how your package has changed over time. You can see this on the data lake console by choosing History in the package view:

Having a competitor in the data lake space is a good thing for us, though based on this intro post, it seems that Amazon and Microsoft are taking different approaches to the data lake, where Microsoft wants you to stay in the data lake (e.g., writing U-SQL or Python statements to query the data lake) and Amazon wants you to shop the data lake and check out the specific S3 buckets and files for your own processing.

Python Support In Azure Data Lake

Saveen Reddy announces that Python is now a first-tier language in the Azure Data Lake:

This week, were are now making announcing even more support for Python. As of today Python is now a first-class language supported by our management SDKs. This enables you to develop applications or automate the Data Lake services. Check out or Getting Started articles that now include many python samples

Saveen has a Jupyter notebook which demonstrates Python in Azure Data Lake Store.

Thinking About The Data Lake

Ust Oldfield gives architectural hints on Azure Data Lake Store:

It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad.

This is how you get data swamps (a term which I’m so happy is catching on).  Read the whole thing.

Azure Data Lake Updates

Michael Rys has the October updates for Azure Data Lake:

We seem to be just cranking out new stuff :). Here are the October 2016 Updates for Azure Data Lake U-SQL!

The main take away is that the October refresh has now removed the old deprecated syntax of the items we have announced over the last couple of release notes!

Thanks for those who volunteered to test the new version of more scalable file set. Please contact us if you want to try it and help us validate it.

Click through for the release notes.

Azure Data Lake Analytics Units

Yan Li explains the Azure Data Lake Analytics Unit:

An Azure Data Lake Analytics Unit, or AU, is a unit of computation resources made available to your U-SQL job. Each AU  gives your job access to a set of underlying resources like CPU and memory. Currently, an AU is the equivalent of 2 CPU cores and 6 GB of RAM. As we see how people want to use the service, we may change the definition of an AU or more options for controlling CPU and memory usage.

How AUs are used during U-SQL Query Execution

When you submit a U-SQL script for execution, the U-SQL compiler parallelizes the U-SQL script into hundreds or even thousands of tasks called vertices. Each vertex is allocated to one AU. The AU is dynamically allocated to the task and released once that particular task is completed.

I appreciate the ADL team’s transparency in how they define a unit.  It’s much nicer to be able to tell someone that an AU is 2 CPU cores + 6 GB of RAM, rather than saying it’s some fuzzy measure of CPU + memory + I/O which has no direct bearing on your operations.

Azure Data Lake Updates

Saveen Reddy points out a few updates to Azure Data Lake Store & the Azure Data Lake Analytics portal:

Use Custom Delimeters when Previewing Files

Previously, we had supported comma, colon, space, tab, ampersand, and bar delimiters. With the many different kinds of files used in Azure Data Lake Store and Azure Storage, we’ve added a “Custom” delimiter options for you to define your own delimiter.

To change the delimiter on the Azure Portal:

  1. Open the file you want to preview using Data Explorer.

  2. Click on Format

  3. Under Delimiter, click the dropdown and change it to Custom

  4. A new Custom Delimiter field will appear, type in your delimiter here

  5. Click OK

Read on for more updates.

Data Lake Planning

Melissa Coates discusses some of the planning involved with creating a data lake:

Does a Data Lake Replace a Data Warehouse?

I’m biased here, and a firm believer that modern data warehousing is still very important. Therefore, I believe that a data lake, in an of itself, doesn’t entirely replace the need for a data warehouse (or data marts) which contain cleansed data in a user-friendly format. The data warehouse doesn’t absolutely have to be in a relational database anymore, but it does need a semantic layer which is easy to work with that most business users can access for the most common reporting needs.

On this question, my answer is “Absolutely not.”  Data warehouses are designed to answer specific, known business questions.  They’re great for regulatory reporting, quarterly reports to shareholders, weekly reports to management, etc.  Data lakes are designed for ad hoc analysis of information.  Read the whole thing.

Automatic Approval For Data Lake Analytics

Yan Li reports that Azure Data Lake Analytics no longer requires waiting for approval:

We’re happy to announce that we’ve made it much faster to get started with the Data Lake Store and Analytics services starting today. Before today, when you tried to sign up for these services you had to go through an approval process that introduced a delay of at least one hour.

Now, you no longer have to wait for approval, and you can simply create an account immediately.

Yan also has some “getting started” links to help you out, now that you don’t have to wait for an account.


January 2017
« Dec