Press "Enter" to skip to content

Category: Data Lake

Python Support In Azure Data Lake

Saveen Reddy announces that Python is now a first-tier language in the Azure Data Lake:

This week, were are now making announcing even more support for Python. As of today Python is now a first-class language supported by our management SDKs. This enables you to develop applications or automate the Data Lake services. Check out or Getting Started articles that now include many python samples

Saveen has a Jupyter notebook which demonstrates Python in Azure Data Lake Store.

Comments closed

Thinking About The Data Lake

Ust Oldfield gives architectural hints on Azure Data Lake Store:

It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad.

This is how you get data swamps (a term which I’m so happy is catching on).  Read the whole thing.

Comments closed

Azure Data Lake Updates

Michael Rys has the October updates for Azure Data Lake:

We seem to be just cranking out new stuff :). Here are the October 2016 Updates for Azure Data Lake U-SQL!

The main take away is that the October refresh has now removed the old deprecated syntax of the items we have announced over the last couple of release notes!

Thanks for those who volunteered to test the new version of more scalable file set. Please contact us if you want to try it and help us validate it.

Click through for the release notes.

Comments closed

Azure Data Lake Analytics Units

Yan Li explains the Azure Data Lake Analytics Unit:

An Azure Data Lake Analytics Unit, or AU, is a unit of computation resources made available to your U-SQL job. Each AU  gives your job access to a set of underlying resources like CPU and memory. Currently, an AU is the equivalent of 2 CPU cores and 6 GB of RAM. As we see how people want to use the service, we may change the definition of an AU or more options for controlling CPU and memory usage.

How AUs are used during U-SQL Query Execution

When you submit a U-SQL script for execution, the U-SQL compiler parallelizes the U-SQL script into hundreds or even thousands of tasks called vertices. Each vertex is allocated to one AU. The AU is dynamically allocated to the task and released once that particular task is completed.

I appreciate the ADL team’s transparency in how they define a unit.  It’s much nicer to be able to tell someone that an AU is 2 CPU cores + 6 GB of RAM, rather than saying it’s some fuzzy measure of CPU + memory + I/O which has no direct bearing on your operations.

Comments closed

Azure Data Lake Updates

Saveen Reddy points out a few updates to Azure Data Lake Store & the Azure Data Lake Analytics portal:

Use Custom Delimeters when Previewing Files

Previously, we had supported comma, colon, space, tab, ampersand, and bar delimiters. With the many different kinds of files used in Azure Data Lake Store and Azure Storage, we’ve added a “Custom” delimiter options for you to define your own delimiter.

To change the delimiter on the Azure Portal:

  1. Open the file you want to preview using Data Explorer.

  2. Click on Format

  3. Under Delimiter, click the dropdown and change it to Custom

  4. A new Custom Delimiter field will appear, type in your delimiter here

  5. Click OK

Read on for more updates.

Comments closed

Data Lake Planning

Melissa Coates discusses some of the planning involved with creating a data lake:

Does a Data Lake Replace a Data Warehouse?

I’m biased here, and a firm believer that modern data warehousing is still very important. Therefore, I believe that a data lake, in an of itself, doesn’t entirely replace the need for a data warehouse (or data marts) which contain cleansed data in a user-friendly format. The data warehouse doesn’t absolutely have to be in a relational database anymore, but it does need a semantic layer which is easy to work with that most business users can access for the most common reporting needs.

On this question, my answer is “Absolutely not.”  Data warehouses are designed to answer specific, known business questions.  They’re great for regulatory reporting, quarterly reports to shareholders, weekly reports to management, etc.  Data lakes are designed for ad hoc analysis of information.  Read the whole thing.

Comments closed

Automatic Approval For Data Lake Analytics

Yan Li reports that Azure Data Lake Analytics no longer requires waiting for approval:

We’re happy to announce that we’ve made it much faster to get started with the Data Lake Store and Analytics services starting today. Before today, when you tried to sign up for these services you had to go through an approval process that introduced a delay of at least one hour.

Now, you no longer have to wait for approval, and you can simply create an account immediately.

Yan also has some “getting started” links to help you out, now that you don’t have to wait for an account.

Comments closed

Securing The Data Plane

Michael Schiebel gives an overview of security architecture inside a data lake:

Existing platform based Hadoop architectures make several implicit assumptions on how users interact with the platform such as developmental research versus production applications.  While this was perfectly good in a research mode, as we move to a modern data application architecture we need to bring back modern application concepts to the Hadoop ecosystem.  For example, existing Hadoop architectures tightly couple the user interface with the source of data.  This is done for good reasons that apply in a data discovery research context, but cause significant issues in developing and maintaining a production application.  We see this in some of the popular user interfaces such as Kibana, Banana, Grafana, etc.  Each user interface is directly tied to a specific type of data lake and imposes schema choices on that data.

Read the whole thing.  Also, “Securing the data plane” sounds like a terrible ’90s action film.

Comments closed

Azure Data Lake ACLs

Saveen Reddy introduces file and folder level Access Control Lists for Azure Data Lake storage:

We’ve emphasized that Azure Data Lake Store is compatible with WebHDFS. Now that ACLs are fully available, it’s important to understand the ACL model in WebHDFS/HDFS because they are POSIX-style ACLs and not Windows-style ACLs.  Before we five deep into the details on the ACL model, here are key points to remember.

  • POSIX-STYLE ACLs DO NOT ALLOW INHERITANCE. For those of you familiar with POSIX ACLs, this is not a surprise. For those coming from a Windows background this is very important to keep in mind. For example, if Alice can read files in folder /foo, it does not mean that she can rad files in /foo/bar. She must be granted explicit permission to /foo/bar. The POSIX ACL model is different in some other interesting ways, but this lack of inheritance is the most important thing to keep in mind.

  • ADDING A NEW USER TO DATA LAKE ANALYTICS REQUIRES A FEW NEW STEPS. Fortunately, a portal wizard automates the most difficult steps for you.

This is an interesting development.

Comments closed