Category: Data Lake

Until now, if you had to analyze data stored in ADLS with Excel, you would have to copy it into a relational data store like Azure SQL Data Warehouse or download the data onto a machine, and then use Excel to analyze that data. This was rather cumbersome involving additional cost and time. With this new support, you can now access files stored in ADLS with Excel in-place, without having to copy them to other stores or locations. You can quickly get advanced insights into raw or prepared data. Models and queries you have created using Excel that ran against local data, can be run seamlessly against data stored in ADLS.

Security capabilities of ADLS allow administrators to control access to the data stored in ADLS in a discretionary manner. With this you can limit the access that Excel users have for the data in ADLS. In this manner, data in the ADLS-based data lake continues to be the single source of truth with no redundant copies and can be analyzed by analytics tools of your own choice .

Click through for a demo video.

Comments closed

Diving Into The Data Lake

Published 2017-07-25 by Kevin Feasel

Jesse Gorter explains the data lake metaphor:

A data lake is a concept that opposes the idea of a data mart. Where a data mart is a silo with structured and cleansed data, a data lake is a huge data collection that is unstructured and raw. You could also say that a data mart is a bottle of clean water whereas the data lake is the lake with (not so clean) water. 🙂

Now why would you want a data lake? Imagine you are generating huge logfiles, for example in airplanes. Machines that track air pressure, temperature etc. If something goes wrong, you definitely want to be alerted. That is event-driven: “if A and B happen, alert pilot, or do C” and there are tools for dealing with that kind of streaming data. But what if the plane landed safely? What do you do with all that data? You do not need it anymore right?

Well, some people would say: “Wrong”. You might need that data later for reasons you do not know today. Google, Microsoft and Facebook are all hoarding data. Also data they are not sure they might need someday. This data could later prove to be valuable for AI, machine learning or for something else.

Read the whole thing. The data lake concept is powerful, but it requires at least as much data governance as prior models. Just because you can dump a bunch of files without thinking about it doesn’t mean you’ll get back something useful later.

Comments closed

Data Lake Tools For VS Code Updated

Published 2017-07-14 by Kevin Feasel

Jenny Jiang announces Azure Data Lake Tools for Visual Studio Code’s July update:

Local Debug enables you to debug your C# code behind, step through the code, and validate your script locally before submitting to ADLA.

Use command ADL: Start Local Run Service to start local run service and set a breakpoint in your code behind, then click command ADL: Local Debug to start local debug service. You can debug through the debug console and view parameter, variable, and call stack information.

Click through to see the other improvements.

Comments closed

Thinking About The Data Lake

Published 2017-07-03 by Kevin Feasel

James Serra explains at a high level what the data lake metaphor is and how it works:

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data. With the old way, you have to know the questions to ask. The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

Too much time spent modeling when you don’t know all of the questions your data needs to answer
Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

There are some good details here. My addition would be to reiterate the importance of a good data governance policy.

Comments closed

Managing Data Lake Analytics Compute

Published 2017-06-12 by Kevin Feasel

Yan Li has a three-part series looking at management of Azure Data Lake compute. First, an overview:

Scenario 2: Set One Specific Group to Different Limits

New members are joining and sharing the same ADLA account. To prevent any new members, who are just learning ADLA, from mistakenly submitting a job that consumes too much compute resource (increasing cost and blocking other jobs), customers want to set the maximum AU per job for new employees at 30 AUs while others can submit jobs with up to 100 AUs.

Default Policy:

Job AU limit: 100

Priority limit: 1

Exception Policy: New Employee Policy

Job AU limit: 30
Priority limit: 200
Group: New Employee Group

Next up is a look at job-level policies:

With job-level policies, you can control the maximum AUs and the maximum priority that individual users (or members of security groups) can set on the jobs that they submit. This allows you to not only control the costs incurred by your users but also control the impact they might have on high priority production jobs running in the same ADLA account.

There are two parts to a job level policy:

Default Policy: This is the policy that is applied to all users of the service.

Exceptions: The set of “exception” policies apply to specific users.

Submitted jobs that do not violate the job-level policies are still subject to the account level policies as described in Azure Data Lake Analytics Account Level Policy.

Finally, account-level policies:

ADLA supports three types of account-level policies:

Maximum AUs — Controls the maximum number of AUs that can be used by running jobs
Maximum Number of Running Jobs — Controls the maximum number of concurrently running jobs.
Days to Retain Job Queries — Controls how long detailed information about jobs are retained in the users ADLS account.

There’s a good amount of information here.

Comments closed

Jupyter + Pandas On Azure Data Lake Store

Published 2017-05-12 by Kevin Feasel

Amit Kulkarni demonstrates how to access data in Azure Data Lake Store within a Jupyter notebook:

For the rest of this post, I assume that you have some basic familiarity with Python, Pandas and Jupyter.

On your machine, you will need all of the following installed:

Python 2 or 3 with Pip
Pandas
Jupyter

Amit shows two separate methods for retrieving data, so check it out.

Comments closed

Azure Data Lake Tools For VS Code

Published 2017-05-12 by Kevin Feasel

Jenny Jiang announces that Azure Data Lake Tools for Visual Studio Code is now generally available:

ADLA Integration

The ADL Tools for VSCode integrate well with ADLA. Azure Data Lake includes the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. U-SQL on ADLA offers Job as a Service with the Microsoft invented U-SQL language. Customers do not have to manage deployment of clusters, but can simply submit their jobs to ADLA, an analytics platform managed by Microsoft.

Click through for the full announcement.

Comments closed

Cloudera Accessing Azure Data Lake Store

Published 2017-04-27 by Kevin Feasel

The Azure Data Lake team has announced that you can now access Azure Data Lake Store using a Cloudera cluster:

The Azure Data Lake (ADL) vision from the beginning has been to transform business data into intelligence by providing analytics on any data at cloud scale. ADL enterprise customers gain insights on their business data using a wide range of tools and platforms. Today’s release of Cloudera Enterprise 5.11 brings another very valuable and widely-used Hadoop computation platform to the set of platforms that can leverage ADLS. No matter what big data analytics platform you choose, Azure Data Lake Store provides a single high throughput enterprise-scale hierarchical file system data lake repository for big data.

Anyone with an Azure subscription can now deploy Cloudera clusters with ADLS. To get started, you can use the Cloudera Enterprise Data Hub template or the Cloudera Director template on Azure Marketplace to create a Cloudera cluster. Once the cluster is up, see here for more information on how to set up your Cloudera cluster with ADLS today!

That’s an interesting development.

Comments closed

Data Lake Zoning

Published 2017-04-26 by Kevin Feasel

Parth Patel, et al, explain that there ought to be several zones of data within a data lake:

Within a Data Lake, zones allow the logical and/or physical separation of data that keeps the environment secure, organized, and Agile. Typically, the use of 3 or 4 zones is encouraged, but fewer or more may be leveraged. A generic 4-zone system might include the following:

Transient Zone — Used to hold ephemeral data, such as temporary copies, streaming spools, or other short-lived data before being ingested.

Raw Zone – The zone in which raw data will be maintained. This is also the zone where sensitive data must be encrypted, tokenized, or otherwise secured.

Trusted Zone – After Data Quality, Validation, or other processing is performed on data in the Raw Zone, it becomes the “source of truth” in this zone for downstream systems.

Refined Zone – Manipulated and enriched data is kept in this zone. This is used to store the output from tools like Hive or external tools that will write into to the Data Lake.

Your particular situation may differ but I’d consider this to be good advice no matter where or how you’re storing data, such as a classical data warehouse or an ODS.

Comments closed

Azure Data Lake Store Best Practices

Published 2017-04-11 by Kevin Feasel

Ust Oldfield provides recommendations on how to size and lay out files in Azure Data Lake Store:

The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.

In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.

This is the type of thing that you can easily forget about, but it makes a huge difference down the line.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31