Category: Data Lake

Azure Data Lake Analytics Updates

Published 2018-06-12 by Kevin Feasel

Michael Rys has a boatload of new updates for Azure Data Lake:

The top items include expanding our built-in support for standard file formats with native Parquet support for extractors and outputters (in public preview) and ORC (in private preview)!

In addition, since the fast file set feature now has been generally released, we can consume hundreds of thousands of such files in bulk in a single EXTRACT statement. We will publish a blog at a later date to give you much more detailed information on how this capability helps you to process so many files efficiently in a scalable way.

Important aspects of processing files at scale include:

the ability to generate many files from a rowset in a single statement, providing a way to dynamically partition the data for future use with Hadoop or Spark, or to provide individual files for customers. This has been our top customer ask on the ADL Feedback forum –and now it is in private preview!
the ability to handle many small files. We recommend that you make your files large enough for the processing to be efficient (300MB to 4GB is a good range), but often, your file formats (e.g., images) or data ingestion pipelines (e.g., EventHub archives) are not able to reach that size. Thus, we are adding the ability to group several files into a vertex to increase efficiency and lower cost of your job (we have seen 10 to 30 times improvement in some customer jobs!).

Read on for the full changelog.

Comments closed

Using AU Analyzer To Lower Data Lake Analytics Costs

Published 2018-05-16 by Kevin Feasel

Matthew Hicks shows off the Data Lake Analytics AU Analyzer:

The AU Analyzer looks at all the vertices (or nodes) in your job, analyzes how long they ran and their dependencies, then models how long the job might run if a certain number of vertices could run at the same time. Each vertex may have to wait for input or for its spot in line to run. The AU Analyzer isn’t 100% accurate, but it provides general guidance to help you choose the right number of AUs for your job.

You’ll notice that there are diminishing returns when assigning more AUs, mainly because of input dependencies and the running times of the vertices themselves. So, a job with 10,000 total vertices likely won’t be able to use 10,000 AUs at once, since some will have to wait for input or for dependent vertices to complete.

In the graph below, here’s what the modeler might produce, when considering the different options. Notice that when the job is assigned 1427 AUs, assigning more won’t reduce the running time. 1427 is the “peak” number of AUs that can be assigned.

I like this kind of tooling, as it provides a realistic assessment of tradeoffs.

Comments closed

Exposing Azure Data Lake Store Data With Power BI

Published 2018-05-08 by Kevin Feasel

Melissa Coates shows how you can use Power BI to access data in Azure Data Lake Store:

What can you query from ADLS?

You can connect to the data stored in Azure Data Lake Store. What you *cannot* connect to currently is the data stored in the Catalog tables/views/stored procedures within Azure Data Lake Analytics (hopefully connectivity to the ADLA Catalog objects from tools other than U-SQL is available soon).

You’re not sending a U-SQL query here. Rather, we’re sending a web API request to an endpoint.

With an ADLS data source, you have to import the data into Power BI Desktop. There is no option for DirectQuery.

In other words, data that you’ve already prepped using U-SQL and want to display to the outside world. Click through for a demonstration as well as additional helpful information.

Comments closed

Data Lake Permissions

Published 2018-03-20 by Kevin Feasel

Melissa Coates has started a multi-part series on Azure Data Lake permissions. She’s put up the first three parts already. Part 1 covers the types of permissions available as well as some official documentation:

(1) RBAC permissions to the ADLS account itself, for the purpose of managing the resource.
RBAC = Role-based access control. RBAC are the familiar Azure roles such as reader, contributor, or owner. Granting a role on the service allows someone to view or manage the configuration and settings for that particular Azure service (ADLS in this case). See Part 2 for info about setting up RBAC.

Part 2 looks at permissions for the Azure Data Lake Store service itself:

Setting permissions for the service + the data stored in ADLS is always two separate processes, with one exception: when you define an owner for the ADLS service in Azure, that owner is automatically granted ‘superuser’ (full) access to manage the ADLS resource in Azure *AND* full access to the data. Any other RBAC role other than owner needs the data access specifically assigned via ACLs. This is a good thing because not all system administrators need to see the data, and not all data access users/groups/service principals need access to the service itself. This type of separation is true for certain other services too, such as Azure SQL Database.

Try to use groups whenever you can to grant access, rather than individual accounts. This is a consistent best practice for managing security across many types of systems.

Part 3 covers using ACLs to grant rights to specific files or folders in Azure Data Lake Storage:

There are two types of ACLs: Access ACLs and Default ACLs.

An Access ACL is the read/write/execute permissions specified for a folder or file. Every single folder or file has its security explicitly defined — so that means the ADLS security model is not an ‘inheritance’ model. That is an important concept to remember.

A Default ACL is like a ‘template’ setting at a folder level (the concept of a default doesn’t apply at the file level). Any new child item placed in that folder will automatically obtain that default security setting. The default ACLs are absolutely critical, given that data permissions aren’t an inheritance model. You want to avoid a situation where a user has permission to read a folder, but is unable to see any of the files within the folder — that situation will happen if a new file gets added to a folder which has an access ACL set at the folder level, but not a default ACL to apply to new child objects.

There’s a lot of good information here and I’m looking forward to parts 4 and 5.

Comments closed

Defining A Data Lake

Published 2018-01-23 by Kevin Feasel

Derik Hammer gives us a definition of the data lake:

Data lake, a term originally coined by James Dixon, the founder and CTO of Pentaho, is used to describe a data store which can scale to extremely large sizes, in an affordable manner. A data lake is also designed to store the raw data, in its original format, so it can be used immediately, rather than waiting weeks for the IT department to massage it into a format that the data warehouse can accept and/or use effectively.

The data lake concept always includes the capability to scale to an enormous size. However, you do not need petabytes of data to find use in a data lake. It can be used as cheap storage for long-term archival data. It can be used to transform data before attempting to ingest into a data warehouse with the convenience of retaining the original and transformed versions of the data. It also can be used as the centralized staging location for ingestion into the data warehouse, simplifying the loading processes.

I would like to take this opportunity to remind readers that the Aristotelian opposite of the Data Lake is the Data Swamp. Derik uses this term as well and it makes me feel warm and fuzzy inside to see broad adoption of this term.

Comments closed

Comparing Data Lake Job Runs

Published 2018-01-17 by Kevin Feasel

Yanan Cai shows how to compare stats on different executions of a job:

Troubleshooting issues in recurring job is a time-consuming task. It starts with searching through the Job Browser to find instances of a recurring job and identifying both baseline and anomalous performance. This is followed by multi-way comparisons between job instances to figure out what has been changed in the query, data or environment. This is followed by analysis to discover which changes may have performance impact. While this is happening production workloads continue to under-perform or go down.

Azure Data Lake Tools for Visual Studio now makes it easy to spot anomalies and quickly trace the key characteristics across recurring job instances allowing for an efficient debugging experience. The Pipeline Browser automatically groups recurring jobs to simplify discovery of all runs. The Related Job View collects data about inputs, outputs and execution across multiple runs into a single visualization.

Read on for more.

Comments closed

Data Lake Zones

Published 2018-01-02 by Kevin Feasel

Melissa Coates walks us through the different layers of a data lake:

As we are approaching the end of 2017, many people have resolutions or goals for the new year. How about a goal to get organized…in your data lake?

The most important aspect of organizing a data lake is optimal data retrieval.

Click through for a great visual showing the various zones in a data lake.

Comments closed

The Need For Multiple Warehouse Architectures

Published 2017-12-27 by Kevin Feasel

James Serra argues in favor of a data lake approach and a traditional data warehouse:

I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution? The short answer is you should use both. The rest of this post will dig into the reasons why.

I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update. I also touched on this topic in my blogs Use cases of various products for a big data cloud solution, Data lake details, Why use a data lake?and What is a data lake? and my presentation Big data architectures and the data lake.

Read on for James’s argument, which is good. My argument is summed up as follows: the purpose of a data warehouse is to solve known business problems—that is, to help build reports that people on the business side need based on established requirements. The purpose of a data lake is to hold all kinds of data and curate it for when people come looking for something they didn’t know they needed.

Comments closed

Data Lake Archive Tier

Published 2017-12-14 by Kevin Feasel

Ust Oldfeld looks at an important part of a data lake:

The Archive access tier in blob storage was made generally available today (13th December 2017) and with it comes the final piece in the puzzle to archiving data from the data lake.

Where Hot and Cool access tiers can be applied at a storage account level, the Archive access tier can only be applied to a blob storage container. To understand why the Archive access tier can only be applied to a container, you need to understand the features of the Archive access tier. It is intended for data that has no or low SLAs for availability within an organisation and the data is stored offline (Hot and Cool access tiers are online). Therefore, it can take up to 15 hours for data to be made online and available. Brining Archive data online is a process called rehydration (fitting for the data lake). If you have lots of blob containers in a storage account, you can archive them and rehydrate them as required, rather than having to rehydrate the entire storage account.

Read on for more details, including a pattern for archiving data lake data.

Comments closed

Fetching U-SQL Job Input And Output Paths

Published 2017-12-07 by Kevin Feasel

Matthew Hicks shows how to retrieve information on U-SQL input and output paths using Powershell:

Each time you submit a U-SQL job, a job folder is created in your Azure Data Lake Store account. This folder contains useful debugging information about the job, including a file called the U-SQL algebra file. This is an XML file containing information about your job graph, the list of input and output files, and other key U-SQL job metadata.

We’ve just published a sample script that reads the U-SQL algebra file for a specified job and returns the input or output files. Give it a try!

Read on for more.

Comments closed