Press "Enter" to skip to content

Category: Data Lake

Using ACLs to Secure Azure Data Lake Data

Matthew Roche takes us through access control lists (ACLs) in Azure Data Lake Storage Gen2 and how they apply to Power BI:

Earlier this week I received a question from a customer on how to get Power BI to work with data in ADLSg2 that is  secured using ACLs. I didn’t know the answer, but I knew who would know, and I looped in Ben Sack from the dataflows team.Ben answered the customer’s questions and unblocked their efforts, and he said that I could turn them into a blog post. Thank you, Ben!

Read on for the answer.

Comments closed

Azure Databricks and Delta Lake

Brad Llewellyn starts a new series on Delta Lake in Azure Databricks:

Saving the data in Delta format is as simple as replacing the .format(“parquet”) function with .format(“delta”).  However, we see a major difference when we look at the table creation.  When creating a table using Delta, we don’t have to specify the schema, because the schema is already strongly defined when we save the data.  We also see that Delta tables can be easily queried using the same SQL we’re used to.  Next, let’s compare what the raw files look like by examining the blob storage container that we are storing them in.

There are some good demos in this post and it promises to be a nice series.

Comments closed

Evolution of the Data Lake

Jim Wankowski takes us through the history of data lakes:

It is important to understand the difference between data lakes and data warehouses. A data warehouse is highly structured. Much effort is done upfront in developing schemas and hierarchies prior to the data being loaded into a warehouse. There is no hierarchy or structure to the way data is stored in a data lake. The structure is applied afterward. There can be multiple schemas applied to the same data in a data lake.

Read on to learn how the data lake concept has evolved over the past few years.

Comments closed

Delta Lake to Become an Open Standard

Michael Armbrust and Reynold Xin have exciting news about Delta Lake:

At today’s Spark + AI Summit Europe in Amsterdam, we announced that Delta Lake is becoming a Linux Foundation project. Together with the community, the project aims to establish an open standard for managing large amounts of data in data lakes. The Apache 2.0 software license remains unchanged.

Delta Lake focuses on improving the reliability and scalability of data lakes. Its higher level abstractions and guarantees, including ACID transactions and time travel, drastically simplify the complexity of real-world data engineering architecture. Since we open sourced Delta Lake six months ago, we have been humbled by the reception. The project has been deployed at thousands of organizations and processes exabytes of data each month, becoming an indispensable pillar in data and AI architectures.

Read on to see what this means for Delta Lake.

Comments closed

The Benefits of Delta Lake

Kaushik Nath explains what a Delta Lake is and why it is beneficial:

Data lakes have generated a large amount of publicity as the new storage technology for our big data era. Because something new is always better, right? 

All this hype around data lakes has ignored their inherent drawbacks and limitations. Well, I’m Not Here to create a debate by saying that no one should ever use data lakes. But I am saying that companies should enter into the data lake investment with eyes wide open. Otherwise it might lead to some serious complications.

Delta Lake is a concept intended to mitigate some of the issues with data lakes in general, turning them into data swamps.

Comments closed

The Flexible Data Lake

Neil Stokes explains how you can optimize a Hadoop-based data lake:

There are many details, of course, but these trade-offs boil down to three facets as shown below.

Big refers to the volume of data you can handle with your environment. Hadoop allows you to scale your storage capacity – horizontally as well as vertically – to handle vast volumes of data.

Fast refers to the speed with which you can ingest and process the data and derive insights from it. Hadoop allows you to scale your processing capacity using relatively cheap commodity hardware and massively parallel processing techniques to access and process data quickly.

Cheap refers to the overall cost of the platform. This means not just the cost of the infrastructure to support your storage and processing requirements, but also the cost of building, maintaining and operating the environment which can grow quite complicated as more requirements come into play.

The bottom line here is that there’s no magic in Hadoop. Like any other technology, you can typically achieve one or at best two of these facets, but in the absence of an unlimited budget, you typically need to sacrifice in some way.

Software development is full of trade-offs, and data lakes are no different. Read the whole thing.

Comments closed

Architecting a Data Lake in AWS

Gaurav Mishra takes us through data lake architecture on AWS:

Landing zone: This is the area where all the raw data comes in, from all the different sources within the enterprise. The zone is strictly meant for data ingestion, and no modelling or extraction should be done at this stage.

Curation zone: Here’s where you get to play with the data. The entire extract-transform-load (ETL) process takes place at this stage, where the data is crawled to understand what it is and how it might be useful. The creation of metadata, or applying different modelling techniques to it to find potential uses, is all done here.

Production zone: This is where your data is ready to be consumed into different application, or to be accessed by different personas. 

This is a nice overview of data lake concepts and worth the read if you’re using AWS. Even if not, the same principles (if not the same technologies) apply for Azure, other clouds, and on-prem.

Comments closed

From Data Lake to Delta Lake

Anmol Sarna explains the benefits of Spark’s Delta Lake:

Traditionally data has been residing in silos across the organization and the ecosystem in which it operations (external data). That’s a challenge: you can’t combine the right data to succeed in a big data project if that data is a bit everywhere in and out of the cloud. This is where the idea – and reality – of (big) data lakes comes from.

data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.

Although the data lakes serve as a central ingestion point for the plethora of data that organizations seek to gather and mine, it still has various limitations or challenges.

Software is made up of tradeoffs. What you gain by dumping relational structure, you lose in dumping relational structure (and this applies in the opposite direction as well).

Comments closed

Accessing Data in Azure Data Lake Storage Gen 2

James Serra gives us several methods to access data in Azure Data Lake Storage Gen 2:

With data lakes becoming popular, and Azure Data Lake Store (ADLS) Gen2 being used for many of them, a common question I am asked about is “How can I access data in ADLS Gen2 instead of a copy of the data in another product (i.e. Azure SQL Data Warehouse)?”. The benefits of accessing ADLS Gen2 directly is less ETL, less cost, to see if the data in the data lake has value before making it part of ETL, for a one-time report, for a data scientist who wants to use the data to train a model, or for using a compute solution that points to ADLS Gen2 to clean your data. While these are all valid reasons, you still want to have a relational database (see Is the traditional data warehouse dead?). The trade-off in accessing data directly in ADLS Gen2 is slower performance, limited concurrency, limited data security (no row-level, column-level, dynamic data masking, etc) and the difficulty in accessing it compared to accessing a relational database.

Since ADLS Gen2 is just storage, you need other technologies to copy data to it or to read data in it.

Read on for the solution.

Comments closed

Machine Learning and Delta Lake

Brenner Heintz and Denny Lee walk us through solving data engineering problems with Delta Lake:

As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to the data so that it can be fed into machine learning models, these data lakes can quickly become convoluted, unorganized messes that have given rise to the term “data swamps.”

Before a single data point has been transformed or analyzed, data engineers have already run into their first dilemma: how to bring together processing of historical (“batch”) data, and real-time streaming data. Traditionally, one might use a lambda architecture to bridge this gap, but that presents problems of its own stemming from lambda’s complexity, as well as its tendency to cause data loss or corruption.

Read the whole thing.

Comments closed