Data Lake – Page 11 – Curated SQL

Data Lake File Formats and Security

Published 2020-01-31 by Kevin Feasel

Ashish Kumar and Jorge Villamariona continue a series on data lakes:

People from a traditional RDBMS background are often surprised at the extraordinary amount of control that data lake architects have over how datasets can be stored. Data Lake Architects, as opposed to the Relational Database Administrators, get to determine an array of elements such as file sizes, type of storage (row vs. columnar), degrees of compression, indexing, schemas, and block sizes. These are related to the big data oriented ecosystem of formats commonly used for storing and accessing information in a data lake.

It is a bit of a different world and it comes with trade-offs. The whole thing is worth reading.

Comments closed

What’s New with Delta Lake

Published 2020-01-30 by Kevin Feasel

Denny Lee and Tathagata Das announce Delta Lake 0.5.0:

With the following pull requests, you can now run even more Delta Lake operations concurrently. With finer grain conflict detection, these updates make it easier to run complex workflows on Delta tables such as:
– Running deletes (e.g. for GDPR compliance) concurrently on older partitions while newer partitions are being appended.
– Running file compactions concurrently with appends.
– Running updates and merges concurrently on disjoint sets of partitions.

Click through for the full changelog.

Comments closed

Data Lake Storage and Data Processing

Published 2020-01-28 by Kevin Feasel

Ashish Kumar has started a series on data lake essentials:

Data Lake architecture is all about storing large amounts of data which can be structured, semi-structured or unstructured, e.g. web server logs, RDBMS data, NoSql data, social media, sensors, IoT data and third-party data. A data lake can store the data in the same format as its source systems or transform it before storing.
The main purpose of a data lake is to make organizational data from different sources, accessible to a variety of end users like business analysts, data engineers, data scientists, product managers, executives, etc, in order to enable these personas to leverage insights in a cost-effective manner, for improved business performance. Today, many forms of advanced analytics are only possible on data lakes.

Click through for more information on what a data lake should provide—whether that be in-house or a cloud provider.

Comments closed

Using ACLs to Secure Azure Data Lake Data

Published 2020-01-21 by Kevin Feasel

Matthew Roche takes us through access control lists (ACLs) in Azure Data Lake Storage Gen2 and how they apply to Power BI:

Earlier this week I received a question from a customer on how to get Power BI to work with data in ADLSg2 that is secured using ACLs. I didn’t know the answer, but I knew who would know, and I looped in Ben Sack from the dataflows team.Ben answered the customer’s questions and unblocked their efforts, and he said that I could turn them into a blog post. Thank you, Ben!

Read on for the answer.

Comments closed

Azure Databricks and Delta Lake

Published 2019-12-11 by Kevin Feasel

Brad Llewellyn starts a new series on Delta Lake in Azure Databricks:

Saving the data in Delta format is as simple as replacing the .format(“parquet”) function with .format(“delta”). However, we see a major difference when we look at the table creation. When creating a table using Delta, we don’t have to specify the schema, because the schema is already strongly defined when we save the data. We also see that Delta tables can be easily queried using the same SQL we’re used to. Next, let’s compare what the raw files look like by examining the blob storage container that we are storing them in.

There are some good demos in this post and it promises to be a nice series.

Comments closed

Evolution of the Data Lake

Published 2019-11-15 by Kevin Feasel

Jim Wankowski takes us through the history of data lakes:

It is important to understand the difference between data lakes and data warehouses. A data warehouse is highly structured. Much effort is done upfront in developing schemas and hierarchies prior to the data being loaded into a warehouse. There is no hierarchy or structure to the way data is stored in a data lake. The structure is applied afterward. There can be multiple schemas applied to the same data in a data lake.

Read on to learn how the data lake concept has evolved over the past few years.

Comments closed

Delta Lake to Become an Open Standard

Published 2019-10-17 by Kevin Feasel

Michael Armbrust and Reynold Xin have exciting news about Delta Lake:

At today’s Spark + AI Summit Europe in Amsterdam, we announced that Delta Lake is becoming a Linux Foundation project. Together with the community, the project aims to establish an open standard for managing large amounts of data in data lakes. The Apache 2.0 software license remains unchanged.
Delta Lake focuses on improving the reliability and scalability of data lakes. Its higher level abstractions and guarantees, including ACID transactions and time travel, drastically simplify the complexity of real-world data engineering architecture. Since we open sourced Delta Lake six months ago, we have been humbled by the reception. The project has been deployed at thousands of organizations and processes exabytes of data each month, becoming an indispensable pillar in data and AI architectures.

Read on to see what this means for Delta Lake.

Comments closed

The Benefits of Delta Lake

Published 2019-10-15 by Kevin Feasel

Kaushik Nath explains what a Delta Lake is and why it is beneficial:

Data lakes have generated a large amount of publicity as the new storage technology for our big data era. Because something new is always better, right?
All this hype around data lakes has ignored their inherent drawbacks and limitations. Well, I’m Not Here to create a debate by saying that no one should ever use data lakes. But I am saying that companies should enter into the data lake investment with eyes wide open. Otherwise it might lead to some serious complications.

Delta Lake is a concept intended to mitigate some of the issues with data lakes in general, turning them into data swamps.

Comments closed

The Flexible Data Lake

Published 2019-10-09 by Kevin Feasel

Neil Stokes explains how you can optimize a Hadoop-based data lake:

There are many details, of course, but these trade-offs boil down to three facets as shown below.
Big refers to the volume of data you can handle with your environment. Hadoop allows you to scale your storage capacity – horizontally as well as vertically – to handle vast volumes of data.
Fast refers to the speed with which you can ingest and process the data and derive insights from it. Hadoop allows you to scale your processing capacity using relatively cheap commodity hardware and massively parallel processing techniques to access and process data quickly.
Cheap refers to the overall cost of the platform. This means not just the cost of the infrastructure to support your storage and processing requirements, but also the cost of building, maintaining and operating the environment which can grow quite complicated as more requirements come into play.
The bottom line here is that there’s no magic in Hadoop. Like any other technology, you can typically achieve one or at best two of these facets, but in the absence of an unlimited budget, you typically need to sacrifice in some way.

Software development is full of trade-offs, and data lakes are no different. Read the whole thing.

Comments closed

Architecting a Data Lake in AWS

Published 2019-10-08 by Kevin Feasel

Gaurav Mishra takes us through data lake architecture on AWS:

Landing zone: This is the area where all the raw data comes in, from all the different sources within the enterprise. The zone is strictly meant for data ingestion, and no modelling or extraction should be done at this stage.
Curation zone: Here’s where you get to play with the data. The entire extract-transform-load (ETL) process takes place at this stage, where the data is crawled to understand what it is and how it might be useful. The creation of metadata, or applying different modelling techniques to it to find potential uses, is all done here.
Production zone: This is where your data is ready to be consumed into different application, or to be accessed by different personas.

This is a nice overview of data lake concepts and worth the read if you’re using AWS. Even if not, the same principles (if not the same technologies) apply for Azure, other clouds, and on-prem.

Comments closed

Category: Data Lake