Press "Enter" to skip to content

Category: Data Lake

Data Lakes for Smaller Projects

Thomas Spicer explains that your data lake doesn’t need to be enormous to be useful:

We recently wrote an article debunking common myths about data lake architectures, data lake definitions, and data lake analytics. It is called What is a Data Lake? Get A Leg Up Avoiding The Biggest Myths.” In that article, we framed the current conversation about data lakes and how they fit within enterprise data strategies. This topic has historically been confusing and opaque for those wanting to get value from a data lake due to conflicting advice from consultants and vendors.  

One area that can be particularly confusing is the perception that lakes are only for “big data.” If you spend any time reading materials on lakes, you would think there is only one type and it would look like the Capsian Sea (it’s a lake despite “sea” in the name). People describe data lakes as massive, all-encompassing entities, designed to hold all knowledge. The good news is that lakes are not just for “big data” and you have more opportunities than ever to have them be part of your data stack.

Click through for Thomas’s argument.

Comments closed

Working with ADLS Gen 2 in Power Query

Marco Russo takes us through some ways to optimize performance when working with Azure Data Lake Storage Gen 2 from Power Query:

With Power Query you can apply filters to the list obtained by the File System View option, thus restricting the access to only those files (or a single file) you are interested in. However, there is no query folding of this filter. What happens is that every time you refresh the data source, the list of all these files is read by Power Query; the filters in M Query to the folder path and the file name are then applied to this list only client-side. This is very expensive because the entire list is also downloaded when the expression is initially evaluated just to get the structure of the result of the transformation.

A better way to manage the process is to specify in the URL the complete folder path to traverse the hierarchy, and get only the files that are interesting for the transformation – or the exact path of the file is you are able to do that. For example, the data lake I used had one file for each day, stored in a folder structure organized by yyyy\mm, so every folder holds up to 31 files (one month).

Read on for more advice in this vein.

Comments closed

Storing Streaming Data in Azure Data Lake

Jesse Gorter takes us through writing streaming data from Event Hubs into Azure Data Lake Storage:

In my previous blog I showed how you can stream Twitter data to an Event Hub and stream the data to a Power BI live dashboard. In this post, I am going to show you how to store this data for long term storage. An Event Hub stores your events temporarily. That means it does not store them for later analysis. Say you want to analyze whether negative or positive tweets have an impact on your sales, you would need to store tweets for a historical view.

The question is where to store this data: directly to the datawarehouse, or store it to a data lake? This really depends on the architecture that you want to have. A data lake is often used to store the raw data historically. Is is especially interesting because it allows to store any kind of data, structured or unstructured and it is quite cheap compared to Azure SQL database or Azure SQL datawarehouse. So for that reason, we are going to store it in a data lake.

Jesse walks us through data lake creation and data migration from Event Hubs into a Data Lake Storage container.

Comments closed

Data Lakes and the Power of Data Catalogs

Ashish Kumar and Jorge Villamariona take us through data lakes and data catalogs:

Any data lake design should incorporate a metadata storage strategy to enable business users to search, locate and learn about the datasets that are available in the lake. While traditional data warehousing stores a fixed and static set of meaningful data definitions and characteristics within the relational storage layer, data lake storage is intended to support the application of schema at read time with flexibility. However, this means that a separate storage layer is required to house cataloging metadata that represents technical and business meaning. While organizations sometimes simply accumulate content in a data lake without a metadata layer, this is a recipe for an unmanageable data swamp instead of a useful data lake. There are a wide range of approaches and solutions to ensure that appropriate metadata is created and maintained. Here are some important principles and patterns to keep in mind. Single data set can have multiple metadata layers dependent on use cases. e.g. Hive Metastore, Apache Glue etc. Same data can be exported to some NoSQL database which would have different schema.

Having a bunch of data isn’t helpful if you don’t know where it is, how it’s formatted, or anything else about the data.

Comments closed

Good Ideas for Designing Data Lakes

Prateek Shrivastava and Rangasayee Chandrasekaran share some advice on designing data lakes in the cloud:

Data generation and data collection across semi-structured and unstructured formats is both bursty and continuous. Inspecting, exploring and analyzing these datasets in their raw form is tedious, because the analytical engines scan the entire data set across multiple files. We recommend five ways to reduce data scanned and reduce query overheads –

Click through for the details.

Comments closed

Delta Lake and ACID Properties

Kundan Kumarr notes that Spark’s Delta Lake allows for ACID transactions:

DeltaLog is the crux of Delta Lake which ensures atomicity, consistency, isolation, and durability of user-initiated transactions. DeltaLog is an ordered record of transactions. Every transaction performed since the inception of Delta Lake Table, has an entry in the DeltaLog (also known as the Delta Lake transaction log). It acts as a single source of truth, giving users access to the last version of a DeltaTable’s state. It provides serializability, the strongest level of isolation level. Let’s see how DeltaLog ensures ACID Transactions.

Click through for the explanation.

Comments closed

Data Lake File Formats and Security

Ashish Kumar and Jorge Villamariona continue a series on data lakes:

People from a traditional RDBMS background are often surprised at the extraordinary amount of control that data lake architects have over how datasets can be stored. Data Lake Architects, as opposed to the Relational Database Administrators, get to determine an array of elements such as file sizes, type of storage (row vs. columnar), degrees of compression, indexing, schemas, and block sizes. These are related to the big data oriented ecosystem of formats commonly used for storing and accessing information in a data lake.

It is a bit of a different world and it comes with trade-offs. The whole thing is worth reading.

Comments closed

What’s New with Delta Lake

Denny Lee and Tathagata Das announce Delta Lake 0.5.0:

With the following pull requests, you can now run even more Delta Lake operations concurrently. With finer grain conflict detection, these updates make it easier to run complex workflows on Delta tables such as:

– Running deletes (e.g. for GDPR compliance) concurrently on older partitions while newer partitions are being appended.
– Running file compactions concurrently with appends.
– Running updates and merges concurrently on disjoint sets of partitions.

Click through for the full changelog.

Comments closed

Data Lake Storage and Data Processing

Ashish Kumar has started a series on data lake essentials:

Data Lake architecture is all about storing large amounts of data which can be structured, semi-structured or unstructured, e.g. web server logs, RDBMS data, NoSql data, social media, sensors, IoT data and third-party data. A data lake can store the data in the same format as its source systems or transform it before storing.

The main purpose of a data lake is to make organizational data from different sources, accessible to a variety of end users like business analysts, data engineers, data scientists, product managers, executives, etc, in order to enable these personas to leverage insights in a cost-effective manner, for improved business performance. Today, many forms of advanced analytics are only possible on data lakes.

Click through for more information on what a data lake should provide—whether that be in-house or a cloud provider.

Comments closed

Using ACLs to Secure Azure Data Lake Data

Matthew Roche takes us through access control lists (ACLs) in Azure Data Lake Storage Gen2 and how they apply to Power BI:

Earlier this week I received a question from a customer on how to get Power BI to work with data in ADLSg2 that is  secured using ACLs. I didn’t know the answer, but I knew who would know, and I looped in Ben Sack from the dataflows team.Ben answered the customer’s questions and unblocked their efforts, and he said that I could turn them into a blog post. Thank you, Ben!

Read on for the answer.

Comments closed