Press "Enter" to skip to content

Category: Data Lake

The Need For Multiple Warehouse Architectures

James Serra argues in favor of a data lake approach and a traditional data warehouse:

I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution?  The short answer is you should use both.  The rest of this post will dig into the reasons why.

I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update.  I also touched on this topic in my blogs Use cases of various products for a big data cloud solutionData lake detailsWhy use a data lake?and What is a data lake? and my presentation Big data architectures and the data lake.  

Read on for James’s argument, which is good.  My argument is summed up as follows:  the purpose of a data warehouse is to solve known business problems—that is, to help build reports that people on the business side need based on established requirements.  The purpose of a data lake is to hold all kinds of data and curate it for when people come looking for something they didn’t know they needed.

Comments closed

Data Lake Archive Tier

Ust Oldfeld looks at an important part of a data lake:

The Archive access tier in blob storage was made generally available today (13th December 2017) and with it comes the final piece in the puzzle to archiving data from the data lake.

Where Hot and Cool access tiers can be applied at a storage account level, the Archive access tier can only be applied to a blob storage container. To understand why the Archive access tier can only be applied to a container, you need to understand the features of the Archive access tier. It is intended for data that has no or low SLAs for availability within an organisation and the data is stored offline (Hot and Cool access tiers are online). Therefore, it can take up to 15 hours for data to be made online and available. Brining Archive data online is a process called rehydration (fitting for the data lake). If you have lots of blob containers in a storage account, you can archive them and rehydrate them as required, rather than having to rehydrate the entire storage account.

Read on for more details, including a pattern for archiving data lake data.

Comments closed

Fetching U-SQL Job Input And Output Paths

Matthew Hicks shows how to retrieve information on U-SQL input and output paths using Powershell:

Each time you submit a U-SQL job, a job folder is created in your Azure Data Lake Store account. This folder contains useful debugging information about the job, including a file called the U-SQL algebra file. This is an XML file containing information about your job graph, the list of input and output files, and other key U-SQL job metadata.

We’ve just published a sample script that reads the U-SQL algebra file for a specified job and returns the input or output files. Give it a try!

Read on for more.

Comments closed

Automating Azure Data Lake Storage ACLs

Shannon Lowder shows how to automate Azure Data Lake Storage access control lists:

Now that you have these, you can use a for each loop to set your permissions.

foreach ($ACL in $ACLs) {
   write-host "Grant $useremail " $ACL[1] " access to " $ACL[0];
    Set-AzureRmDataLakeStoreItemAclEntry -AccountName $adls -Path $ACL[0] -AceType User -Id $(Get-AzureRmADUser -Mail $useremail ).Id -Permissions $ACL[1]
    Set-AzureRmDataLakeStoreItemAclEntry -AccountName $adls -Path $ACL[0] -AceType User -Id $(Get-AzureRmADUser -Mail $useremail ).Id -Permissions $ACL[1] -Default
}

Now, for each permission, we’ll set the ACL and the default.  Why set both?  Well, when folders are created under each of the target folders, you want to cascade those permissions down from parent to child, right?  Well, that’s what the Default ACL controls.  If you skip the second Set-AzureRMDataLakeStoreItemAclEntry, then new folders would not inherit the permissions of the containing folder and your users would be unable to access their files properly.

Read the whole thing.  Shannon also has one of the very few valid use cases for 3D pie charts.

Comments closed

Data Lake Zones

Shannon Lowder walks us through a multi-zone approach to storing data in a data lake:

Our first zone is the raw zone.  This zone will serve as the landing point for source files.  Like the extract (or stage) schema in our data warehouse, we want these files to match the source system as close as possible.In the data lake, we actually go one step beyond saying we want the schema of our raw files to match the source system, we also want these files to be immutable.

Immutable means once they are written to the raw folder we shouldn’t be able to modify or delete them.  That way, we can always reconstruct different states from these files without having to retrieve them from the source system.

Worth reading the whole thing.

Comments closed

Data Lakes Aren’t New

Shannon Lowder reveals one of the deep, dark data lake secrets:

Turns out there are three basic zones or areas to a data lake. Raw, Managed, and Presentation.

The raw zone should be optimized for fast storage.  The goal is to get the data in as quickly as possible.  Don’t make any changes to this data.  You want it stored as close to the original format as possible.  It sounds just like staged data to me.  Data you’d build an extract package to get from source to your staging environment, right?

Maybe you’re thinking this is just a coincidence…let’s move on.

Spoilers:  it’s not a coincidence.

Comments closed

Azure Data Lake Analytics Pipelines

Yan Li notes that Azure Data Lake Analytics now offers the ability to manage pipelines:

To make it easier to manage and understand jobs, ADLA now captures the pipeline and recurrence information for each job. This information can be used to connect and organize jobs belonging to the same pipeline or recurring instances. As shown in Fig 2, now jobs are organized by pipeline and recurring instances which enable you to:

  • Quickly identify jobs in pipelines which may have failed or taken longer than expected.

  • Get the aggregated statistics (e.g. job counts, successful and failed AU hours etc.) for a pipeline or a recurring instance

This is an interesting improvement.

Comments closed

Generating U-SQL Extract Scripts From Visual Studio

Yanan Cai shows a GUI for creating U-SQL EXTRACT scripts via Azure Data Lake Tools for Visual Studio:

One of U-SQL’s core capabilities is to be able to schematize unstructured data on the fly without having to create a metadata object for it. This capability is provided by the EXTRACT expression that will invoke either a user-defined extractor or built-in extractor to process the input file or set of files specified in the FROM clause and produces a rowset whose schema is specified in the EXTRACT clause.

While using the build-in extractor to schema semi-structured data, like data in .csv file, the schema definition in U-SQL is slow and error prone, especially for the .csv file contains hundreds of columns.

Recently, we released a new feature in the latest version of Azure Data Lake Tools for Visual Studio to help you generate this U-SQL EXTRACT statement automatically.

Click through for an example as well as a video showing the process.

Comments closed

The Data Lake From 10,000 Feet

Pradeep Menon has a high-level explanation of what a data lake is and how it differs from traditional data warehouses:

With the changes in the data paradigm, a new architectural pattern has emerged. It’s called as the Data Lake Architecture. Like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas. It provides data scientists an avenue to explore data and create a hypothesis. It provides an avenue for business users to explore data. It provides an avenue for data analysts to analyze data and find patterns. It provides an avenue for reporting analysts to create reports and present to stakeholders.

The way I compare a data lake to a data warehouse or a mart is like this:

Data Lake stores data in the purest form caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.

One way of thinking about this is that data warehouses are great for solving known business questions:  generating 10K reports or other regulatory compliance reporting, building the end-of-month data, and viewing standard KPIs.  By contrast, the data lake is (among other things) for spelunking, trying to answer those one-off questions people seem to have but which the warehouse never seems to have quite the right set of information.

Comments closed