Press "Enter" to skip to content

Category: Data Lake

Overview: U-SQL Database Projects

Zach Stagers gives us an overview of the new U-SQL Database Project structure:

Source Control

The projects integrates much more nicely with TFS than the older “U-SQL Project” does.

It actually gives you the icons (padlock, check mark, etc..) in the solution explorer, so it actually looks like it’s under source control!

Something that I’d really hoped had been fixed, but hasn’t, is when copying and renaming an existing item, it doesn’t recognize the rename. You have to undo the checkout of the non-existent object (the copy, before being renamed):

Read on for more improvements.

Comments closed

Using Azure Data Lake Analytics With Integration Services

Yanan Cai announces that Azure Data Lake Analytics has a new task in the Azure Feature Pack for SQL Server Integration Services:

With ADLA Task in Azure Feature Pack, you can now orchestrate and create U-SQL jobs as a part of the SSIS workflow to process big data in the cloud. As ADLA is a serverless analytics service, you don’t need to worry about cluster creation and initialization, all you need is an ADLA account to start your analytics.

You can get the U-SQL script from different places by using SSIS built-in functions. You can:

  • Edit the inline U-SQL script in ADLA Task to call table valued functions and stored procedures in your U-SQL databases.

  • Use the U-SQL files stored in ADLS or Azure Blob Storage by leveraging Azure Data Lake Store File System Task and Azure Blob Download Task.

  • Use the U-SQL files from local file directly using SSIS File Connection Manager.

  • Use an SSIS variable that contains the U-SQL statements. You can also use SSIS expression to generate the U-SQL statements dynamically.

Read on for more information and a link to download the pack.

Comments closed

Data Lakes eBook

Melissa Coates has a free eBook available:

I wrote the updated content from a practical point of view, totally hype-free. The table of contents:

  • Modern Data Architecture
  • Business Needs Driving Data Architectures to Evolve and Adapt
  • Principles of a Modern Data Architecture
  • Data Lake + Data Warehouse: Complementary Solutions
  • Tips for Designing a Data Lake
  • Azure Technologies for Implementing a Data Lake
  • Considerations for a Successful Data Lake in the Cloud
  • Getting Started with a Data Lake

To download the ebook, BlueGranite will ask for you to register your information. That’s common for premium content like this. We take a low-key approach to sales, so I can assure you that registration only means you’ll receive notifications of new content that you may find interesting.

It’s the length of a good-sized paper, so you won’t have to invest dozens of hours of time to get the story.

Comments closed

Data Lakes And Data Swamps

Randolph West talks about data lakes:

Internet companies including search engines (Google, Bing), social media companies (Facebook, Twitter), and email providers (Yahoo!, Outlook.com) are managing data stores measured in petabytes. On a daily basis these organizations handle all sorts of structured and unstructured data.

Assuming they put all their data in one repository, that could technically be thought of as a data lake. These organizations have adapted existing tools, and even created new technologies, to manage data of this magnitude in a field called big data.

The short version: big data is not a 100 GB SQL Server database or data warehouse. Big data is a relatively new field that came about because traditional data management tools are simply unable to deal with such large volumes of data. Even so, a single SQL Server database can allegedly be more than 500 petabytes in size, but Michael J. Swart warns usif you’re using over 10% of what SQL Server restricts you to, you’re doing it wrong.

Incidentally, I’ll note that the term data swamp has a storied history here at Curated SQL.

Comments closed

Backing Up Azure Data Lake Store Data

Hugo Almeida has some hints for backing up Azure Data Lake Store data using Azure Data Factory:

Our Hadoop HDP IaaS cluster on Azure uses Azure Data Lake Store (ADLS) for data repository and accesses it through an applicational user created on Azure Active Directory (AAD). Check this tutorial if you want to connect your own Hadoop to ADLS.

Our ADLS is getting bigger and we’re working on a backup strategy for it. ADLS provides locally-redundant storage (LRS), however, this does not prevent our application from corrupting data or accidentally deleting it. Since Microsoft hasn’t published a new version of ADLS with a clone feature we had to find a way to backup all the data stored in our data lake.

We’re going to show you How to do a full ADLS backup with Azure Data Factory (ADF). ADF does not preserve permissions. However, our Hadoop client can only access the AzureDataLakeStoreFilesystem (adl) through hive with a “hive” user and we can generate these permissions before the backup.

Read the whole thing if you’re thinking of using Azure Data Lake Store.

Comments closed

Alerting On Azure Data Lake Store Data Usage

Jose Lara shows off an interesting feature in Azure Data Lake Store:

The massive scale and capabilities of Azure Data Lake Store are regularly used by companies for big data storage. As the number of files, file types, and folders grow, things get harder to manage and staying compliant becomes a greater challenge for companies. Regulations such as GDPR (General Data Protection Regulation) have heightened requirements for control and supervision of files that contain sensitive data.

In this blog post, I’ll show you how to set up alerts in your Azure Data Lake Store to make managing your data easier. We will create a log analytics query and an alert that monitors a specific path and file type and sends a notification whenever the path or file is created, accessed, modified, or deleted.

Auditing access has historically been tricky, so it’s nice that they were able to get that in.

Comments closed

Azure Data Lake Analytics Updates

Michael Rys has a boatload of new updates for Azure Data Lake:

The top items include expanding our built-in support for standard file formats with native Parquet support for extractors and outputters (in public preview) and ORC (in private preview)!

In addition, since the fast file set feature now has been generally released, we can consume hundreds of thousands of such files in bulk in a single EXTRACT statement. We will publish a blog at a later date to give you much more detailed information on how this capability helps you to process so many files efficiently in a scalable way.

Important aspects of processing files at scale include:

  1. the ability to generate many files from a rowset in a single statement, providing a way to dynamically partition the data for future use with Hadoop or Spark, or to provide individual files for customers. This has been our top customer ask on the ADL Feedback forum –and now it is in private preview!

  2. the ability to handle many small files. We recommend that you make your files large enough for the processing to be efficient (300MB to 4GB is a good range), but often, your file formats (e.g., images) or data ingestion pipelines (e.g., EventHub archives) are not able to reach that size. Thus, we are adding the ability to group several files into a vertex to increase efficiency and lower cost of your job (we have seen 10 to 30 times improvement in some customer jobs!).

Read on for the full changelog.

Comments closed

Using AU Analyzer To Lower Data Lake Analytics Costs

Matthew Hicks shows off the Data Lake Analytics AU Analyzer:

The AU Analyzer looks at all the vertices (or nodes) in your job, analyzes how long they ran and their dependencies, then models how long the job might run if a certain number of vertices could run at the same time. Each vertex may have to wait for input or for its spot in line to run. The AU Analyzer isn’t 100% accurate, but it provides general guidance to help you choose the right number of AUs for your job.

You’ll notice that there are diminishing returns when assigning more AUs, mainly because of input dependencies and the running times of the vertices themselves. So, a job with 10,000 total vertices likely won’t be able to use 10,000 AUs at once, since some will have to wait for input or for dependent vertices to complete.

In the graph below, here’s what the modeler might produce, when considering the different options. Notice that when the job is assigned 1427 AUs, assigning more won’t reduce the running time. 1427 is the “peak” number of AUs that can be assigned.

I like this kind of tooling, as it provides a realistic assessment of tradeoffs.

Comments closed

Exposing Azure Data Lake Store Data With Power BI

Melissa Coates shows how you can use Power BI to access data in Azure Data Lake Store:

What can you query from ADLS?

You can connect to the data stored in Azure Data Lake Store. What you *cannot* connect to currently is the data stored in the Catalog tables/views/stored procedures within Azure Data Lake Analytics (hopefully connectivity to the ADLA Catalog objects from tools other than U-SQL is available soon).

You’re not sending a U-SQL query here. Rather, we’re sending a web API request to an endpoint.

With an ADLS data source, you have to import the data into Power BI Desktop. There is no option for DirectQuery.

In other words, data that you’ve already prepped using U-SQL and want to display to the outside world.  Click through for a demonstration as well as additional helpful information.

Comments closed

Data Lake Permissions

Melissa Coates has started a multi-part series on Azure Data Lake permissions.  She’s put up the first three parts already.  Part 1 covers the types of permissions available as well as some official documentation:

(1) RBAC permissions to the ADLS account itself, for the purpose of managing the resource.
RBAC = Role-based access control. RBAC are the familiar Azure roles such as reader, contributor, or owner. Granting a role on the service allows someone to view or manage the configuration and settings for that particular Azure service (ADLS in this case). See Part 2 for info about setting up RBAC.

Part 2 looks at permissions for the Azure Data Lake Store service itself:

Setting permissions for the service + the data stored in ADLS is always two separate processes, with one exception: when you define an owner for the ADLS service in Azure, that owner is automatically granted ‘superuser’ (full) access to manage the ADLS resource in Azure *AND* full access to the data. Any other RBAC role other than owner needs the data access specifically assigned via ACLs. This is a good thing because not all system administrators need to see the data, and not all data access users/groups/service principals need access to the service itself. This type of separation is true for certain other services too, such as Azure SQL Database.

Try to use groups whenever you can to grant access, rather than individual accounts. This is a consistent best practice for managing security across many types of systems.

Part 3 covers using ACLs to grant rights to specific files or folders in Azure Data Lake Storage:

There are two types of ACLs: Access ACLs and Default ACLs.

An Access ACL is the read/write/execute permissions specified for a folder or file. Every single folder or file has its security explicitly defined — so that means the ADLS security model is not an ‘inheritance’ model. That is an important concept to remember.

Default ACL is like a ‘template’ setting at a folder level (the concept of a default doesn’t apply at the file level). Any new child item placed in that folder will automatically obtain that default security setting. The default ACLs are absolutely critical, given that data permissions aren’t an inheritance model. You want to avoid a situation where a user has permission to read a folder, but is unable to see any of the files within the folder — that situation will happen if a new file gets added to a folder which has an access ACL set at the folder level, but not a default ACL to apply to new child objects.

There’s a lot of good information here and I’m looking forward to parts 4 and 5.

Comments closed