One of U-SQL’s core capabilities is to be able to schematize unstructured data on the fly without having to create a metadata object for it. This capability is provided by the EXTRACT expression that will invoke either a user-defined extractor or built-in extractor to process the input file or set of files specified in the FROM clause and produces a rowset whose schema is specified in the EXTRACT clause.
While using the build-in extractor to schema semi-structured data, like data in .csv file, the schema definition in U-SQL is slow and error prone, especially for the .csv file contains hundreds of columns.
Recently, we released a new feature in the latest version of Azure Data Lake Tools for Visual Studio to help you generate this U-SQL EXTRACT statement automatically.
Click through for an example as well as a video showing the process.
With the changes in the data paradigm, a new architectural pattern has emerged. It’s called as the Data Lake Architecture. Like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas. It provides data scientists an avenue to explore data and create a hypothesis. It provides an avenue for business users to explore data. It provides an avenue for data analysts to analyze data and find patterns. It provides an avenue for reporting analysts to create reports and present to stakeholders.
The way I compare a data lake to a data warehouse or a mart is like this:
Data Lake stores data in the purest form caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.
One way of thinking about this is that data warehouses are great for solving known business questions: generating 10K reports or other regulatory compliance reporting, building the end-of-month data, and viewing standard KPIs. By contrast, the data lake is (among other things) for spelunking, trying to answer those one-off questions people seem to have but which the warehouse never seems to have quite the right set of information.
Until now, if you had to analyze data stored in ADLS with Excel, you would have to copy it into a relational data store like Azure SQL Data Warehouse or download the data onto a machine, and then use Excel to analyze that data. This was rather cumbersome involving additional cost and time. With this new support, you can now access files stored in ADLS with Excel in-place, without having to copy them to other stores or locations. You can quickly get advanced insights into raw or prepared data. Models and queries you have created using Excel that ran against local data, can be run seamlessly against data stored in ADLS.
Security capabilities of ADLS allow administrators to control access to the data stored in ADLS in a discretionary manner. With this you can limit the access that Excel users have for the data in ADLS. In this manner, data in the ADLS-based data lake continues to be the single source of truth with no redundant copies and can be analyzed by analytics tools of your own choice .
Click through for a demo video.
A data lake is a concept that opposes the idea of a data mart. Where a data mart is a silo with structured and cleansed data, a data lake is a huge data collection that is unstructured and raw. You could also say that a data mart is a bottle of clean water whereas the data lake is the lake with (not so clean) water. 🙂
Now why would you want a data lake? Imagine you are generating huge logfiles, for example in airplanes. Machines that track air pressure, temperature etc. If something goes wrong, you definitely want to be alerted. That is event-driven: “if A and B happen, alert pilot, or do C” and there are tools for dealing with that kind of streaming data. But what if the plane landed safely? What do you do with all that data? You do not need it anymore right?
Well, some people would say: “Wrong”. You might need that data later for reasons you do not know today. Google, Microsoft and Facebook are all hoarding data. Also data they are not sure they might need someday. This data could later prove to be valuable for AI, machine learning or for something else.
Read the whole thing. The data lake concept is powerful, but it requires at least as much data governance as prior models. Just because you can dump a bunch of files without thinking about it doesn’t mean you’ll get back something useful later.
Local Debug enables you to debug your C# code behind, step through the code, and validate your script locally before submitting to ADLA.
Use command ADL: Start Local Run Service to start local run service and set a breakpoint in your code behind, then click command ADL: Local Debug to start local debug service. You can debug through the debug console and view parameter, variable, and call stack information.
Click through to see the other improvements.
The data lake introduces a new data analysis paradigm shift:
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
This allows you to avoid a lot of up-front work before you are able to analyze data. With the old way, you have to know the questions to ask. The new way supports situations when you don’t know the questions to ask.
This solves the two biggest reasons why many EDW projects fail:
Too much time spent modeling when you don’t know all of the questions your data needs to answer
Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value
There are some good details here. My addition would be to reiterate the importance of a good data governance policy.
Yan Li has a three-part series looking at management of Azure Data Lake compute. First, an overview:
Scenario 2: Set One Specific Group to Different Limits
New members are joining and sharing the same ADLA account. To prevent any new members, who are just learning ADLA, from mistakenly submitting a job that consumes too much compute resource (increasing cost and blocking other jobs), customers want to set the maximum AU per job for new employees at 30 AUs while others can submit jobs with up to 100 AUs.
- Job AU limit: 100
- Priority limit: 1
Exception Policy: New Employee Policy
Job AU limit: 30
Priority limit: 200
Group: New Employee Group
Next up is a look at job-level policies:
With job-level policies, you can control the maximum AUs and the maximum priority that individual users (or members of security groups) can set on the jobs that they submit. This allows you to not only control the costs incurred by your users but also control the impact they might have on high priority production jobs running in the same ADLA account.
There are two parts to a job level policy:
- Default Policy: This is the policy that is applied to all users of the service.
- Exceptions: The set of “exception” policies apply to specific users.
Submitted jobs that do not violate the job-level policies are still subject to the account level policies as described in Azure Data Lake Analytics Account Level Policy.
Finally, account-level policies:
ADLA supports three types of account-level policies:
Maximum AUs — Controls the maximum number of AUs that can be used by running jobs
Maximum Number of Running Jobs — Controls the maximum number of concurrently running jobs.
Days to Retain Job Queries — Controls how long detailed information about jobs are retained in the users ADLS account.
There’s a good amount of information here.
For the rest of this post, I assume that you have some basic familiarity with Python, Pandas and Jupyter.
On your machine, you will need all of the following installed:
Python 2 or 3 with Pip
Amit shows two separate methods for retrieving data, so check it out.
The ADL Tools for VSCode integrate well with ADLA. Azure Data Lake includes the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. U-SQL on ADLA offers Job as a Service with the Microsoft invented U-SQL language. Customers do not have to manage deployment of clusters, but can simply submit their jobs to ADLA, an analytics platform managed by Microsoft.
Click through for the full announcement.
The Azure Data Lake (ADL) vision from the beginning has been to transform business data into intelligence by providing analytics on any data at cloud scale. ADL enterprise customers gain insights on their business data using a wide range of tools and platforms. Today’s release of Cloudera Enterprise 5.11 brings another very valuable and widely-used Hadoop computation platform to the set of platforms that can leverage ADLS. No matter what big data analytics platform you choose, Azure Data Lake Store provides a single high throughput enterprise-scale hierarchical file system data lake repository for big data.
Anyone with an Azure subscription can now deploy Cloudera clusters with ADLS. To get started, you can use the Cloudera Enterprise Data Hub template or the Cloudera Director template on Azure Marketplace to create a Cloudera cluster. Once the cluster is up, see here for more information on how to set up your Cloudera cluster with ADLS today!
That’s an interesting development.