Handling the varying formats in U-SQL involves a few steps if it’s the first time you’ve done this:
Upload custom JSON assemblies [one time setup]
Create a database [one time setup]
Register custom JSON assemblies [one time setup]
Upload JSON file to Azure Data Lake Store [manual step as an example–usually automated]
Run U-SQL script to “standardize” the JSON file(s) into a consistent CSV column/row format
Melissa then shows us how to do this step-by-step.
It seemed some of the rows in my CSV files exceeded an upper limit on how much the Extractor.Csv function can handle and adding the silent:true parameter didn’t solve the issue.
I dug a bit deeper and found rows in some of the files that are long – really long. One in particular was 47MB long just for the row and this was valid data. I could have manually edited these outs by hand but thought I’d see if I could solve another way.
After some internet research and a couple of helpful tweets to and from Michael Rys, I decided to have a go at making my own custom U-SQL extractor.
Phillip has included the custom extractor code, so if you find yourself needing to parse very large rows of data in U-SQL, you’ll definitely be interested in this.
Many common scenarios for U-SQL developers require constructing a RowSet made up of a simple range of numbers or dates, for example the integers from 1 to 10. In this blog post we’ll take a look at options for doing this in U-SQL. In the process, we’ll get a chance to learn how to use some common U-SQL features:
Creating RowSets from constant values
Using CROSS JOIN
Using SELECT to map integers to DateTimes
Using CREATE TABLE to create a table directly from a RowSet. This is sometimes called “CREATE TABLE AS SELECT” and often abbreviated as “CTAS“.
Click through to learn more.
One of U-SQL’s core capabilities is to be able to schematize unstructured data on the fly without having to create a metadata object for it. This capability is provided by the EXTRACT expression that will invoke either a user-defined extractor or built-in extractor to process the input file or set of files specified in the FROM clause and produces a rowset whose schema is specified in the EXTRACT clause.
While using the build-in extractor to schema semi-structured data, like data in .csv file, the schema definition in U-SQL is slow and error prone, especially for the .csv file contains hundreds of columns.
Recently, we released a new feature in the latest version of Azure Data Lake Tools for Visual Studio to help you generate this U-SQL EXTRACT statement automatically.
Click through for an example as well as a video showing the process.
This is a quick tip about syntax for handling row headers in U-SQL, the data processing language of Azure Data Lake Analytics. There are two components: handling row headers on the source data which is being queried, and row headers on the dataset being generated by ADLA.
Click through for the one-liners as well as sample queries.
First, let’s talk about “zipimport”. Thanks to the adoption of PEP 273 – Python had the ability to import modules from ZIP files since Python 2.3. This ability is called “zipimport” and is a built-in feature of the Python’s existing import statement. Read the zipimport documentation now.
To review the basics.
You create a module (a .py file, etc.)
ZIP up the module into a .zip file
Add the path to the .zip file to sys.path
Then import the module
Read on for the step-by-step process.
The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.
Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.
Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?
Read on for a couple of sampling methods.
Michael Rys has a couple pieces of U-SQL syntax which will be deprecated. First is partition by bucket:
In the upcoming refresh, we are removing the deprecated syntax
PARTITION BY BUCKETand will raise an error.
Thus, if you have not yet updated your table definitions with the previously announced new syntax, please do so now or your scripts will fail starting some day in February!
Back in October, we announced that we simplified the U-SQL Credentials by merging the password secrets that are being created in Powershell and the other parts of the credential object into credentials that are being completely created with a Powershell command. This reduces one statement from the creation process.
During the initial phase, we did provide support for both kinds of credential objects, and still supported the old syntax.
In the upcoming February refresh, we are now automatically migrating the existing old credentials into the new format and remove the
If you’re writing U-SQL code, you’ll want to read up on the ramifications and alternatives here.
In USQL there are built-in extractors for parsing text, comma delimited or tab delimined files. Once again, parsing JSON becomes problematic. There is a solution built into USQL, write some C# code to extend it or use someone else’s C# code to extend USQL. Since I wanted to parse JSON, fortunately there are libraries available on github containing the information required to do it. Download the github package and open up the Microsoft.Analytics.Samples project in Visual Studio. When I did this the first time, there was a problem loading the Newtonsoft.Json reference, so I right clicked on the references and downloaded the missing parts again. Build the solution and check out the code in the directory …Examples\DataFormats\Microsoft.Analytics.Samples.Formats\bin\Debug\ . There will be two DLLs, Microsoft.Analytics.Samples.Formats.dll and Newtonsoft.Json.dll. These dlls then need to be registered in Data Lake Analytics and locally if you chose to run your USQL locally. As at some point the goal is to run from within Data Lake analytics, you will need to copy both of these dlls to the data lake. I created a folder for the dlls called Assemblies, and ran this command
It’s funny how often that library comes up… Click through to see how to use it with U-SQL jobs.
There are a few steps required before any code is run. If the Data Lake Analytics Tools are not installed within Visual Studio, download themhere and install them. When the tools are installed, the menu item Data Lake appears in Visual studio. The second step is to model your PC with the same file structure as your data lake. The default location which the Data Lake tools will look for your data structure is C:\Users\<<insertyourname>>\AppData\Local\USQLDataRoot . What this means is if you have folders and subfolders created in your data lake, your PC needs to have the same structure, including the data.
There is also a way to test these jobs locally before you spend that Azure money spinning up Data Lake jobs.