First, let’s talk about “zipimport”. Thanks to the adoption of PEP 273 – Python had the ability to import modules from ZIP files since Python 2.3. This ability is called “zipimport” and is a built-in feature of the Python’s existing import statement. Read the zipimport documentation now.
To review the basics.
You create a module (a .py file, etc.)
ZIP up the module into a .zip file
Add the path to the .zip file to sys.path
Then import the module
Read on for the step-by-step process.
The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.
Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.
Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?
Read on for a couple of sampling methods.
Michael Rys has a couple pieces of U-SQL syntax which will be deprecated. First is partition by bucket:
In the upcoming refresh, we are removing the deprecated syntax
PARTITION BY BUCKETand will raise an error.
Thus, if you have not yet updated your table definitions with the previously announced new syntax, please do so now or your scripts will fail starting some day in February!
Back in October, we announced that we simplified the U-SQL Credentials by merging the password secrets that are being created in Powershell and the other parts of the credential object into credentials that are being completely created with a Powershell command. This reduces one statement from the creation process.
During the initial phase, we did provide support for both kinds of credential objects, and still supported the old syntax.
In the upcoming February refresh, we are now automatically migrating the existing old credentials into the new format and remove the
If you’re writing U-SQL code, you’ll want to read up on the ramifications and alternatives here.
In USQL there are built-in extractors for parsing text, comma delimited or tab delimined files. Once again, parsing JSON becomes problematic. There is a solution built into USQL, write some C# code to extend it or use someone else’s C# code to extend USQL. Since I wanted to parse JSON, fortunately there are libraries available on github containing the information required to do it. Download the github package and open up the Microsoft.Analytics.Samples project in Visual Studio. When I did this the first time, there was a problem loading the Newtonsoft.Json reference, so I right clicked on the references and downloaded the missing parts again. Build the solution and check out the code in the directory …Examples\DataFormats\Microsoft.Analytics.Samples.Formats\bin\Debug\ . There will be two DLLs, Microsoft.Analytics.Samples.Formats.dll and Newtonsoft.Json.dll. These dlls then need to be registered in Data Lake Analytics and locally if you chose to run your USQL locally. As at some point the goal is to run from within Data Lake analytics, you will need to copy both of these dlls to the data lake. I created a folder for the dlls called Assemblies, and ran this command
It’s funny how often that library comes up… Click through to see how to use it with U-SQL jobs.
There are a few steps required before any code is run. If the Data Lake Analytics Tools are not installed within Visual Studio, download themhere and install them. When the tools are installed, the menu item Data Lake appears in Visual studio. The second step is to model your PC with the same file structure as your data lake. The default location which the Data Lake tools will look for your data structure is C:\Users\<<insertyourname>>\AppData\Local\USQLDataRoot . What this means is if you have folders and subfolders created in your data lake, your PC needs to have the same structure, including the data.
There is also a way to test these jobs locally before you spend that Azure money spinning up Data Lake jobs.
In my previous series on Stream Analytics, I wrote some U-SQL. That U-SQL didn’t look much different than Ansi-SQL, which is sort of the point of porting the functionality to a different yet familiar language. Another application which heavily uses U-SQL is Azure Data Lake. Data Lake stores its data in HDInsight, but you don’t need to write hive to query the data, as U-SQL will do it. Like Hive, U-SQL can be used to create a schema on top of some data, and then query it.
For example, to write a query on this csv file stored in a Data Lake, I would need to create the data definition for the data, then I could easily write a statement to query it.
I’m interested in seeing how much adoption we see in this language.
While the ADL Tools in VisualStudio make it easy to register an assembly, you can also do it with a script (in the same way that the tools do it for you) if you are for example developing on a different platform, have already compiled assemblies that you just want to upload and register. You basically follow the following steps:
You upload your assembly dll and all additionally required non-system dlls and resource files into a location of your choosing in your Azure Data Lake Storage account or even a Windows Azure Blob Store account that is linked to your Azure Data Lake account. You can use any of the many upload tools available to you (e.g., Powershell commands, VisualStudio’s ADL Tool Data Lake Explorer upload, your favorite SDK’s upload command or through the Azure Portal).
- Once you have uploaded the dlls, you use the
CREATE ASSEMBLYstatements to register them.
We will use this approach in the spatial example below.
There’s quite a bit going on in this post, making it an interesting read.
We have published C# libraries that supply UDOs and UDFs for processing images with U-SQL in our GitHub site. In this section, we introduce these UDOs and UDFs and, in the next section, we use them within a U-SQL walkthrough to operate on images.
The basic flow behind processing images in U-SQL has three stages:
Use the custom UDO extractor ImageExtractor to read a (JPEG or non-JPEG) image file and return the image data as a byte column value which contains the same exact image as the file in an (always) JPEG representation. Please note that there is a current limitation in U-SQL that a row cannot exceed a size of 4 MB, so you will run into issues if your image size is greater than 4 MB.
Use the image processing UDFs to manipulate this byte (the UDFs support JPEG and non-JPEG representations within this byte despite the previous step always producing a JPEG representation). For example, one UDF extracts metadata from an image to produce textual or numeric data. More interesting UDFs derive an output image from an input image; that output represents the visually transformed input (e.g. rotated or scaled/resized), also stored as a byte containing an (always) JPEG representation of the output.
Use the custom UDO outputter ImageOutputter to writes each byte to a JPEG image file so that we can view the output images of the aforementioned UDFs.
The major value proposition to me for U-SQL is “doing stuff SQL can’t do very well.” This is one of those cases.
As part of the Azure Data Lake Analytics and U-SQL August 1st refresh, we released a couple of new, long awaited features, are fixing some confusing syntax and fixed several bugs
Read on for more details.
As part of the Azure Data Lake Analytics and U-SQL July refresh released earlier this month, we released a couple of new features, are deprecating some confusing syntax and fixed several bugs:
Michael also links to the release notes in Github.