The main take away is that we continue the deprecation of items that we changed during the preview phase and introduce a lot of new capabilities including
PIVOT/UNPIVOTmore catalog sharing and much more!
There’s a pretty hefty list of updates to check out.
The great thing about Biml is that I can use it as much or as little as I feel is helpful. That T-SQL statement to get column lists could have been Biml, but it didn’t have to be. The client can maintain and enhance these pipelines with or without Biml as they see fit. There is no vendor lock-in here. Just as with Biml-generated SSIS projects, there is no difference between a hand-written ADF solution and a Biml-generated ADF solution, other than the Biml-generated solution is probably more consistent.
And have I mentioned the time savings? There is a reason why Varigence gives out shirts that say “It’s Monday and I’m done for the week.”
Click through for the script.
First, let’s talk about “zipimport”. Thanks to the adoption of PEP 273 – Python had the ability to import modules from ZIP files since Python 2.3. This ability is called “zipimport” and is a built-in feature of the Python’s existing import statement. Read the zipimport documentation now.
To review the basics.
You create a module (a .py file, etc.)
ZIP up the module into a .zip file
Add the path to the .zip file to sys.path
Then import the module
Read on for the step-by-step process.
During the past few years though, end-to-end business use-cases have evolved to another level.
- The end-to-end business problems are now mostly solved by multiple applications working together.
- As the platform matured, users have increasingly started wanting to solely focus on the business application layers, and getting impatient to get on with developing their main business-logic.
- However, YARN, and for that matter any other related platform, hasn’t catered to this evolving need, leaving the users to unwillingly get involved in the painstaking details of wiring applications together, keeping them up, manually scaling them as need arises etc.
Manual plumbing of all these different colored services in tiresome! Further, there is a clear need for seamless aggregate deployment, lifecycle management and application wireup. This is the gap that needs to be bridged between what these end-to-end business use-cases need from the platform and what the platform offers today. If these features are provided, then the business use cases authors can singularly focus on the business logic.
This is a higher-level “where are we at?” kind of post which could be helpful if you’re new to the data lake concept.
Meagan Longoria has a multi-part series on using Biml to script Azure Data Factory tasks to migrate data from an on-prem SQL Server instance to Azure Data Lake Store. Here’s part 1:
My Azure Data Factory is made up of the following components:
Gateway – Allows ADF to retrieve data from an on premises data source
Linked Services – define the connection string and other connection properties for each source and destination
Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
Pipelines – combine the data sets and activities and define an execution schedule
Click through for the Biml.
So to give a concrete example, if the default file system was
/user/filename.txtwould resolve to
Why does the default file system matter? The first answer to this is purely convenience. It is a heck lot easier to simply say
adl://amitadls.azuredatalakestore.net/in code and configurations. Secondly, many components in Hadoop use relative paths by default. For instance there are a fixed set of places, specified by relative paths, where various applications generate their log files. Finally, many ISV applications running on Hadoop specify important locations by relative paths.
Read on to see how.
Most common patterns using Azure Data Lake Store (ADLS) involve customers ingesting and storing raw data into ADLS. This data is then cooked and prepared by analytic workloads like Azure Data Lake Analytics and HDInsight. Once cooked this data is then explored using engines like Azure SQL Data Warehouse. One key pain point for customers is having to wait for a substantial time after the data was cooked to be able to explore it and gather insights. This was because the data stored in ADLS would have to be loaded into SQL Data Warehouse using tools row-by-row insertion. But now, you don’t have to wait that long anymore. With the new SQL Data Warehouse PolyBase support for ADLS, you will now be able to load and access the cooked data rapidly and lessen your time to start performing interactive analytics. PolyBase support will allow to you access unstructured/semi-structured files in ADLS faster because of a highly scalable loading design. You can load the files stored in ADLS into SQL Data Warehouse to perform analytics with fast response times or you use can the files in ADLS as external tables. So get ready to unlock the value stored in your petabytes of data stored in ADLS.
I’ve been waiting for this support, and I’m happy that they were able to integrate the two products.
Late last year, I presented a Cognitive Intelligence demo using Azure Data Lake (ADL) at PASS Summit keynote. It was a fun and quick demo! Watch it here
In case you’re new to ADL, you can now (since Dec 2015) develop, compile and run ADL locally in Visual Studio. This is huge! Because you don’t have to worry about your ADL Analytics Unit (AU) consumptions. Plus, this allows you to try it before you buy it too!
Click through for the step-by-step installation instructions.
The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.
Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.
Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?
Read on for a couple of sampling methods.
Today early adopters of Amazon Athena are using it for big data analytics pipeline projects along with Kinesis streaming data and other Amazon data sources.
Athena is serverless parallel query pay-per-use service. There is no infrastructure to set up or manage. It scales automatically and can handle large datasets or complex distributed queries.
The easy way of thinking about Athena is that it’s ElasticMapReduce (a pay-as-you-go Hadoop cluster) without the ceremony of administering or spinning up the cluster.