First things first: let’s get on the same page with what is meant by business logic. When I refer to business logic (also commonly referred to as business rules), I’m talking about the processing rules that are used to transform an organization’s data so that it is accurate, understandable, and usable. In almost all cases, these business rules are not designed to change the meaning of the data, but to clarify and make it easier to comprehend. Business logic may be applied when data arrives from other sources, or to existing data to reflect changes that have taken place.
Business logic is usually highly customized for and by each organization. The amount of processing required is heavily dependent on factors such as source data quality, reporting granularity, technical skill level of the intended audience, and even company culture.
It’s worth reading the whole thing.
ADF V2 natively supports decompression of files as documented at https://docs.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support. With this functionality ADF should change the extension of the file when it is decompressed so 1234_567.csv.gz would become 1234_567.csv however, I’ve noticed that this doesn’t happen in all cases.
In our particular case the file names and extensions of the source files are all uppercase and when ADF uploads them it doesn’t alter the file extension e.g. if I upload 1234_567.CSV.GZ I get 1234_567.CSV.GZ in blob storage rather than 1234_567.CSV.
Click through for more details and be sure to vote on his Azure Feedback bug if this affects you.
To get some of this data usable for reporting we’re importing it into Azure SQL Database so people can start working their way through it, and we can fix up errors before we push it through into Azure Data Lake for mining. Being a fan of dbatools it was my first port of call for automating something like this.
Just to make life interesting, I want to add a time of creation field to the data to make tracking trends easier. As this information doesn’t actually exist in the CSV columns, I’m going to use
LastWriteTimeas a proxy for the creationtime.
Click through for the script.
ETL arose to solve a problem of providing businesses with clean and ready-to-analyze data. We remove dirty and irrelevant data and transform, enrich, and reshape the rest. The example of this could be sessionization: the process of creating sessions out of raw pageviews and users’ events.
ETL is complicated, especially the transformation part. It requires at least several months for a small-sized (less than 500 employees) company to get up and running. Once you have the initial transform jobs implemented, never-ending changes and updates will begin because data always evolves with business.
The other problem of ETL is that during the transformation, we reshape data into some specific form. This form usually lacks some data’s resolution and does not include data that is useless for that time or for that particular task. Often, “useless” data becomes “useful.” For example, if business users request daily data instead of weekly, then you will have to fix your transformation process, reshape data, and reload it. That would take a few weeks more.
Read on for more, including his argument for why ELT is better.
In ADF V2 the integration runtime is responsible for providing the compute infrastructure that carries out data movement between data stores. A self-hosted integration runtime is an on-premise version of the integration runtime that is able to perform copy activities to and from on-premise data stores.
When we configure a self-hosted integration runtime the data factory service, that sits in Azure, will orchestrate the nodes that make up the integration runtime through the use of Azure Service Bus meaning our nodes that are hosted on-prem are performing all of our data movement and connecting to our on-premises data sources while being triggered by our data factory pipelines that are hosted in the cloud. A self-hosted integration runtime can have multiple nodes associated with it, which not only caters for high availability but also gives an additional performance benefit as ADF will use all of the available nodes to perform processing.
Read on for the scripts and full process.
The shortage of data scientists – those triple-threat types who possess advanced statistics, business, and coding skills – has been well-documented over the years. But increasingly, businesses are facing a shortage of another key individual on the big data team who’s critical to achieving success – the data engineer.
Data engineers are experts in designing, building, and maintaining the data-based systems in support of an organization’s analytical and transactional operations. While they don’t boast the quantitative skills that a data scientist would use to, say, build a complex machine learning model, data engineers do much of the other work required to support that data science workload, such as:
Building data pipelines to collect data and move it into storage;
Preparing the data as part of an ETL or ELT process;
Stitching the data together with scripting languages;
Working with the DBA to construct data stores;
Ensuring the data is ready for use;
Using frameworks and microservices to serve data.
Read the whole thing. My experience is that most shops looking to hire a data scientist really need to get data engineers first; otherwise, you’re wasting that high-priced data scientist’s time. The plus side is that if you’re already a database developer, getting into data engineering is much easier than mastering statistics or neural networks.
While this seems like a lot of parts just to copy a few files, it’s important to note I only scratched the surface of what ADF can do. Think of ADF as an airline company that manages and enables cargo (data) movement between cities (data sources). A pipeline represents the overall goal of moving certain cargo from one city to another. The linked service is the airport, which provides a landing point and access control for the cities. The dataset is the list of cargo to move, the activity is the flight itself, while the integration runtime is the airport infrastructure that makes the cargo movement possible. A single pipeline requires all these objects to run successfully; however, many pipelines can use these same objects to complete different tasks. Once you’ve created these data factory objects, it is straightforward to layer on additional functionality or more pipelines. ADF also has visual tools that make building these objects a breeze – to build my pipeline, all I had to do was click on “Copy data” in the visual tools start menu and follow the steps provided.
Matt has a video demonstrating the process as well.
This is a quick post to share a few scripts to find what is currently executing in Azure Data Factory. These PowerShell scripts are applicable to ADF version 1 (not version 2 which uses different cmdlets).
Prerequisite: In addition to having installed the Azure Resource Manager modules, you’ll have to register the provider for Azure Data Factory:#One-time registration of the ADF provider #Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory
Click through for the Powershell snippets.
The data within the Data Lake store was organised into a Year and Month hierarchy for the folders, and each days transactions were stored in a file which was named after the day within the relevant month folder. The task then was to create a pipeline which copies the dataset in the Data Lake Store over to the dbo.Orders table in Azure SQL DB every day within the scheduled period (Q1 2016).
After creating all the json scripts and deploying them (with no errors), I clicked on the ‘Monitor and Manage’ tile to monitor the activities, check everything was working as it should be and monitor the progress. After waiting for at least 10 minutes, I started to get frustrated.
Click through for the fix and an explanation.
For Hadoop developer, the actual game starts after the data is being loaded in HDFS. They play around this data in order to gain various insights hidden in the data stored in HDFS.
So, for this analysis the data residing in the relational database management systems need to be transferred to HDFS. The task of writing MapReduce code for importing and exporting data from relational database to HDFS is uninteresting & tedious. This is where Apache Sqoop comes to rescue and removes their pain. It automates the process of importing & exporting the data.
Sqoop makes the life of developers easy by providing CLI for importing and exporting data. They just have to provide basic information like database authentication, source, destination, operations etc. It takes care of remaining part.
Sqoop internally converts the command into MapReduce tasks, which are then executed over HDFS. It uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.
In my experience, Sqoop does two things really well: first, it lets you move data from a relational database into HDFS (or Hive). Second, it lets you move data from HDFS (or Hive) into a staging table on a relational database. That can make Sqoop a useful part of an ETL process.