Now let’s go to the construction of the sample application. In the example, we will first send the data from our Linux file system to the data storage unit of the Hadoop ecosystem (HDFS) (for example, Extraction). Then we will read the data we have written here with Spark and then we will apply a simple Transformation and write to Hive (Load). Hive is a substructure that allows us to query the data in the hadoop ecosystem, which is stored in this environment. With this infrastructure, we can easily query the data in our big data environment using SQL language.
Most of the things relational database professionals do are pretty much the same things that you do with Spark and Hive. There are differences in implementation and level of programming familiarity, but they’re pretty similar.
A work table is a table defined in a nearby data location; either a schema in the source or target database or in a database on the same instance. I take a constraint-driven approach to work table location selection. Closer – a schema in the same database – is often better for performance.
I write this knowing some folks will frown at the suggestion of polluting a data source or target database with additional schemas and tables. Best practices exist for a reason. It’s helpful to maintain a list of best practices and to include in this list the reasons each practice exists. This could be a case where violating one or more best practices is justified.
Andy throws out a few ideas as alternatives but states his preference for using work tables to solve this problem.
The Write-DbaDbTableData cmdlet is pretty neat because it can create automatically the destination table if it doesn’t exists, truncate the table if it exists (or append, your choice), keep the identity values and nulls if necessary and everything is done via a bulk insert with a configurable batch size.
Click through for the script. It’s not a replacement for a real ETL process but if you just need something fast, it will do the job.
File management may not be at the top of my list of priorities during data integration projects. I assume that once I learn enough about sourcing data systems and target destination platform, I’m ready to design and build a data integration solution between two or more connecting points. Then, a historical file management process becomes a necessity or a need to log and remove some of the incorrectly loaded data files. Basically, a step in my data integration process to remove (or clean) such files would be helpful.
Click through to see how to do this.
Obviously, you have to have the module installed, and a copy of AdventureWorksDW2017 db restored to a SQL Server. After that, all you have to do is loop through the tables, ‘query’ them with the Read-SqlTableData cmdlet, and pipe the results to the Export-Excel cmdlet.
I did some trial and error with this yesterday. I settled on exporting all of the Dimension tables to separate Worksheets within the same Excel file, and exporting all of the Fact tables to their own individual files (since they tend to be much larger).
Click through for Aaron’s script.
1) Azure Data Factory and Service Bus can find common grounds to communicate with each other, and Azure Logic Apps could serve as a good mediator to establish this type of messaging communication.
2) As soon as messages land in a service bus queue, it’s now a responsibility of recipient side to obtain and process those message, which may be part of another blog post.
Click through for a demo of the process.
It is a common practice to load data to blob storage or data lake storage before loading to a database, especially if your data is coming from outside of Azure. We often create a staging area in our data lakes to hold data until it has been loaded to its next destination. Then we delete the data in the staging area once our subsequent load is successful. But before February 2019, there was no Delete activity. We had to write an Azure Function or use a Logic App called by a Web Activity in order to delete a file. I imagine every person who started working with Data Factory had to go and look this up.
But now Data Factory V2 has a Delete activity.
Meagan shows how it works, what kinds of parameters you can set, and a couple of gotchas, so check it out.
Azure Data Factory (ADF) offers a convenient cloud-based platform for orchestrating data from and to on-premise, on-cloud, and hybrid sources and destinations. But it is not a full Extract, Transform, and Load (ETL) tool. For those who are well-versed with SQL Server Integration Services (SSIS), ADF would be the Control Flow portion.
You can scale out your SSIS implementation in Azure. In fact, there are two (2) options to do this: SSIS On-Premise using the SSIS runtime hosted by SQL Server or On Azure using the Azure-SSIS Integration Runtime.
Azure Data Factory is not quite an ETL tool as SSIS is. There is that transformation gap that needs to be filled for ADF to become a true On-Cloud ETL Tool. The second iteration of ADF in V2 is closing the transformation gap with the introduction of Data Flow.
Despite it not being nearly as complete as SSIS, there are useful data transformations available in Azure Data Factory, as Marlon shows.
Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. This allows for concise and flexible scripts but can also be the downside of Airflow; since it’s Python code there are infinite ways to define your pipelines. The Zen of Python is a list of 19 Python design principles and in this blog post I point out some of these principles on four Airflow examples. This blog was written with Airflow 1.10.2.
My favorite of the Zen of Python principles is a combination of two: “simple is better than complex; complex is better than complicated.” That’s something I don’t always get right, but it is critical for a stable architecture.
In a previous post, I gave an overview to regression tests. In this post, I will give a practical example of developing and performing regression tests with the Pester framework for PowerShell. The code for performing regression tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test regression scenarios. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database. The structure of the metadata database will be exactly the same as laid out in the Integration Test post.
There’s a hefty test script here too, so check it out.