First, you will need to create an Azure Data Factory SSIS Integration Runtime. If you don’t know how, that’s ok – I’ve written a post titled Lift and Shift SSIS Part 0: Creating the ADF Integration Runtime that describes one way to set up ADFIR.
Read on for an example.
I observed one thing here with BCP (Bulk Copy Program), when you have 2 versions of SQL Server installed on you PC or Server. I had SQL Server 2014 & 2016 installed on one of my DEV server.
So if you are executing Query from SQL 2016 instance, it was inserting records in SQL 2014 instance:
exec master..xp_cmdshell ‘BCP AdventureWorks2014.Person.Address2 IN d:\PersonAddressByQuery.txt -T -c’
But even if you use BCP 2016 version, it was still inserting in SQL 2014 instance:
Read on for the reason as well as how to specify which instance you want to use.
Azure Data Factory v2 (ADFv2) has some significant improvements over v1, and we now consider ADF as a viable platform for most of our cloud based projects. But things aren’t always as straightforward as they could be. I’m sure this will improve over time, but don’t let that stop you from getting started now.
This post provides a walk through of using the ‘Lookup’ and ‘If Condition’ activities to do some basic conditional logic depending on the results of a database query.
Assumptions: You already have an ADF pipeline created. If you want to hook into SSIS then you’ll also need the SSIS Integration Runtime set up – although this is not relevant just for the if condition.
Read on for an example.
2. Volume of data moved – this is measured in DMUs (data movement units). This is one you should be aware of as this will default to auto, which is basically using all the DMUs it can use and this is paid for by the hour. Let’s say you specify and use 2 DMUs and it takes an hour to move that data. The other option is you could use 8 DMUs and it takes 15 minutes, this price is going to end up the same. You’re using 4X the DMUs but it’s happening in a quarter of the time.
This is good to look at and do some comparisons since how many DMUs you’re using is where the bulk of your spend if going to be.
There are a few moving parts here, so the calculation is not trivial. But Chris makes good sense of it all.
First things first: let’s get on the same page with what is meant by business logic. When I refer to business logic (also commonly referred to as business rules), I’m talking about the processing rules that are used to transform an organization’s data so that it is accurate, understandable, and usable. In almost all cases, these business rules are not designed to change the meaning of the data, but to clarify and make it easier to comprehend. Business logic may be applied when data arrives from other sources, or to existing data to reflect changes that have taken place.
Business logic is usually highly customized for and by each organization. The amount of processing required is heavily dependent on factors such as source data quality, reporting granularity, technical skill level of the intended audience, and even company culture.
It’s worth reading the whole thing.
ADF V2 natively supports decompression of files as documented at https://docs.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support. With this functionality ADF should change the extension of the file when it is decompressed so 1234_567.csv.gz would become 1234_567.csv however, I’ve noticed that this doesn’t happen in all cases.
In our particular case the file names and extensions of the source files are all uppercase and when ADF uploads them it doesn’t alter the file extension e.g. if I upload 1234_567.CSV.GZ I get 1234_567.CSV.GZ in blob storage rather than 1234_567.CSV.
Click through for more details and be sure to vote on his Azure Feedback bug if this affects you.
To get some of this data usable for reporting we’re importing it into Azure SQL Database so people can start working their way through it, and we can fix up errors before we push it through into Azure Data Lake for mining. Being a fan of dbatools it was my first port of call for automating something like this.
Just to make life interesting, I want to add a time of creation field to the data to make tracking trends easier. As this information doesn’t actually exist in the CSV columns, I’m going to use
LastWriteTimeas a proxy for the creationtime.
Click through for the script.
ETL arose to solve a problem of providing businesses with clean and ready-to-analyze data. We remove dirty and irrelevant data and transform, enrich, and reshape the rest. The example of this could be sessionization: the process of creating sessions out of raw pageviews and users’ events.
ETL is complicated, especially the transformation part. It requires at least several months for a small-sized (less than 500 employees) company to get up and running. Once you have the initial transform jobs implemented, never-ending changes and updates will begin because data always evolves with business.
The other problem of ETL is that during the transformation, we reshape data into some specific form. This form usually lacks some data’s resolution and does not include data that is useless for that time or for that particular task. Often, “useless” data becomes “useful.” For example, if business users request daily data instead of weekly, then you will have to fix your transformation process, reshape data, and reload it. That would take a few weeks more.
Read on for more, including his argument for why ELT is better.
In ADF V2 the integration runtime is responsible for providing the compute infrastructure that carries out data movement between data stores. A self-hosted integration runtime is an on-premise version of the integration runtime that is able to perform copy activities to and from on-premise data stores.
When we configure a self-hosted integration runtime the data factory service, that sits in Azure, will orchestrate the nodes that make up the integration runtime through the use of Azure Service Bus meaning our nodes that are hosted on-prem are performing all of our data movement and connecting to our on-premises data sources while being triggered by our data factory pipelines that are hosted in the cloud. A self-hosted integration runtime can have multiple nodes associated with it, which not only caters for high availability but also gives an additional performance benefit as ADF will use all of the available nodes to perform processing.
Read on for the scripts and full process.
The shortage of data scientists – those triple-threat types who possess advanced statistics, business, and coding skills – has been well-documented over the years. But increasingly, businesses are facing a shortage of another key individual on the big data team who’s critical to achieving success – the data engineer.
Data engineers are experts in designing, building, and maintaining the data-based systems in support of an organization’s analytical and transactional operations. While they don’t boast the quantitative skills that a data scientist would use to, say, build a complex machine learning model, data engineers do much of the other work required to support that data science workload, such as:
Building data pipelines to collect data and move it into storage;
Preparing the data as part of an ETL or ELT process;
Stitching the data together with scripting languages;
Working with the DBA to construct data stores;
Ensuring the data is ready for use;
Using frameworks and microservices to serve data.
Read the whole thing. My experience is that most shops looking to hire a data scientist really need to get data engineers first; otherwise, you’re wasting that high-priced data scientist’s time. The plus side is that if you’re already a database developer, getting into data engineering is much easier than mastering statistics or neural networks.