Press "Enter" to skip to content

Category: ETL / ELT

Integrating Kafka Into A Data Scientist’s Workflow

Liz Bennett from Stitch Fix has a guest post on the Confluent blog:

Our main requirement for this new project was to build infrastructure that would be 100 percent self-service for our Data Scientists. In other words, my teammates and I would never be directly involved in the discovery, creation, configuration and management of the event data. Self-service would fix the primary shortcoming of our legacy event delivery system: manual administration that was performed by my team whenever a new dataset was born. This manual process hindered the productivity and access to event data for our Data Scientists. Meanwhile, fulfilling the requests of the Data Scientists hindered our own ability to improve the infrastructure. This scenario is exactly what the Data Platform Team strives to avoid. Building self-service tooling is the number one tenet of the Data Platform Team at Stitch Fix, so whatever we built to replace the old event infrastructure needed to be self-service for our Data Scientists. You can learn more about our philosophy in Jeff Magnusson’s post Engineers Shouldn’t Write ETL.

This is an architectural overview and a good read.

Comments closed

Dataflows In Power BI

James Serra gives us a preview of Power BI Dataflows:

In short, Dataflows integrates data lake and ETL technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep).  Dataflows include a standard schema, called the Common Data Model (CDM), that contains the most common business entities across the major functions such as marketing, sales, service, finance, along with connectors that ingest data from the most common sources into these schemas.  This greatly simplifies modeling and integration challenges (it prevents multiple metadata/definition on the same data).  You can also extend the CDM by creating custom entities.  Lastly – Microsoft and their partners will be shipping out-of-the-box applications that run on Power BI that populate data in the Common Data Model and deliver insights through Power BI.

A dataflow is not just the data itself, but also logic on how the data is manipulated.  Dataflows belong to the Data Warehouse/Mart/Lake family.  Its main job is to aggregate, cleanse, transform, integrate and harmonize data from a large and growing set of supported on-premises and cloud-based data sources including Dynamics 365, Salesforce, Azure SQL Database, Excel, SharePoint.  Dataflows hold a collection of data-lake stored entities (i.e. tables) which are stored in internal Power BI Common Data Model compliant folders in Azure Data Lake Storage Gen2.

Also check out the comments for some clarification on why you’d want to use Dataflows rather than doing the work directly in the data lake.

Comments closed

Azure Data Factor V2 Now In General Availability

Chris Seferlis covers some of the improvements in Azure Data Factory V2:

With ADF V2 you get a browser-based interface using drag and drop technology; V1 was primarily done in the Visual Studio IDE. It also added triggers for scheduling, so you can schedule your jobs when required and in additional ways (which I’ll discuss further in a bit).

Some other features of ADF V2 that came out as it became generally available:

  • Lift and Shift operations for your SSIS packages, so if you have SSIS packages local, you can now Lift and Shift those into compute with the integration runtime service in Data Factory.

  • This also allows for cloud to cloud, cloud to prem, prem to prem and some third-party tools are supported within that as well.

  • Control flow activities like link branching, looping, conditional execution and parameterization.

  • Integration with HD Spark and Databricks for big data workloads and data science.

There’s more where that came from, too.

Comments closed

Data Engineering Remains As Important As Ever

Prashanth Southekal has good news for ETL developers:

While many companies have embarked on data analytics initiatives, only a few have been successful. Studies have shown that over 70% of data analytics programs fail to realize their full potential and over 80% of the digital transformation initiatives fail. While there are many reasons that affect successful deployment of data analytics, one fundamental reason is lack of good quality data. However, many business enterprises realize this and invest considerable time and effort in data cleansing and remediation; technically known as data engineering. It is estimated that about 60 to 70% of the effort in data analytics is on data engineering. Given that data quality is an essential requirement for analytics, there are 5 key reasons on why data analytics is heavy on data engineering.

1.Different systems and technology mechanisms to integrate data.

Business systems are designed and implemented for a purpose; mainly for recording business transactions. The mechanisms for data capture in Business systems such as ERP is batch/discrete data while in the SCADA/IoT Field Systems it is for continuous/time-series data. This means that these business systems store diverse data types caused by the velocity, volume, and variety dimensions in the data. Hence the technology (including the database itself) to capture data is varied and complex.  And when you are trying to integrate data from these diverse systems from different vendors, the metadata model varies resulting in data integration challenges.

That 60-70% on data engineering is probably a moderate underestimate.

Comments closed

Executing SSIS From Azure Data Factory

Andy Leonard shows us how to execute an SSIS package from Azure Data Factory:

The good people who work on Azure Data Factory recently added an Execute SSIS Package activity. It’s pretty cool. Let’s tinker with it some, shall we?

First, you will need to create an Azure Data Factory SSIS Integration Runtime. If you don’t know how, that’s ok – I’ve written a post titled Lift and Shift SSIS Part 0: Creating the ADF Integration Runtime that describes one way to set up ADFIR.

Read on for an example.

Comments closed

BCP And Multiple SQL Server Instances

Manoj Pandey investigates an interesting issue with BCP:

I observed one thing here with BCP (Bulk Copy Program), when you have 2 versions of SQL Server installed on you PC or Server. I had SQL Server 2014 & 2016 installed on one of my DEV server.
So if you are executing Query from SQL 2016 instance, it was inserting records in SQL 2014 instance:

exec master..xp_cmdshell ‘BCP AdventureWorks2014.Person.Address2 IN d:\PersonAddressByQuery.txt -T -c’

But even if you use BCP 2016 version, it was still inserting in SQL 2014 instance:

Read on for the reason as well as how to specify which instance you want to use.

Comments closed

Lookups And Conditionals In Azure Data Factory V2

Alex Whittles shows us how to perform lookups and operations with IF clauses in Azure Data Factory V2:

Azure Data Factory v2 (ADFv2) has some significant improvements over v1, and we now consider ADF as a viable platform for most of our cloud based projects. But things aren’t always as straightforward as they could be. I’m sure this will improve over time, but don’t let that stop you from getting started now.

This post provides a walk through of using the ‘Lookup’ and ‘If Condition’ activities to do some basic conditional logic depending on the results of a database query.

Assumptions: You already have an ADF pipeline created. If you want to hook into SSIS then you’ll also need the SSIS Integration Runtime set up – although this is not relevant just for the if condition.

Read on for an example.

Comments closed

Azure Data Factory V2 Pricing

Chris Seferlis gives us the details on how Azure Data Factory V2 pricing works:

2. Volume of data moved – this is measured in DMUs (data movement units). This is one you should be aware of as this will default to auto, which is basically using all the DMUs it can use and this is paid for by the hour. Let’s say you specify and use 2 DMUs and it takes an hour to move that data. The other option is you could use 8 DMUs and it takes 15 minutes, this price is going to end up the same. You’re using 4X the DMUs but it’s happening in a quarter of the time.

This is good to look at and do some comparisons since how many DMUs you’re using is where the bulk of your spend if going to be.

There are a few moving parts here, so the calculation is not trivial.  But Chris makes good sense of it all.

Comments closed

Tips For Managing Business Logic

Tim Mitchell has a few tips for managing business logic, particularly when building ETL processes:

First things first: let’s get on the same page with what is meant by business logic. When I refer to business logic (also commonly referred to as business rules), I’m talking about the processing rules that are used to transform an organization’s data so that it is accurate, understandable, and usable. In almost all cases, these business rules are not designed to change the meaning of the data, but to clarify and make it easier to comprehend. Business logic may be applied when data arrives from other sources, or to existing data to reflect changes that have taken place.

Business logic is usually highly customized for and by each organization. The amount of processing required is heavily dependent on factors such as source data quality, reporting granularity, technical skill level of the intended audience, and even company culture.

It’s worth reading the whole thing.

Comments closed

Azure Data Factory v2 And Decompression

Ben Jarvis notes a file naming bug with Azure Data Factory v2 when decompressing files:

ADF V2 natively supports decompression of files as documented at https://docs.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support. With this functionality ADF should change the extension of the file when it is decompressed so 1234_567.csv.gz would become 1234_567.csv however, I’ve noticed that this doesn’t happen in all cases.

In our particular case the file names and extensions of the source files are all uppercase and when ADF uploads them it doesn’t alter the file extension e.g. if I upload 1234_567.CSV.GZ I get 1234_567.CSV.GZ in blob storage rather than 1234_567.CSV.

Click through for more details and be sure to vote on his Azure Feedback bug if this affects you.

Comments closed