Press "Enter" to skip to content

Month: April 2020

Tying Azure Data Factory to Source Control

Eddy Djaja explains why you really want to tie Azure Data Factory to your source control:

Azure Data Factory (ADF) is Microsoft’s ETL or more precise: ELT tool in the cloud. For more information of ADF, Microsoft puts the introduction of ADF in this link: https://docs.microsoft.com/en-us/azure/data-factory/introduction. As some have argued if ADF will replace or complement the “on-premise”  SSIS, it is uncertain and only time can tell what will happen in the future.
Unlike SSIS, the authoring of ADF does not use Visual Studio. ADF authoring uses a web browser to create ADF components, such as pipelines, activities, datasets, etc. The simplicity of authoring ADF may confuse the novice developers on how ADF components are saved, stored and published. When logging to ADF for the first time after creating an ADF, the authoring is in the ADF mode. How do we know?

Click through for the explanation and some resources on how to do it.

Comments closed

Checking JSON Structure with ADF

Rayis Imayev takes us through the solution of a tricky problem in Azure Data Factory:

Within my “ForEach” container I have also placed a Stored Procedure task and set 4 data elements from my incoming data stream as values for corresponding parameters.

However this approach will not work for all my incoming JSON events, it actually failed for the last one, since it didn’t have both “stop_time” and “last_update” data elements.

An easy way to fix this problem is to add missing data elements with empty values for the last event record, however, when we don’t have control over incoming data, we need to adjust our data processing steps.

Read on to see how Rayis solves this problem.

Comments closed

Image Caching with Docker

etash2901 at the Knoldus blog walks us through the way Docker caches images:

If the objects on the file system that Docker is about to produce have not changed between builds, reusing a cache of a previous build on the host is a great time-saver. It makes building a new container really, really fast. None of those file structures have to be created and written to disk this time — the reference to them is sufficient to locate and reuse the previously built structures.

This is an order of magnitude faster than a a fresh build. If you’re building many containers, this reduced build-time means getting that container into production costs less, as measured by compute time.

Click through for some advice on how to minimize the amount of time you spend waiting for image layers to download or process.

Comments closed

Tips for Moving from Pandas to Koalas

Haejoon Lee, et al, walk us through migrating existing code written for Pandas to use the Koalas library:

In particular, two types of users benefit the most from Koalas:

– pandas users who want to scale out using PySpark and potentially migrate codebase to PySpark. Koalas is scalable and makes learning PySpark much easier
– Spark users who want to leverage Koalas to become more productive. Koalas offers pandas-like functions so that users don’t have to build these functions themselves in PySpark

This blog post will not only demonstrate how easy it is to convert code written in pandas to Koalas, but also discuss the best practices of using Koalas; when you use Koalas as a drop-in replacement of pandas, how you can use PySpark to work around when the pandas APIs are not available in Koalas, and when you apply Koalas-specific APIs to improve productivity, etc. The example notebook in this blog can be found here.

Read on to learn more.

Comments closed

Handling Missing Data

Marina Wyss explains various techniques for handling missing data in data sets:

Missing or incomplete data can have a huge negative impact on any data science project. This is particularly relevant for companies in the early stages of developing solid data collection and management systems.

While the best solution for missing data is to avoid it in the first place by developing good data-collection and stewardship policies, often we have to make due with what’s available.

This blog covers the different kinds of missing data, and what we can do about missing data once we know what we’re dealing with. These strategies range from simple – for example, choosing models that handle missings automatically, or simply deleting problematic observations – to (probably superior) methods for estimating what those missing values may be, otherwise known as imputation.

I like the distinction in form Marina draws, and we also get a good set of techniques for filling the gaps.

Comments closed

Generating Random Numbers with R

The folks at Data Sharkie walk us through random number generation in R:

Why is random numbers generation important and where is it used?

Random numbers generations have application in various fields like statistical sampling, simulation, test designs, and so on. Generally, when a data scientist is in need of a set of random numbers, they will have in mind

R programming language allows users to generate random distributed numbers with a set of built-in functions: runif()rnorm()rbinom().

Read on to generate random numbers across two separate distributions.

Comments closed

Combining User-Defined Types and Temp Tables

Andy Levy tries to make cats and dogs live together:


This tripped me up a few weeks ago, but once I stopped and thought about for a moment it made total sense. I was trying to copy some data into a temp table and got an error I’d never encountered before.

Column, parameter, or variable #1: Cannot find data type MyStringType.

What’s that all about? Let’s find out.

I don’t think it spoils things to say that Andy’s story is a tragedy and not a comedy. But in fairness, the number of shops using user-defined types (as opposed to user-defined table types) is probably not enormous.

Comments closed

Creating a Time Dimension in Power BI

Reza Rad walks us through creating a time dimension in Power BI:

I have explained about Date Dimension a lot previously and mentioned why that is needed. Date dimension gives you the ability to slice and dice your data by different date attributes, such as year, quarter, month, day, fiscal columns, etc. Time dimension, on the other hand, will give you the ability to slice and dice data in the level of hours, minutes, seconds, and buckets related to that, such as every 30 minutes, or 15 minutes, etc.

Time table SHOULD NOT be combined with Date table, the main reason is the huge size of the combined result. Let’s say your date table which includes one record per day, has 10 years of data in it, which means 3,650 rows. Now if you have a Time table with a row for every second, this ends up with 24*60*60=86,400 rows just for the time table. If you combine date and time table, you will have 3,650*86,400=315,360,000 rows. 315 Million rows in a table are not good for a dimension table. Even if you store one record per minute in your time table, you would still end up with over 5 million rows.

So don’t combine the Date and Time table. These two should be two different tables, and they both can have a relationship to the fact table.

With that in mind, click through to see how to create the table.

Comments closed

Auto-Recovery with Power BI

Prathy Kamasani shows us how to recover lost Power BI desktop reports:

A quick post, how many times in Power BI Desktop, have you clicked on “No, remove the files.” and then say OOPS! Well, I did plenty of times to discover this trick.

In short, you can find those removed files under Temp folder like many other windows application files. Usually, the location will be somewhere like this – C:UsersprathyAppDataLocalMicrosoftPowerBI DesktopTempSaves. This location depends upon which version of Power BI Desktop you have. Beware, these files will be removed whenever you clear your Temp Directory.

Auto-save and auto-recovery are marvelous things.

Comments closed

Execution Plan Training, in Video Form

Hugo Kornelis makes an announcement:

As those who have been to my full-day precon on execution plans know, I believe that learning to understand execution plans does not start with dozens of examples. It starts with an explanation of the basics, followed by an overview of operators. Just like learning Russian doesn’t start with reading Tolstoy’s Война и мир (War and Peace), but with learning the grammar rules and the vocabulary.

Once you know the grammar of a language, and enough of its vocabulary, you can then pick up any book. And the more you do that, the easier it becomes. Eventually, one day, you will be able to read Война и мир in its original language.

And once you know the basics of reading execution plans, and are familiar with most of the operators, you will be able to tackle any execution plan you find on your servers, no matter how complex.

And, at least for now, this is free. So check out what Hugo has already and pass along a “thank you” if you like what you see there.

Comments closed