Author: Kevin Feasel

Every report has a data source and getting source data in the right format for your BI platform is a substantial task – so much so, that you might be tempted to put Power BI on top of the data sources you have created for your previous BI platform with no changes. However different BI platforms need their data in different formats. Many BI platforms like their data munged together in one big table, sometimes even with data at different granularities in the same table. Power BI, on the other hand, likes its source data modelled as a star schema (you can find out what a star schema is and why it’s important here). If you don’t model your data as a star schema you may find that you see incorrect values in your reports, that report performance is poor, and that it’s a lot harder to write the DAX calculations that you need.

Four out of the five fit just as well with any other data platform technology.

Comments closed

Using Azure Functions Inside Azure Data Factory

Published 2020-04-20 by Kevin Feasel

Rayis Imayev shows how you can call an Azure Function from inside your Azure Data Factory Pipeline:

Creating a data solution with Azure Data Factory (ADF) may look like a straightforward process: you have incoming datasets, business rules of how to connect and change them and a final destination environment to save this transformed data. Very often your data transformation may require more complex business logic that can only be developed externally (scripts, functions, web-services, databricks notebooks, etc.).

In this blog post, I will try to share my experience of using Azure Functions in my Data Factory workflows: my highs and lows of using them, my victories and struggles to make them work.

This includes a description of the options, a demo function, and additional notes for each technique.

Comments closed

Avoiding Loops in Python with NumPy

Published 2020-04-17 by Kevin Feasel

Swantika Gupta walks us through vectorization and broadcasting with NumPy:

Vectorization is a powerful ability within NumPy which is used to speed up the code execution without using loop. It expresses operations as occurring on entire arrays rather than their individual elements.
Looping over an array or any data structure in Python has a lot of overhead involved. In NumPy, Vectorized Operations delegates the looping internally to highly optimized C and Fortran functions, making for cleaner and faster Python code. So, vectorization refers to the concept of replacing explicit for-loops with array expressions, which can then be computed internally with a low-level language, like C.

Read on for a few examples of this and broadcasting.

Comments closed

Apache Kafka 2.5 Released

Published 2020-04-17 by Kevin Feasel

David Arthur announces Apache Kafka 2.5:

KIP-500 update
In Apache Kafka 2.5, some preparatory work has been done towards the removal of Apache ZooKeeper™ (ZK).
– KIP-555: details about the ZooKeeper deprecation process in admin tools
– KIP-543: dynamic configs will not require ZooKeeper access

KIP-500 looks like a doozy.

Comments closed

Metadata-Only Column Changes with SQL Server 2016

Published 2020-04-17 by Kevin Feasel

Paul White takes us through several metadata-only changes which SQL Server 2016 introduced:

These changes can be metadata-only because the underlying binary data layout does not change when Column Descriptor row format is used (hence the need for compression). Without compression, row store uses the original FixedVar representation, which cannot accommodate these fixed-length data type changes without rewriting the physical layout.
You may notice that tinyint is omitted from the integer types list. This is because it is unsigned, while the other integer types are all signed, so a metadata-only change is not possible. For example, a value of 255 can fit in one byte for tinyint, but requires two bytes in any of the signed formats. The signed formats can hold -128 to +127 in one byte when compressed.

This is very interesting, but note the long list of requirements for it to work, notably that compression must be enabled on all indexes and partitions.

Comments closed

Dataflows as an Alternative to Incremental Loading in Power BI

Published 2020-04-17 by Kevin Feasel

Imke Feldmann gives us an alternative to Power BI’s incremental loading using Dataflows:

If you’ve been following my blog for a while, you might have noticed my interest in incremental load workarounds. It took some time before we saw the native functionality for it in Power BI and it was first released for premium workspaces only. Fortunately, we now have it for shared workspaces / pro licenses as well and it is a real live saver for scenarios where the refresh speed is an issue.
However, there is a second use case for incremental refresh scenarios that is not covered ideally with the current implementation. This is where the aim is to harvest and store data in Power BI that will become unavailable in their source in the future or one simply wants to create a track of changes in a data source. Chris Webb has beaten me to this article here and describes in great detail how that setup works. He also mentions that this is not a recommended setup, which I agree. Another disadvantage of that solution is that this harvested data is only available as a shared dataset instead of a “simple” table. This limits the use cases and might force you to set up these incremental refreshes in multiple datasets.

Read on for more information.

Comments closed

Diagnosing Out of Memory Failures with SQL Server 2017

Published 2020-04-17 by Kevin Feasel

Lonny Niederstadt had a curious issue:

When [Max Server Memory] can be attained by SQL Server, [Target Server Memory] will be equal to [Max Server Memory]. If memory conditions external to SQL Server make [Max Server Memory] unattainable, [Target Server Memory] will be adjusted downward to an attainable value. As workload is placed on the system, [Total Server Memory] grows toward [Target Server Memory]. That’s typical, expected behavior.
In this case, the story stays boring until after the following midnight. There wasn’t enough workload to drive much growth of [Total Server Memory] until about 2:15 am, after which [Total Server Memory] grew fairly rapidly. [Total Server Memory] reached a plateau between 3:00 am and 3:15 am, and then stepped slightly down later. [Target Server Memory] was never attained. That’s curious.

Indeed it is. Read the whole thing. And, given that it is labeled as Part 1, stay tuned for Part 2.

Comments closed

Documenting SQL Server Tables

Published 2020-04-17 by Kevin Feasel

Phil Factor has a way to create table documentation in source control and propagate it to the actual database:

It has always been a problem that documentation in the source, where it should be, is not then passed into the live database when the build script is executed. In a table, you have columns, constraints and indexes that you are likely to document using line-ending comments and block comments. You probably have a big block comment at the start, explaining the table. This information should be available in the live database. Microsoft don’t have a good answer and vaguely go on about adding comments in extended properties. Well, that’s fine but it hasn’t happened, unsurprisingly: Have you ever tried to do it? It is an almost impossible task, even with SQL Doc.
My solution is to execute my finely-documented build script as usual to create the latest version of the database, and then process the same script in PowerShell to add all the comments and documentation as extended properties in the right place in the live database.

It’s an interesting approach to a classic problem.

Comments closed

Querying Database and Log File Sizes with T-SQL

Published 2020-04-17 by Kevin Feasel

Allen White takes us through an easy technique to check database and log file sizes:

As a consultant, I have to be able to quickly spot problems, and one of the problems I frequently find is transaction log files that are incorrectly sized.
There are two catalog views in the master database which make this easy to do – sys.master_files and sys.databases. The sys.master_files view contains the database and individual file names, and the data_space_id column always has a value of 0 for the log file. The size column returns the value in 8KB pages, so we have to multiply the column by 8, then divide by 1024 to get the size in megabytes (MB).

Click through for the demo.

Comments closed

Explaining SQL_VARIANT

Published 2020-04-17 by Kevin Feasel

Kenneth Fisher explains what the SQL_VARIANT data type is used for:

The SQL_VARIANT data type is an interesting beast. It is a data type that can store most types of data. So a date, decimal, int, varchar etc. can be stored in this single data type. This sounds great right? Well there are a few issues here.

I don’t think I’ve ever used SQL_VARIANT data types before. It always struck me as a refuge for not wanting to think about what the proper data type should be. In fairness to it, though, I’ve seen plenty of unthoughtful solutions using NVARCHAR as well.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31