Press "Enter" to skip to content

Month: December 2019

Data Copy & Package Execution in ADF

Cathrine Wilhelmsen continues a series on Azure Data Factory. First, we get to see how to copy data from on-prem SQL Servers:

In the previous post, we looked at the three different types of integration runtimes. In this post, we will first create a self-hosted integration runtime. Then, we will create a new linked service and dataset using the self-hosted integration runtime. Finally, we will look at some common techniques and design patterns for copying data from and into an on-premises SQL Server.

And when I say “on-premises”, I really mean “in a private network”. It can either be a SQL Server on-premises on a physical server, or “on-premises” in a virtual machine.

Then, we learn how to run SSIS packages in Azure Data Factory:

Two posts ago, we looked at the three types of integration runtimes and created an Azure integration runtime. In the previous post, we created a self-hosted integration runtime for copying SQL Server data. In this post, we will complete the integration runtime part of the series. We will look at what SSIS Lift and Shift is, how to create an Azure-SSIS integration runtime, and how you can start executing SSIS packages in Azure Data Factory.

I’m going to guess that the next post will be all about the third integration runtime.

Comments closed

Problems with sp_estimate_data_compression_savings

Andy Mallon knows it’s getting close to Festivus and he has some grievances to air:

If you’re working with compressed indexes, SQL Server provides a system stored procedure to help test the space savings of implementing data compression: sp_estimate_data_compression_savings. Starting in SQL Server 2019, it can even be used to estimate savings with columnstore.

I really don’t like sp_estimate_data_compression_savings. In fact, I kind of hate it. It’s not always very accurate–and even when it is accurate, the results can be misleading. Before I get ranty about why I don’t like it, let’s look at it in action.

Andy makes good points in this, so check it out.

Comments closed

Schiphol Takeoff: Low-Code Automated Deployment

Tim van Cann and Daniel van der Ende have an open source project for automatic deployment on Azure:

To give a bit more insight into why we built Schiphol Takeoff, it’s good to take a look at an example use case. This use case ties a number of components together:

– Data arrives in a (near) real-time stream on an Azure Eventhub.
– A Spark job running on Databricks consumes this data from Eventhub, processes the data, and outputs predictions.
– A REST API is running on Azure Kubernetes Service, which exposes the predictions made by the Spark job.

Conceptually, this is not a very complex setup. However, there are quite a few components involved:

– Azure Eventhub
– Azure Databricks
– Azure Kubernetes Service

Each of these individually has some form of automation, but there is no unified way of coordinating and orchestrating deployment of the code to all at the same time. If, for example, you were to change the name of the consumer group for Azure Eventhub, you could script that. However, you’d also need to manually update your Spark job running on Databricks to ensure it could still consume the data.

This looks pretty nice. I’ll need to dive into it some more.

Comments closed

New Features in Kafka 2.4

Manikumar Reddy announces new features in Apache Kafka 2.4:

KIP-392: Allow consumers to fetch from closest replica

Historically, consumers were only allowed to fetch from leaders. In multi-datacenter deployments, this often means that consumers are forced to incur expensive cross-datacenter network costs in order to fetch from the leader. With KIP-392, Kafka now supports reading from follower replicas. This gives the broker the ability to redirect consumers to nearby replicas in order to save costs.

It’s not the biggest release of Kafka ever, but there are some really nice updates here.

Comments closed

Testing Power BI Report Performance in the Browser

Chris Webb continues a series on testing Power BI report performance in a browser. Part 2 walks us through some of the mechanics of the process:

Before you publish your report, in Power BI Desktop add a blank page with no visuals on to it. It doesn’t need to be the page that is opened when the report opens and you will be able to delete it later. Why do this? When you’re testing how long it takes for your report page to render, you’re probably doing so because you want to improve performance. Some things in the report page that influence performance you have the power to change, such as the design of the dataset, the DAX in the measures, the number and type of visuals on a page; some things will always happen when a report runs and you have to accept that overhead. Testing how long a blank page takes to render will give you an idea of how long this latter category of “things that always happen” takes, and you can subtract this time from the time your chosen report page takes to run.

Part 3 is a demonstration of the process:

…so you go ahead and publish. You view the report after publishing and it still seems fast. Then the complaints start coming in: the report is slow!?! It seems to be users who are viewing the report on their phone who are having the most problems. So, following the instructions in my last post, you open up Chrome DevTools and run an audit using a simulated slow 4G connection:

That’s an important part of testing. We normally develop inside a fast network, but our users may be on rather slow networks.

Comments closed

Copy Reports with Shared Data Sets Between Workspaces

Gilbert Quevauvilliers ran into a cross-environment issue:

I was working on some documentation for a customer and I came across a very quick and easy way to create a copy of a report which also creates a connection to the shared dataset that I could then copy to another “New Workspace”

Before I found out this gem, I had to manually do this via PowerShell which worked really well, but I had to do a whole lot of extra work to find the GUID’s then test it and make sure it works. With this new method it makes it simple and quick. It is a WIN-WIN

You can follow along as I show you how to do it below.

Click through for the demonstration.

Comments closed

When FOR JSON PATH Isn’t Enough

Dave Mason walks us through some options when working with JSON data in SQL Server:

In both situations, we need to know something about the JSON schema to query it in a meaningful way: in the first example, column names and types are hard-coded; in the second example, column names are hard-coded as path parameter values for the JSON_VALUE function. Even though JSON data is self-describing, SQL Server doesn’t have a way to infer schema. (I would be quite happy to be wrong about this–please add a comment if you know something I don’t!) About the time I came to this realization, I commented on Twitter that JSON might be fool’s gold. You don’t need to know schema to store JSON data in SQL Server. But you do if you want to query it. “It’s pay me now or pay me later.”

It’s schema on read or schema on write. I’m not sure there is ever a truly schema-free scenario in a business application.

Comments closed

The Benefits of DAX Variables

Reza Rad explains why you should use DAX variables if you’re repeating calculations:

We have to main parts in the expression above: A and B. Each of those is doing a calculation. Now, with the markings above, reading the expression is much simpler. The whole expression means this:

=IF(A>B, A, B)

All the above expression is saying is that if A is bigger than B, then return A, otherwise B. Now it is much simpler to read it because we split the repetitive parts into sections. That is what exactly the DAX variable is for.

Readability is not the only benefit, however. Reza has more.

Comments closed

Why Disabling the Clustered Index is a Bad Idea

Kenneth Fisher has an experiment in mind:

You are probably already aware that you can disable an index. This can be handy when you have a large load and the load + re-enabling the indexes (you have to completely rebuild them) is faster than leaving the indexes in place. I’ve had pretty limited occasions where this has helped but it can be a handy trick at times. That said, this is only true for non-clustered indexes. What happens when you disable the clustered index?

Nothing good, that’s what.

Comments closed