Press "Enter" to skip to content

Month: November 2020

The Evolving Lakehouse

Simon Whiteley looks at the current status of the Lakehouse model:

We have discussed in the past this idea of the lakehouse, the aspirational target of many analytics platforms these days of combining the huge power and potential of data lakes with the rigour, reliability and concurrency of a data warehouse. It’s an interesting concept but has, in the past, been firmly an aspiration.

In the world without lakehouses, we often see the “Modern Data Warehouse”, this two-phased approach to providing a holistic platform – we load our early data into a lake where we shape it and massage it into an understandable state. It is here we perform data science, exploratory data analysis, early sight analytics prototyping and various other functions that don’t quite fit into a data warehouse… but then we load our data into a relational store for serving to the business. This is where we can meet their demands for a rich SQL environment, auditable data models and rigorous change procedures. Essentially, we store data twice so that we can achieve the best of both worlds.

Definitely read Simon’s take on it. My take is that the Lakehouse concept will start to be useful to specific companies in about 2-3 years, as I don’t think the performance is there today.

Comments closed

Using Linked Servers to Scan SQL Server Instances

Kenneth Fisher has a plan:

I recently did a post on testing a linked server where I said I would explain why I wanted to make the test. Basically I needed to scan a few hundred instance names and do the following

– Check if the instance is one we have access to or even exists? If not make a note of the error so we can tell the difference.
– Collect information like instance size (total size of all databases), CPU count, memory count etc.
– Collect a list of database names on the instance, their status, size, etc.

I’m of mixed feelings about this. On the one side, I appreciate that it’s pretty simple and does what is promised: a quick and dirty method for reaching a SQL Server instance (or something which might be one). On the other hand, it feels like this is trying to be the combination of two or three things which do the job better:

  • Central Management Server, for managing the SQL Server instances available to you. This is where you’d put confirmed instances and start running checks like Kenneth wants.
  • Nmap, for determining whether there are servers which are not part of your CMS listing. Nmap can check for availability on port 1433 and see if that’s a SQL Server instance on there, as well as hitting the SQL Browser service for instances not on 1433 (and assuming you leave the Browser on).
  • A Powershell script to combine these together, or at least a script in some language which does a decent job with looping. Even better if it offers parallelism.
Comments closed

Check if an ADF Pipeline is Already Running

Paul Andrew has a scenario for us:

Scenario: I want to trigger a Data Factory pipeline, but when I do I want the pipeline to know if it’s already running. If it is already running, stop the new run.

Sounds simple enough right?

Wrong!

But, now simple for you, because I’ve done it for you, yay! 🙂

I thought it was simple, but it wasn’t simple, but now it’s simple, but is it really simple? Click through to find out.

Comments closed

Azure Data Studio November 2020 Release

Alan Yu announces the November 2020 release of Azure Data Studio:

Another feature request was to provide support for parameters in a notebook. Parameterization is the ability to execute the same notebook with different parameters.

With this release of Azure Data Studio, users will now be able to utilize Papermill’s ability to parameterize, execute, and store notebooks. By stating the parameters cell as the first code cell in your notebook, it ensures that the injected parameters in the outputted parameterized notebook will be placed directly after the original parameters cell. That way the parameterized notebook will utilize the newly injected parameters instead of the original parameters cell.

Users can utilize Papermill CLI as well as the Python API  to pass in a new set of parameters quickly and efficiently as shown below.

That does look interesting.

Comments closed

Pre-Calculating Semi- and Non-Additive Measures in Power BI

Gerhard Brueckl shows us how to work with semi-additive and non-additive measures in Power BI:

In one of my recent projects we wanted to visualize data from the customers analytical platform based on Azure Databricks in Power BI. The connection between those two tools works pretty flawless which I also described in my previous post but the challenge was the use-case and the calculations. We wanted to display the distinct customers across various aggregations levels over a billion rows fact table. We came up with different potential solutions all having their pros and cons:

1. load all data into Power BI (import mode) and do the aggregations there

2. use Power BI with direct query and let the back-end do the heavy lifting

3. load only necessary pre-aggregated data into Power BI (import mode)

Please keep in mind that we are dealing with a distinct count measure here. Semi- and Non-additive measure like this cannot easily be aggregated from lower levels to higher levels without having all the detail data available!

Read on for Gerhard’s thoughts on each as well as the decision and process.

Comments closed

The Performance Overhead for Query Store

Erin Stellato updates a prior post:

I wrote the original Query Store performance overhead post just over two years ago, and just like the data in your database keeps changing, so does SQL Server.  However, the question, “What is the performance overhead of enabling Query Store?” is still the most frequent question I am asked. 

So why am I writing this post?  Because there have been many improvements specific to Query Store that have taken the feature to the point where it can support all workloads, including those that are ad-hoc.  This is a big deal.

Read on to learn more.

Comments closed

Monitoring Kafka Metrics with Prometheus and Grafana

Murat Derman shows how you can use Prometheus and Grafana to track vital measures on an Apache Kafka cluster:

You can add  scrape_interval parameter in your configuration by default it is  every 1 minute  scrape_interval: 5s

Prometheus has its own query language called promql. You can learn more about this language from this here https://prometheus.io/docs/prometheus/latest/querying/basics/ 

There are lot of metrics you can define for Kafka. I will mention  a few  of them in this article

Read on for a breakdown of some of these measures.

Comments closed

Using Hints in SQL Server

Jared Poche is flirting with the dark side:

I work on hundreds of databases with the same schema. They have different data sets and distributions, different sizes, and their statistics are going to update at different times. But if one of them chooses a bad plan, I have to push aside whatever other work to research the high CPU on database xyz.

Consistency is really valuable to me. And in this case, the answer is simple. Yes, I want to scan the fast, small memory-optimized table variable first, and use it to filter the larger, slower table. Adding a join hint or a force order to this query should keeps its plan and performance consistent.

Click through for a few examples of where query hints can be useful, but also where they can fail you in unexpected ways.

Comments closed

Using the Vertipaq Analyzer with DAX Studio

Gilbert Quevauvilliers walks us through the Vertipack Analyzer in DAX Studio:

If you are looking to better understand your Power BI Model, how big are your tables, which column is taking up the most space then you can use Vertipaq analyzer which is built directly into the amazing DAX studio.

If you are looking for a video on how to do this, there are some awesome videos found here.

SQLBI.COM – Introducing VertiPaq Analyzer in DAX Studio

Guy In a Cube – How do you even use VertiPaq Analyzer with Power BI???

The reason for my blog post, is that I find sometimes I want to watch a video to better understand the content, and other times I want to follow a step by step process on how complete the analysis.

Click through for a walkthrough of the process.

Comments closed