Press "Enter" to skip to content

Month: February 2019

Creating Benchmarks With WorkloadTools

Gianluca Sartori announces a new tool:

Throughout my career, I had to go through the pain of benchmarking often enough to get fed up with all the existing tools and decide to code my own. The result of this endeavor is WorkloadTools: a collection of tools to collect, analyze and replay SQL Server workloads, on premises and in the cloud.

At the moment, the project includes 3 tools:
SqlWorkload – a command line tool to capture, replay and analyze a workload
ConvertWorkload – a command line tool to convert existing workloads (traces and extended events) to the format used by SqlWorkload
WorkloadViewer – a GUI tool to visualize and analyze workload data

Click through for the link to check it out.

Comments closed

Power Query SSIS Source Preview

Joost van Rossum is excited about a new SSIS data source in preview:

First you need to download and install the latest SSDT (Version 15.9.0) and since it is still in preview, you can only use it within SSDT or on a Azure Integration Runtime in Azure Data Factory.

If you drag the new Power Query Source to your Data Flow canvas and edit it, you can paste your Power Query script from Power BI (or Excel) in the query textbox. So there is not yet(?) an editor, but this should simplify a lot of tasks that could were previously could only be solved with .Net scripting.

Read on for more details. Because it’s a preview, there’s still a lot of work yet to do it, it seems. But on the bright side, it’s new functionality in Integration Services.

Comments closed

Power BI Architecture Diagram V4

Dustin Ryan has a new version of the Power BI Architecture Diagram:

First and most importantly, I updated the Power BI logo in the diagram to the latest version of the logo!

Secondly, I included Power BI Dataflows in the diagram tagged #6. Power BI Dataflows are used to ingest, transform, integrate, and enrich big data by defining data source connections, ETL logic, refresh schedules, and more. Data is stored as entities in the Common Data Model in Azure Data Lake Storage Gen2. Dataflow entities can be consumed as a data source in Power BI and by using Power BI Desktop. Read more about Dataflows here.

Click through for a full changelog and a link to download the architecture diagram and legend.

Comments closed

Reference Column Names

Jon Shaulis shows why you want to reference tables when including column names in queries:

If you don’t read the rest of this setup, I want you to take away one thing. 

Always reference your tables with your columns when more than one table is involved in the query!

This post is made primarily with SQL Server in mind, but this behavior is actually ANSI SQL and can be replicated in PostgreSQL, MySQL, and Oracle.

Jon’s example is a case where perfectly valid ANSI SQL logic (which is why you can replicate this across platforms and why it’s not a bug) leads to an unexpected result.

Comments closed

Databricks Runtime 5.2 Released

Nakul Jamadagni announces Databricks Runtime 5.2:

Delta Time Travel
Time Travel, released as an Experimental feature, adds the ability to query a snapshot of a table using a timestamp string or a version, using SQL syntax as well as DataFrameReader options for timestamp expressions.
Sample code
SELECT count(FROM events TIMESTAMP AS OF timestamp_expression
SELECT count(
FROM events VERSION AS OF version

Time travel looks a bit like temporal tables in SQL Server.

Comments closed

Kafka And The Differing Aims Of Data Professionals

Kai Waehner argues that there is an impedence mismatch between data engineers, data scientists, and ML production engineers:

Data scientists love Python, period. Therefore, the majority of machine learning/deep learning frameworks focus on Python APIs. Both the stablest and most cutting edge APIs, as well as the majority of examples and tutorials use Python APIs. In addition to Python support, there is typically support for other programming languages, including JavaScript for web integration and Java for platform integration—though oftentimes with fewer features and less maturity. No matter what other platforms are supported, chances are very high that your data scientists will build and train their analytic models with Python.

There is an impedance mismatch between model development using Python, its tool stack and a scalable, reliable data platform with low latency, high throughput, zero data loss and 24/7 availability requirements needed for data ingestion, preprocessing, model deployment and monitoring at scale. Python in practice is not the most well-known technology for these requirements. However, it is a great client for a data platform like Apache Kafka.

Click through for the full argument as well as where Kafka can help mitigate some of the issues.

Comments closed

Why Increase Cost Threshold For Parallelism

Randolph West explains why the default value of cost threshold for parallelism is too low at 5:

Unfortunately, the default setting of 5 means that queries are likely to hit that threshold more often than not on modern hardware, and the optimizer is forced to look at parallel plans unnecessarily. A side-effect to this is that queries running in parallel will block queries running on a single thread, so short-running queries will be delayed by long-running queries needing the same resources.

Therefore, if I am setting up a SQL Server instance from scratch, I will set this value to 50 by default, and monitor the performance counters. Once a query cost exceeds 50, then it can use all the cores up to the limit defined by the max degree of parallelism.

50 is a reasonable first stab at a default. I’ll be honest, though: I’m not sure I know exactly where to set the cost threshold for parallelism. I agree that 5 is too low and 50 is better, but I don’t have a great feel for when it should move up or down short of “everything’s going parallel in your database, so increase the value” or “nothing’s ever going parallel, so maybe decrease it a skosh.”

Comments closed

Power BI Violin Plots

Meagan Longoria shows off a violin plot custom visual in Power BI:

violin plot is a nifty chart that shows both distribution and density of data. It’s essentially a box plot with a density plot on each side. Box plots are a common way to show variation in data, but their limitation is that you can’t see frequency of values. In other words, you can see statistics such as min, max, median, mean, or quartiles, but you can’t see the individual values nor how often they occurred.

Read on for a review of the custom visual available for violin plots, including areas where it does well and where it falls short at present.

Comments closed

Triggers In Postgres

Ryan Booz explains how triggers work in Postgres from the standpoint of someone familiar with SQL Server:

They are implemented as Functions (Stored Procedures)
This caught me off guard at first. I’ve been working with and dealing with triggers in SQL Server since day 1. They are first-class citizens… objects that have their own code blocks and rules.

PostgreSQL approaches it differently. Any reusable code block, regardless of its true purpose is a Function of varying types. Triggers are no different. Therefore, you write the logic of your trigger in a Function and then call it by adding a trigger to the DML event of a table.

Click through for a few more tips on triggers.

Comments closed

Wait Stats In Azure Data Studio

Paul Randal takes the Wait Stats Report in Azure Data Studio for a spin:

Note that the x-axis is percentage of all waits, not wait count. You’ll see that PREEMPTIVE_OS_FLUSHFILEBUFFERS is the top wait on my Linux instance – that’s by design and I’ll blog about that next. I’ve also submitted a GitHub change to add that wait to the list of waits filtered out by script the extension uses.

Anyway, you can drill in to the details by clicking the ellipsis at the top-right of the graph and selecting ‘Show Details’. That’ll give all the waits and by selecting each one you can see the usual output from my waits script. To get more information on what each wait means, select the bottom cell, right-click on the URL to copy it, and paste into your favorite browser to go to my waits library. And of course, you can refresh the results via the ellipsis as well.

I like how Azure Data Studio is coming together as a full product. There’s a ways to go yet, but it’s getting there.

Comments closed