Press "Enter" to skip to content

Day: February 15, 2019

Databricks Runtime 5.2 Released

Nakul Jamadagni announces Databricks Runtime 5.2:

Delta Time Travel
Time Travel, released as an Experimental feature, adds the ability to query a snapshot of a table using a timestamp string or a version, using SQL syntax as well as DataFrameReader options for timestamp expressions.
Sample code
SELECT count(FROM events TIMESTAMP AS OF timestamp_expression
SELECT count(
FROM events VERSION AS OF version

Time travel looks a bit like temporal tables in SQL Server.

Comments closed

Kafka And The Differing Aims Of Data Professionals

Kai Waehner argues that there is an impedence mismatch between data engineers, data scientists, and ML production engineers:

Data scientists love Python, period. Therefore, the majority of machine learning/deep learning frameworks focus on Python APIs. Both the stablest and most cutting edge APIs, as well as the majority of examples and tutorials use Python APIs. In addition to Python support, there is typically support for other programming languages, including JavaScript for web integration and Java for platform integration—though oftentimes with fewer features and less maturity. No matter what other platforms are supported, chances are very high that your data scientists will build and train their analytic models with Python.

There is an impedance mismatch between model development using Python, its tool stack and a scalable, reliable data platform with low latency, high throughput, zero data loss and 24/7 availability requirements needed for data ingestion, preprocessing, model deployment and monitoring at scale. Python in practice is not the most well-known technology for these requirements. However, it is a great client for a data platform like Apache Kafka.

Click through for the full argument as well as where Kafka can help mitigate some of the issues.

Comments closed

Why Increase Cost Threshold For Parallelism

Randolph West explains why the default value of cost threshold for parallelism is too low at 5:

Unfortunately, the default setting of 5 means that queries are likely to hit that threshold more often than not on modern hardware, and the optimizer is forced to look at parallel plans unnecessarily. A side-effect to this is that queries running in parallel will block queries running on a single thread, so short-running queries will be delayed by long-running queries needing the same resources.

Therefore, if I am setting up a SQL Server instance from scratch, I will set this value to 50 by default, and monitor the performance counters. Once a query cost exceeds 50, then it can use all the cores up to the limit defined by the max degree of parallelism.

50 is a reasonable first stab at a default. I’ll be honest, though: I’m not sure I know exactly where to set the cost threshold for parallelism. I agree that 5 is too low and 50 is better, but I don’t have a great feel for when it should move up or down short of “everything’s going parallel in your database, so increase the value” or “nothing’s ever going parallel, so maybe decrease it a skosh.”

Comments closed

Power BI Violin Plots

Meagan Longoria shows off a violin plot custom visual in Power BI:

violin plot is a nifty chart that shows both distribution and density of data. It’s essentially a box plot with a density plot on each side. Box plots are a common way to show variation in data, but their limitation is that you can’t see frequency of values. In other words, you can see statistics such as min, max, median, mean, or quartiles, but you can’t see the individual values nor how often they occurred.

Read on for a review of the custom visual available for violin plots, including areas where it does well and where it falls short at present.

Comments closed

Triggers In Postgres

Ryan Booz explains how triggers work in Postgres from the standpoint of someone familiar with SQL Server:

They are implemented as Functions (Stored Procedures)
This caught me off guard at first. I’ve been working with and dealing with triggers in SQL Server since day 1. They are first-class citizens… objects that have their own code blocks and rules.

PostgreSQL approaches it differently. Any reusable code block, regardless of its true purpose is a Function of varying types. Triggers are no different. Therefore, you write the logic of your trigger in a Function and then call it by adding a trigger to the DML event of a table.

Click through for a few more tips on triggers.

Comments closed

Wait Stats In Azure Data Studio

Paul Randal takes the Wait Stats Report in Azure Data Studio for a spin:

Note that the x-axis is percentage of all waits, not wait count. You’ll see that PREEMPTIVE_OS_FLUSHFILEBUFFERS is the top wait on my Linux instance – that’s by design and I’ll blog about that next. I’ve also submitted a GitHub change to add that wait to the list of waits filtered out by script the extension uses.

Anyway, you can drill in to the details by clicking the ellipsis at the top-right of the graph and selecting ‘Show Details’. That’ll give all the waits and by selecting each one you can see the usual output from my waits script. To get more information on what each wait means, select the bottom cell, right-click on the URL to copy it, and paste into your favorite browser to go to my waits library. And of course, you can refresh the results via the ellipsis as well.

I like how Azure Data Studio is coming together as a full product. There’s a ways to go yet, but it’s getting there.

Comments closed

How’s My Database?

Daniel Janik has a Windows app for you:

There are actually about 40 things it checks for.

Current limitations are that queries with a cursor or temp table are not analyzed. There’s also a bug where the missing indexes and warnings appear on the wrong node/operator. Since the tool is using estimated plans at the moment, it may not be as accurate.

I’m planning on a few new features in the next month to add feeding the utility a query plan and displaying the original query. I’m also planning on adding history and the ability to execute a query from the tool. Before we get to those we need to fix some known bugs though. I’m hoping that you. Yes! you can help me identify other bugs to make this a great tool for the SQL community.

The product is in beta, so check it out and send Daniel some feedback.

Comments closed

When Differential Backups Grow Larger Than Fulls

Kenneth Fisher notes that differential backups can end up being larger than full backups of the same database:

The thing about DBA Myths is that they are generally widespread and widely believed. At least I believed this one until I posted What’s a differential Backup?and Randolph West (b/t) pointed out that my belief that differential backups can’t get larger than full backups was incorrect. In fact, differential backups (like FULL backups) contain enough transaction log information to cover transactions that occur while the backup is taking place. So if the amount of data that needs to be backed up combined with transactions requires more space than just the data ….

Read on for a demonstration.

Comments closed