Press "Enter" to skip to content

Month: March 2020

Monitoring Data Quality on Streaming Data

Abraham Pabbathi and Greg Wood want to check data quality on Spark Streaming data:

While the emergence of streaming in the mainstream is a net positive, there is some baggage that comes along with this architecture. In particular, there has historically been a tradeoff: high-quality data, or high-velocity data? In reality, this is not a valid question; quality must be coupled to velocity for all practical means — to achieve high velocity, we need high quality data. After all, low quality at high velocity will require reprocessing, often in batch; low velocity at high quality, on the other hand, fails to meet the needs of many modern problems. As more companies adopt streaming as a lynchpin for their processing architectures, both velocity and quality must improve.

In this blog post, we’ll dive into one data management architecture that can be used to combat corrupt or bad data in streams by proactively monitoring and analyzing data as it arrives without causing bottlenecks.

This was one of the sticking points of the lambda architecture: new data could still be incomplete and possibly wrong, but until reached the batch layer, you wouldn’t know that.

Comments closed

Two Query Store Stories

Mark Wilkinson gives us two separate takes on Query Store:

When the Query Data Store (QDS) feature was announced for SQL Server 2016, we were excited about the prospect of being able to have deep insight on any query running in our environment. I work for a company that deals heavily in the e-commerce space, and we have a large SQL Server footprint. Our environment is unique in that is essentially a multi-tenant system, but all the tenants could have wildly different workloads. It’s really the kind of query execution scenario QDS was built for. We had the pleasure of working with the Microsoft SQLCAT team to get 2016 and QDS up and running in our production environment before it was GA.

In this post I’m going to share two stories about our QDS experience (from pre and post GA of the feature). One from the perspective of the Database Developer, and one from the Database Administrator. For the most part this is not a technical post full of queries and code samples. It’s just me talking about some things I have experienced using QDS in production.

CentralQDS, by the way, is really cool. Hopefully we’re able to show that to the world someday (and note how I say “we” even though I did absolutely nothing with it except for being in the same company as the people who developed it).

Comments closed

Improving vCenter Performance Metric Logging

David Klee has some recommendations on settings for vCenter performance metric collection:

The default performance metric statistics collection interval within vCenter is to start rolling up data in an aggregation method starting at just one hour. Much of the data necessary for troubleshooting of performance challenges reported either same day or on the previously day is lost from the vCenter data and forces the administrator to revert to cumbersome and/or time-consuming tooling, such as vRealize Operations Manager. DBAs might not have access to such tools. Hopefully by now they have read-only access to vCenter!

The vCenter performance statistics collection and rollup settings can be customized to provide a longer window of time for critical metrics to be available to the administrator for management.

Click through for some recommendations of aggregation intervals and collection durations to help with virtual machine troubleshooting.

Comments closed

Loading Data from CSVs with Inconsistent Quoted Identifiers

Dave Mason has some fun with loading data from files:

BCP and OPENROWSET are long-lived SQL Server options for working with data in external files. I’ve blogged about OPENROWSET, including a recent article showing a way to deal with quoted data. One of the shortcomings I’ve never been able to overcome is an inconsistent data file with data fields in some rows enclosed in double quotes, but not all. I’ve never found a way around this limitation.

Let’s demonstrate with BCP. Below is a sample data file I’ll attempt to load into a SQL Server table. Note the data fields highlighted in yellow, which are enclosed in double quotes and contain the field terminator , (comma) character. For reference, the file is also available on Github.

I get unduly frustrated with the implementations of various data loaders around SQL Server and how they handle quoted identifiers differently. And don’t get me started on PolyBase.

Comments closed

Azure Data Studio and Query Store

Rob Farley explains why Azure Data Studio can’t be a 100% thing right now:

But the big thing with this new laptop is that I’ve made a conscious decision about what to install on it. And particularly, what things to NOT install.

For example, I’ve only installed SQL inside docker, not directly on Windows. I’m running the Linux version just because I can.

But today I feel like I’ve compromised.

Today I’ve installed SSMS, instead of persisting with just Azure Data Studio. It only took a week for me to cave, and the reason is Query Store.

Read on to understand why. I ended up having to cave on Docker-only for SQL Server because of PolyBase components that I needed and which aren’t (yet?) supported in the Linux version.

Comments closed

Finding Long-Running Queries with system_health

Grant Fritchey shows us where we can find long-running queries easily:

Wouldn’t it be great to just quickly and easily take a look at your system to see if you had any queries that ran for a long time, but, without actually doing any work to capture query metrics?

Oh, yeah, I can do that right now, and so can you.

All we need is something that is built into every single server you currently have under management (sorry, not Azure SQL Database) that is SQL Server 2008 or better: system_health

Grant then ties this into general benefits of Extended Events and shows how you can query and view the results.

Comments closed

Dynamic Binning with Power BI

Reza Rad has fun with dynamic binning in Power BI:

In the previous article/video, I explained how to create dynamic bins by choosing the count of bins in a slicer in the Power BI report. In this article, I’ll explain, how you can do it the other way around, which is by selecting the size of the bin, you will have bins and buckets dynamically generated.

I like this for its ability to let you select the proper number and size of bins when Power BI is being particularly obstinate about something. In an ideal world, I don’t like this so much as a user-facing feature because we as designers should know the proper number and size of bins.

Comments closed

Connecting Kafka to Elasticsearch

Danny Kay and Liz Bennett build an example of writing Kafka topic data to Elasticsearch:

The Elasticsearch sink connector helps you integrate Apache Kafka® and Elasticsearch with minimum effort. You can take data you’ve stored in Kafka and stream it into Elasticsearch to then be used for log analysis or full-text search. Alternatively, you can perform real-time analytics on this data or use it with other applications like Kibana.

For some background on what Elasticsearch is, you can read this blog post by Sarwar Bhuiyan. You can also learn more about Kafka Connect in this blog post by Tiffany Chang and in this presentation from Robin Moffatt.

This is a demo-heavy walkthrough, so check it out.

Comments closed

Visualizing Rating Data

Stephanie Evergreen shows various ways to visualize ratings data:

Stacked Bars *seem* like a good idea – we show 100%, we can fit more questions and data into a similar amount of space – advantages, right? Except that stacked bars are difficult for people to read. How well can you compare the values of the orange segments? Not so much.

If you are going to use stacked bars, make some helpful formatting tweaks, like smarter color coding and an order from greatest to least.

A lot of this comes down to simplification and reduction of possibilities. Read the whole thing.

Comments closed

DISTINCT is not a Function

Lukas Eder corrects a misconception:

A very common misconception I often encounter with SQL users is the idea that DISTINCT is something like a function, and that it can take parenthesised arguments. Just recently, I’ve seen this Stack Overflow question where the OP was looking for a way to express this in jOOQ:

SELECT DISTINCT (emp.id), emp.fname, emp.name FROM employee emp;

Notice the parentheses around (emp.id), which look as though this is some special kind of DISTINCT usage, which is akin to a DISTINCT function. The idea is often that:

– The behaviour is somewhat different from omitting the parentheses
– The performance is faster, because only the ID needs to be considered for distinctness

Both of these ideas are (mostly) wrong, as Lukas shows.

Comments closed