Press "Enter" to skip to content

Day: August 28, 2019

Sampling and Estimating Rare Events

Yi Liu takes us through a process to estimate rare events:

Naturally, we get an unbiased estimate of the overall prevalence of violation if we sample the videos uniformly from the population and have them reviewed by human raters to estimate the proportion of violating videos. We also get an unbiased estimate of the violation rate in each policy vertical. But given the low probability of violation and wanting to use our rater capacity wisely, this is not an adequate solution — we typically have too few positive labels in uniform samples to achieve an accurate estimate of the prevalence, especially for those sensitive policy verticals. To obtain a relative error of no more than 20%, we need roughly 100 positive labels, and more often than not, we have zero violation videos in the uniform samples for rarer policies.

This is similar in nature to testing for rare diseases, where a random sample of N people in the population is likely to turn up 0 cases of it.

Comments closed

A Plan for Troubleshooting Plans

Bert Wagner takes us through a workflow for troubleshooting performance issues in SQL Server using execution plans:

With the query pasted and formatted in my SSMS editor window, I like retrieving the estimated execution plan first, and then pasting the query into a second editor window and executing the query with the “Include Actual Execution Plan” option turned on. For bonus points, I’ll split the SSMS window vertically so I can start looking at the estimated execution plan while the query runs and returns the actual execution plan: I like this combination because I (almost) immediately receive my estimated execution plan and can start looking for problems. Once the query on the right finishes executing and I get the actual plan with all of its lovely run-time stats, I usually switch to that looking at that one.

Even if your approach is quite different, it’s good to compare and contrast.

Comments closed

Migrating Old Databases to New Versions

Chrissy LeMaire walks us through migrating an old, old application database to SQL Server 2017:

There was even a linked server in the mix, but our biggest concerns revolved around the changing collation and the Agent jobs, which were known to be brittle.

The destination test server was an existing shared server, which mirrored the scenario that would play out in production. And while the databases only needed to exist on the new server for a limited period of time, these migrated databases were going to be the most important databases on the entire instance. This meant that the SQL Server configs were going to have to cater to this app’s needs. One exception was the collation, as the accent sensitivity was determined not to be a big deal and the vendor agreed.

Read on to see how Chrissy did it. The answer, naturally, is with dbatools.

Comments closed

CPU Usage DMV in SQL Server on Linux

Tejas Shah announces an improvement to sys.dm_os_ring_buffers in SQL Server 2019 RC1 on Linux:

Sys.dm_os_ring_buffers DMV has been a key DMV used for monitoring SQL Server by built-in tools as well as third party monitoring utilities. When SQL Server 2017 was released on Linux, unfortunately this DMV did not return correct CPU usage information by SQL Server process. SQL Server team is glad to announce that the starting with SQL Server 2019 release candidate, the sys.dm_os_ring_buffers DMV returns SQL CPU utilization correctly. This improvement should benefit the SQL Server monitoring ecosystem on Linux by providing a way to monitor SQL Server CPU usage and enable decision making to engage corrective action if required.

This brings it in line with what we have on Windows.

Comments closed

The Structure of Graph Data

Mala Mahadevan begins a series on graph data in SQL Server:

The simplest way to understand a graph data model is that there are just two entities – Nodes, which is what we call Entities in the relational world, and Edges, which are what we call relationships. They are typically represented like below, with the circles standing for nodes, and the arrows for relationships. The emphasis, as we can see is on the bold arrows – because relationships are what graph data is about, with less emphasis on entities/nodes.

Read the whole thing.

Comments closed

PARSE, CAST, and CONVERT

Max Vernon gives us three ways to change data types:

PARSE provides a mechanism to convert a wide variety of character based dates into a datetime data type. From the Docs:

Returns the result of an expression, translated to the requested data type in SQL Server.

Use PARSE only for converting from string to date/time and number types. For general type conversions, continue to use CAST or CONVERT. Keep in mind that there is a certain performance overhead in parsing the string value.

PARSE has its value, but comes at a cost, as Max shows.

Comments closed

Two Takes on Power BI Time Dimensions

Brett Powell gives us a version of a time dimension in Power Query:

Like most dimensions in a data model, users will generally begin their analysis of time at a high level (e.g. hour) and then filter and drill their way into greater levels of detail, often leveraging hierarchies built into the model and/or drilling functionality in tools like Power BI. To support analysis by multiple grains and hierarchies, the query produces columns which group the seconds into hourly quartiles, hours, and minutes.

Gilbert Quevauvilliers has a script to generate a time dimension as well:

Below is the syntax to create a Time Dimension Table in Power Query

I had a requirement where I needed to create a Time Dimension for a customer. Most of the time I only need the date. Upon searching I could not find a resource where they had created the time dimension only using Power Query.

Gilbert’s gives you a key insight into the value of time dimensions: breaking everything out into periods. 5-minute intervals, 15-minute intervals, or however the company looks at data.

Comments closed