Press "Enter" to skip to content

Curated SQL Posts

Area Under The ROC Is Not Accuracy

Stephen Chen debunks bad journalistic summaries of a Google research paper:

Journalists latched onto Google’s NN 0.95 score vs. the comparison 0.86 (see EWS Strawman below), as the accuracy of determining mortality. However the actual metric the researchers used is AUROC (Area Under Receiver Operating Characteristic Curve) and not a measure of predictive accuracy that indexes the difference between the predicted vs. actual like RMSE (Root Mean Squared Error) or MAPE (Mean Absolute Percentage Error). Some articles even erroneously try to explain the 0.95 as the odds ratio.

Just as the concept of significance has different meanings to statisticians and laypersons, AUROC as a measure of model accuracy does not mean the probability of Google’s NN predicting mortality accurately as journalists/laypersons have taken it to mean. The ROC (see sample above) is a plot of a model’s False Positive Rate (i.e. predicting mortality where there is none) vs. the True Positive Rate (i.e. correctly predicting mortality). A larger area under the curve (AUROC) means the model produces less False Positives, not the certainty of mortality as journalists erroneously suggest.

The researchers themselves made no claim to soothsayer abilities, what they said in the paper was:

… (their) deep learning model would fire half the number of alerts of a traditional predictive model, resulting in many fewer false positives.

It’s an interesting article and a reminder of the importance of terminological precision (something I personally am not particularly good at).

Comments closed

How Perfmon Memory Counters Fit Together

Lonny Niederstadt takes us through a tour of how various Perfmon memory counters relate:

Wading through all of the SQL Server memory-related perfmon counters to understand how they related to each other took me a really long time.  Time-series graphs that show the relationship help me tremendously, and when I started trying to account for SQL Server memory years ago I couldn’t find any.  So I started to blog some time-series graphs, under the theory that either my understanding was correct and my graphs would be helpful to someone… or they’d be wrong and someone would correct me.
Well… its been about 5 years and my graphs haven’t generated too much discussion, but they’ve really helped me 😀😀😀

Perfmon: SQL Server Database pages + Stolen pages + Free pages = Total pages
http://sql-sasquatch.blogspot.com/2013/09/perfmon-database-pages-stolen-pages.html

Working with SQL Server 2016 and some demanding ColumnStore batch mode workloads, I began to see suspicious numbers, and graphs that didn’t make sense to me.  Today I got pretty close to figuring it out so I wanted to share what I’ve learned.

The following graphs are from a 4×10 physical server running Windows and SQL Server.  Four sockets, 4 NUMA nodes.

For bonus points, Lonny traces down a problem where expectations aren’t meeting reality.

Comments closed

Alleviating tempdb Contention

Pam Lahoud has some advice for those with tempdb-heavy workloads:

TL;DR – Update to the latest CU, create multiple tempdb files, if you’re on SQL 2014 or earlier enable TF 1117 and 1118, if you’re on SQL 2016 enable TF 3427.

And now it’s time for everyone’s favorite SQL Server topic – tempdb! In this article, I’d like to cover some recent changes that you may not be aware of that can help alleviate some common performance issues for systems that have a very heavy tempdb workload. We’re going to cover three different scenarios here:

  1. Object allocation contention

  2. Metadata contention

  3. Auditing overhead (even if you don’t use auditing)

There’s some good information in here so don’t just say tl;dr.

Comments closed

Reading The Transaction Log

Nesha Maric shows us a couple of methods and a third-party tool for reading the SQL Server transaction log:

Update operations in SQL Server are not fully logged in the transaction log. Full before-and-after values, unfortunately don’t exist, only the delta of the change for that record. For example, SQL Server may show a change from “H” to “M” when the actual record that was changed was from “House” to “Mouse”. To piece together the full picture a process must be devised to manually reconstruct the history of changes, including the state of the record prior to update. This requires painstakingly re-constructing every record from the original insert to the final update, and everything in between.

BLOBs are another challenge when trying to use fn_dblog to read transaction history. BLOBs, when deleted, are never inserted into the transaction log. So examining the transaction log won’t provide information about its existence unless the original insert can be located. But only by combining these two pieces of data will you be able to recover a deleted BLOB file. This obviously requires that the original insert exists in the active/online portion of the transaction log, the only part accessible to fn_dblog. This may be problematic if the original insert was done some weeks, months or years earlier and the transaction log has been subsequently backed up or truncated

I’ve tried to avoid messing directly with the transaction log whenever possible, but there are scenarios where it’s the only place you have needed information.

Comments closed

When Multiple Missing Indexes Exist

Brent Ozar shows what happens when there are multiple missing indexes for a query:

SQL Server Management Studio only shows you the first missing index recommendation in a plan.

Not the best one. Not all of them. Just whichever one happens to show up first.

Using the public Stack Overflow database, I’ll run a simple query:

But that behavior isn’t the case for all tools; SQL Operations Studio is a bit different.

Comments closed

Problem With Merge Replication And FILESTREAM

Gianluca Sartori walks us through an error when combining merge replication with FILESTREAM:

I published tables with FILESTREAM data before, but it seems like there is a particular planetary alignment that triggers an error during the execution of the snapshot agent.

This unlikely combination consists in a merge article with a FILESTREAM column and two UNIQUE indexes on the ROWGUIDCOL column. Yes, I know that generally it does not make sense to have two indexes on the same column, but this happened to be one of the cases where it did, so we had a CLUSTERED PRIMARY KEY on the uniqueidentifier column decorated with the ROWGUIDCOL attribute and, on top, one more NONCLUSTERED UNIQUE index on the same column, backed by a UNIQUE constraint.

Setting up the publication does not throw any error, but generating the initial snapshot for the publication does:

Cannot create, drop, enable, or disable more than one constraint,
column, index, or trigger named 'ncMSmerge_conflict_TestMergeRep_DataStream'
in this context. Duplicate names are not allowed.

This is a rather specific confluence of events, so it probably won’t affect many people.  Still, it is a bug.

Comments closed

Enabling LDAP Authentication On Cassandra

Kurt Greaves shows off a new LDAP authenticator for Apache Cassandra:

The LDAPAuthenticator is implemented using JNDI, and authentication requests will be made by Cassandra to the LDAP server using the username and password provided by the client. At this time only plain text authentication is supported.

If you configure a service LDAP user in the ldap.properties file, on startup Cassandra will authenticate the service user and create a corresponding role in the system_auth.roles table. This service user will then be used for future authentication requests received from clients. Alternatively (not recommended), if you have anonymous access enabled for your LDAP server, the authenticator allows authentication without a service user configured. The service user will be configured as a superuser role in Cassandra, and you will need to log in as the service user to define permissions for other users once they have authenticated.

The authenticator itself is hosted on GitHub, so you can check out its repo too.

Comments closed

RStudio Integration With Databricks

Brian Dirking, et al, announce support between RStudio and the Databricks platform:

With Databricks RStudio Integration, both popular R packages for interacting with Apache Spark, SparkR or sparklyr can be used the inside the RStudio IDE on Databricks. When multiple users use a cluster, each creates a separate SparkR Context or sparklyr connection, but they are all talking to a single Databricks managed Spark application allowing unique opportunities for collaboration between users. Together, RStudio can take advantage of Databricks’ cluster management and Apache Spark to perform such as a massive model selection as noted in the figure below.

I like seeing this level of integration, especially from a language like R, which has historically been limited to operating on a single machine’s memory.

Comments closed

Auto-Scaling Amazon ElasticMapReduce

Brandon Schller gives us some tips on sizing and scaling ElasticMapReduce:

EMR scaling is more complex than simply adding or removing nodes from the cluster. One common misconception is that scaling in Amazon EMR works exactly like Amazon EC2 scaling. With EC2 scaling, you can add/remove nodes almost instantly and without worry, but EMR has more complexity to it, especially when scaling a cluster down. This is because important data or jobs could be running on your nodes.

To prevent data loss, Amazon EMR scaling ensures that your node has no running Apache Hadoop tasks or unique data that could be lost before removing your node. It is worth considering this decommissioning delay when resizing your EMR cluster. By understanding and accounting for how this process works, you can avoid issues that have plagued others, such as slow cluster resizes and inefficient automatic scaling policies.

If you’re using EMR today or think you might use it in the future, you should read this.

Comments closed

Using DAX With Reporting Services

David Stelfox gives us an example of using DAX when connecting SQL Server Reporting Services to a SQL Server Analysis Services Tabular model:

Tools like Power BI have changed reporting allowing power users to leverage tabular cubes to present information quicker and without the (perceived) need for developers. However, experience tells us many users still want data in tables with a myriad of formatting and display rules. Power BI is not quite there yet in terms of providing all this functionality in the same way that SSRS is. For me, SSRS’s great value and, at the same time its curse, is the sheer amount of customisation a developer can do. I have found that almost anything a business user demands in terms of formatting and display is possible.

But you have invested your time and money in a tabular SSAS model which plays nicely with Power BI but your users want SSRS reports so how to get to your data – using DAX, of course. Using EVALUATE, SUMMARIZECOLUMNS and SELECTCOLUMNS you can return data from a tabular model in a tabular format ready to be read as a dataset in SSRS.

It’s a good post and a good example.  The only quibble I have is in the motivating paragraph; Power BI and SQL Server Reporting Services have different end goals—Power BI isn’t (and I think never will be) a pixel-perfect report building product; it’s meant to be a dashboarding technology.  That quibble aside, the example is well worth checking out.

Comments closed