Press "Enter" to skip to content

Curated SQL Posts

When Using DBCC DROPCLEANBUFFERS

Dan Guzman shares words of wisdom with using DBCC DROPCLEANBUFFERS for testing query performance in SQL Server:

One can make the argument that DBCC DROPCLEANBUFFERS might not be particularly valuable for testing. First, the storage engine in SQL Server Enterprise Edition (or Developer Edition, which is often used when testing) behaves differently with a cold cache versus a warm one. With a warm cache, a page not already in cache (e.g. index seek by primary key) will be fetched from disk using a single 8K page IO request as one expects. However, when the cache isn’t fully warmed up (Buffer Manager’s Target Pages not yet met), the entire 64K extent (8 contiguous 8K pages) is read for the single page request regardless of whether the adjacent pages are actually needed by the query. This has the benefit of warming the cache much more quickly than would otherwise occur, but given that the normal steady state of a production SQL Server is a warm cache, testing with a cold cache isn’t a fair comparison of different plans. More data than normal will be transferred from storage so timings may not be indicative of actual performance.

I don’t think I agree 100% with that argument, but I am sympathetic to it.  Still, Dan has great advice in this post.

Comments closed

Layout Images In Power BI

Meagan Longoria has some tips for using layout images in Power BI:

Using layout images in Power BI has become a popular design trend. When I say layout images, I’m referring to background images with shapes around areas where visuals are placed. This is different from the new wallpaper feature that became available in the July release, which can be used to format the grey area outside your report page and extend the main color of background images.

Layout images can help with spacing and alignment within a report and can help create consistency across reports. They can also help create affordances, using consistent layout and design to make it obvious how users should interact with our reports.

I use layout images in some of my reports, but I don’t think they are necessary on every report. There are a couple of things to consider when using layout images.

Read on for an example of a good layout image versus a bad layout image as well as tips and tricks on how to create good layout images.

Comments closed

Bug In July Windows Updates Causing “TCP port is already in use” Errors

Jordon Riel warns us about a recent Windows update which can cause SQL Server’s database engine to fail to start up:

We have recently become aware of a regression in one of the TCP/IP functions that manages the TCP port pool which was introduced in the July 10, 2018 Windows updates for Windows 7/Server 2008 R2 and Windows 8.1/Server 2012 R2.

This regression may cause the restart of the SQL Server service to fail with the error, “TCP port is already in use”. We have also observed this issue preventing Availability Group listeners from coming online during failover events for both planned and/or unexpected failovers. When this occurs, you may observe errors similar to below in the SQL ERRORLOGs:

Error: 26023, Severity: 16, State: 1.
Server TCP provider failed to listen on [ <IP ADDRESS> <ipv4> <PORT>]. Tcp port is already in use.
Error: 17182, Severity: 16, State: 1.
TDSSNIClient initialization failed with error 0x2740, status code 0xa. Reason: Unable to initialize the TCP/IP listener. Only one usage of each socket address (protocol/network address/port) is normally permitted. 
Error: 17182, Severity: 16, State: 1.
TDSSNIClient initialization failed with error 0x2740, status code 0x1. Reason: Initialization failed with an infrastructure error. Check for previous errors. Only one usage of each socket address (protocol/network address/port) is normally permitted. 
Error: 17826, Severity: 18, State: 3.
Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.
Error: 17120, Severity: 16, State: 1.
SQL Server could not spawn FRunCommunicationsManager thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

Read on for the solution.

Comments closed

Using rquery On Databricks

Nina Zumel and John Mount talk about rquery, a relational data transformation engine for R which runs on Spark:

rquery is based on an appreciation of Codds’ relational algebra. Codd’s relational algebra is a formal algebra that describes the semantics of data transformations and queries. Previous, hierarchical, databases required associations to be represented as functions or maps. Codd relaxed this requirement from functions to relations, allowing tables that represent more powerful associations (allowing, for instance, two-way multimaps).

Codd’s work allows most significant data transformations to be decomposed into sequences made up from a smaller set of fundamental operations:

  • select (row selection)
  • project (column selection/aggregation)
  • Cartesian product (table joins, row binding, and set difference)
  • extend (derived columns, keyword was in Tutorial-D).

One of the earliest and still most common implementation of Codd’s algebra is SQL. Formally Codd’s algebra assumes that all rows in a table are unique; SQL further relaxes this restriction to allow multisets.

rquery is another realization of the Codd algebra that implements the above operators, some higher-order operators, and emphasizes a right to left pipe notation. This gives the Spark user an additional way to work effectively.

They include a fairly lengthy example and give a great introduction to the tool.  It’s now officially on my list of stuff to try out.

Comments closed

Explaining Text Classification Models With LIME

Shirin Glander shows us how to use LIME to explain which words help us classify whether a user liked a particular item:

Okay, not a perfect score but good enough for me – right now, I’m more interested in the explanations of the model’s predictions. For this, we need to run the lime() function and give it

  • the text input that was used to construct the model
  • the trained model
  • the preprocessing function
explainer <- lime(clothing_reviews_train$text, 
                  xgb_model, 
                  preprocess = get_matrix)

With this, we could right away call the interactive explainer Shiny app, where we can type any text we want into the field on the left and see the explanation on the right: words that are underlined green support the classification, red words contradict them.

I hadn’t used LIME for this before, and it looks very interesting.  H/T R-Bloggers

Comments closed

Querying Data In Temporal Tables

Jeanne Combrinck shows us how to query data stored in temporal tables:

When you want to get latest (actual) state of data in a temporal table, you can query the same way as you query a normal table. If the PERIOD columns are not hidden, their values will appear in a SELECT * query. If you specified PERIOD columns as hidden, their values won’t appear in a SELECT * query. When the PERIOD columns are hidden, reference the PERIOD columns specifically in the SELECT clause to return the values for these columns.

To perform any type of time-based analysis, use the new FOR SYSTEM_TIME clause with four temporal-specific sub-clauses to query data across the current and history tables.

  • AS OF <date_time>

  • FROM <start_date_time> TO <end_date_time>

  • BETWEEN <start_date_time> AND <end_date_time>

  • CONTAINED IN (<start_date_time> , <end_date_time>)

  • ALL

Do check it out.  The big downside I’ve seen to temporal tables is that I can’t query a temporal table as of a per-row timestamp in another table.

Comments closed

The Halloween Problem And Deleting From Self-Referencing Tables

Louis Davdison walks us through a case where deleting from a table can be extremely slow:

I have a great interest in hierarchy solutions, as I plan to write a book on implementing hierarchies in SQL Server in the next year or so (after I see what happens in SQL Server vNext after 2017). Something I didn’t include in my design testing when I built my presentation on hierarchies (available from a link here) a few years back was the cost to remove nodes from a tree. I learned a big lesson today about the cost of deleting from a hierarchy (the hard way). It is definitely something I will be digging deeper into at some point in a lab environment, comparing multiple methods of deleting from a hierarchy, but I wanted to write down the issue with self-referencing table for future use (and hopefully for your use too.)

We have a table with 100s of millions of rows, and deletes from this table were taking forever. There were 17 foreign key constraints to this table, but each of the references were only showing as 1% of the plan. The physical DELETE from the clustered index operator was ~80%, but in the query, there showed up an operator that looked ominous.

Definitely worth a read.

Comments closed

Using Process Monitor With Power Query

Chris Webb has a couple interesting posts on using Process Monitor, a venerable sysinternals tool, to troubleshoot issues with Power Query performance.  First, Chris looks at how often Power Query opens and closes a file for data processing:

Troubleshooting Power Query performance issues in Power BI and Excel can be difficult because it’s a bit of a black box: there’s nothing in the UI to tell you what’s going on inside the Power Query engine and the diagnostic logs are very difficult to interpret. With relational data source like SQL Server you can use tools like SQL Server Profiler to see the queries that are being run by Power Query, and I blogged recently about using Fiddler to troubleshoot OData performance issues; but what about file-based data sources, which often present the most challenges regarding performance?

Process Monitor, a free tool from Microsoft, allows you to monitor file system activity in real-time and even having spent a limited amount of time using it I can already tell that it can provide a lot of information to help identify performance issues with file-based data sources. Take, for example, the scenario I described in my recent post on improving the performance of merge operations. In that post (which I suggest you read before you carry on) I mentioned that it looked as though the Power Query engine was reading data from one of the source files multiple times and Process Monitor confirms that this indeed the case.

Then he shows us just how much data Power Query is reading each time it opens the file:

This post is really just a quick follow-on from my post earlier this week on using Process Monitor to troubleshoot Power Query performance issues with file-based data sources, which I suggest you read before carrying on. I realised, after playing around with Process Monitor some more, that the ReadFile operation actually tells you how much data is being read from a file when a Power Query query is running. For example, here’s a sample of some of the ReadFile operations captured while running the unoptimised version of the query I talked about in my last post:

I get the feeling that really learning a few sysinternal tools can be extremely useful, like how learning Wireshark can make you a better DBA.

Comments closed

Tableau And Power BI: Complementarities

Jen Stirrup argues that there is a good reason to use both Tableau and Power BI in the same environment:

A question I often here is this: Which tool should I use, Tableau or Power BI? The truth is: They are not mutually exclusive.

Tableau is great at business mysteries: ill-defined questions where you have to surf the data for results. Power BI is particularly great at modelling and cleaning the data, with clean, crisp data visualisation and the ability to use custom and open-source data visualizations. This blog isn’t aimed at the technical user, but at the analyst who needs to get information out quickly. I will do another post, aimed at the geeks, another time.

I am all about embracing the power of And.  The downside is, my corporate budget went bankrupt just thinking about the cost of both of them together…

Comments closed

Visualizing Linear Regression Results

Bernardo Lares gives us a few ways of interpreting visually a linear regression result in R:

The most obvious plot to study for a linear regression model, you guessed it, is the regression itself. If we plot the predicted values vs the real values we can see how close they are to our reference line of 45° (intercept = 0, slope = 1). If we’d had a very sparse plot where we can see no clear tendency over that line, then we have a bad regression. On the other hand, if we have all our points over the line, I bet you gave the model your wished results!

Then, the Adjusted R2 on the plot gives us an easy parameter for us to compare models and how well did it fits our reference line. The nearer this value gets to 1, the better. Without getting too technical, if you add more and more useless variables to a model, this value will decrease; but, if you add useful variables, the Adjusted R-Squared will improve.

We also get the RMSE and MAE (Root-Mean Squared Error and Mean Absolute Error) for our regression’s results. MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. On the other side we have RMSE, which is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation. Both metrics can range from 0 to ∞ and are indifferent to the direction of errors. They are negatively-oriented scores, which means lower values are better.

I like this approach to explaining models.

Comments closed