Press "Enter" to skip to content

Day: August 2, 2018

Highlighting Data With gghighlight

Laura Ellis shows off the gghighlight package, which allows you to highlight selectively certain sets of data in ggplot:

While the above methodology is quite easy, it can be a bit of a pain at times to create and add the new data frame.  Further, you have to tinker more with the labelling to really call out the highlighted data points.

Thanks to Hiroaki Yutani, we now have the gghighlight package which does most of the work for us with a small function call!!   Please note that a lot of this code was created by looking at examples on her introduction document.

The new school way is even simplier:

  1. Using ggplot2, create a plot with your full data set.

  2. Add the gghighlight() function to your plot with the conditions set to identify your subset.

  3. Celebrate! This was one less step AND we got labels!

That’s a very cool package.  H/T R-Bloggers

Comments closed

Stoppable, Async Shiny Interfaces

Ian at Fells Stats wants to make a long-running Shiny app a bit more user-friendly:

Shiny operates in a reactive programming framework. Fundamentally this means that any time any UI element that affects the result changes, so does the result. This happens automatically, with your analysis code running every time a widget is changed. In a lot of cases, this is exactly what you want and it makes Shiny programs concise and easy to make; however in the case of long running processes, this can lead to frozen UI elements and a frustrating user experience.

The easiest solution is to use an Action Button and only run the analysis code when the action button is clicked. Another important component is to provide your user with feedback as to how long the analysis is going to take. Shiny has nice built in progress indicators that allow you to do this.

There are a couple of false starts in there but by the time you reach the third act, the story makes sense.  H/T R-Bloggers

Comments closed

Classes And Vectors In R

Dave Mason continues his journey toward learning R.  He looks next at the class() function:

Note the value assigned to horse_power is a whole number (integer) and the value assigned to miles_per_gallon is a rational number. But R tells us they are both of the “numeric” class. R does have an integer class. A variable’s class will be an integer if the value is followed by a capital “L”. Let’s reassign a value to horse_power to demonstrate:

> horse_power <- 240L
> class(horse_power)
[1] "integer"

Another way to determine the class of a variable is to use one of the is.*() functions. For example, is.integer() and is.numeric() tell us the miles_per_gallon is not an integer, and is a numeric:

> is.integer(miles_per_gallon)
[1] FALSE
> is.numeric(miles_per_gallon)
[1] TRUE

There’s also the typeof() function and the mode() function, and all three can differ under certain circumstances.

Next up, Dave hits vectors, the simplest of the interesting data types in R:

It’s important to know that the elements of a vector must be of the same class (data type). If the values passed to the c() function are of different classes, some of them will be coerced to a different class to ensure all classes of the vector are the same. Below, the parameter classes passed to the c() function include character, numeric, and integer. The corresponding numeric and integer parameter values are coerced to character within the vector:

> some_data <- c("a", "b", 7.5, 25L)
> some_data
[1] "a" "b" "7.5" "25"
>

Read on for more about vectors.

Comments closed

Configuring SQL Server Management Studio

Brent Ozar shares his configuration settings for SQL Server Management Studio:

Under Query Results, SQL Server, Results to Grid, I change my XML data size to unlimited so that it brings back giant query plans. (Man, does my job suck sometimes.)

A lot of presenters like to check the box for “Display results in a separate tab” and “Switch to results tab after the query executes” because this gives them more screen real estate for the query and results. I’m just really comfortable with Control-R to hide the results pane.

And I just went and removed a bunch of menu bar icons I never use…  Good advice from Brent.

Comments closed

Pivoting And Unpivoting Data In T-SQL

Jeanne Combrinck shows how to use the PIVOT and UNPIVOT operators in SQL Server:

One thing that I still get confused about writing is pivot queries. I find myself needing to lookup the syntax every time. Basically you use Pivot and Unpivot to change the output of a table. If you would like rows turned into columns you can use pivot and for the opposite you can use unpivot.

One thing to note is the column identifiers in the unpivot clause follow the catalog collation. For SQL Database, the collation is always SQL_Latin_General_CP1_CI_AS. For SQL Server partially contained databases, the collation is always Latin1_General_100_CI_AS_KS_WS_SC. If the column is combined with other columns, then a collate clause (COLLATE DATABASE_DEFAULT) is required to avoid conflicts.

Click through for an example of each.

Comments closed

Blocking A Truncate Statement

Arun Sirpal shows that the TRUNCATE command needs to take locks like any other data modification command:

The truncate option is fast and efficient but did you know that it takes a certain lock where you could actually be blocked?

What am I talking about? When you issue a truncate it takes a Sch-M lock and it uses this when it is moving the allocation units to the deferred drop queue. So if it takes this lock and you look at the locking compatibility matrix below you will see what can cause a conflict (C).

Arun includes an image which shows what can block what, and also shows us an example.

Comments closed

The Blocking Monitoring Framework

Dmitri Korotkevitch announces a new tool:

Troubleshooting of the blocking and concurrency issues is, in the nutshells, a simple process. You need to identify the processes involved in blocking conditions or deadlocks and analyze why those processes acquire the locks on the same resources. In majority of cases, you need to analyze queries and their execution plans identifying possible inefficiencies that led to excessive number of locks being acquired.

Collecting this information is not a trivial task. The information is exposed through DMVs (you can download the set of scripts here); however, it requires you to run the queries at time when blocking occurred. Fortunately, SQL Server allows you to capture blocking and deadlock conditions with the blocked process report and deadlock graph, analyzing them later.

There is the caveat though. Neither blocked process report nor deadlock graph provide you execution plans of the statements. Nor do they always include affected statements in the plain text. You may need to query plan cache and other DMVs to get this information and longer you wait lesser is the chance that the information is available. Moreover, SQL Server may generate enormous number of blocked process reports in cases of prolonged blocking and complex blocking chains, which complicates the analysis.

Confirmed to work with SQL Server 2012 and later, but might work on earlier versions as well.  Dmitri has released it to the public, so check it out.

Comments closed

In Defense Of Inline Table-Valued Functions

Riley Major defends the honor of inline table-valued functions:

So no, user-defined functions are not the devil. Scalar user-defined functions can cause big problems if misused, but generally inline user-defined functions do not cause problems.

The real rule of thumb is not to avoid functions, but rather to avoid adorning your index fields with logic or functions. Because when you hide your intentions from the optimizer with complex syntax, you risk not getting the better performing index seek.

Riley shows an example where his inline table-valued UDF was just as efficient an execution plan as without the UDF.

Comments closed

Things Not To Do In SQL Server

Randolph West has a how-not-to guide for SQL Server:

Don’t use TIMESTAMP

We covered this in detail in a previous post, What about TIMESTAMP? It’s better to pretend that this data type doesn’t exist.

Why not?

It is not what you think it is. TIMESTAMP is actually a row version value based on the amount of time since SQL Server was started. If you need to record an actual date and time, use DATETIME2 instead.

When should we?

Never.

I appreciate that Randolph includes a “when should you not listen to my overall pronouncement?” bit, as there are commonly exceptions to “do not do X” style rules.

Comments closed