Press "Enter" to skip to content

Author: Kevin Feasel

Forcing 0 Intercept Inflates R-squared In R

John Mount has an informative post on how you can trick yourself when running linear regression models in R and forcing the y intercept to be 0:

So far so good. Let’s now remove the “intercept term” by adding the “0+” from the fitting command.

m2 <- lm(y~0+x, data=d)
t(broom::glance(m2))
##                        [,1]
## r.squared      7.524811e-01
## adj.r.squared  7.474297e-01
## sigma          3.028515e-01
## statistic      1.489647e+02
## p.value        1.935559e-30
## df             2.000000e+00
## logLik        -2.143244e+01
## AIC            4.886488e+01
## BIC            5.668039e+01
## deviance       8.988464e+00
## df.residual    9.800000e+01
d$pred2 <- predict(m2, newdata = d)

Uh oh. That appeared to vastly improve the reported R-squared and the significance (“p.value“)!

Read on to learn why this happens and how you can prevent this from tricking you in the future.

Comments closed

Scripting Azure Resources With ARM Templates

Melissa Coates has a detailed explanation of how to script out the creation and configuration of services in Azure using Azure Resource Manager (ARM) templates:

In many cases, you can easily provision resources in the web-based Azure portal. If you’re never going to repeat the deployment process, then by all means use the interface in the Azure portal. It doesn’t always make sense to invest the time in automated deployments. However, ARM templates are really helpful if you’re interested in achieving repeatability, improving accuracy, achieving consistency between environments, and reducing manual effort.

Use ARM templates if you intend to:

  • Include the configuration of Azure resources in source control (“Infrastructure as Code”), and/or

  • Repeat the deployment process numerous times, and/or

  • Automate deployments, and/or

  • Employ continuous integration techniques, and/or

  • Utilize DevOps principles and practices, and/or

  • Repeatedly utilize testing infrastructure then de-provision it when finished

Melissa walks through an example of deploying a website with backing database, along with various configuration changes.

Comments closed

Linked Servers And The Kerberos Double-Hop Problem

Jana Sattainathan shows how to set up Kerberos pass-through when dealing with linked servers:

Let us say you have SQLServer1 and you want to setup a linked server to SQLServer2 using “pass-through authentication”, a double-hop happens as explain in the article below. Basically, the first hop is when the user authenticates to SQLServer1 and the second hop when that gets passed on from SQLServer1 to SQLServer2.

The below article is a must-read before you proceed:

The three nodes involved in the double-hop as illustrated in the example are

  1. Client – The client PC from which the user is initiating connection to SQLServer1

  2. Middle server – SQLServer1

  3. Second server – SQLServer2

Dealing with the double-hop problem is far trickier than it should be; if you’ve had to deal with this, I recommend Jana’s guide.

Comments closed

Good Indexes Make Good Queries

Thomas Rushton has an example of a positive indexing experience:

100k runs of the query in a ten minute interval? yeowch. Yeah, this should be optimised if possible. The primary wait type was CPU – indicating that the data was all in RAM, but the CPU was having to schlep through the entire table to find what it needed. Or to find that it didn’t need anything. Or something.

The interesting story buried in here is that none of the tooling Thomas used indicated that things could be improved with a better index, and yet there was a tremendous opportunity here based on the graphics at the end.

Comments closed

Leave ANSI PADDING On

Kenneth Fisher explains what the ANSI PADDING setting does:

ON is the default and is what you would expect. Trailing spaces are saved in VARCHAR and in CHAR additional spaces added to fill the entire space. When ANSI_PADDING is off then additional spaces are not saved .. unless the column is CHAR AND NOT NULL.

So there’s the first reason to not turn ANSI_PADDING off. Most people expect the ON results and the OFF results can be .. let’s just say confusing.

Click through for more details, including how painful it is to change the setting on a column after the fact.

Comments closed

Not All Defaults Are Good

Nate Johnson rails against bad defaults in SQL Server:

Your servers have many-core CPUs, right?  And you want SQL to utilize those cores to the best of its ability, distributing the many users’ workloads fairly amongst them, yes?  Damn right, you paid $3k or more per core in freaking licensing costs!  “OK”, says SQL Server, “I’ll use all available CPUs for any query with a ‘cost’ over ‘5’“.  (To give context here, in case you’re not aware, ‘5’ is a LOW number; most OLTP workload queries are in the double to triple digits).  “But wait!”, you protest, “I have more than 1 user, obviously, and I don’t want their horrible queries bringing all CPUs to their knees and forcing the 50 other user queries to wait their turn!”

Nate has a few recommendations here, as well as a picture of kittens.

Comments closed

Environmental Factors And SQL Server

Jeff Mlakar has a set of tips and tricks around SQL Server performance:

Performance problems for a SQL Server based application are likely to be caused by environmental factors and not buggy code.

Whether it is a configuration you can change in SQL Server, Windows Server, VMware, or the network it is likely the first course of action is to perform a quick assessment of the environment. This is where understanding the various configurations and best practices are key. Knowing what to look for can save tons of time.

A mistake I often see is a performance issue is passed off to someone else (more senior) and that engineer assumes a lot of things without checking. People are going to relay the problem as they see it – not as it actually is. This leads to skipping over some elementary checks which can save time and frustration from tracking down imaginary bugs.

Start troubleshooting with a quick environmental check.

There are quite a few checks here.

Comments closed

The Assumptive Nature Of R

Tim Sweetser and Kyle Schmaus explain some of the less-obvious bits of R that make it harder to use as a production language:

For us, the biggest surprise when using an R data.frame is what happens when you try to access a nonexistent column. Suppose we wanted to do something with the prices of our diamonds. price is a valid column of diamonds, but say we forgot the name and thought it was title case. When we ask for diamonds[["Price"]], R returns NULL rather than throwing an error! This is the behavior not just for tibble, but for data.tableand data.frame as well. For production jobs, we need things to fail loudly, i.e. throw errors, in order to get our attention. We’d like this loud failure to occur when, for example, some upstream data change breaks our script’s assumptions. Otherwise, we assume everything ran smoothly and as intended. This highlights the difference between interactive use, where R shines, and production use.

Read on for several good points along these lines.

Comments closed

Great Circles In R

Yan Holtz shows how to draw great circles using an R package called geosphere:

This post explains how to draw connection lines between several localizations on a map, using R. The method proposed here relies on the use of the gcIntermediate function from the geosphere package. Instead of making straight lines, it offers to draw the shortest routes, using great circles. A special care is given for situations where cities are very far from each other and where the shortest connection thus passes behind the map.

Now we know how to make pretty-looking global route charts.

Comments closed

Modular, Production-Ready R

David Smith highlights Syberia, a development framework for productionalizing R code:

Syberia also encourages you to break up your process into a series of distinct steps, each of which can be run (and tested) independently. It also has a make-like feature, in that results from intermediate steps are cached, and do not need to be re-run each time unless their dependencies have been modified.

Syberia can also be used to associate specific R versions with scripts, or even other R engines like Microsoft R. I was extremely impressed when during a 30-minute-break at the R/Finance conference last month, Robert was able to sketch out a Syberia implementation of a modeling process using the RevoScaleR library. In fact Robert’s talk from the conference, embedded below, provides a nice introduction to Syberia.

Interesting stuff.  If you’re working with models in R today, this could be up your alley.

Comments closed