Comparing Poisson Regression to Regressing Against Logs

Nina Zumel compares a pair of methods for performing regression when income is the dependent variable:

Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. Again, this is common when financial data is involved: raises in salary tend to be in terms of percentage of income, not in absolute dollar increments.

Unfortunately, a full discussion of the differences between Poisson regression and regressing against log amounts was outside of the scope of our book, so we will discuss it in this note.

This is an interesting post with a great teaser for the next post in the series.

tidylo: Calculating Log Odds in R

Julia Silge announces a new package, tidylo:

The package contains examples in the README and vignette, but let’s walk though another, different example here. This weighted log odds approach is useful for text analysis, but not only for text analysis. In the weeks since we’ve had this package up and running, I’ve found myself reaching for it in multiple situations, both text and not, in my real-life day job. For this example, let’s look at the same data as my last post, names given to children in the US.

Which names were most common in the 1950s, 1960s, 1970s, and 1980?

This package looks like it’s worth checking out if you deal with frequency-based problems.

Deep Dive on Index Seeks

Hugo Kornelis gives us a great deal of information on index seeks in SQL Server:

Every Seek Keys specification can be either for a “singleton seek”, or for a “range seek”. A singleton seek applies when at most a single row can satisfy the requirement of the Seek Keys specification. A range seek means that (potentially) more than a single row can qualify.

For a singleton seek, the index structure is used to find the row that matches the specified condition. If it exists, it is returned and then the operator immediately continues to the next Seek Keys specification. If it doesn’t, then nothing is returned and  the operator continues to the next Seek Keys specification. 

Read the whole thing and pair it with index scans if you haven’t read that already.

Checking if an Account is Disabled

Jack Vamvas has a script to see if the sa account is disabled:

Often organisations have a   SQL server security policy  dictating the ‘sa’  login is disabled. There is some sound reasoning behind this policy. The primary reason is to decreases the surface area available to attack – and is a common principle of security. 

When this is a requirement – I’ll  add some code to disable ‘sa’  as part of the  build  and certification process. But , from time to time – the ‘sa’ login will be re enabled, so it’s important to build an alerting or reporting mechanism identifying the exception. 

Click through for the script.

ML Services and Injectable Code

Grant Fritchey looks at sp_execute_external_script for potential SQL injection vulnerabilities:

The sharp eyed will see that the data set is defined by SQL. So, does that suffer from injection attacks? Short answer is no. If there was more than one result set within the Python code, it’s going to error out. So you’re protected there.

This is important, because the data set query can be defined with parameters. You can pass values to those parameters, heck, you’re likely to pass values to those parameters, from the external query or procedure. So, is that an attack vector?

No.

Another factor is that you need explicitly to grant EXECUTE ANY EXTERNAL SCRIPT rights to non-sysadmin, non-db_owner users, meaning a non-privileged user can’t execute external scripts at all. You can also limit the executing service account

Pester and Testable Powershell Scripts

Shane O’Neill has a public service announcement:

Now scripts are notoriously hard to test, I’ve written about how I’ve done that before but, honestly, if you really want to know then you need to check out Jakub Jares ( blog | twitter ).

Knowing how difficult testing scripts are, the first thing I decided to do was take the functions in the script and split them out. This way they can be abstracted away and tested safely.

I also didn’t want to take on too much at one time so I choose a random function, GetPreviousTag, and only actioned that one first.

Shane also found a bug in the first script, underscoring the importance of good tests.

Auditing Databases in Use

Jason Brimhall shows how you can use Extended Events to figure out if that database is still in use:

Here we can see there are indeed some databases that are still in use on this server. If we are looking to retire the instance, or migrate relevant databases to a new server, we have better information about how to go about planning that work. Based on this data, I would be able to retire the ProtossZealot database but would be required to do something with the ZergRush and BroodWar databases. In the case of the ProtossZealot database that is not being used, we now have evidence to present back to the team that the database is not used. It is now up to those insisting on keeping it around to justify its existence and document why it must remain in tact.

Click through for the script Jason used to determine this.

IsNumeric in Power Query

Matthew Roche (with an assist from Imke Feldmann) shows how you can check if a value—or part of a value—is numeric in Power Query:

Using Imke’s approach, this is what I ended up doing:

1. Extract the last two characters from the source column
2. When the extracted characters contain a number, convert them to numeric
3. Standardize the resulting value to represent the correct year

Read on to see how Matthew pulled it off.

Categories

July 2019
MTWTFSS
« Jun  
1234567
891011121314
15161718192021
22232425262728
293031