Why Does Empirical Variance Use n-1 Instead Of n?

Sebastian Sauer gives us a simulation showing why we use n-1 instead of n as the denominator when calculating the variance of a sample:

Our results show that the variance of the sample is smaller than the empirical variance; however even the empirical variance too is a little too small compared with the population variance (which is 1). Note that sample size was n=10 in each draw of the simulation. With sample size increasing, both should get closer to the “real” (population) sample size (although the bias is negligible for the empirical variance). Let’s check that.

This is an R-heavy post and does a great job of showing that it’s necessary, and ends with  recommended reading if you want to understand the why.

Testing Disk Speed With diskspd

Marek Masko shows how to test I/O performance using the diskspd utility:

What is Diskspd?

Diskspd is a storage testing tool created by Microsoft Windows, Windows Server and Cloud Server Infrastructure Engineering teams. It combines robust and granular IO workload definition with flexible runtime and output options. That makes it a perfect tool for storage performance testing, validation and benchmarking.

Where to find Diskspd?

Diskspd is a free and open source utility. Its source code can be found on GitHub. The repository also hosts other frameworks which use Diskspd. You can find them under ‘Frameworks’ directory. A binary release is hosted by Microsoft at the following location: http://aka.ms/diskspd.

Click through for more details, including an example of a poorly-performing I/O solution.

Finding SQL Server Instances With dbatools

Chrissy LeMaire shows off a very helpful command in dbatools:

Nearly every time I inherit a SQL Server environment, I’m only given a partial list of SQL Servers that exist on the network. It’s my usual routine to get permission to sniff the network then run about five different programs including Idera’s SQL Discovery and Microsoft’s SQL Server Assessment and Planning Toolkit.

I always thought it’d be cool to have one comprehensive PowerShell command that could do the work of all the above and was ecstatic to see NetSPI’s Scott Sutherland had written a few commands to do just that in his awesome PowerShell module PowerUpSQL.

When I saw Scott’s multi-pronged approach (including some UDP magic 🎩), I asked if he’d be interested in contributing to dbatools and he said yes! He submitted a gorgeous mock-up and I was so excited. Then came the PR, complete with great documentation and multithreading.

Click through for a lot more information on the command.

Resuming Azure SQL Data Warehouse With Powershell

Arun Sirpal shows how to unpause an Azure SQL Data Warehouse instance using Powershell:

I totally forgot that with Azure SQL DWH you can pause and resume compute, to save money because it is expensive. Question is how do you go about resuming compute? TSQL is not possible and sure you can do the change via Azure portal but what about PowerShell?

This makes it easy to script out an overnight data load and then pausing the Azure Data Warehouse until the morning when those analysts come in, so that you can save a bit of cash (or a lot, depending upon your DWU utilization).

Comparing Distinctness

Michael J. Swart shows several options for comparing whether an attribute’s value is distinct from a parameter:

Check it:

DECLARE @TeamId bigint = NULL, @SubTeamId bigint = NULL;
FROM tasks

Talk about elegant! That’s what we wanted from the beginning. It’s part of ANSI’s SQL 1999 standard. Paul White tells us it’s implemented internally as part of the query processor, but it’s not part of T-SQL! There’s a connect item for it… err. Or whatever they’re calling it these days. Go read all the comments and then give it a vote. There are lots of examples of problems that this feature would solve.

PROS: Super-elegant!
CONS: Invalid syntax (vote to have it included).

This would be nice to have.  In the meantime, Michael shows several options which are currently valid syntax.

Rotating Out Partitions

Kendra Little explains that there are a couple of models available for partitioned table management:

I recently received a terrific question about table partitioning:

I’m writing up a proposal for my company to start partitioning a 2.5TB table. The idea/need is to control the size of the table while saving the old data. The old data will be moved to an archive database on a different server where the BI guys work with it.

In none of the videos articles I’ve seen is the explanation of how the rolling partition works on a long term daily basis.

  1. Are the partitions reused, like in a ROUND ROBIN fashion?
  2. Or, do you add new partitions each day with new filegroups, drop the oldest partition off – this would be FIFO?

Lots of folks assume the answer here is always #2, simply because there’s a bunch of sample code out there for it.

But option #1 can be simpler to manage when it fits your data retention technique!

Click through to learn more about reusable partitioning.

Second-Order SQL Injection Attacks

Bert Wagner explains what he calls second-order SQL injection attacks:

SQL injection attacks that delay execution until a secondary query are known as “second order”.

This means a malicious user can inject a query fragment into a query (that’s not necessarily vulnerable to injection), and then have that injected SQL execute in a second query that is vulnerable to SQL injection.

Let’s look at an example.

Another way of thinking about this is a persisted SQL injection attack, akin to reflected versus persisted cross-site scripting attacks.  The fix is, don’t trust unsanitized user input.  Just because you put a user’s data into your database doesn’t mean that someone sanitized it, so treat that stuff as unsafe unless you know otherwise.


March 2018
« Feb Apr »