Standard Deviation Estimation

Kevin Feasel

2016-06-23

R

Dan Goldstein gives a rule of thumb for getting standard deviations for various distributions:

Say you’ve got 30 numbers and a strong urge to estimate their standard deviation. But you’ve left your computer at home. Unless you’re really good at mentally squaring and summing, it’s pretty hard to compute a standard deviation in your head. But there’s a heuristic you can use:

Subtract the smallest number from the largest number and divide by four

Let’s call it the “range over four” heuristic. You could, and probably should, be skeptical. You could want to see how accurate the heuristic is. And you could want to see how the heuristic’s accuracy depends on the distribution of numbers you are dealing with.

Sometimes you just don’t have STDEV() available.

2016 SOS_RWLock

Ewald Cress continues his series on internals, and looks at how SOS_RWLock has changed in SQL Server 2016:

Allow me to call out some layout comparison points against the 2014 version:

  • There is no separate member to track the shared reader count.

  • The four-byte spinlock is gone.

  • The four-byte waiting writer count is gone.

  • The two chunks of four-byte padding (for qword alignment of pointers) are gone.

  • The WaitListEntry structure hasn’t changed at all.

Ewald also covers Compare-And-Swap operations in detail.  Definitely a good read.

T-SQL Tuesday Roundup

Michael Swart rounds up the usual suspects:

There’s always some anxiety when throwing a party. Wondering whether it will be a smash. Well I had nothing to worry about with the twenty bloggers who participated last week. You guys hit it out of the park!

Michael put a lot of effort into making his round-up look nice and making my life a little easier by exposing me to a couple blogs I didn’t know about.  Great job.

Minimizing Cloud Costs

Kevin Feasel

2016-06-22

Cloud

Kenneth Fisher looks at reducing the bottom line for cloud operations:

This got me thinking about ways to reduce/minimize costs. These are some general ideas since from what I can tell cloud billing is as complex as the tax codes and at that I have limited experience.

  • If you aren’t using your VM, shut it down. You can do this manually, or with apowershell script or even at the push of a button

  • Start small. Only create the machines you need and keep them to a minimum.

  • Starting small will lead to some bottle necks. Feel free to bounce up and down as you need. There are some restrictions (size etc) when you move downwards, so be careful. Again this can be done manually or with powershell. Let’s say you need to do a high volume load. Bump your service tier, then once you are done, bump it back down again.

  • And my personal favorite : Don’t install enterprise when you only need standard.

Doing business on Azure or AWS does require a bit of a shift in mindset.  Cloud costs are entirely variable—you control when services run; how much compute, storage, and bandwidth you want to use; and your SLA.  Choosing different spots on the continuum results in different pricing.  This has also helped the growth of technologies like Hadoop, in which you can separate compute from storage.  If I know that my cluster gets heavy usage during core business hours, light usage overnight, and no usage on the weekend, I can spin up and down nodes as necessary, and can even shut off clusters which don’t need to operate, and because I’m storing the data off of the cluster nodes (and on S3 or in Azure Data Lake Storage), data doesn’t become unavailable just because the primary compute process is unavailable—I could spin up another cluster or write a quick one-off data reader.

Presentation Versus Storage

Edwin Sarmiento looks at how data is stored on disk when you use Dynamic Data Masking or Always Encrypted in SQL Server 2016:

Looking at the data, the masked columns appear as they are on disk. This validates Ronit Reger’s statement on his blog post Use Dynamic Data Masking to obfuscate your sensitive data.

* There are no physical changes to the data in the database itself; the data remains intact and is fully available to authorized users or applications.* Note that Dynamic Data Masking is not a replacement for access control mechanisms, and is not a method for physical data encryption.

In contrast, the encrypted columns are encrypted on disk and the data types are different on disk compared to how they were defined in the table schema – SSN is defined with nvarchar(11) while CreditCardNumber is defined with nvarchar(25). This means that those “valuables” are even more secured on disk, requiring additional layers of security just to get access to them.

Read the whole thing.

Analyze Fantasy Sports With Spark

Jordan Volz is back with part two of his series on fantasy sports analysis using Apache Spark:

We’ll look at both zTot and nTot, and consider the player’s age and experience.The latter is potentially important because there have been shifts in what ages players joined the league over the timespan we are considering. It used to be rare for players to skip college, then it wasn’t, now they are required to play at least one year. It will be interesting to see if we see a difference in age versus experience in the numbers.

We start with the RDD containing all the raw stats, z-scores, and normalized z-scores. Another piece of data to consider is how a player’s z-score and normalized z-score change each year, so we’ll calculate the change in both from year to year. We’ll save off two sets of data, one a key-value pair of age-values, and one a key-value pair of experience-values. (Note that in this analysis, we disregard all players who played in 1980, as we don’t have sufficient data to determine their experience level.)

Jordan also looks at player performance over time and makes data analysis look pretty easy.

BigQuery Versus Redshift

Kiyoto Tamura compares Google’s BigQuery versus Amazon’s Redshift for cloud-based warehousing:

Neither service is truly “set and forget” and requires a dedicated engineer to learn the service and maintain it. You can use various tools to automate many aspects of the operation, but someone will have to maintain automation scripts and workflows.

That said, here are things that I’ve heard first-hand from talking to users

The bottom line there is that Redshift is a bit more mature than BigQuery today, but keep an eye on both of them.

LAST_VALUE

Kevin Feasel

2016-06-22

T-SQL

Steve Jones plays with a window function new to SQL Server 2012:

The important thing to understand with window functions is that there is a frame at any point in time when the data is being scanned or processed. I’m not sure what the best term to use is.

Let’s look at the same data set Kathi used. For simplicity, I’ll use a few images of her dataset, but I’ll examine the SalesOrderID. I think that can be easier than looking at the amounts.

Here’s the base dataset for two customers, separated by CustomerID and ordered by the OrderDate. I’ve included amount, but it’s really not important.

Steve goes into detail and explains what’s going on each step of the way.  Window functions are extremely useful; check them out if you’re not already familiar with them.

Connecting SQL Server To Hadoop Using Polybase

I have a post up on using Polybase to create an external table which points to Hadoop:

An interesting thing about FIELD_TERMINATOR is that it can be multi-character.  MSDN uses ~|~ as a potential delimiter.  The reason you’d look at a multi-character delimiter is that not all file formats handle quoted identifiers—for example, putting quotation marks around strings that have commas in them to indicate that commas inside quotation marks are punctuation marks rather than field separators—very well.  For example, the default Hive SerDe (Serializer and Deserializer) does not handle quoted identifiers; you can easily grab a different SerDe which does offer quoted identifiers and use it instead, or you can make your delimiter something which is guaranteed not to show up in the file.

You can also set some defaults such as date format, string format, and data compression codec you’re using, but we don’t need those here.  Read the MSDN doc above if you’re interested in digging into that a bit further.

It’s a bit of a read, but the end result is that we can retrieve data from a Hadoop cluster as though it were coming from a standard SQL Server table.  This is easily my favorite feature in SQL Server 2016.

Blitz Scripts Open Sourced

Kevin Feasel

2016-06-22

Tools

Brent Ozar announces that the sp_Blitz series of scripts is now open source:

Our prior copyright license said you couldn’t install this on servers you don’t own. We’d had a ton of problems with consultants and software vendors handing out outdated or broken versions of our scripts, and then coming to us for support.

Now, it’s a free-for-all! If you find the scripts useful, go ahead and use ’em. Include sp_Blitz, sp_BlitzCache, sp_BlitzIndex, etc as part of your deployments for easier troubleshooting.

This is very good news.

Categories

June 2018
MTWTFSS
« May  
 123
45678910
11121314151617
18192021222324
252627282930