Press "Enter" to skip to content

Day: February 16, 2018

The Basics Of PCA In R

Prashant Shekhar gives us an overview of Principal Component Analysis using R:

PCA changes the axis towards the direction of maximum variance and then takes projection on this new axis. The direction of maximum variance is represented by Principal Components (PC1). There are multiple principal components depending on the number of dimensions (features) in the dataset and they are orthogonal to each other. The maximum number of principal component is same as a number of dimension of data. For example, in the above figure, for two-dimension data, there will be max of two principal components (PC1 & PC2). The first principal component defines the most of the variance, followed by second principal component, third principal component and so on. Dimension reduction comes from the fact that it is possible to discard last few principal components as they will not capture much variance in the data.

PCA is a useful technique for reducing dimensionality and removing covariance.

Comments closed

Tidy Data Is Normalized Data

I emphasize the link between a tidy dataframe and a normalized data structure:

The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data.  Specifically, Wickham argues that tidy data should achieve third normal form.

Now, in practice, Wickham argues, we tend to need to denormalize data because analytics tools prefer having everything connected together, but the way we denormalize still retains a fairly normal structure:  we still treat observations and variables like we would in a normalized data structure, so we don’t try to pack multiple observations in the same row or multiple variables in the same column, reuse a column for multiple purposes, etc.

I had an inkling of this early on and figured I was onto something clever until I picked up Wickham’s vignette and read that yeah, that’s exactly the intent.

Comments closed

SQL Operations Studio February Release

Alan Yu announces the February release of SQL Operations Studio:

The February release includes several major repo updates and feature releases, including:

  • Added Auto-Update Installation feature
  • Added Connection Dialog ‘Database’ Drop-down
  • Added functionality for new query tabs keeping active connection
  • Fixed bugs in SQL Editor and auto-completion

For complete updates, refer to the Release Notes.

Auto-update, something that Management Studio and Power BI don’t do.

Comments closed

Navigating A SQL Server Instance With Powershell

Andy Mallon shows off the drive-based navigation available in the SQL Server Powershell module:

I recently noticed that the system objects were missing from my results when I do a Get-ChildItem. I noticed it with views, but then realized that none of the system objects showed up. What gives? I floundered through a quick Google search, where I knew I wasn’t searching for the right thing, and was not surprised when I didn’t see the answer.

I said to myself, “Andy, hold on a second & think. If something doesn’t want to open up, sometimes you just have to -force it open.”

I don’t tend to use this much, as I have recollections of it being slow.  Nonetheless, it is good to know about.

Comments closed

Find And Replace Stored Procedure Code

Jana Sattainathan has to find and replace a lot of code:

The database has 100’s of stored procedures (700+ to be somewhat precise)

Qualifying stored procedures have these characteristics

  • Procedure name ends with “_del”
  • Procedure has the string “exec” in the code
  • Procedure has the string “sp_execute” in the code

This is what needs to be done:

  • Replace CREATE PROC with ALTER PROC

  • Replace SYSTEM_USER with “ORIGINAL_LOGIN()”

  • Replace AS at the beginning of CREATE PROC with “WITH EXECUTE AS OWNER AS”

  • Comment out some SET statements

  • …in fact, there could be any number of other changes

Read on to see how Jana did it.

Comments closed

The Downside Of Nested Views

Randolph West doesn’t mince words:

Nested views are bad. Let’s get that out of the way.

What is a nested view anyway? Imagine that you have a SELECT statement you tend to use all over the place (a very common practice when checking user permissions). There are five base tables in the join, but it’s fast enough.

Instead of copying and pasting the code wherever you need it, or using a stored procedure (because for some reason you’re allergic to stored procedures), you decide to simplify your code and re-use that SELECT in the form of a database view. Now whenever you need to run that complicated query, you can instead query the view directly.

What harm could this do?

Spoilers:  a lot.

Comments closed

Diagramming Databases With Power BI

Philip Seamark shows how to visualize the relationships between tables using Power BI:

The network navigator was another good visual, and if you have an R instance installed on your local machine, you can play with some of the custom R visuals.

The catalog views could be used in a similar way to generate power bi visuals showing other object dependencies inside an MS SQL Database.  Additional columns could be added to the base query to be used in tool-tips etc.

If your database tables do not have foreign keys, here is another query that can be used to help guess relationships between tables based on column name and datatype.

Most of these techniques look like they wouldn’t work well on databases with a very large number of tables, but for an average-sized database, it can serve as an avant-garde ERD.

Comments closed

Batch Mode Memory Fractions

Joe Obbish explains what memory fractions are and how incorrect calculations can lead to tempdb spills:

There’s very little information out there about memory fractions. I would define them as information in the query plan that can give you clues about each operator’s share of the total query memory grant. This is naturally more complicated for query plans that insert into tables with columnstore indexes but that won’t be covered here. Most references will tell you not to worry about memory fractions or that they aren’t useful most of the time. Out of thousands of queries that I’ve tuned I can only think of a few for which memory fractions were relevant. Sometimes queries spill to tempdb even though SQL Server reports that a lot of query memory was unused. In these situations I generally hope for a poor cardinality estimate which leads to a memory fraction which is too low for the spilling operator. If fixing the cardinality estimate doesn’t prevent the spill then things can get a lot more complicated, assuming that you don’t just give up.

Extremely interesting post.

Comments closed