# Day: August 14, 2018

PCA looks for a new the reference system to describe your data. This new reference system is designed in such a way to maximize the variance of the data across the new axis. The first principal component accounts for as much variance as possible, as does the second and so on. PCA transforms a set of (tipically) correlated variables into a set of uncorrelated variables called principal components. By design, each principal component will account for as much variance as possible. The hope is that a fewer number of PCs can be used to summarise the whole dataset. Note that PCs are a linear combination of the original data.

The procedure simply boils down to the following steps

1. Scale (normalize) the data (not necessary but suggested especially when variables are not homogeneous).

2. Calculate the covariance matrix of the data.

3. Calculate eigenvectors (also, perhaps confusingly, called “loadings”) and eigenvalues of the covariance matrix.

4. Choose only the first N biggest eigenvalues according to one of the many criteria available in the literature.

5. Project your data in the new frame of reference by multipliying your data matrix by a matrix whose columns are the N eigenvectors associated with the N biggest eigenvalues.

6. Use the projected data (very confusingly called “scores”) as your new variables for further analysis.

I like the explanations provided, and the data set is definitely something I’m not used to seeing with PCA.  H/T R-bloggers

Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: `data.table` uses a radix sort which has the potential to perform in near linear time (faster than the `n log(n)` lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).

In fact, if we divide the `y` in the above graph by `log(rows)` we get something approaching a constant.

John has also provided us with a markdown document for comparison.

All of the examples so far have consisted of matrices with data elements of the same class. And for good reason: it’s a requirement for a matrix. R will coerce elements with mismatched classes to the same class. Here are two vectors, one of class integer and the other of class character. After combining them into a matrix via rbind(), we see the first row of data elements are of the character class (enclosed in double quotes):

```> row1 <- c(1L, 2L, 3L, 4L)
> row2 <- c("a", "b", "c", "d")
> new_matrix <- rbind(row1, row2)
> new_matrix [,1] [,2] [,3] [,4]
row1 "1" "2" "3" "4"
row2 "a" "b" "c" "d"```

Matrices drive a large number of statistical techniques, though I tend to deal with them less directly than I would have imagined.

Another special new capability that we gain with Composite Models is native support for many-to-many relationships. I see this as incredibly helpful to new Power BI users that are not familiar with strict requirements around relationships.

Traditionally, Power BI is only able to create relationships that are one-many. This means that one of the two tables in a relationship must have a column with unique values for the relationship to work. With this new feature the requirement that a table must have a column with unique values is no longer needed.

For those of you that have a data modeling background this may be a little different than what you traditionally envision a many-to-many relationship should be. Normally a many-to-many relationship includes a bridge table to bring the two tables together, but now with Power BI Composite Models the relationship can be directly defined between them.

Check it out.

You have been warned. I don’t like the code above. If you want to apply filters, use CALCULATE. Maybe the code written using CALCULATE is verbose and boring, but it’s easier to read. If you are looking for an enjoyable reading for your free time, you can choose a novel rather than playing with an obscure DAX syntax.

So, why writing a blog post about this? The reason is that TOTALYTD has the ability to accept a scalar value as an argument instead of (or after) the filter. This is the same additional optional parameter accepted by DATESYTD in case the fiscal year does not end on December 31st. The problem is that the third argument of TOTALYTD is a filter if you use a predicate or a table expression, and it’s a marker for end of year in case it’s a scalar value. Confusing enough? Ok, let me present some example.

Performance troubleshooting should begin with capturing wait stats so we can understand where SQL Server is busy. The script below captures wait stats into a table in tempdb; the script should be ran via a SQL Server Agent job or some other scheduling mechanism.

I like the definitions that Max provides.  My only recommendation would be to store this data someplace a bit more permanent than tempdb.

The problems come with uncommon values. If you are pseudonymizing a medical database that is required for research purposes on people with potentially embarrassing diseases, and it appears on the dark web, anyone with a rare or unusual surname or first-name comes up on the list, so the shuffle doesn’t help the privacy of Fortescue Ceresole, or whatever his name may be.

If you are spoofing data entirely, you don’t necessarily have this problem because your constructed value will have no relationship to the original value. If it comes from a list of common names or if you randomly create a name ‘Thomas’, it will have no relationship to the original names in the database as long as you did things correctly and shuffle the list. Although a Markov string can produce an identical name that is uncommon, it can be eliminated from the list by an outer join with the original data.

After you shuffle data, you ‘zip’ it. Zipping lists is something you come across in procedural programming, and Linq has a good example. A .net array has an order, and all you are doing is to join by the order of the element in the list. If you randomize that order, you get a shuffle.

This is a sample Power Query file, which in that I do a very simple transformation. The transformation is adding one to the existing number. However, in this sample, we are doing it over thousands of steps! one step at a time, we are adding thousands to a number. The main reason to do it this way is to show you what is the performance you get when you have too many variables (or let’s say steps) in Power Query.

Reza reduces the number of steps and turns a 15-minute operation into one which finishes in under a second.

Docker is available for Mac and Windows with a simple installation by the defaults.