Press "Enter" to skip to content

Day: September 12, 2017

Naive PCA With R

Pablo Bernabeu gives us a naive method for performing a Principal Component Analysis:

STAGE 1.  Determine whether PCA is appropriate at all, considering the variables

  • Variables should be inter-correlated enough but not too much. Field et al. (2012) provide some thresholds, suggesting that no variable should have many correlations below .30, or any correlation at all above .90. Thus, in the example here, variable Q06 should probably be excluded from the PCA.

  • Bartlett’s test, on the nature of the intercorrelations, should be significant. Significance suggests that the variables are not an ‘identity matrix’ in which correlations are a sampling error.

  • KMO (Kaiser-Meyer-Olkin), a measure of sampling adequacy based on common variance (so similar purpose as Bartlett’s). As Field et al. review, ‘values between .5 and .7 are mediocre, values between .7 and .8 are good, values between .8 and .9 are great and values above .9 are superb’ (p. 761). There’s a general score as well as one per variable. The general one will often be good, whereas the individual scores may more likely fail. Any variable with a score below .5 should probably be removed, and the test should be run again.

  • Determinant: A formula about multicollinearity. The result should preferably fall below .00001.

PCA is a powerful tool in several fields, including clinical testing.

Comments closed

Serverless Lambda Architecture

Laith Al-Saadoon shows off a new Amazon Web Services product, AWS Glue, which allows you to build a data processing system on the Lambda architecture without directly provisioning any EC2 instances:

With the launch of AWS Glue, AWS provides a portfolio of services to architect a Big Data platform without managing any servers or clusters. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (for example, the table definition and schema) in the AWS Glue Data Catalog. After it’s cataloged, your data is immediately searchable, queryable, and available for ETL.

AWS Glue generates the code to execute your data transformations and data loading processes. Furthermore, AWS Glue provides a managed Spark execution environment to run ETL jobs against a data lake in Amazon S3. In short, you can now run a Lambda Architecture in AWS in a completely 100% serverless fashion!

“Serverless” applications allow you to build and run applications without thinking about servers. What this means is that you can now stream data in real-time, process huge volumes of data in S3, and run SQL queries and visualizations against that data without managing server provisioning, installation, patching, or capacity scaling. This frees up your users to spend more time interpreting the data and deriving business value for your organization.

Laith has a working demo of the process available as well.

Comments closed

Trigram Search In SQL Server

Paul White shows how to implement trigram wildcard searches in SQL Server:

The basic idea of a trigram search is quite simple:

  1. Persist three-character substrings (trigrams) of the target data.
  2. Split the search term(s) into trigrams.
  3. Match search trigrams against the stored trigrams (equality search)
  4. Intersect the qualified rows to find strings that match all trigrams
  5. Apply the original search filter to the much-reduced intersection

We will work through an example to see exactly how this all works, and what the trade-offs are.

A must-read.  N-grams in SQL Server is an example of a non-obvious data architecture which performs much better than the obvious alternative, at least when the conditions are right.

Comments closed

Transit Data Visualization In R

Goncalo Trincao Cunha shows us how to plot General Transit Feed Specification data in R:

GTFS (General Transit Feed Specification) is a specification that defines a data format for public transportation routes, stop, schedules, and associated geographic information.

In this post, we’ll use R with ggplot2 and ggmap to visualize GTFS route and schedule information on a map.

This post uses a GTFS feed from CARRIS, which is a bus public transport operator from the city of Lisbon.

Click through for code and a few interesting maps of Lisbon, Portugal.

Comments closed

Deleting SSAS Cube Partitions With Powershell

Richie Lee shows how to remove Analysis Services cube partitions using Powershell:

One such an example of ad-hoc DBA tasks was when I had to delete about 600 partitions from a measure group that had thousands of partitions. Doing this manually would be ridiculous, so at the time I created a SQL script that used some dynamic T-SQL to create the delete commands in XMLA. XMLA has no “delete if exist” type syntax, so if I needed to run this again, this dynamic SQL output wouldn’t work. And so I decided that if I had to run the same task a gain I would write a PowerShell script that would run DSC-style and drop the partitions that were no longer required. And funnily enough, that is exactly what I had to do.

I knew I would be able to create a Powershell script that used AMO to check if a partition exists and drop it if it did. I also wanted the script to take into account any other partitions in other measure groups that may also need to be dropped. So I made sure the script uses PowerShell switches that can be included when calling the function, and if they are included then the pertaining partitions in that measure group will be deleted. So you can run the script for one, some or all of the measure groups in a cube.

Click through for the script.

Comments closed

Instant Log Initialization In Azure

Dimitri Furman shows a benefit of creating database files with Azure Blob Storage:

Recently, we were working on a performance testing exercise using a SQL Server database with files in Azure Blob Storage. After creating the database using the default 8 MB size for data and log file (as in the example above), we wanted to increase the size of all files to be sufficient for the expected workload. IFI was not yet enabled for the SQL Server instance we were working with, and growing the data file from 8 MB to 1 TB took about one minute (using Premium Storage). This was expected, since the data file had to be fully initialized. We expected that the log growth to 1 TB would take about as much time, for the same reason. It was very surprising then that the same operation on the log file completed in less than one second.

It turns out that this is due to differences in Azure Blob Storage versus traditional storage systems.

Comments closed

Starting Out With Powershell

Rob Sewell has his own introduction to Powershell:

PowerShell is another language, you are not going to be as proficient in a new language as you are in the language you spend all day working with. You will have to go through the learning curve and you will have to understand how to interpret errors. There is a learning curve just like with any language. Undoubtedly you will get frustrated at times. Reach out for help. Use twitter, use the #powershellhelp in the SQL Server Community Slack channel. There are many other places and plenty of people who will be glad to help you.

Rob has a pretty detailed introduction to the topic, so it’s well worth the read for new Powershell users.

Comments closed

Power BI Drillthrough

Reid Havens shows how to use the new drillthrough functionality in Power BI:

Report Drill Through enables users to create a report page, filtered to a single entity (E.g. Customer, Employee, Store, Product). Reports often have a summary landing page where there might be a lot of information, but not much detail about a specific item. Traditionally in Power BI we’d create a separate reporting page for further detail breakouts on an entity. However, that would require it’s own set of SLICERS that you would have to re-select if you wanted it to mirror the reporting page you were coming from…that’s too many clicks!

Selecting an item in Drill Through in a table from one reporting page will take you to another page, FILTERED to the entity you selected!This feature essentially let’s us create detail sub-pages that are linked to whatever primary reporting page the report uses. Features like this have been available for YEARS in Excel using linked cells, I’m super happy we finally have this as a feature in Power BI Desktop. It only takes a few steps to setup, but I’ll leave the instructions for that over at the Power BI September release page. With that said, let me run you quickly through how this looks like in a sample report.

Click through for an example, showing how useful drillthrough can be.

Comments closed

Getting Started With Powershell

Rob Farley gives us a basic introduction to Powershell:

I want to take a few minutes to get you started with PowerShell, in a way that I think is less overwhelming than how I found most PowerShell material back in the day. Because you’ve told me you keep hearing about PowerShell, and that you can’t tell how to get into it.

The first thing to know is that it feels like Command Prompt on steroids, but a lot more useful than Command Prompt. You’ll see what I mean as soon as you open it. Let’s do that, but let’s open it using ‘Run as Administrator’, because that just helps for some of the demos. In real life, you’ll only open it with Admin rights if you really need to.

So click on the start button and start typing PowerShell. I don’t want you to open the ISE (yet), just find the “Windows PowerShell” shortcut, and right-click on it to choose ‘Run as Administrator’.

This T-SQL Tuesday is all about Powershell, and Rob’s is one of the first posts you should read if you’re not familiar.

Comments closed