Press "Enter" to skip to content

Month: July 2017

Data Cleaning Tips

Michael Grogan has a few tips for data cleaning with R:

6. Delete observations using head and tail functions

The head and tail functions can be used if we wish to delete certain observations from a variable, e.g. Sales. The head function allows us to delete the first 30 rows, while the tail function allows us to delete the last 30 rows.

When it comes to using a variable edited in this way for calculation purposes, e.g. a regression, the as.matrix function is also used to convert the variable into matrix format:

Salesminus30days←head(Sales,-30)
X1=as.matrix(Salesminus30days)
X1

Salesplus30days<-tail(Sales,-30)
X2=as.matrix(Salesplus30days)
X2

Some of these tips are for people familiar with Excel but fairly new to R.  These also use the base library rather than the tidyverse packages (e.g., using merge instead of dplyr’s join or as.date instead of lubridate).  You may consider that a small negative, but if it is, it’s a very small one.

Comments closed

Useful dplyr Functions

S. Richter-Walsh explains seven important dplyr functions with plenty of examples:

There are many useful functions contained within the dplyr package. This post does not attempt to cover them all but does look at the major functions that are commonly used in data manipulation tasks. These are:

select() 
filter()
mutate() 
group_by() 
summarise()
arrange() 
join()

The data used in this post are taken from the UCI Machine Learning Repository and contain census information from 1994 for the USA. The dataset can be used for classification of income class in a machine learning setting and can be obtained here.

That’s probably the bare minimum you should know about dplyr, but knowing just these seven can make data analysis in R much easier.

Comments closed

Streaming ETL Using CDC And Event Hub

Rolf Tesmer combines Change Data Capture and Event Hubs to build a streaming ETL solution:

The solution picks up the SQL data changes from the CDC Change Tracking system tables, creates JSON messages from the change rows, and then posts the message to an Azure Event Hub.  Once landed in the Event Hub an Azure Stream Analytics (ASA) Job distributes the changes into the multiple outputs.

What I found pretty cool was that I could transmit SQL delta changes from source to target in as little as 5 seconds end to end!

There are a bunch of steps, but the end result is worth it.

Comments closed

Check Where That Backup’s Restoring To

Shane O’Neill “has a friend” who learned an important lesson about the database restore GUI:

GUIs are good for….

…discovery.

They give you the option to script out the configurations you have chosen. If my friend had chosen to script out the restore, rather then clicking “OK” to run it, maybe he would have caught this mistake when reviewing it – rather than overwriting the Live database with 2 week old data and spending a weekend in the office with 3 colleagues fixing it.

Plus if you ever want to ensure that you know something, try and script it out from scratch.

Read the whole thing; good thing that totally didn’t happen to Shane and was just his friend!

Comments closed

Save Early, Save Often

Kenneth Fisher relays an important life lesson:

So years and years ago, when I was in college, one of my favorite classes was Assembly Language. We were working with Mac Assembly in case anyone is interested (yes I used a Mac at school, one of the big ones that had the monitor built into it). Somewhere around week three or four, we were supposed to print something to the screen. I spent several hours (this was only my second programming class so even Hello World was a challenge) and got my program ready to test. It worked! Sort of.

Hello World was written to the top of the screen! Then a second or so later the bottom half of the screen turned into random ASCII garbage. Then a second or so later the computer rebooted. Well, that’s not good. Time to debug!

So the computer comes back up, I take a look, and I don’t have ANY code. I hadn’t saved (and this was long enough ago there was no auto-save). I had to start ALL over again. In the end, I did manage to re-write my code, got it working and even got an A. I also learned that I needed to save my work before running it. Well, learned my lesson for the first time (of many).

I have attempted to put a sanguine spin on this mishap, based on something Phil Factor once wrote:  if you throw away (or lose) the code the first time around, the second time you write it, the code will probably be better.  This is because the first time you’re writing a set of code, you’re trying to force the pieces together and get the code working; the second time around, you have a working algorithm in mind, so the code will likely be much cleaner.

1 Comment

Changing A Large Table’s Clustered Index

Nate Johnson explains how to change the clustered index on a very large table:

I call this the “setup, dump, & swap”.  Essentially we need to create an empty copy of the table, with the desired index(es), dump all the data into it, and then swap it in.  There are couple ways you can do this, but it boils down to the same basic premise: It’s “better” (probably not in terms of speed, but definitely in terms of efficiency and overhead) to fill this new copy of the table & its indexes, than it is to build the desired index on the existing table.

As Nate notes, “very large” here will depend on your environment, but this is a useful technique because the old table can be live until the moment of the swap.  As Nate writes this, I’m actually in the middle of one of these sorts of swaps—one that will take a week or two to finish due to pacing.

Comments closed

SQL Server Roll-Your-Own Cryptography

David Fowler has an interesting article on a simple cryptographic algorithm in SQL Server:

As this post was prompted by my post on bitwise logic, we’re going to base our algorithm around the XOR cypher.  Basically this cypher works by taking a key which for simplicity sake will be a single byte and XOR-ing that against the message (or plain text) that we want to encrypt.

Let’s look at an example of how this is going to work.

Let’s say that we want to encrypt the plain text ‘SQLUndercover’.  How are we going to do this?  Firstly we need to remember that all text characters are represented by a single byte as an ASCII code.  ‘SQLUndercover’ is represented by the following set of ASCII codes, 83 81 76 85 110 100 101 114 99 111 118 101 114.

I definitely recommend reading this article for two reasons:  first, because it’s interesting; second, because it shows how easy it is to break amateur crypto.  If you feel the need to roll your own cryptographic algorithm, follow my Official Crypto Flow Chart (patent pending):  Do you have a PhD in mathematics with a specialty in cryptanalysis?  If not, don’t do roll your own crypto; if so, maybe do it but even then probably not.

Comments closed

Availability Group Extended Events

Tracy Boggiano shows how to enhance the built-in Extended Events session for Availability Groups:

Now that the extended events session is setup we can use some queries to query information about our AGs and error messages happening on our servers.  First to query when the server failover and becomes your primary server query the event availability_replica_state_change.  The first 10 lines just read in the files for the extended event session so you don’t have to identify the exact filenames.  Then we parse the xml event for get the timestamp, previous state, current state, repliace name, and group name for the event FROM the filenames we collected before.  In the WHERE clause were are looking for when the state has changed to PRIMARY_NORMAL indicating a failover to the server.

Read on for scripts showing what she extends and also how to query this data.

Comments closed

Tibbles In R

Tristan Mahr explains what tibbles and tribbles are and how they compare to built-in data frames:

The name “tribble” is short for “transposed tibble” (the transposed part referring to change from column-wise creation in tibble() to row-wise creation in tribble()).

I like to use light-weight tribbles for two particular tasks:

  • Recoding: Create a tribble of, say, labels for a plot and join it onto a dataset.

  • Exclusion: Identify observations to exclude, and remove them with an anti-join.

I’ve been more used to data frames than tibbles, but this post shows some interesting things you can do with tibbles a lot more easily than with data frames.  It’s enough to make me want to use tibbles more frequently.  H/T R-bloggers

Comments closed

Measuring Model Accuracy

Fabio Veronesi shows several methods of testing model accuracy:

Mean Squared Deviation or Mean Squared Error

This is simply the numerator of the previous equation, but it is not used often. The issue with both the RMSE and the MSE is that since they square the residuals they tend to be more affected by large residuals. This means that even if our model explains the large majority of the variation in the data very well, with few exceptions; these exceptions will inflate the value of RMSE.

Click through for several calculations.  H/T R-bloggers

Comments closed