Shredding Excel With R

Kevin Feasel



John MacKintosh shows how to use R for wrangling + ETL:

I had over 140 files to process. That’s not usually a big deal – I normally use SQL Server Integration Services to loop through network folders, connect to hundreds of spreadsheets and extract the source data.

But this relies on the data being in a tabular format (like a dataframe or database table).

A quick glance at the first few sheets confirmed I could not use this approach – the data was not in tabular format. Instead it was laid out in a format suited to viewing the data on screen – with the required data scattered in different ranges throughout each sheet ( over 100 rows and many columns). It wasn’t going to be feasible to point SSIS at different locations within each sheet. (It can be done, but it’s pretty complex and I didn’t have time to experiment).

The other challenge was that over time, changes to design meant that data moved location e.g. dates that were originally in cell C2 moved to D7, then moved again as requirements evolved. There were 14 different templates in all, each with subtle changes. Each template was going to need a custom solution to extract the data.

This is a good look at how R can be about more than “just” statistical analysis.

Getting Started With Functional Programming

Jose Gonzalez links to a set of blog posts and videos introducing functional programming:

Since I’ve started to play with (and rave about) functional programming (FP), a lot of people have asked me how to get started.

Instead of writing the same email multiple times, I decided to create a blog post I can refer them to. Also, it’s a central place to put all my notes about the topic.

Here’s a small collection of all the resources I’ve accumulated on my adventure on learning functional programming.

I think the functional paradigm fits relational database development extremely well, better than the object-oriented paradigm.

Support Vector Machines In R

Deepanshu Bhalla explains what support vector machines are:

The main idea of support vector machine is to find the optimal hyperplane (line in 2D, plane in 3D and hyperplane in more than 3 dimensions) which maximizes the margin between two classes. In this case, two classes are red and blue balls. In layman’s term, it is finding the optimal separating boundary to separate two classes (events and non-events).

Deepanshu then goes on to implement this in R.

Disabling Office 365 Groups

John White says, spoilers, you can’t:

The bottom line of all this is that even if you use Office 365, and you think that you have disabled Groups in your tenant, the chances are that you could be in for a surprise. If any of these dependent services are in use, the chances are that you already have several created.

Groups are the bedrock of all new features in Office 365 moving forward – it is therefore a good idea that your organization understand them as soon as possible. Their inevitability is also another strong argument for paying close attention to them. If you are currently discussing whether or not they should be used, I would strongly encourage you to shifting that discussion to how they should best be used.

Read the whole thing if you’re getting into Office 365.

Update:  John has a clarification, showing that it is possible, but it’s not trivial.

Data Analysis Basics In R

Sibanjan Das provides some of the basics of data analysis using R:

Let’s start thinking in a logical way the steps that one should perform once we have the data imported into R.

  • The first step would be to discover what’s in the data file that was exported. To do this, we can:
    • Use head function to view few rows from the data set. By default, head shows first 5 rows. Ex: head(mtcars)
    • str to view the structure of the imported data. Ex: str(mtcars)
    • summary to view the data summary. Ex: summary(mtcars) 

There’s a lot to data analysis, but this is a good start.

Suse On Windows 10

Kevin Feasel



Brad Sams reports that Ubuntu isn’t the only flavor of Linux available on Windows 10 anymore:

If you do go down this route, you have the option for installing either openSUSE Leap 42.2 and SUSE Linux Enterprise Server 12 SP2.

The benefits here are obvious, with Microsoft enabling the Windows subsystem for Linux, they are opening the door to more than simply running Bash inside of Windows 10. While that is a good feature and one of the most likely used instances of this subsystem, what Microsoft has actually done is opened the door for more vendors to bring their Linux tools to the Windows platform.

I’d expect Red Hat to follow suit.

Detail Rows Expression In SSAS Tabular

Chris Webb shows a new feature in SSAS Tabular vNext:

What drillthrough does in SSAS Multidimensional, and what the new Detail Rows Expression property in SSAS Tabular does, is allow an end user to see the detail-level data (usually the rows in the fact table) that was aggregated to give the value the user clicked on in the original PivotTable.

Read through for an example as well as how it’s already an improvement over Multidimensional’s dillthrough.

Thinking About Real-Time Analytics

Martin Willcox offers some advice for people getting into the real-time analytics game:

  1. Clarify who will be making the decision – man, or machine? Humans have powers of discretion that machines sometimes lack, but are much slower than a silicon-based system, and only able to make decisions one-at-a-time, one-after-another.  If we chose to put a human in the loop, we are normally in “please-update-my-dashboard-faster-and-more-often” territory.

  2. It is important to be clear about decision-latency. Think about how soon after a business event you need to take a decision and then implement it. You also need to understand whether decision-latency and data-latency are the same. Sometimes a good decision can be made now on the basis of older data. But sometimes you need the latest, greatest and most up-to-date information to make the right choices.

There are some good insights here.


January 2017
« Dec Feb »