Press "Enter" to skip to content

Month: December 2017

Outlier Detection With dplyr And ruler

Evgeni Chasnovski shows how to use a couple R packages in concert to find outliers:

During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data.

There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyrand ruler packages.

After reading this post you will know:

  • Most basic outlier detection techniques.

  • A way to implement them using dplyr and ruler.

  • A way to combine their results in order to obtain a new outlier detection method.

  • A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point).

Read the whole thing.  H/T R-Bloggers

Comments closed

rquery: Relational Algebra In R

John Mount announces rquery:

rquery is Win-Vector LLC‘s currently in development big data query tool for R.

rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with RSQL, and dplyr at big data scale in production).

As an example: rquery operators allow us to write our earlier “treatment and control” example as follows.

dQ <- d %.>%
  extend_se(.,
            if_else_block(
              testexpr =
                "rand()>=0.5",
              thenexprs = qae(
                a_1 := 'treatment',
                a_2 := 'control'),
              elseexprs = qae(
                a_1 := 'control',
                a_2 := 'treatment'))) %.>%
  select_columns(., c("rowNum", "a_1", "a_2"))

It’s an interesting idea.

Comments closed

Getting Started With dplyr

Abdul Majed Raja has a dplyr tutorial:

dplyr is one of the most popular r-packages and also part of tidyverse that’s been developed by Hadley Wickham. The mere fact that dplyr package is very famous means, it’s one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play a very important role in BAU of many us – in fact, a very important part before any modeling exercise since Feature Engineering and EDA are the most important differentiating factors of your model and someone else’s.
Hence, this post aims to bring out some well-known and not-so-well-known applications of dplyr so that any data analyst could leverage its potential using a much familiar – Titanic Dataset.

This covers the main pieces  of dplyr, including its pipeline.  dplyr is a key part of the tidyverse, and knowing it well makes R so much easier.  H/T R-Bloggers

Comments closed

Organizing SQL Queries

Eleni Markou shows a few techniques available to organize SQL queries, especially for analytics:

Jupyter Notebook

For the advocates of python, a commonly used application is Jupyter Notebook. Jupyter Notebook is a server-client application that allows editing and running of python code via a web browser combining python code, SQL,  equations, text, and visualizations. It also offers syncing with GitHub repositories.

More specifically, Jupyter Notebook will be rendered by GitHub directly on your repo page.  This means that one can enjoy all the benefits that Git offers regarding version control, branching, merging and collaborative development when using Jupyter Notebook.

The best strategy is probably a multi-tiered strategy.  It absolutely starts with source control, but it doesn’t have to end there.

Comments closed

Columnstore Indexes And Partition Operations

Niko Neugebauer continues his columnstore index series, this time looking at how partitioned tables behave:

Let’s start with a simple test of merging the 2007 partition with the year 2008, by issuing the following command:

It might ready you a reasonably huge surprise, but this command will fail, if you are using the Columnstore Indexes.

The very same command will function without any problem, if we would simply avoid creating Clustered Columnstore Index …
The reason behind this limitation has to do with the fact that Columnstore Indexes do not sort or control the boundaries of the data, and this is biting the total implementation in such operations.

It’s an interesting read, and a little disappointing.

Comments closed

Upgrading SSIS Packages

Bartosz Ratajczyk explains why upgrading SQL Server Integration Services packages could be in your best interest:

Looking at the times of the upgrade (it takes milliseconds) we can live with automatic version migrations during thousands of executions. So – is there any gain if we retain it?

Let’s take a closer look at the SSIS Toolbox. We are migrating to SSIS 2017 from the lower version, let’s say the source is SSIS 2012. Open SQL Server Data Tools (for Visual Studio 2015 or 2017, does not matter for now) and load your project. I will use SSDT for VS 2017 with sample project created for SSIS 2012. See the SSIS Toolbox for the project in version SSIS 2012? There is a Script Task following an FTPTask.

I will upgrade the SSIS project to the latest version (and write more about it in few lines) and take a look at the SSIS Toolbox now.

Now we can see additional tasks for Hadoop. Upgrading the project does at least two things that are interesting to us: it uses the latest versions of the tasks and components, but also introduces the new elements to use.

There are some benefits, but those come with a little bit of risk.

Comments closed

Mapping Extended Events To Trace Events

Jason Brimhall shows how Extended Events and server trace events (like Profiler uses) map together:

If you have browsed XEvents to any extent you should probably be familiar with at least one map object that Microsoft has given us. That particular object is sys.dm_xe_map_values and I wrote a bit about it here. That object presents some good information about various data values within XEvents that are delivered with the Event payload as well as some hidden gems that can be extra useful for exploring SQL Server internals. The point is, maps is not necessarily a new concept.

While the concept of a map is not new within XEvents, the implementation in this case is a tad different. You see, the dm_xe_map_values object is more like an EAV object while the map I will introduce today is more of an ordinary lookup table. Let’s take a look.

The map I want to concentrate on for now is the sys.trace_xe_event_map table. You read that correctly – it is a table. We don’t have very many “tables” when dealing with XEvents, but this one happens to be. As Microsoft Docs describes it, the table “contains one row for each Extended Events event that is mapped to a SQL Trace event class.

Click through for a script which shows how to map them, as well as a couple interesting points.

Comments closed

Regex In SSMS

Jen McCown goes through a good example of where regular expressions can solve a DBA’s boring and repetitive task:

What I needed was to be able to check each one of those tables, and see if any of the should-be-deleted rows still exist. Of course, I can certainly type this all out by hand…

IF EXISTS (SELECT * FROM dbo.Table1 WHERE val = 1) SELECT 'Table1 has undeleted rows!' as TableName;
IF EXISTS (SELECT * FROM dbo.Table2 WHERE val = 1) SELECT 'Table2 has undeleted rows!' as TableName;

*Sigh*. I’m already  bored. And in reality, I had 45 of these statements, not 15. I’m not a big fan of repetitive, monkey-button work.

SSMS regular expressions are great.  I got to see a version of Sean’s linked talk in person and it helped things click in my mind.

Comments closed

Design Concepts: Affordances

Meagan Longoria continues her design concepts series:

Affordances allow us to take advantage of common experiences and cultural standards to create immediate understanding of our designs. Examples of affordances in physical products include:

  • A coffee cup with a handle suggests that you should grip the cup by the handle.

  • Buttons on doorbells are for pushing.

  • The material and shape of balls suggest they are for throwing and bouncing.

Read the whole thing.  If you want to learn more about affordances, Don Norman’s The Design of Everyday Things is a great starting point.

Comments closed

Breeze: Mathematics In Scala

Nitin Aggarwal introduces the mathematics library behind Spark’s machine learning library, MLlib:

In simple terms, Breeze is a Scala library that extends the Scala collection library to provide support for vectors and matrices in addition to providing a whole bunch of functions that support their manipulation. We could safely compare Breeze to NumPy in Python terms. Breeze forms the foundation of MLlib—the Machine Learning library in Spark

Breeze comprises four libraries:

  • breeze-math: Numerics and Linear Algebra. Fast linear algebra backed by native libraries (via JBlas) where appropriate.

  • breeze-process: Tools for tokenizing, processing, and massaging data, especially textual data. Includes stemmers, tokenizers, and stop word filtering, among other features.

  • breeze-learn: Optimization and Machine Learning. Contains state-of-the-art routines for convex optimization, sampling distributions, several classifiers, and DSLs for Linear Programming and Belief Propagation.

  • breeze-viz: (Very alpha) Basic support for plotting, using JFreeChart.

Read on for samples and basic usage.

Comments closed