Press "Enter" to skip to content

Day: November 6, 2018

Exploratory Data Analysis In R

Laura Ellis walks us through some easy techniques for learning about our data using R:

DIM AND GLIMPSE

Next, we will run the dim function which displays the dimensions of the table. The output takes the form of row, column.

And then we run the glimpse function from the dplyr package. This will display a vertical preview of the dataset. It allows us to easily preview data type and sample data.

Spending some quality time doing EDA can save you in the long run, as it can help you get a feel for things like data quality, the distributions of variables, and completeness of data.

Comments closed

Using Datadog To Monitor Spark Clusters On EMR

Priya Matpadi walks us through one way to monitor Spark clusters on Amazon ElasticMapReduce:

We recently implemented a Spark streaming application, which consumes data from from multiple Kafka topics. The data consumed from Kafka comprises different types of telemetry events generated by mobile devices. We decided to host the Spark cluster using the Amazon EMR service, which manages a fleet of EC2 instances to run our data-processing pipelines.

As part of preparing the cluster and application for deployment to production, we needed to implement monitoring so we could track the streaming application and the Spark infrastructure itself. At a high level, we wanted ensure that we could monitor the different components of the application, understand performance parameters, and get alerted when things go wrong.

In this post, we’ll walk through how we aggregated relevant metrics in Datadog from our Spark streaming application running on a YARN cluster in EMR.

Check it out.  If this is interesting, Priya’s blog has the full series.

Comments closed

Pivoting With Spark SQL

MaryAnn Xue shows us how to use the PIVOT operator in Spark SQL:

Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns.

The upcoming Apache Spark 2.4 release extends this powerful functionality of pivoting data to our SQL users as well. In this blog, using temperatures recordings in Seattle, we’ll show how we can use this common SQL Pivot feature to achieve complex data transformations.

The syntax is quite similar to the PIVOT syntax that SQL Server uses.

Comments closed

Showing Forecasts With Actuals In Power BI

Alberto Ferrari shows us how we can incorporate actuals and forecasted values in the same Power BI visuals:

The Forecast measure in the demo model is quite an advanced piece of DAX code that would require a full article by itself. The curious reader will find more information on how to reallocate budget at different granularities in the video Budgeting with Power BI. In this article, we use the Forecast measure without detailed explanations; our goal is to explain how to compute the next measure: Remaining Forecast.

The Remaining Forecast measure must analyze the Sales table, finding the last day for which there are sales, and only then computing the forecasts.

Read the whole thing.

Comments closed