Press "Enter" to skip to content

Author: Kevin Feasel

ggplot2 Scales And Coordinates

I continue my series on ggplot2:

The other thing I want to cover today is coordinate systems.  The ggplot2 documentation shows seven coordinate functions.  There are good reasons to use each, but I’m only going to demonstrate one.  By default, we use the Cartesian coordinate system and ggplot2 sets the viewing space.  This viewing space covers the fullness of your data set and generally is reasonable, though you can change the viewing area using the xlim and ylim parameters.

The special coordinate system I want to point out is coord_flip, which flips the X and Y axes.  This allows us, for example, to turn a column chart into a bar chart.  Taking our life expectancy by continent, data I can create a bar chart whereas before, we’ve been looking at column charts.

There are a lot of pictures and more step-by-step work.  Most of these are still 3-4 lines of code, so again, pretty simple.

Comments closed

Configuring Disk Controllers In VMs

David Klee explains how you can get a performance improvement in high-I/O virtual machines:

Both Hyper-V and VMware’s default controller emulates the LSI Logic SAS controller, because that’s what is built into Windows driver storage, and *just works* without having to do anything fancy.

But… it’s not necessarily there for speed. It’s there for compatibility so that you can boot up a VM without having to deal with extra drivers.

VMware created a driver a while back that comes with the VMware Tools package called the Paravirtual SCSI controller, and it gives a 10-30% bump in performance (depending on the speed of the underlying storage) because it’s built for speed from the beginning. It’s just not native to Windows, so I don’t personally feel comfortable using it for the C: drive controller unless required. You can change the controller type for these controllers, so we use it by default for non-OS SQL Server object drives.

Definitely worth the read.

Comments closed

GDPR In The UK

Ed Elliott covers that lesser-known Sex Pistols track in a multi-part series.

Part 1 covers some of the official documentation around how the ICO interprets GDPR:

To read the article, and the actual requirements I would start at page 32 which begins “HAVE ADOPTED THIS REGULATION:” this lists each of the articles (requirements). You can go through each of these and make sure you are compliant with them.

The exciting bit, the fines

The exciting headline-grabbing parts of GDPR are the fines that can be enforced. We don’t yet know how the ICO will apply the fines, words like maximum are used and the maximum possible fines are large. It is possible that the maximum fines will apply but we will look in part 2 at previous ICO enforcement actions to see if the ICO’s past performance gives us any clues as to its possible future decisions.

Part 2 looks at a couple of prior cases and how the ICO handled them:

Talk Talk started mitigating the issue by writing to all of its customers telling them how to deal with scam calls. Talk Talk told the ICO what happened and they responded with their own investigation and a £100,000 fine. The reasons were:

– The system failed to have adequate controls over who could access which records, i.e. anyone could access any record not just the cases they were working on
– The exports allowed all fields, not just the ones required for the regulatory reports
– Wipro were able to make wildcard searches
– The issue was a long-running thing from 2004 when Wipro were given access until 2014

One of the mitigating factors was that there was no evidence that this was even the source of the scam calls, plus there is no evidence anyone suffered any damage or distress as a result of this incident.

Part 3 looks at a couple more cases, too.  And Ed promises part 4.

Comments closed

Yarn Service Framework Coming

Jian He, et al, announce the Yarn Service Framework:

Apache Hadoop YARN is well known as the general resource-management platform for big-data applications such as MapReduce, Hive / Tez and Spark. It abstracts the complicated cluster resource management and scheduling from higher level applications and enables them to focus solely on their own application specific logic.

In addition to big-data apps, another broad spectrum of workloads we see today are long running services such as HBase, Hive/LLAP and container (e.g. Docker) based services. In the past year, the YARN community has been working hard to build first-class support for long running services on YARN.

This is going to ship with Hadoop 3.1.

Comments closed

PySpark DataFrame Transformations

Vincent-Philippe Lauzon shows how to perform data frame transformations using PySpark:

We wanted to look at some more Data Frames, with a bigger data set, more precisely some transformation techniques.  We often say that most of the leg work in Machine learning in data cleansing.  Similarly we can affirm that the clever & insightful aggregation query performed on a large dataset can only be executed after a considerable amount of work has been done into formatting, filtering & massaging data:  data wrangling.

Here, we’ll look at an interesting dataset, the H-1B Visa Petitions 2011-2016 (from Kaggle) and find some good insights with just a few queries, but also some data wrangling.

It is important to note that about everything in this article isn’t specific to Azure Databricks and would work with any distribution of Apache Spark.

The notebook used for this article is persisted on GitHub.

Read on for explanation, or check out the notebook to work on it at your own pace.

Comments closed

ggplot2 Mappings And Geoms

I continue my ggplot2 series:

We have used a new geom here, geom_smooth.  The geom_smooth function creates a smoothed conditional mean.  Basically, we’re drawing some line as a result of a function based on this input data.  Notice that there are two parameters that I set:  method and se.  The method parameter tells the function which method to use.  There are five methods available, including using a Generalized Additive Model (gam), Locally Weighted Scatterplot Smoothing (loess), and three varieties of Linear Models (lm, glm, and rlm).  The se parameter controls whether we want to see the standard error bar.

I don’t cover all of the mapping options and all of the geoms, but I think it’s enough to get a grip on the concept.

Comments closed

Using FreeTDS To Connect To SQL Server

Steph Locke embraces the pain of FreeTDS:

If you use SQL Server (or Azure SQL DB) as your data store and you need to connect to the databasse from shinyapps.io, you’re presently stuck with FreeTDS. If you have any control over infrastructure I cannot recommend highly enough the actual ODBC Driver on Linux for ease. Alas, shinyapps.io does not let you control the infrastructure. We have to make do with with FreeTDS and it can be pretty painful to get right.

Due to how obtuse the error messages you end up getting back from FreeTDS in your shiny app and the time to deploy an app, you might just want to cry a little. I know I did. Determined to succeed, here is my solution to getting a working database connection that you can also use to test you’re doing it right. If you’re on a particularly old version of SQL Server though, I can’t guarantee this will work for you.

Read on for more.  I also have an older post on working with FreeTDS, though I ended up using TDS_Version = 8.0 instead of 7.4.

Comments closed

Power BI Cumulative Totaling

Martin Schoombee runs into an interesting issue with cumulative totals in Power BI:

A common practice in the data warehousing world is to use a Date Key as unique identifier in a date dimension. This attribute is usually a number in the format yyyymmdd. I’m not going to dive into all the reasons why it is used in data warehouse environments here, but (for fun) let’s change our data model to use the Date Keyattribute in the relationship between the two tables.

If we look at our visualizations again, we see a very different picture. Sales by date still looks the same, but the sales by month seems a little out of whack (image below). If you had cumulative sales at any other aggregated level (quarter, year, etc.) it would also have been incorrect.

The answer is not immediately intuitive, so it’s good to know this ahead of time rather than have to struggle with it later.

Comments closed

Processing Tabular Models With Helpful Information

Ust Oldfeld has a stored procedure which runs a SQL Agent job and provides notice when processing completes:

Recently, at a client, I was challenged to create a stored procedure that would process a tabular model. This stored procedure would then be executed from a web application. The process behind it being: a user enters data into a web application, which gets written to a database. That data then needs to be immediately surfaced up into reports, with additional calculations and measures along the way. Therefore the tabular model, which does all the additional calculation and measures, needs to be processed by a user from the web application.

Click through for the script.

Comments closed

Ungrouped Results In Query Store

Erin Stellato explains why you might see two rows for the same query plan in Query Store’s run-time stats:

You can see that there are two rows for the same query_id and plan_id, but the count_executions is different, as are the avg_duration and avg_logical_io_reads values .  The data is not truly a duplicate.  This behavior occurs because the Query Store data is stored in memory before it is flushed to disk, and when you query the data SQL Server is pulling it from both locations (and doing a UNION ALL) and displaying it in the output.  If I waited a bit and ran the query again, the two rows for that interval would probably disappear – most likely because the in memory data had been flushed to disk.

Read the whole thing.

Comments closed