Press "Enter" to skip to content

Day: February 2, 2018

Dealing With Dates In R

Mathew McLean shows how to convert strings to dates using a couple well-known packages and introduces flipTime:

The package flipTime provides utilities for working with time series and date-time data. The package can be installed from GitHub using

1
2
require(devtools)
install_github("Displayr/flipTime")

I will discuss only two functions from the package in this post, AsDate() and AsDateTime(). These are used for the conversion of date and date-time strings, respectively. These functions build on the convenience and speed of the lubridate function. Furthermore, the flipTime functions provide additional functionality (making them easier to use). The functions are smart about identifying the proper format to use. So the user doesn’t need to specify the format(s) as inputs. At the same time, both AsDate() and AsDateTime() are careful to not convert any strings to dates when they are not formatted as dates. Additionally, it will also warn the user when the dates are not in an unambiguous format.

Check it out.

Comments closed

ARIMA In R

Subhasree Chatterjee shows us how to use R to implement an ARIMA model:

Once the data is ready and satisfies all the assumptions of modeling, to determine the order of the model to be fitted to the data, we need three variables: p, d, and q which are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively.

To examine which p and q values will be appropriate we need to run acf() and pacf() function.

pacf() at lag k is autocorrelation function which describes the correlation between all data points that are exactly k steps apart- after accounting for their correlation with the data between those k steps. It helps to identify the number of autoregression (AR) coefficients(p-value) in an ARIMA model.

ARIMA feels like it should be too simple to work, but it does.

Comments closed

Logs Are For Parsing

Tim Wilde shares an oft-forgotten truth:

How often have you found yourself contemplating some hair-brained regex scheme in order to extract an inkling of value from a string and wishing the data had just arrived in a well-structured package without all the textual fluff?

So why do we insist on writing prose in our logs? Take “Exception while processing order 1234 for customer abc123” for example. There are at least four important pieces of information drowning in that one sentence alone:

  1. An exception was raised!
  2. During order processing
  3. Order number 1234
  4. Customer abc123

Being an exception log message, it’s more than likely followed by a stack trace, too. And stack traces certainly don’t conform to carefully crafted log layout patterns.

Logging is something we tend to forget about and slap in at the last minute.  We also think about it from the viewpoint of a developer looking at a single error message.  Those are both mistakes that lead to a huge amount of extra work later.

Comments closed

KSQL 0.4 Released

Apurva Mehta announces the release of KSQL 0.4:

The SHOW TOPICS command has been enhanced to include the number of active consumers and also the number of active consumer groups which are reading the topics.

Consumer groups are a feature of Apache Kafka which enable multiple consumer processes to divide the work of consuming Kafka topic. You can learn more about them in the Kafka Consumer JavaDocs, and of course you should read the SHOW TOPICS documentation for more information.

Read on for the full set of changes.

Comments closed

Understanding Kafka Consumers And Offsets

Simarpreet Kaur Monga builds a simple Kafka consumer in Scala to demonstrate how offsets work:

The method endOffsets accepts a collection of TopicPartition, for which you want to find the endOffsets.

As I want to find the endOffsets of the partitions that are assigned to my topic, I have passed the value of consumer.assignment() in the parameter of endOffsets. consumer.assignment gives the set of TopicPartitions that the Consumer has been assigned.

Note: You should call the method assignment only after calling poll on the consumer; otherwise, it will give null as the result. Additionally, the method endOffsets doesn’t change the position of the consumer, unlike seek methods, which do change the consumer position/offset.

Read the whole thing.

Comments closed

Displaying Power BI Filter Values

Reid Havens shows how to display Power BI report filter values on the report itself:

The client and I brainstormed, and we decided to create card visuals to identify filter selections. The beautiful thing about DAX, as mentioned in many articles on our site, is that it can easily return text values. I actually did something similar to this with Dynamic Titles when I posted about Power BI’s new Drill Through feature last year, that article can be found here. So, I essentially wanted to do something similar here, but to call out the filter selection for each slicer. The end result looked something like this

The end result was a new section of the report, dedicated to calling out slicer selections. The BIGGEST reason the client wanted this, was for screenshots. They often took screenshots of this report, and pasted it into emails or slides to use in presentations. The result works well, and uses a bit of clever DAX to always return the right selections, no matter the combination of selections among the slicers.

Read on to see a couple odd scenarios that Reid ran into and how to fix them.

Comments closed

ggplot2 Scales And Coordinates

I continue my series on ggplot2:

The other thing I want to cover today is coordinate systems.  The ggplot2 documentation shows seven coordinate functions.  There are good reasons to use each, but I’m only going to demonstrate one.  By default, we use the Cartesian coordinate system and ggplot2 sets the viewing space.  This viewing space covers the fullness of your data set and generally is reasonable, though you can change the viewing area using the xlim and ylim parameters.

The special coordinate system I want to point out is coord_flip, which flips the X and Y axes.  This allows us, for example, to turn a column chart into a bar chart.  Taking our life expectancy by continent, data I can create a bar chart whereas before, we’ve been looking at column charts.

There are a lot of pictures and more step-by-step work.  Most of these are still 3-4 lines of code, so again, pretty simple.

Comments closed

Configuring Disk Controllers In VMs

David Klee explains how you can get a performance improvement in high-I/O virtual machines:

Both Hyper-V and VMware’s default controller emulates the LSI Logic SAS controller, because that’s what is built into Windows driver storage, and *just works* without having to do anything fancy.

But… it’s not necessarily there for speed. It’s there for compatibility so that you can boot up a VM without having to deal with extra drivers.

VMware created a driver a while back that comes with the VMware Tools package called the Paravirtual SCSI controller, and it gives a 10-30% bump in performance (depending on the speed of the underlying storage) because it’s built for speed from the beginning. It’s just not native to Windows, so I don’t personally feel comfortable using it for the C: drive controller unless required. You can change the controller type for these controllers, so we use it by default for non-OS SQL Server object drives.

Definitely worth the read.

Comments closed

GDPR In The UK

Ed Elliott covers that lesser-known Sex Pistols track in a multi-part series.

Part 1 covers some of the official documentation around how the ICO interprets GDPR:

To read the article, and the actual requirements I would start at page 32 which begins “HAVE ADOPTED THIS REGULATION:” this lists each of the articles (requirements). You can go through each of these and make sure you are compliant with them.

The exciting bit, the fines

The exciting headline-grabbing parts of GDPR are the fines that can be enforced. We don’t yet know how the ICO will apply the fines, words like maximum are used and the maximum possible fines are large. It is possible that the maximum fines will apply but we will look in part 2 at previous ICO enforcement actions to see if the ICO’s past performance gives us any clues as to its possible future decisions.

Part 2 looks at a couple of prior cases and how the ICO handled them:

Talk Talk started mitigating the issue by writing to all of its customers telling them how to deal with scam calls. Talk Talk told the ICO what happened and they responded with their own investigation and a £100,000 fine. The reasons were:

– The system failed to have adequate controls over who could access which records, i.e. anyone could access any record not just the cases they were working on
– The exports allowed all fields, not just the ones required for the regulatory reports
– Wipro were able to make wildcard searches
– The issue was a long-running thing from 2004 when Wipro were given access until 2014

One of the mitigating factors was that there was no evidence that this was even the source of the scam calls, plus there is no evidence anyone suffered any damage or distress as a result of this incident.

Part 3 looks at a couple more cases, too.  And Ed promises part 4.

Comments closed