Press "Enter" to skip to content

Month: February 2018

KSQL 0.4 Released

Apurva Mehta announces the release of KSQL 0.4:

The SHOW TOPICS command has been enhanced to include the number of active consumers and also the number of active consumer groups which are reading the topics.

Consumer groups are a feature of Apache Kafka which enable multiple consumer processes to divide the work of consuming Kafka topic. You can learn more about them in the Kafka Consumer JavaDocs, and of course you should read the SHOW TOPICS documentation for more information.

Read on for the full set of changes.

Comments closed

Understanding Kafka Consumers And Offsets

Simarpreet Kaur Monga builds a simple Kafka consumer in Scala to demonstrate how offsets work:

The method endOffsets accepts a collection of TopicPartition, for which you want to find the endOffsets.

As I want to find the endOffsets of the partitions that are assigned to my topic, I have passed the value of consumer.assignment() in the parameter of endOffsets. consumer.assignment gives the set of TopicPartitions that the Consumer has been assigned.

Note: You should call the method assignment only after calling poll on the consumer; otherwise, it will give null as the result. Additionally, the method endOffsets doesn’t change the position of the consumer, unlike seek methods, which do change the consumer position/offset.

Read the whole thing.

Comments closed

Displaying Power BI Filter Values

Reid Havens shows how to display Power BI report filter values on the report itself:

The client and I brainstormed, and we decided to create card visuals to identify filter selections. The beautiful thing about DAX, as mentioned in many articles on our site, is that it can easily return text values. I actually did something similar to this with Dynamic Titles when I posted about Power BI’s new Drill Through feature last year, that article can be found here. So, I essentially wanted to do something similar here, but to call out the filter selection for each slicer. The end result looked something like this

The end result was a new section of the report, dedicated to calling out slicer selections. The BIGGEST reason the client wanted this, was for screenshots. They often took screenshots of this report, and pasted it into emails or slides to use in presentations. The result works well, and uses a bit of clever DAX to always return the right selections, no matter the combination of selections among the slicers.

Read on to see a couple odd scenarios that Reid ran into and how to fix them.

Comments closed

ggplot2 Scales And Coordinates

I continue my series on ggplot2:

The other thing I want to cover today is coordinate systems.  The ggplot2 documentation shows seven coordinate functions.  There are good reasons to use each, but I’m only going to demonstrate one.  By default, we use the Cartesian coordinate system and ggplot2 sets the viewing space.  This viewing space covers the fullness of your data set and generally is reasonable, though you can change the viewing area using the xlim and ylim parameters.

The special coordinate system I want to point out is coord_flip, which flips the X and Y axes.  This allows us, for example, to turn a column chart into a bar chart.  Taking our life expectancy by continent, data I can create a bar chart whereas before, we’ve been looking at column charts.

There are a lot of pictures and more step-by-step work.  Most of these are still 3-4 lines of code, so again, pretty simple.

Comments closed

Configuring Disk Controllers In VMs

David Klee explains how you can get a performance improvement in high-I/O virtual machines:

Both Hyper-V and VMware’s default controller emulates the LSI Logic SAS controller, because that’s what is built into Windows driver storage, and *just works* without having to do anything fancy.

But… it’s not necessarily there for speed. It’s there for compatibility so that you can boot up a VM without having to deal with extra drivers.

VMware created a driver a while back that comes with the VMware Tools package called the Paravirtual SCSI controller, and it gives a 10-30% bump in performance (depending on the speed of the underlying storage) because it’s built for speed from the beginning. It’s just not native to Windows, so I don’t personally feel comfortable using it for the C: drive controller unless required. You can change the controller type for these controllers, so we use it by default for non-OS SQL Server object drives.

Definitely worth the read.

Comments closed

GDPR In The UK

Ed Elliott covers that lesser-known Sex Pistols track in a multi-part series.

Part 1 covers some of the official documentation around how the ICO interprets GDPR:

To read the article, and the actual requirements I would start at page 32 which begins “HAVE ADOPTED THIS REGULATION:” this lists each of the articles (requirements). You can go through each of these and make sure you are compliant with them.

The exciting bit, the fines

The exciting headline-grabbing parts of GDPR are the fines that can be enforced. We don’t yet know how the ICO will apply the fines, words like maximum are used and the maximum possible fines are large. It is possible that the maximum fines will apply but we will look in part 2 at previous ICO enforcement actions to see if the ICO’s past performance gives us any clues as to its possible future decisions.

Part 2 looks at a couple of prior cases and how the ICO handled them:

Talk Talk started mitigating the issue by writing to all of its customers telling them how to deal with scam calls. Talk Talk told the ICO what happened and they responded with their own investigation and a £100,000 fine. The reasons were:

– The system failed to have adequate controls over who could access which records, i.e. anyone could access any record not just the cases they were working on
– The exports allowed all fields, not just the ones required for the regulatory reports
– Wipro were able to make wildcard searches
– The issue was a long-running thing from 2004 when Wipro were given access until 2014

One of the mitigating factors was that there was no evidence that this was even the source of the scam calls, plus there is no evidence anyone suffered any damage or distress as a result of this incident.

Part 3 looks at a couple more cases, too.  And Ed promises part 4.

Comments closed

Yarn Service Framework Coming

Jian He, et al, announce the Yarn Service Framework:

Apache Hadoop YARN is well known as the general resource-management platform for big-data applications such as MapReduce, Hive / Tez and Spark. It abstracts the complicated cluster resource management and scheduling from higher level applications and enables them to focus solely on their own application specific logic.

In addition to big-data apps, another broad spectrum of workloads we see today are long running services such as HBase, Hive/LLAP and container (e.g. Docker) based services. In the past year, the YARN community has been working hard to build first-class support for long running services on YARN.

This is going to ship with Hadoop 3.1.

Comments closed

PySpark DataFrame Transformations

Vincent-Philippe Lauzon shows how to perform data frame transformations using PySpark:

We wanted to look at some more Data Frames, with a bigger data set, more precisely some transformation techniques.  We often say that most of the leg work in Machine learning in data cleansing.  Similarly we can affirm that the clever & insightful aggregation query performed on a large dataset can only be executed after a considerable amount of work has been done into formatting, filtering & massaging data:  data wrangling.

Here, we’ll look at an interesting dataset, the H-1B Visa Petitions 2011-2016 (from Kaggle) and find some good insights with just a few queries, but also some data wrangling.

It is important to note that about everything in this article isn’t specific to Azure Databricks and would work with any distribution of Apache Spark.

The notebook used for this article is persisted on GitHub.

Read on for explanation, or check out the notebook to work on it at your own pace.

Comments closed

ggplot2 Mappings And Geoms

I continue my ggplot2 series:

We have used a new geom here, geom_smooth.  The geom_smooth function creates a smoothed conditional mean.  Basically, we’re drawing some line as a result of a function based on this input data.  Notice that there are two parameters that I set:  method and se.  The method parameter tells the function which method to use.  There are five methods available, including using a Generalized Additive Model (gam), Locally Weighted Scatterplot Smoothing (loess), and three varieties of Linear Models (lm, glm, and rlm).  The se parameter controls whether we want to see the standard error bar.

I don’t cover all of the mapping options and all of the geoms, but I think it’s enough to get a grip on the concept.

Comments closed

Using FreeTDS To Connect To SQL Server

Steph Locke embraces the pain of FreeTDS:

If you use SQL Server (or Azure SQL DB) as your data store and you need to connect to the databasse from shinyapps.io, you’re presently stuck with FreeTDS. If you have any control over infrastructure I cannot recommend highly enough the actual ODBC Driver on Linux for ease. Alas, shinyapps.io does not let you control the infrastructure. We have to make do with with FreeTDS and it can be pretty painful to get right.

Due to how obtuse the error messages you end up getting back from FreeTDS in your shiny app and the time to deploy an app, you might just want to cry a little. I know I did. Determined to succeed, here is my solution to getting a working database connection that you can also use to test you’re doing it right. If you’re on a particularly old version of SQL Server though, I can’t guarantee this will work for you.

Read on for more.  I also have an older post on working with FreeTDS, though I ended up using TDS_Version = 8.0 instead of 7.4.

Comments closed