Kerberos Authentication In Apache Cassandra

Justin Cameron announces an open source Kerberos authenticator in Apache Cassandra:

In conjunction with the Cassandra authenticator, we have also published an open-source Kerberos authenticator plugin for the Cassandra Java driver.

The plugin supports multiple Kerberos quality of protection (QOP) levels, which may be specified directly when configuring the authenticator. The driver’s QOP level must match the QOP level configured for the server authenticator, and is only used during the authentication exchange. If confidentiality and/or integrity protection is required for all traffic between the client and Cassandra, it is recommended that Cassandra’s built-in SSL/TLS be used (note that TLS also protects the Kerberos authentication exchange, when enabled).

An (optional) SASL authorization ID is also supported. If provided, it specifies a Cassandra role that will be assumed once the Kerberos client principal has authenticated, provided the Cassandra user represented by the client principal has been granted permission to assume the role. Access to other roles may be granted using the GRANT ROLE CQL statement.

Click through for more details and check out the GitHub repo.

Azure Databricks Geospatial Analysis

Jose Mendes gives us an example of using Azure Databricks to perform geospatial analysis:

Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries (further details here).

Although people mentioned in their GitHub page that the 1.0.5 Magellan library is available for Apache Spark 2.3+ clusters, I learned through a very difficult process that the only way to make it work in Azure Databricks is if you have an Apache Spark 2.2.1 cluster with Scala 2.11. The cluster I used for this experience consisted of a Standard_DS3_v2 driver type with 14GB Memory, 4 Cores and auto scaling enabled.

In terms of datasets, I used the NYC Taxicab dataset to create the geometry points and the Magellan NYC Neighbourhoods GeoJSON dataset to extract the polygons. Both datasets were stored in a blob storage and added to Azure Databricks as a mount point.

It sounds like this is much faster than using U-SQL to perform the same task.

Looking At The Robin Hood Caching Algorithm

Adrian Colyer reviews a paper on a multi-system caching algorithm:

The thing about this common pattern is that we need to wait for all of these back-end requests to complete before returning to the user. So improving the average latency of these requests doesn’t help us one little bit.

Since each request must wait for all of its queries to complete, the overall request latency is defined to be the latency of the request’s slowest query. Even if almost all backends have low tail latencies, the tail latency of the maximum of several queries could be high.

(See ‘The Tail at Scale’).

The user can easily see P99 latency or greater.

Techniques to mitigate tail latencies include making redundant requests, clever use of scheduling, auto-scaling and capacity provisioning, and approximate computing. Robin Hood takes a different (complementary) approach: use the cache to improve tail latency!

Robin Hood doesn’t necessarily allocate caching resources to the most popular back-ends, instead, it allocates caching resources to the backends (currently) responsible for the highest tail latency.

This is a great review of an interesting algorithm.

Data Modeling In Cassandra

Charmy Garg walks us through some of the basics of modeling tables in Cassandra:

Two basic goals in Cassandra which we should keep in mind:

  • Spread data evenly around the cluster – You want every node in the cluster to have roughly the same amount of data. Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. So, the key to spreading data evenly is this: pick a good primary key.

  • Minimize the number of partitions read – Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible. Why is this important? [Each partition may reside on a different node. The coordinator will generally need to issue separate commands to separate nodes for each partition you request. This adds a lot of overhead and increases the variation in latency. Furthermore, even on a single node, it’s more expensive to read from multiple partitions than from a single one due to the way rows are stored.]

Charmy also has a couple of pitfalls that people used to the relational database model may hit.

Voice Control For Shiny Apps

Over at Jumping Rivers, an example of using a Javascript library to control a page using voice commands:

I have found that performance across all devices and browsers is definitely not equal. By far the best browser I have found for viewing the apps is Google Chrome. I have also tended to find that my Ubuntu machines don’t do as well as Microsoft machines in picking up words correctly. A chat I had with someone recently suggested this might be down to drivers under Ubuntu for the microphones but that is not my area of expertise. Voice recognition was also fine on both of my Blackberry phones (one running BB OS 10, the other running Android 7).

It is worth noting that this does require an internet connection to function, in Chrome the voice to text is performed in the cloud.

The other thing I have noticed is that annyang seems relatively sensitive to background noise. This isn’t so bad for functions called using specific phrases but does sometimes have a large effect on the multi-word splats. This is because the splats are greedy and the background noise makes the recognition engine think that you are still talking long after you finished which gives the appearance of the application hanging.

The solution is by no means perfect, but it does look quite interesting.

Reading Excel Files In An Office-less World

Bill Fellows shows us how to read from an Excel file on a machine without Microsoft Office installed:

A common problem working with Excel data is Excel itself. Working with it programatically requires an installation of Office, and the resulting license cost, and once everything is set, you’re still working with COM objects which present its own set of challenges. If only there was a better way.

Enter, the better way – EPPlus. This is an open source library that wraps the OpenXml library which allows you to simply reference a DLL. No more installation hassles, no more licensing (LGPL) expense, just a simple reference you can package with your solutions.

Let’s look at an example.

Read on for the example.  A couple alternatives I like are readxl and XLConnect in R.

Building A Basic Kafka Producer

M. Mallikarjun shows us a simple producer in Kafka:

A Kafka producer is an application that can act as a source of data in a Kafka cluster. A producer can publish messages to one or more Kafka topics.

So, how many ways are there to implement a Kafka producer? Well, there are a lot! But in this article, we shall walk you through two ways.

  1. Kafka Command Line Tools
  2. Kafka Producer Java API

You can write producers in quite a few languages.  Java is the example here, but there are several libraries, including a good one for .NET.

Writing Higher-Order Functions With Scala

Jyoti Sachdeva explains the concept of higher-order functions and shares an example in Scala:

In this blog, I’m going to explain higher-order functions.

A higher order function takes other function as a parameter or return a function as a result.

This is possible because functions are first-class value in scala. What does that mean?

It means that functions can be passed as arguments to other functions and functions can return other function.

The map function is a classic example of a higher order function.

Higher-order functions are one of the key components to functional programming and allows us to reason in small chunks at a time

What’s New With Machine Learning Services

Niels Berglund looks at SQL Server 2019’s Machine Learning Services offering for updates:

So, when I read What’s new in SQL Server 2019, I came across a lot of interesting “stuff”, but one thing that stood out was Java language programmability extensions. In essence, it allows us to execute Java code in SQL Server by using a pre-built Java language extension! The way it works is as with R and Python; the code executes outside of the SQL Server engine, and you use sp_execute_external_script as the entry-point.

I haven’t had time to execute any Java code as of yet, but in the coming days, I definitely will drill into this. Something I noticed is that the architecture for SQL Server Machine Learning Services has changed (or had additions to it).

That Java support is for Spark, I’d imagine.  And I hope they allow for Scala.

Hadoop + SQL Server In 2019

Travis Wright shows off a big part of what the SQL Server team has been working on the last couple of years:

SQL Server 2019 big data clusters provide a complete AI platform. Data can be easily ingested via Spark Streaming or traditional SQL inserts and stored in HDFS, relational tables, graph, or JSON/XML. Data can be prepared by using either Spark jobs or Transact-SQL (T-SQL) queries and fed into machine learning model training routines in either Spark or the SQL Server master instance using a variety of programming languages, including Java, Python, R, and Scala. The resulting models can then be operationalized in batch scoring jobs in Spark, in T-SQL stored procedures for real-time scoring, or encapsulated in REST API containers hosted in the big data cluster.

SQL Server big data clusters provide all the tools and systems to ingest, store, and prepare data for analysis as well as to train the machine learning models, store the models, and operationalize them.
Data can be ingested using Spark Streaming, by inserting data directly to HDFS through the HDFS API, or by inserting data into SQL Server through standard T-SQL insert queries. The data can be stored in files in HDFS, or partitioned and stored in data pools, or stored in the SQL Server master instance in tables, graph, or JSON/XML. Either T-SQL or Spark can be used to prepare data by running batch jobs to transform the data, aggregate it, or perform other data wrangling tasks.

Data scientists can choose either to use SQL Server Machine Learning Services in the master instance to run R, Python, or Java model training scripts or to use Spark. In either case, the full library of open-source machine learning libraries, such as TensorFlow or Caffe, can be used to train models.

Lastly, once the models are trained, they can be operationalized in the SQL Server master instance using real-time, native scoring via the PREDICT function in a stored procedure in the SQL Server master instance; or you can use batch scoring over the data in HDFS with Spark. Alternatively, using tools provided with the big data cluster, data engineers can easily wrap the model in a REST API and provision the API + model as a container on the big data cluster as a scoring microservice for easy integration into any application.

I’ve wanted Spark integration ever since 2016 and we’re going to get it.


November 2018
« Oct