A Simple Example With Spark And Kafka

Gary Dusbabek has a nice example showing how to build a simple application with Spark and Kafka:

This is a hands-on tutorial that can be followed along by anyone with programming experience. If your programming skills are rusty, or you are technically minded but new to programming, we have done our best to make this tutorial approachable. Still, there are a few prerequisites in terms of knowledge and tools.

The following tools will be used:

  • Git—to manage and clone source code

  • Docker—to run some services in containers

  • Java 8 (Oracle JDK)—programming language and a runtime (execution) environment used by Maven and Scala

  • Maven 3—to compile the code we write

  • Some kind of code editor or IDE—we used the community edition of IntelliJ while creating this tutorial

  • Scala—programming language that uses the Java runtime. All examples are written using Scala 2.12. Note: You do not need to download Scala.

The Hello World of streaming apps is a Twitter client.

Talking To Secure Hadoop Clusters

Mubashir Kazia shows how to connect to a secured Hadoop cluster using Active Directory:

The primary form of strong authentication used on a secure cluster is Kerberos. Kerberos supports credentials delegation where a server process to which a user has authenticated, can perform actions on behalf of the user. This involves the server process accessing databases or other web services as the authenticated user. Historically the form of delegation that was supported by Kerberos is now called “full delegation”. In this type of delegation, the Ticket Granting Ticket (TGT) of the user is made available to the server process and server can then authenticate to any service where the user has been granted authorization. Until recently most Kerberos Key Distribution Center(KDC)s other than Active Directory supported only this form of delegation. Also Java until Java 7 supported only this form of delegation. Starting with Java 8, Java now supports Kerberos constrained delegation (S4U2Proxy), where if the KDC supports it, it is possible to specify which particular services the server process can be delegated access to.

Hadoop within its security framework has implemented impersonation or proxy support that is independent of Kerberos delegation. With Hadoop impersonation support you can assign certain accounts proxy privileges where the proxy accounts can access Hadoop resources or run jobs on behalf of other users. We can restrict proxy privileges granted to a proxy account to act on behalf of only certain users who are members of certain groups and/or only for connections originating from certain hosts. However we can’t restrict the proxy privileges to only certain services within the cluster.

What we are discussing in this article is how to setup Kerberos constrained delegation and access a secure cluster. The example here involves Apache Tomcat, however you can easily extend this to other Java Application Servers.

This is a good article showing specific details on using Kerberos in applications connecting to Hadoop.

Power Query: Joining On Date Ranges

Reza Rad shows how to build merge joins between date ranges in Power Query:

Customer’s table has the history details of changes through the time. For example, the customer ID 2, has a track of change. John was living in Sydney for a period of time, then moved to Melbourne after that.

The problem we are trying to solve is to join these two tables based on their customer ID, and find out the City related to that for that specific period of time. We have to check the Date field from Sales Table to fit into FromDate and ToDate of the Customer table.

This is a common type 2 SCD scenario.  I’d be concerned that this solution would not work with large data sets which may already be pushing the size limits of the Vertipaq engine.

K Nearest Cliques

Vincent Granville explains an algorithm built around finding cliques of data points:

The cliques considered here are defined by circles (in two dimensions) or spheres (in three dimensions.) In the most basic version, we have one clique for each cluster, and the clique is defined as the smallest circle containing a pre-specified proportion p of the points from the cluster in question. If the clusters are well separated, we can even use p = 1. We define the density of a clique as the number of points per unit area. In general, we want to build cliques with high density.

Ideally, we want each cluster in the training set to be covered by a small number of (possibly slightly overlapping) cliques, each one having a high density.  Also, as a general rule, a training set point can only belong to one clique, and (ideally) to only one cluster. But the circles associated with two cliques are allowed to overlap.

It’s an interesting approach, and I can see how it’d be faster than K Nearest Neighbors, but I do wonder how accurate the results would be in comparison to KNN.

Your Reminder Not To MERGE

Kevin Wilkie points out the numerous problems with the MERGE operator:

Now, when I last posted, I’m sure you thought I was done talking about the MERGE statement. You are so wrong, compadre! One more post is absolutely needed!

There are a few issues with the MERGE statement. Well, as of this writing, there are 361 possible issues according to Microsoft Connect – the actual website where Microsoft checks to see what issues exist!

So, if you want to use the MERGE statement, please read through every issue listed on the link above and make sure that none of those scenarios could exist for you. If they don’t, great. Knock yourself out and use it.

But wait, there’s more!  Read on to see what else could be a problem.

OS Threads DMV

Ewald Cress moves up the internals stack a little further and looks at a DMV:

Broadly speaking, a DMV presents just another iterator that can be plugged into a query plan. The execution engine calls GetRow() repeatedly until it reaches the end, and the iterator emits rows. The only unusual thing is that the ultimate source of the data may have nothing to do with the storage engine.

Now if you asked me to guess where in the world we’d find a list of all threads to iterate over, I would have expected that we’d start with the NodeManager, iterating over all SOS_Nodes, and then for each of them iterating over its collection of associated SystemThreads. After all, we have a guaranteed 1:1 correspondence between threads and SystemThreads, and I figured that all SystemThreads enlist themselves into a parent SOS_Node upon creation. No-brainer, right?

Turns out that this guess would have been completely wrong, and the reason it would have been a broken implementation will become apparent when we look at the started_by_sqlservr column.

Definitely read the whole thing, but you may need to top off your caffeinated beverage of choice first.  Also, I’m pretty sure Ewald promised to hammer one out once per week; that’s how I read it, at least…

One CLR Solution

Solomon Rutzky continues his SQL Server 2017 CLR security series:

This new requirement prevents the technique described towards the end of Part 1 from working. That technique uses a SAFE Assembly as an indirect means of creating the Asymmetric Key to create the Login from. That worked perfectly prior to SQL Server 2017, but now even SAFE Assemblies require that the signature-based Login be created first, which now puts us in a whole chicken-egg paradox.

Before proceeding to the solution, it should be noted that yes, Microsoft has, as of RC2 (released on 2017-08-02), provided a kinda/sorta “fix” for this that allows for creating an Assembly without having the signature-based Login. HOWEVER, that “fix” is absolutely horrible, convoluted, and unnecessary. It should not be used by anyone. Ever! In fact, it should be completely removed and forgotten about. In no uncertain terms: it is not an option! To help clarify, I am being intentionally vague about that new feature here (and in Part 1) so as not to distract from these two solutions (this post and Part 3) that do not promote bad practices; it will be covered starting in Part 4.

Solomon outlines one approach to dealing with CLR security changes, though it’s a bit lengthy.

Availability Group DMVs

Adrian Buckman shows off a few Availability Group DMVs:

Whilst I was combing through various combinations of sys.dm_hadr and sys.Availability I stumbled across a couple of gems that I thought would be nice to share.

  • sys.dm_hadr_cluster

  • sys.dm_hadr_cluster_members

  • sys.dm_hadr_cluster_networks

Read on for usage examples.

Building An Image Recognizer With R

David Smith has a post showing how to build an image recognizer with R and Microsoft’s Cognitive Services Library:

The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates “features” from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these “features”, you could use them in a traditional machine learning model to classify the images, or perform other recognition tasks.

But if you don’t have millions of images, it’s still possible to generate these features from a model that has already been trained on millions of images. ResNet is a very deep neural network model trained for the task of image recognition which has been used to win major computer-vision competitions. With the rxFeaturize function in Microsoft R Client and Microsoft R Server, you can generate 4096 features from this model on any image you provide. The features themselves are meaningful only to a computer, but that vector of 4096 numbers between zero and one is (ideally) a distillation of the unique characteristics of that image as a human would recognize it. You can then use that features vector to create your own image-recognition system without the burden of training your own neural network on a large corpus of images.

Read the whole thing and follow David’s link to the Microsoft Cognitive blog for more details.

MERGE In Hive

Carter Shanklin introduces the MERGE operator in Hive:

USE CASE 2: UPDATE HIVE PARTITIONS.

A common strategy in Hive is to partition data by date. This simplifies data loads and improves performance. Regardless of your partitioning strategy you will occasionally have data in the wrong partition. For example, suppose customer data is supplied by a 3rd-party and includes a customer signup date. If the provider had a software bug and needed to change customer signup dates, suddenly records are in the wrong partition and need to be cleaned up.

It has been interesting to see Hive morph over the past few years from a batch warehousing system to something approaching a relational warehouse.  This is one additional step in that direction.

Categories

August 2017
MTWTFSS
« Jul  
 123456
78910111213
14151617181920
21222324252627
28293031