A Simple Example With Spark And Kafka

Gary Dusbabek has a nice example showing how to build a simple application with Spark and Kafka:

This is a hands-on tutorial that can be followed along by anyone with programming experience. If your programming skills are rusty, or you are technically minded but new to programming, we have done our best to make this tutorial approachable. Still, there are a few prerequisites in terms of knowledge and tools.

The following tools will be used:

  • Git—to manage and clone source code

  • Docker—to run some services in containers

  • Java 8 (Oracle JDK)—programming language and a runtime (execution) environment used by Maven and Scala

  • Maven 3—to compile the code we write

  • Some kind of code editor or IDE—we used the community edition of IntelliJ while creating this tutorial

  • Scala—programming language that uses the Java runtime. All examples are written using Scala 2.12. Note: You do not need to download Scala.

The Hello World of streaming apps is a Twitter client.

Talking To Secure Hadoop Clusters

Mubashir Kazia shows how to connect to a secured Hadoop cluster using Active Directory:

The primary form of strong authentication used on a secure cluster is Kerberos. Kerberos supports credentials delegation where a server process to which a user has authenticated, can perform actions on behalf of the user. This involves the server process accessing databases or other web services as the authenticated user. Historically the form of delegation that was supported by Kerberos is now called “full delegation”. In this type of delegation, the Ticket Granting Ticket (TGT) of the user is made available to the server process and server can then authenticate to any service where the user has been granted authorization. Until recently most Kerberos Key Distribution Center(KDC)s other than Active Directory supported only this form of delegation. Also Java until Java 7 supported only this form of delegation. Starting with Java 8, Java now supports Kerberos constrained delegation (S4U2Proxy), where if the KDC supports it, it is possible to specify which particular services the server process can be delegated access to.

Hadoop within its security framework has implemented impersonation or proxy support that is independent of Kerberos delegation. With Hadoop impersonation support you can assign certain accounts proxy privileges where the proxy accounts can access Hadoop resources or run jobs on behalf of other users. We can restrict proxy privileges granted to a proxy account to act on behalf of only certain users who are members of certain groups and/or only for connections originating from certain hosts. However we can’t restrict the proxy privileges to only certain services within the cluster.

What we are discussing in this article is how to setup Kerberos constrained delegation and access a secure cluster. The example here involves Apache Tomcat, however you can easily extend this to other Java Application Servers.

This is a good article showing specific details on using Kerberos in applications connecting to Hadoop.

Power Query: Joining On Date Ranges

Reza Rad shows how to build merge joins between date ranges in Power Query:

Customer’s table has the history details of changes through the time. For example, the customer ID 2, has a track of change. John was living in Sydney for a period of time, then moved to Melbourne after that.

The problem we are trying to solve is to join these two tables based on their customer ID, and find out the City related to that for that specific period of time. We have to check the Date field from Sales Table to fit into FromDate and ToDate of the Customer table.

This is a common type 2 SCD scenario.  I’d be concerned that this solution would not work with large data sets which may already be pushing the size limits of the Vertipaq engine.

K Nearest Cliques

Vincent Granville explains an algorithm built around finding cliques of data points:

The cliques considered here are defined by circles (in two dimensions) or spheres (in three dimensions.) In the most basic version, we have one clique for each cluster, and the clique is defined as the smallest circle containing a pre-specified proportion p of the points from the cluster in question. If the clusters are well separated, we can even use p = 1. We define the density of a clique as the number of points per unit area. In general, we want to build cliques with high density.

Ideally, we want each cluster in the training set to be covered by a small number of (possibly slightly overlapping) cliques, each one having a high density.  Also, as a general rule, a training set point can only belong to one clique, and (ideally) to only one cluster. But the circles associated with two cliques are allowed to overlap.

It’s an interesting approach, and I can see how it’d be faster than K Nearest Neighbors, but I do wonder how accurate the results would be in comparison to KNN.

Your Reminder Not To MERGE

Kevin Wilkie points out the numerous problems with the MERGE operator:

Now, when I last posted, I’m sure you thought I was done talking about the MERGE statement. You are so wrong, compadre! One more post is absolutely needed!

There are a few issues with the MERGE statement. Well, as of this writing, there are 361 possible issues according to Microsoft Connect – the actual website where Microsoft checks to see what issues exist!

So, if you want to use the MERGE statement, please read through every issue listed on the link above and make sure that none of those scenarios could exist for you. If they don’t, great. Knock yourself out and use it.

But wait, there’s more!  Read on to see what else could be a problem.

OS Threads DMV

Ewald Cress moves up the internals stack a little further and looks at a DMV:

Broadly speaking, a DMV presents just another iterator that can be plugged into a query plan. The execution engine calls GetRow() repeatedly until it reaches the end, and the iterator emits rows. The only unusual thing is that the ultimate source of the data may have nothing to do with the storage engine.

Now if you asked me to guess where in the world we’d find a list of all threads to iterate over, I would have expected that we’d start with the NodeManager, iterating over all SOS_Nodes, and then for each of them iterating over its collection of associated SystemThreads. After all, we have a guaranteed 1:1 correspondence between threads and SystemThreads, and I figured that all SystemThreads enlist themselves into a parent SOS_Node upon creation. No-brainer, right?

Turns out that this guess would have been completely wrong, and the reason it would have been a broken implementation will become apparent when we look at the started_by_sqlservr column.

Definitely read the whole thing, but you may need to top off your caffeinated beverage of choice first.  Also, I’m pretty sure Ewald promised to hammer one out once per week; that’s how I read it, at least…

One CLR Solution

Solomon Rutzky continues his SQL Server 2017 CLR security series:

This new requirement prevents the technique described towards the end of Part 1 from working. That technique uses a SAFE Assembly as an indirect means of creating the Asymmetric Key to create the Login from. That worked perfectly prior to SQL Server 2017, but now even SAFE Assemblies require that the signature-based Login be created first, which now puts us in a whole chicken-egg paradox.

Before proceeding to the solution, it should be noted that yes, Microsoft has, as of RC2 (released on 2017-08-02), provided a kinda/sorta “fix” for this that allows for creating an Assembly without having the signature-based Login. HOWEVER, that “fix” is absolutely horrible, convoluted, and unnecessary. It should not be used by anyone. Ever! In fact, it should be completely removed and forgotten about. In no uncertain terms: it is not an option! To help clarify, I am being intentionally vague about that new feature here (and in Part 1) so as not to distract from these two solutions (this post and Part 3) that do not promote bad practices; it will be covered starting in Part 4.

Solomon outlines one approach to dealing with CLR security changes, though it’s a bit lengthy.

Availability Group DMVs

Adrian Buckman shows off a few Availability Group DMVs:

Whilst I was combing through various combinations of sys.dm_hadr and sys.Availability I stumbled across a couple of gems that I thought would be nice to share.

  • sys.dm_hadr_cluster

  • sys.dm_hadr_cluster_members

  • sys.dm_hadr_cluster_networks

Read on for usage examples.


August 2017
« Jul Sep »