Category: Architecture

Page Ranking With Kafka Streams

Published 2017-10-20 by Kevin Feasel

Hunter Kelly walks through a page ranking algorithm:

Once you have the adjacency matrix, you perform some straightforward matrix calculations to calculate a vector of Hub scores and a vector of Authority scores as follows:

Sum across the columns and normalize, this becomes your Hub vector

Multiply the Hub vector element-wise across the adjacency matrix

Sum down the rows and normalize, this becomes your Authority vector

Multiply the Authority vector element-wise down the the adjacency matrix

Repeat

An important thing to note is that the algorithm is iterative: you perform the steps above until eventually you reach convergence—that is, the vectors stop changing—and you’re done. For our purposes, we just pick a set number of iterations, execute them, and then accept the results from that point. We’re mostly interested in the top entries, and those tend to stabilize pretty quickly.

This is an architectural-level post, so there’s no code but there is a useful discussion of the algorithm.

Comments closed

Predicting Advertising Budgets With Kafka Streams

Published 2017-10-11 by Kevin Feasel

Boyang Chen explains how Pinterest uses Kafka Streams to reduce advertising overdelivery:

Overdelivery occurs when free ads are shown for out-of-budget advertisers. This reduces opportunities for advertisers with available budget to have their products and services discovered by potential customers.

Overdelivery is a difficult problem to solve for two reason:

Real-time spend data: Information about ad impressions needs to be fed back into the system within seconds in order to shut down out-of-budget campaigns.
Predictive spend: Fast, historical spend data isn’t enough. The system needs to be able to predict spend that might occur in the future and slow down campaigns close to reaching their budget. That’s because an inserted ad could remain available to be acted on by a user. This makes the spend information difficult to accurately measure in a short timeframe. Such a natural delay is inevitable, and the only thing we can be sure of is the ad insertion event.

This is a very interesting architectural overview.

1 Comment

Using Kafka To Drive Machine Learning

Published 2017-10-02 by Kevin Feasel

Kai Waehner has a nice architectural post on using Kafka as the focal point for machine learning training and prediction:

The essence of this architecture is that it uses Kafka as an intermediary between the various data sources from which feature data is collected, the model building environment where the model is fit, and the production application that serves predictions.

Feature data is pulled into Kafka from the various apps and databases that host it. This data is used to build models. The environment for this will vary based on the skills and preferred toolset of the team. The model building could be a data warehouse, a big data environment like Spark or Hadoop, or a simple server running python scripts. The model can be published where the production app that gets the same model parameters can apply it to incoming examples (perhaps using Kafka Streams to help index the feature data for easy usage on demand). The production app can either receive data from Kafka as a pipeline or even be a Kafka Streams application itself.

This is approximately 80% of my interests wrapped up in one post, so of course I’m going to read it…

Comments closed

Distributed Database Writes

Published 2017-09-25 by Kevin Feasel

James Serra provides a number of options around distributed writes:

In SQL Server, scaling out reads (i.e. using Active secondary replicas via AlwaysOn Availability Groups) is a lot easier than scaling out writes. So what are your options when you have a tremendous amount of writes that scaling up will not handle, no matter how big your server is? There are a number of options that allow you to write to many servers (instead of writing to one master server) that I’ll call distributed writes. Here are some ideas:

Peer-to-Peer transactional replication (or Multi-master replication) with SQL Server. See Peer-to-Peer – Transactional Replication
Sharding in Azure SQL Database via elastic database tools which requires coding. See Building scalable cloud databases. You can also implement sharding in code for SQL Server
Merge replication in SQL Server. See Merge Replication
Create a messaging and queuing application in SQL Server Service Broker where all writes are placed on the queue and sent to different servers

Read on for more options and some additional thoughts around Cosmos DB. My first inclination would be to put Kafka in front of a distributed write system, but that’s my bias.

Comments closed

Trigram Search In SQL Server

Published 2017-09-12 by Kevin Feasel

Paul White shows how to implement trigram wildcard searches in SQL Server:

The basic idea of a trigram search is quite simple:

Persist three-character substrings (trigrams) of the target data.

Split the search term(s) into trigrams.

Match search trigrams against the stored trigrams (equality search)

Intersect the qualified rows to find strings that match all trigrams

Apply the original search filter to the much-reduced intersection

We will work through an example to see exactly how this all works, and what the trade-offs are.

A must-read. N-grams in SQL Server is an example of a non-obvious data architecture which performs much better than the obvious alternative, at least when the conditions are right.

Comments closed

How The New York Times Uses Apache Kafka

Published 2017-09-07 by Kevin Feasel

Boerge Svingen gives us an architectural overview of how the New York Times uses Apache Kafka to link different services together:

These are all sources of what we call published content. This is content that has been written, edited, and that is considered ready for public consumption.

On the other side we have a wide range of services and applications that need access to this published content — there are search engines, personalization services, feed generators, as well as all the different front-end applications, like the website and the native apps. Whenever an asset is published, it should be made available to all these systems with very low latency — this is news, after all — and without data loss.

This article describes a new approach we developed to solving this problem, based on a log-based architecture powered by Apache Kafka^TM. We call it the Publishing Pipeline. The focus of the article will be on back-end systems. Specifically, we will cover how Kafka is used for storing all the articles ever published by The New York Times, and how Kafka and the Streams API is used to feed published content in real-time to the various applications and systems that make it available to our readers. The new architecture is summarized in the diagram below, and we will deep-dive into the architecture in the remainder of this article.

This is a nice write-up of a real-world use case for Kafka.

Comments closed

Lambda And Kappa Architectures

Published 2017-08-31 by Kevin Feasel

Michael Verrilli has a post contrasting the Lambda and Kappa data architectures:

Any query may get a complete picture by retrieving data from both the batch views and the real-time views. The queries will get the best of both worlds. The batch views may be processed with more complex or expensive rules and may have better data quality and less skew, while the real-time views give you up to the moment access to the latest possible data. As time goes on, real-time data expires and is replaced with data in the batch views.

One additional benefit to this architecture is that you can replay the same incoming data and produce new views in case code or formula changes.

The biggest detraction to this architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and speed layers. Luckily with Spark Streaming (abstraction layer) or Talend (Spark Batch and Streaming code generator), this has become far less of an issue… although the operational burden still exists.

I haven’t seen much on the topic of Big Data architectures this year; it seems like it was a much more popular topic last year.

Comments closed

Thoughts On Exactly-Once Processing And First-In First-Out

Published 2017-08-03 by Kevin Feasel

Kevin Sookocheff looks into Amazon’s Simple Queue Service and explains some concepts of distributed messaging systems in the process:

In an ideal scenario, the five minute window would be a complete non-issue. Unfortunately, if you are relying on SQS’s exactly-once guarantee for critical use cases you will need to account for the possibility of this error and design your application accordingly.

On the message consumer side, FIFO queues do not guarantee exactly once delivery, because in simple fact, exactly once delivery at the transport level is provably impossible. Even if you could ensure exactly-once delivery at the transport level, it probably isn’t what you want anyways — if a subscriber receives a message from the transport, there is still a chance that it can crash before processing it, in which case you definitely want the messaging system to deliver the message again.

Instead, FIFO queues offer exactly-once processing by guaranteeing that once a message has successfully been acknowledged as processed that it won’t be delivered again. To understand more completely how this works, let’s walk through the details of how you go about consuming messages from SQS.

It’s a great read, so check it out.

Comments closed

Isolation Level Basics

Published 2017-08-03 by Kevin Feasel

Randolph West describes the primary isolation levels in SQL Server:

There are four isolation levels in SQL Server (as quoted from SQL Server Books Online):

Read uncommitted (the lowest level where transactions are isolated only enough to ensure that physically corrupt data is not read)

Read committed (Database Engine default level)

Repeatable read

Serializable (the highest level, where transactions are completely isolated from one another)

Read on for a discussion of what these mean, as well as how optimistic versus pessimistic concurrency (in this case, Read Committed Snapshot Isolation versus Read Committed) comes into play.

1 Comment

The Data Lake From 10,000 Feet

Published 2017-08-02 by Kevin Feasel

Pradeep Menon has a high-level explanation of what a data lake is and how it differs from traditional data warehouses:

With the changes in the data paradigm, a new architectural pattern has emerged. It’s called as the Data Lake Architecture. Like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas. It provides data scientists an avenue to explore data and create a hypothesis. It provides an avenue for business users to explore data. It provides an avenue for data analysts to analyze data and find patterns. It provides an avenue for reporting analysts to create reports and present to stakeholders.

The way I compare a data lake to a data warehouse or a mart is like this:

Data Lake stores data in the purest form caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.

One way of thinking about this is that data warehouses are great for solving known business questions: generating 10K reports or other regulatory compliance reporting, building the end-of-month data, and viewing standard KPIs. By contrast, the data lake is (among other things) for spelunking, trying to answer those one-off questions people seem to have but which the warehouse never seems to have quite the right set of information.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31