Category: Architecture

Caching Strategy

Published 2017-12-01 by Kevin Feasel

Kevin Gessner explains some caching concepts used at Etsy:

A major drawback of modulo hashing is that the size of the cache pool needs to be stable over time. Changing the size of the cache pool will cause most cache keys to hash to a new server. Even though the values are still in the cache, if the key is distributed to a different server, the lookup will be a miss. That makes changing the size of the cache pool—to make it larger or for maintenance—an expensive and inefficient operation, as performance will suffer under tons of spurious cache misses.

For instance, if you have a pool of 4 hosts, a key that hashes to 500 will be stored on pool member 500 % 4 == 0, while a key that hashes to 1299 will be stored on pool member 1299 % 4 == 3. If you grow your cache by adding a fifth host, the cache pool calculated for each key may change. The key that hashed to 500 will still be found on pool member 500 % 5 == 0, but the key that hashed to 1299 be on pool member 1299 % 5 == 4. Until the new pool member is warmed up, your cache hit rate will suffer, as the cache data will suddenly be on the ‘wrong’ host. In some cases, pool changes can cause more than half of your cached data to be assigned to a different host, slashing the efficiency of the cache temporarily. In the case of going from 4 to 5 hosts, only 20% of cache keys will be on the same host as before!

It’s interesting reading.

Comments closed

Microservices With Kafka Streams

Published 2017-11-15 by Kevin Feasel

Ben Stopford walks us through a microservices architecture built on top of Kafka:

So we can use the Kafka Streams API to piece together complex business systems as a collection of asynchronously executing, event-driven services. The differentiator here is the API itself, which is far richer than, say, the Kafka Producer or Consumer. It makes code more readable, provides reusable implementations of common patterns like joins, aggregates, and filters and wraps the whole ecosystem with a transparent level of correctness.

Systems built in this way, in the real world, come in a variety of guises. They can be fine grained and fast executing, completing in the context of an HTTP request, or complex and long-running, manipulating the stream of events that map a whole company’s business flow. This post focusses on the former, building up a real-world example of a simple order management system that executes within the context of a HTTP request, and is entirely built with Kafka Streams. Each service is a small function, with well-defined inputs and outputs. As we build this ecosystem up, we will encounter problems such as blending streams and tables, reading our own writes, and managing consistency in a distributed and autonomous environment.

This post stays high-level and covers a lot of ground. I’m wishy-washy on the idea of microservices, but if you are going to do them, it’s better to do them right.

Comments closed

Thinking About Slowly Degrading Page Performance

Published 2017-11-02 by Kevin Feasel

Ritesh Maheshwari talks about how LinkedIn deals with performance regressions:

Looking at the chart above, where the dotted red line is a reference point to show where we started the year, notice how site speed improvements tend to be significant and noticeable, as they are optimization-driven. Degradations, however, can generally be of any “amount,” as they happen for various reasons. LinkedIn’s page-serving pipeline has many moving parts. We deploy code multiple times per day, operate a micro-service architecture with hundreds of services, and infrastructure upgrades are frequent. A slowdown in any of these components can cause degradations.

While large degradations can be caught using A/B testing, canary analysis, or anomaly detection, small ones tend to leak to production. Thus, performance of a page has a tendency to always degrade over time.

This led to having the centralized Performance Team focus on identifying these leaks, called “site speed regressions,” and to craft tools and processes to fix them.

It’s an interesting principle. I could see this principle work for tracking database performance degradation as well.

Comments closed

Page Ranking With Kafka Streams

Published 2017-10-20 by Kevin Feasel

Hunter Kelly walks through a page ranking algorithm:

Once you have the adjacency matrix, you perform some straightforward matrix calculations to calculate a vector of Hub scores and a vector of Authority scores as follows:

Sum across the columns and normalize, this becomes your Hub vector

Multiply the Hub vector element-wise across the adjacency matrix

Sum down the rows and normalize, this becomes your Authority vector

Multiply the Authority vector element-wise down the the adjacency matrix

Repeat

An important thing to note is that the algorithm is iterative: you perform the steps above until eventually you reach convergence—that is, the vectors stop changing—and you’re done. For our purposes, we just pick a set number of iterations, execute them, and then accept the results from that point. We’re mostly interested in the top entries, and those tend to stabilize pretty quickly.

This is an architectural-level post, so there’s no code but there is a useful discussion of the algorithm.

Comments closed

Predicting Advertising Budgets With Kafka Streams

Published 2017-10-11 by Kevin Feasel

Boyang Chen explains how Pinterest uses Kafka Streams to reduce advertising overdelivery:

Overdelivery occurs when free ads are shown for out-of-budget advertisers. This reduces opportunities for advertisers with available budget to have their products and services discovered by potential customers.

Overdelivery is a difficult problem to solve for two reason:

Real-time spend data: Information about ad impressions needs to be fed back into the system within seconds in order to shut down out-of-budget campaigns.
Predictive spend: Fast, historical spend data isn’t enough. The system needs to be able to predict spend that might occur in the future and slow down campaigns close to reaching their budget. That’s because an inserted ad could remain available to be acted on by a user. This makes the spend information difficult to accurately measure in a short timeframe. Such a natural delay is inevitable, and the only thing we can be sure of is the ad insertion event.

This is a very interesting architectural overview.

1 Comment

Using Kafka To Drive Machine Learning

Published 2017-10-02 by Kevin Feasel

Kai Waehner has a nice architectural post on using Kafka as the focal point for machine learning training and prediction:

The essence of this architecture is that it uses Kafka as an intermediary between the various data sources from which feature data is collected, the model building environment where the model is fit, and the production application that serves predictions.

Feature data is pulled into Kafka from the various apps and databases that host it. This data is used to build models. The environment for this will vary based on the skills and preferred toolset of the team. The model building could be a data warehouse, a big data environment like Spark or Hadoop, or a simple server running python scripts. The model can be published where the production app that gets the same model parameters can apply it to incoming examples (perhaps using Kafka Streams to help index the feature data for easy usage on demand). The production app can either receive data from Kafka as a pipeline or even be a Kafka Streams application itself.

This is approximately 80% of my interests wrapped up in one post, so of course I’m going to read it…

Comments closed

Distributed Database Writes

Published 2017-09-25 by Kevin Feasel

James Serra provides a number of options around distributed writes:

In SQL Server, scaling out reads (i.e. using Active secondary replicas via AlwaysOn Availability Groups) is a lot easier than scaling out writes. So what are your options when you have a tremendous amount of writes that scaling up will not handle, no matter how big your server is? There are a number of options that allow you to write to many servers (instead of writing to one master server) that I’ll call distributed writes. Here are some ideas:

Peer-to-Peer transactional replication (or Multi-master replication) with SQL Server. See Peer-to-Peer – Transactional Replication
Sharding in Azure SQL Database via elastic database tools which requires coding. See Building scalable cloud databases. You can also implement sharding in code for SQL Server
Merge replication in SQL Server. See Merge Replication
Create a messaging and queuing application in SQL Server Service Broker where all writes are placed on the queue and sent to different servers

Read on for more options and some additional thoughts around Cosmos DB. My first inclination would be to put Kafka in front of a distributed write system, but that’s my bias.

Comments closed

Trigram Search In SQL Server

Published 2017-09-12 by Kevin Feasel

Paul White shows how to implement trigram wildcard searches in SQL Server:

The basic idea of a trigram search is quite simple:

Persist three-character substrings (trigrams) of the target data.

Split the search term(s) into trigrams.

Match search trigrams against the stored trigrams (equality search)

Intersect the qualified rows to find strings that match all trigrams

Apply the original search filter to the much-reduced intersection

We will work through an example to see exactly how this all works, and what the trade-offs are.

A must-read. N-grams in SQL Server is an example of a non-obvious data architecture which performs much better than the obvious alternative, at least when the conditions are right.

Comments closed

How The New York Times Uses Apache Kafka

Published 2017-09-07 by Kevin Feasel

Boerge Svingen gives us an architectural overview of how the New York Times uses Apache Kafka to link different services together:

These are all sources of what we call published content. This is content that has been written, edited, and that is considered ready for public consumption.

On the other side we have a wide range of services and applications that need access to this published content — there are search engines, personalization services, feed generators, as well as all the different front-end applications, like the website and the native apps. Whenever an asset is published, it should be made available to all these systems with very low latency — this is news, after all — and without data loss.

This article describes a new approach we developed to solving this problem, based on a log-based architecture powered by Apache Kafka^TM. We call it the Publishing Pipeline. The focus of the article will be on back-end systems. Specifically, we will cover how Kafka is used for storing all the articles ever published by The New York Times, and how Kafka and the Streams API is used to feed published content in real-time to the various applications and systems that make it available to our readers. The new architecture is summarized in the diagram below, and we will deep-dive into the architecture in the remainder of this article.

This is a nice write-up of a real-world use case for Kafka.

Comments closed

Lambda And Kappa Architectures

Published 2017-08-31 by Kevin Feasel

Michael Verrilli has a post contrasting the Lambda and Kappa data architectures:

Any query may get a complete picture by retrieving data from both the batch views and the real-time views. The queries will get the best of both worlds. The batch views may be processed with more complex or expensive rules and may have better data quality and less skew, while the real-time views give you up to the moment access to the latest possible data. As time goes on, real-time data expires and is replaced with data in the batch views.

One additional benefit to this architecture is that you can replay the same incoming data and produce new views in case code or formula changes.

The biggest detraction to this architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and speed layers. Luckily with Spark Streaming (abstraction layer) or Talend (Spark Batch and Streaming code generator), this has become far less of an issue… although the operational burden still exists.

I haven’t seen much on the topic of Big Data architectures this year; it seems like it was a much more popular topic last year.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31