Thoughts On Exactly-Once Processing And First-In First-Out

Kevin Sookocheff looks into Amazon’s Simple Queue Service and explains some concepts of distributed messaging systems in the process:

In an ideal scenario, the five minute window would be a complete non-issue. Unfortunately, if you are relying on SQS’s exactly-once guarantee for critical use cases you will need to account for the possibility of this error and design your application accordingly.

On the message consumer side, FIFO queues do not guarantee exactly once delivery, because in simple fact, exactly once delivery at the transport level is provably impossible. Even if you could ensure exactly-once delivery at the transport level, it probably isn’t what you want anyways — if a subscriber receives a message from the transport, there is still a chance that it can crash before processing it, in which case you definitely want the messaging system to deliver the message again.

Instead, FIFO queues offer exactly-once processing by guaranteeing that once a message has successfully been acknowledged as processed that it won’t be delivered again. To understand more completely how this works, let’s walk through the details of how you go about consuming messages from SQS.

It’s a great read, so check it out.

Isolation Level Basics

Randolph West describes the primary isolation levels in SQL Server:

There are four isolation levels in SQL Server (as quoted from SQL Server Books Online):

  • Read uncommitted (the lowest level where transactions are isolated only enough to ensure that physically corrupt data is not read)
  • Read committed (Database Engine default level)
  • Repeatable read
  • Serializable (the highest level, where transactions are completely isolated from one another)

Read on for a discussion of what these mean, as well as how optimistic versus pessimistic concurrency (in this case, Read Committed Snapshot Isolation versus Read Committed) comes into play.

The Data Lake From 10,000 Feet

Pradeep Menon has a high-level explanation of what a data lake is and how it differs from traditional data warehouses:

With the changes in the data paradigm, a new architectural pattern has emerged. It’s called as the Data Lake Architecture. Like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas. It provides data scientists an avenue to explore data and create a hypothesis. It provides an avenue for business users to explore data. It provides an avenue for data analysts to analyze data and find patterns. It provides an avenue for reporting analysts to create reports and present to stakeholders.

The way I compare a data lake to a data warehouse or a mart is like this:

Data Lake stores data in the purest form caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.

One way of thinking about this is that data warehouses are great for solving known business questions:  generating 10K reports or other regulatory compliance reporting, building the end-of-month data, and viewing standard KPIs.  By contrast, the data lake is (among other things) for spelunking, trying to answer those one-off questions people seem to have but which the warehouse never seems to have quite the right set of information.

Chaining Exactly-Once Operations With Kafka

Ben Stopford shows how you can use Kafka to chain together services while maintaining exactly-once guarantees:

Any service-based architecture is itself a distributed system, a field renowned for being difficult, particularly when things go wrong. We have thought experiments like The Two Generals Problemand proofs like FLP which highlight that these systems are difficult to work with.

In practice we make compromises. We rely on timeouts. If one service calls another service and gets an error, or no response at all, it retries that call in the knowledge that it will get there in the end.

The problem is that retries can result in duplicate processing—which can cause very real problems. Taking a payment, twice, from someone’s account will lead to an incorrect balance. Adding duplicate tweets to a user’s feed will lead to a poor user experience.  The list goes on.

I just had a discussion at SQL Saturday Albany about this exact thing, and the pain of rolling your own solutions.

Understanding ACID Properties

Randolph West explains the basics of ACID properties and gives a high-level description of how relational databases typically ensure these properties:

Relational database management systems (RDBMS) such as SQL Server, Oracle, MySQL, and PostgreSQL use transactions to allow concurrent users to select, insert, update, and delete data without affecting everyone else.

An RDBMS is considered ACID-compliant if it can guarantee data integrity during transactions under the following conditions:

Read on for more.

Kafka As A Backbone

Ben Stopford explains how to use Kafka as a backbone for a microservices architecture:

Taking a log-structured approach has an interesting side effect. Both reads and writes are sequential operations. This makes them sympathetic to the underlying media, leveraging pre-fetch, the various layers of caching and naturally batching operations together. This makes them efficient. In fact, when you read messages from Kafka, the server doesn’t even import them into the JVM. Data is copied directly from the disk buffer to the network buffer. An opportunity afforded by the simplicity of both the contract and the underlying data structure.

So batched, sequential operations help with overall performance. They also make the system well suited to storing messages longer term. Most traditional message brokers are built using index structures, hash tables or B-trees, used to manage acknowledgements, filter message headers, and remove messages when they have been read. But the downside is that these indexes must be maintained. This comes at a cost. They must be kept in memory to get good performance, limiting retention significantly. But the log is O(1) when either reading or writing messages to a partition, so whether the data is on disk or cached in memory matters far less.

This is a higher-level look and helps explain why I like Kafka so much as a message broker.

Thinking About Databases At Scale

Chris Adkin has a great post explaining some of the hardware and query principles behind scale issues:

All execution plans iterators that require memory grants have two fundamental code paths, one path for when the memory grant is blown and memory spills out into tempdb and one for when the memory grant is correct or under-estimated. Perhaps the database engine team may at some point include a third option, which is for when the grant can be accommodated inside the CPU cache.

As an example, if you run a log record generation intensive workload on the same CPU socket as the log writer, usually socket 0, this will run in a shorter time compared to running the exact same workload in a different socket

This is the type of post where I catch just enough of it to know that I need to dig deeper and learn more.

Dimensional Modeling

Jen Underwood explains the basics of dimensional modeling:

A dimensional model is also commonly called a star schema. It provides a way to improve report query performance without affecting data integrity. This type of model is popular in data warehousing because it can provide better query performance than transactional, normalized, OLTP data models. It also allows for data history to be stored accurately over time for reporting. Another reason why dimensional models are created…they are easier for non-technical users to navigate. Creating reports by joining many OLTP database tables together becomes overwhelming quickly.

Dimensional models contain facts surrounded by descriptive data called dimensions. Facts contains numerical values of what you measure such as sales or user counts that are additive, or semi-additive in nature. Fact tables also contain the keys/links to associated dimension tables. Compared to most dimension tables, fact tables typically have a large number of rows.

Jen’s post was built off of an early SQL Saturday presentation.  It’s still quite relevant today.

Lamba Architecture Basics

Michael Walker walks through the basics of the Lambda architecture:

Lambda architecture – developed by Nathan Marz – provides a clear set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability and recomputation into the system. Batch processes high volumes of data where a group of transactions is collected over a period of time. Data is collected, entered, processed and then batch results produced. Batch processing requires separate programs for input, process and output. An example is payroll and billing systems. In contrast, real-time data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real-time). Customer services and bank ATMs are examples.

Lambda architecture has three (3) layers:

  • Batch Layer

  • Serving Layer

  • Speed Layer

I haven’t heard much about the Lambda and Kappa architectures lately, so when I saw this, I figured it was time for a refresher.

S3 Versus HDFS For Spark Data Storage

Reynold Xin, Josh Rosen, and Kyle Pistor argue that you should use blob storage (S3, Azure Blob, etc.) instead of disk when building a cloud-based Spark cluster:

Based on our experience, S3’s availability has been fantastic. Only twice in the last six years have we experienced S3 downtime and we have never experienced data loss from S3.

Amazon claims 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. The official SLA from Amazon can be found here: Service Level Agreement – Amazon Simple Storage Service (S3).

For HDFS, in contrast, it is difficult to estimate availability and durability. One could theoretically compute the two SLA attributes based on EC2’s mean time between failures (MTTF), plus upgrade and maintenance downtimes. In reality, those are difficult to quantify. Our understanding working with customers is that the majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year.

It’s interesting how opinion has shifted; even a year ago, the recommendation would be different.


August 2017
« Jul