Category: Architecture

An Introduction to Event Sourcing

Published 2022-10-28 by Kevin Feasel

Aasif Ali provides a high-level introduction to the concept of event sourcing:

Event sourcing is a way to store data as events in an append-only log. It only keeps the latest version of the entity state. This method stores the state of a database object as a sequence of events. It is essentially a new event each time the object changed state, from the beginning of the object’s existence. An event can be anything that is generated by a user, a mouse click, a key press on a keyboard, and so on. It is a great way to atomically update the state and publish events. Not just can we query these events, but we can also use the event log to reconstruct past states, and as a foundation to automatically adjust the state to cope with retroactive changes.
Events are immutable, they cannot be changed. This well-known rule of event stores is often the first defining characteristic of event stores and event sourcing.

Read on to see how this concept works and how products like Apache Kafka make event sourcing viable.

Comments closed

Designing Event Streams for Kafka

Published 2022-10-24 by Kevin Feasel

Dave Shook announces a new course:

Properly designing your events and event streams is essential for any event-driven architecture. Precisely how you design and implement them will significantly affect not only what you can do today, but what you can do tomorrow. For such a critical part of any data infrastructure, most event streaming tutorials gloss over event design.
In the new course on Confluent Developer, events and event streams are put front and center. We’re going to look at the dimensions of event and event stream design and how to apply them to real-world problems. But dimensions and theory are nothing without best practices, so we are also going to take a look at these to help keep you clear of pitfalls and set you up for success. This course also includes hands-on exercises, during which you will work through use cases related to the different dimensions of event design and event streaming.

Click through to learn more about what’s in the course and to check it out–it is free, after all.

Comments closed

General Purpose Tier Azure SQL DB Performance

Published 2022-10-17 by Kevin Feasel

Reitse Eskens continues a series on comparing tiers of Azure SQL Database:

In my previous blog, I wrote about the serverless tier, the one that can go to sleep if you’re not using it for more than one hour (minimum). That tier is cheaper as long as you’re not running it for more than 25% of the time. If you need more time, go provisioned.
Another difference between serverless and provisioned is that the provisioned one gets a set number of cores whereas the serverless one has a minimum and a maximum number of cores. So this time, the blog is about the provisioned tier where you choose a fixed number of CPU’s with a fixed monthly cost.

Click through for the analysis. I’ll reiterate here that I really hope Reitse has some graphics at the end (or at least tables) which sort of lay out where the boundaries between tiers are and what the performance and cost profiles look like between them.

Comments closed

Architecting a Data Lake

Published 2022-10-05 by Kevin Feasel

James Serra provides some guidance:

I have had a lot of conversations with customers to help them understand how to design a data lake. I touched on this in my blog Data lake details, but that was written a long time ago so I wanted to update it. I often find customers do not spend enough time in designing a data lake and many times have to go back and redo their design and data lake build-out because they did not think through all their use cases for data. So make sure you think through all the sources of data you will use now and in the future, understanding the size, type, and speed of the data. Then absorb all the information you can find on data lake architecture and choose the appropriate design for your situation.

The concepts are simple but there are some interesting implications to what James includes as well as additional resources, so check it out.

Comments closed

Event-Driven Microservices in Python with Kafka

Published 2022-09-26 by Kevin Feasel

Dave Klein demonstrates how event-driven microservices work:

Along came microservices. Individual, smaller applications that could be changed, deployed, and scaled independently. After some initial skepticism, this architectural style took off. It truly did solve several significant problems. However, as is often the case, it brought new levels of complexity for us to deal with. We now had distributed systems that needed to communicate and depend on each other to accomplish the tasks at hand.
The most common approach to getting our applications talking to each other was to use what we were already using between our clients and servers: HTTP-based request/response communications, perhaps using REST or gRPC. This works, but it increases the coupling between our independent applications by requiring them to know about APIs, endpoints, request parameters, etc., making them less independent.

Read the whole thing.

Comments closed

Tips for using Synapse Database Templates

Published 2022-09-20 by Kevin Feasel

James Serra provides some guidance:

I had previously blogged about Azure Synapse Analytics database templates, and wanted to follow-up with some notes and tips on that feature as I have been involved on a project that is using it:
– Purview does not yet pull in the metadata for database templates (table/field descriptions and table relationships). Right now it pulls in the metadata as if it was a SQL table or as if it was a file in ADLS. Both just have the basic information supported by those types. The SQL one is probably preferred
– Power BI does not import the table and field descriptions when connecting to a lake database (where the database templates are stored), but it does import the table relationships. You can see the table descriptions by hovering over the table names in the navigator when importing tables using the “Azure Synapse Analytics workspace (Beta)” connector. Note you are not able to see the table descriptions when hovering over the table names using the “Azure Synapse Analytics SQL” connector. Also note the “Select Related Tables” button does not work in the navigator

Click through for more notes from the field.

Comments closed

Building a Lakehouse with Azure Synapse Analytics

Published 2022-09-12 by Kevin Feasel

Arshad Ali does a bit of construction:

Data Lakehouse architecture has become the de facto standard for designing and building data platforms for analytics as it bridges the gap and breaks the silos created by the traditional/modern data warehouse and the data lake. This blog post introduces you to the world of data lakehouse and it goes into details of how to implement it successfully in Azure with Azure Synapse Analytics.

Read the whole thing.

Comments closed

Inverted Indexes for Full-Text Search

Published 2022-09-09 by Kevin Feasel

Maria Zakourdaev twists some text inside-out:

Sometimes there are properties in the document with unstructured text, like newspaper articles, blog posts, or book abstracts. The inverted index is easy to build and is similar to data structures search engines use.
Such document structures can help in various complex search patterns, like common word detection, full-text searches, or document similarity searches, using humming distance or l2distance algorithms. Inverted indexes are useful when the number of keywords is not too large and when the existing data is either totally immutable or rarely changed, but frequently searched.

This post and Maria’s MSSQLTips post both cover the high-level concept, focusing on tradeoffs between different data models. I like this sort of idea a lot and like telling people that sometimes, the right answer in a relational database involves thinking backwards.

Comments closed

Understanding Write-Ahead Logging

Published 2022-08-30 by Kevin Feasel

Kevin Sookocheff explains how write-ahead logging protects data in databases:

A central tenet of databases is that any committed data survives a crash or a failure. Write-ahead logging is a fundamental primitive that ensures all changes to data are first written safely to stable storage before being applied. Coupling that with some careful use of sequence numbers and we can guarantee that changes made to a database can survive system crashes.

This is a core feature in pretty much every relational database and Kevin dives into how one of the key algorithms behind it works.

Comments closed

Optimizing Azure Pricing for Storage and VMs

Published 2022-08-18 by Kevin Feasel

Shane Baldacchino continues a series on cost optimization in the cloud:

Cost. I have been fortunate to work for and help migrate one of Australia’s leading websites (seek.com.au) in to the cloud and have worked for both large public cloud vendors. I have seen the really good, and the not so good when it comes to architecture.
Cloud and cost. It can be quite a polarising topic. Do it right, and you can run super lean, drive down the cost to serve and ride the cloud innovation train. But inversely do it wrong, treat public cloud like a datacentre then your costs could be significantly larger than on-premises.

Click through for some good advice, including an appreciation of spot instances.

Comments closed