Press "Enter" to skip to content

Category: Architecture

Designing Event Streams for Kafka

Dave Shook announces a new course:

Properly designing your events and event streams is essential for any event-driven architecture. Precisely how you design and implement them will significantly affect not only what you can do today, but what you can do tomorrow. For such a critical part of any data infrastructure, most event streaming tutorials gloss over event design.

In the new course on Confluent Developer, events and event streams are put front and center. We’re going to look at the dimensions of event and event stream design and how to apply them to real-world problems. But dimensions and theory are nothing without best practices, so we are also going to take a look at these to help keep you clear of pitfalls and set you up for success. This course also includes hands-on exercises, during which you will work through use cases related to the different dimensions of event design and event streaming.

Click through to learn more about what’s in the course and to check it out–it is free, after all.

Comments closed

General Purpose Tier Azure SQL DB Performance

Reitse Eskens continues a series on comparing tiers of Azure SQL Database:

In my previous blog, I wrote about the serverless tier, the one that can go to sleep if you’re not using it for more than one hour (minimum). That tier is cheaper as long as you’re not running it for more than 25% of the time. If you need more time, go provisioned.
Another difference between serverless and provisioned is that the provisioned one gets a set number of cores whereas the serverless one has a minimum and a maximum number of cores. So this time, the blog is about the provisioned tier where you choose a fixed number of CPU’s with a fixed monthly cost.

Click through for the analysis. I’ll reiterate here that I really hope Reitse has some graphics at the end (or at least tables) which sort of lay out where the boundaries between tiers are and what the performance and cost profiles look like between them.

Comments closed

Architecting a Data Lake

James Serra provides some guidance:

I have had a lot of conversations with customers to help them understand how to design a data lake. I touched on this in my blog Data lake details, but that was written a long time ago so I wanted to update it. I often find customers do not spend enough time in designing a data lake and many times have to go back and redo their design and data lake build-out because they did not think through all their use cases for data. So make sure you think through all the sources of data you will use now and in the future, understanding the size, type, and speed of the data. Then absorb all the information you can find on data lake architecture and choose the appropriate design for your situation.

The concepts are simple but there are some interesting implications to what James includes as well as additional resources, so check it out.

Comments closed

Event-Driven Microservices in Python with Kafka

Dave Klein demonstrates how event-driven microservices work:

Along came microservices. Individual, smaller applications that could be changed, deployed, and scaled independently. After some initial skepticism, this architectural style took off. It truly did solve several significant problems. However, as is often the case, it brought new levels of complexity for us to deal with. We now had distributed systems that needed to communicate and depend on each other to accomplish the tasks at hand.

The most common approach to getting our applications talking to each other was to use what we were already using between our clients and servers: HTTP-based request/response communications, perhaps using REST or gRPC. This works, but it increases the coupling between our independent applications by requiring them to know about APIs, endpoints, request parameters, etc., making them less independent.

Read the whole thing.

Comments closed

Tips for using Synapse Database Templates

James Serra provides some guidance:

I had previously blogged about Azure Synapse Analytics database templates, and wanted to follow-up with some notes and tips on that feature as I have been involved on a project that is using it:

– Purview does not yet pull in the metadata for database templates (table/field descriptions and table relationships). Right now it pulls in the metadata as if it was a SQL table or as if it was a file in ADLS. Both just have the basic information supported by those types. The SQL one is probably preferred

– Power BI does not import the table and field descriptions when connecting to a lake database (where the database templates are stored), but it does import the table relationships. You can see the table descriptions by hovering over the table names in the navigator when importing tables using the “Azure Synapse Analytics workspace (Beta)” connector. Note you are not able to see the table descriptions when hovering over the table names using the “Azure Synapse Analytics SQL” connector. Also note the “Select Related Tables” button does not work in the navigator

Click through for more notes from the field.

Comments closed

Building a Lakehouse with Azure Synapse Analytics

Arshad Ali does a bit of construction:

Data Lakehouse architecture has become the de facto standard for designing and building data platforms for analytics as it bridges the gap and breaks the silos created by the traditional/modern data warehouse and the data lake. This blog post introduces you to the world of data lakehouse and it goes into details of how to implement it successfully in Azure with Azure Synapse Analytics.

Read the whole thing.

Comments closed

Inverted Indexes for Full-Text Search

Maria Zakourdaev twists some text inside-out:

Sometimes there are properties in the document with unstructured text, like newspaper articles, blog posts, or book abstracts. The inverted index is easy to build and is similar to data structures search engines use.

Such document structures can help in various complex search patterns, like common word detection, full-text searches, or document similarity searches, using humming distance or l2distance algorithms. Inverted indexes are useful when the number of keywords is not too large and when the existing data is either totally immutable or rarely changed, but frequently searched.

This post and Maria’s MSSQLTips post both cover the high-level concept, focusing on tradeoffs between different data models. I like this sort of idea a lot and like telling people that sometimes, the right answer in a relational database involves thinking backwards.

Comments closed

Understanding Write-Ahead Logging

Kevin Sookocheff explains how write-ahead logging protects data in databases:

A central tenet of databases is that any committed data survives a crash or a failure. Write-ahead logging is a fundamental primitive that ensures all changes to data are first written safely to stable storage before being applied. Coupling that with some careful use of sequence numbers and we can guarantee that changes made to a database can survive system crashes.

This is a core feature in pretty much every relational database and Kevin dives into how one of the key algorithms behind it works.

Comments closed

Optimizing Azure Pricing for Storage and VMs

Shane Baldacchino continues a series on cost optimization in the cloud:

Cost. I have been fortunate to work for and help migrate one of Australia’s leading websites (seek.com.au) in to the cloud and have worked for both large public cloud vendors. I have seen the really good, and the not so good when it comes to architecture.

Cloud and cost. It can be quite a polarising topic. Do it right, and you can run super lean, drive down the cost to serve and ride the cloud innovation train. But inversely do it wrong, treat public cloud like a datacentre then your costs could be significantly larger than on-premises.

Click through for some good advice, including an appreciation of spot instances.

Comments closed

Building a Data Serving API in Azure

Justice Zishanhi has some recommendations for serving data in Azure:

Data is an important asset to all organizations big and small. As these organizations mature, building an end-to-end data platform to enable BI and AI at scale has become part of that journey. Some organizations, have the requirement to expose modelled data in a data warehouse or data lake (Azure Data Lake Storage Gen2) to downstream consumer applications (mobile or web apps) where access patterns can be unpredictable in respect to frequency of access and/or type of data that is requested.

Data warehouse engines and data lakes are not designed for singleton transactional (request / response) interactions.  To serve these requests at scale and to meet the different SLAs and access pattern unpredictability, data needs to be offloaded to a suitable database engine (i.e., a caching layer) that is built to serve such queries.  

The “Design Patterns” section of this article highlights a generalized pattern for implementing a data serving API which meets this requirement – consisting of a Data Platform component and an API component. For implementing the API, two patterns are commonly adopted – a synchronous pattern or an asynchronous pattern. Both are explored in the “API Implementation Patterns” section of this article.

The example focuses on Cosmos DB and provides quite a bit of helpful guidance.

Comments closed