Press "Enter" to skip to content

Category: Architecture

Azure API Management in front of Databricks and OpenAI

Drew Furgiuele has a follow-up:

A few months ago, I wrote a blog post about using Azure API Management with Databricks Model Serving endpoints. It struck a chord with a lot of people using Databricks on Azure specifically, because more and more people and organizations are trying their damndest to wrangle all the APIs they use and/or deploy themselves. Recently, I got an email from someone who read it and asked a really good question:

Click through for that question, as well as Drew’s answer.

Leave a Comment

Retry Resiliency in Apache Kafka Pipelines

Ravi Teja Thutari explains the value of idempotence in moving data between systems:

In modern flight booking systems, streaming fare updates and reservations through distributed microservices is common. These pipelines must be retry-resilient, ensuring that transient failures or replays don’t cause duplicate bookings or stale pricing. A core strategy is idempotency: each event (e.g., a fare-update or booking command) carries a unique identifier so processing it more than once has no adverse effect. 

Read on to learn more. For reference, idempotence is a property of an operation where you can run through the operation as many times as you wish and will always end up at the same result. In the data operations world, this ties to the final state in a database. If I run a process once and it adds three rows to the database, I should be able to run the process a second time and end up with those exact three rows, no more, no fewer, and no different.

Leave a Comment

Tips for Highly Available PostgreSQL Systems

Semab Tariq provides some high-level guidance:

In today’s digital landscape, downtime isn’t just inconvenient, it’s costly. No matter what business you are running, an e-commerce site, a SaaS platform, or critical internal systems, your PostgreSQL database must be resilient, recoverable, and continuously available. So in short

High Availability (HA) is not a feature you enable; it’s a system you design.

In this blog, we will walk through the important things to consider when setting up a reliable, production-ready HA PostgreSQL system for your applications.

Click through for a variety of things to think about. Most of this will apply to other database systems as well, though specific tools will differ.

Leave a Comment

High Availability Architecture for PostgreSQL

Umair Shahid adds a 9:

Most teams building production applications understand that “uptime” matters. I am writing this blog to demonstrate how much difference an extra 0.09% makes.

At 99.9% availability, your system can be down for over 43 minutes every month. At 99.99%, that window drops to just over 4 minutes. If your product is critical to business operations, customer workflows, or revenue generation, those 39 extra minutes of downtime each month can be the difference between trust and churn.

Click through for some of the tools and practices that can help get you there in PostgreSQL.

Comments closed

The Small Data Showdown in Microsoft Fabric

Miles Cole does a bit of testing:

First, let’s revisit the purpose of the benchmark: The objective is to explore data engineering engines available in Fabric to understand whether Spark with vectorized execution (the Native Execution Engine) should be considered in small data architectures.

Beyond refreshing the benchmark to see if any core findings have changed, I do want to expand in a few areas where I got great feedback from the community:

I really appreciate the approach behind this, both in terms of sticking to more realistic data sizes for many operations as well as performing this test given all of the recent improvements in each engine.

Comments closed

Building Entity-Relationship Diagrams with DBeaver

Dave Stokes builds a diagram:

Even the most experienced database professionals are known to feel a little anxious when peering into an unfamiliar database. Hopefully, they inspect to see how the data is normalized and how the various tables are combined to answer complex queries.  Entity Relationship Maps (ERM) provide a visual overview of how tables are related and can document the structure of the data.

Read on to see how you can do this with the DBeaver database access client.

Comments closed

Building an ML-Friendly Data Lake with Apache Iceberg

Anant Kumar designs a data lake:

As companies collect massive amounts of data to fuel their artificial intelligence and machine learning initiatives, finding the right data architecture for storing, managing, and accessing such data is crucial. Traditional data storage practices are likely to fall short to meet the scale, variety, and velocity required by modern AI/ML workflows. Apache Iceberg steps in as a strong open-source table format to build solid and efficient data lakes for AI and ML.

Click through for a primer on Iceberg, how to set up a fairly simple data lake, and some functionality that can help in model training.

Comments closed

Comparing Microsoft Fabric Engines

Nikola Ilic performs a comparison:

Before we proceed, an important disclaimer: the guidance I’m providing here is based on both my experience with implementing Microsoft Fabric in real-world scenarios, and the recommended practices provided by Microsoft. 

Please keep in mind that the guidance relies on general recommended practices (I intentionally avoid using the phrase best practices, because the best is very hard to determine and agree on). The word general means that the practice I recommend should be used in most of the respective scenarios, but there will always be edge cases when the recommended practice is simply not the best solution. Therefore, you should always evaluate whether the general recommended practice makes sense in your specific use case.

Click through for a comparison between three engines: the lakehouse, the warehouse, and the eventhouse. It would really simplify things if the lakehouse and warehouse combined into one coherent whole.

Comments closed

Organizing a Microsoft Fabric Data Platform with Domains

Jon Vöge does a bit of organization:

A topic which seems more relevant than ever, is the question of how to organize the contents of your Microsoft Fabric Platform.

Through the contents of a few blogs, I will give you an overview of things to consider, as well as suggestions that you can choose from when designing your platform.

This first week, we’ll take a look at Domains in Microsoft Fabric.

Read on to understand why domains can be valuable and a solid way to structure them.

Comments closed

Choosing a Warehousing Data Architecture

James Serra compares and contrasts OLAP architectures:

As discussed in my blog and book “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh” (Amazon), organizations are often challenged with choosing the right data architecture to meet their business goals—especially as AI and data-driven decision-making take center stage. To help clarify, here’s a quick review of the four core architectures, followed by guidance on when to use each. Each architecture includes five stages of data movement – ingest, store, transform, model, and visualize (described here).

Click through for James’s take on how each of them works and when you might choose one over the other.

Comments closed