Press "Enter" to skip to content

Category: Architecture

Minimizing Latency in Kafka Streaming Applications using APIs

Abhishek Goswami doesn’t want to slow down the stream:

Kafka is widely adopted for building real-time streaming applications due to its fault tolerance, scalability, and ability to process large volumes of data. However, in general, Kafka streaming consumers work best only in an environment where they do not have to call external APIs or databases. In a situation when a Kafka consumer must make a synchronous database or API call, the latency introduced by network hops or I/O operations adds up and accumulates easily (especially when the streaming pipeline is performing an initial load of a large volume of data before starting CDC). This can significantly slow down the streaming pipeline and result in the blowing of system resources impacting the throughput of the pipeline. In extreme situations, this may even become unsustainable as Kafka consumers may not be able to commit offsets due to increased latency before the next polling call and get continuously rebalanced by the broker, practically not processing anything yet incrementally consuming more system resources as time passes.

This is a real problem faced by many streaming applications. In this article, we’ll explore some effective strategies to minimize latency in Kafka streaming applications where external API or database calls are inevitable. We’ll also compare these strategies with the alternative approach of separating out the parts of the pipeline that require these external interactions into a separate publish/subscribe-based consumer.

Read on to understand the causes of this latency and several patterns you can use to limit it.

Comments closed

A Primer on Medallion Architecture in Microsoft Fabric

Kenneth Omorodion builds a warehouse:

Data warehouses are essential components of modern analytics systems, offering optimized storage and processing capabilities for large volumes of data. When integrated with a Lakehouse architecture, you can combine the best of both worlds—structured, schema-enforced data storage with the flexibility and scalability of data lakes. Microsoft Fabric provides an excellent environment for implementing the Medallion Architecture, a design pattern for building efficient data processing pipelines by layering data into bronze, silver, and gold zones.

Click through for the process.

Comments closed

The Importance of Planning before Power BI Data Modeling

Kelly Broekstra recommends against jumping right in:

Who has been told by a manager or business person to just connect to the source data and start creating a new report? Here is my tip:

DON’T DO IT

All Power BI and Fabric reports must have a semantic model, which Microsoft describes as “a logical description of an analytical domain, with metrics, business-friendly terminology, and representation, to enable deeper analysis.” – Source

Read on to learn why and what you should instead do if you want to have a better long-term experience with Power BI.

Comments closed

Tips for Adopting Microsoft Fabric

Paul Turley shares some thoughts:

Hello, friends. I’ve spent the past few months working with several new Fabric customers who were seeking guidance and recommendations for Fabric architecture decisions. What have we learned about using Fabric in enterprise data settings in the past 11 months? This post covers some of the important decisions points and Fabric solution design patterns.

Much of the industry’s experience with Microsoft Fabric over the past several months has been at a high-level as organizations were dipping their toe in the pool to test the water. So far, our Data & AI team have assisted around 50 clients with Fabric projects of various sizes. We have also implemented a handful of production scale projects with enterprise workloads, comparing notes with community leaders and the product teams who develop the product. What lessons have we learned?

Click through for several bits of high-level architectural guidance intended to make that adoption easier.

Comments closed

Tips for Orchestrating Fabric Notebooks

Stepan Resl talks orchestration:

Let’s start by introducing what orchestration is and why it’s important to talk about shared resources. Orchestration is a discipline focused on managing and coordinating individual items or control elements to collectively manage the flow of our data operations. In the context of Fabric, this involves managing notebooks, dataflows, pipelines, stored procedures, semantic model updates, and many other items, activities, and services that may even be outside of Fabric.

Read on for some of the options, how they work in Microsoft Fabric, and tips for success.

Comments closed

Tablespaces in Oracle and PostgreSQL

Umair Shahid explains how tablespaces work in Oracle and PostgreSQL:

Tablespaces play an important role in database management systems, as they determine where and how database objects like tables and indexes are stored. Both Oracle and PostgreSQL have the concept of tablespaces, but they implement them differently based on the overall architecture of each database.

Oracle’s tablespaces are an integral part of the database that provide various functionalities, including separating data types, managing storage, and optimizing performance. PostgreSQL, on the other hand, takes a more simplified approach, using tablespaces primarily to control where physical files are stored.

This blog aims to provide a comprehensive comparison between Oracle and PostgreSQL tablespaces, covering their architecture, creation, and practical use cases, with the goal of helping DBAs better understand their capabilities and limitations

Read on to learn more about how tablespaces work in each platform and how they differ.

Comments closed

When to Think about Scalability

Steve Jones hits upon a dilemma:

There may not be a large workload in production either, at least not at first.

So, what do you worry about first: your code being used or performing well? That’s a similar question to this one: Worry about Scalability or Popularity First? While most of us don’t work for a startup and our organizations have some sort of financial stability, does popularity matter?

I don’t always have a solid answer for this. The closest I have is to try to make my baseline:

  • Easy for maintainers (including myself) to read
  • Reasonably efficient
  • Capable of some level of scale but not necessarily the most scalable

If you want practical terms, I create a somewhat-educated guess on approximately how many rows there will be in a steady state after launch and then multiply that by a factor of 2-3 when generating test data. If you can dodge 2-3x expectations, you can dodge a ball.

And if you suddenly balloon to 10x, you grouse and grumble and spend a couple of sprints digging out from the mess of success.

Comments closed

A Primer on Database Sharding

Adrien Payong covers one technique to scale out databases:

Companies of all sizes and across industries are struggling to cope with an explosion of data never before seen in the short history of computing. As applications reach new levels of sophistication and become deeply interconnected, these companies find themselves increasingly overworked, overheated, and at their wits’ end, desperately trying to squeeze just a bit more performance and availability out of their aging database architectures.

Enter sharding, a powerful database architecture pattern that offers a solution to these challenges. Sharding scales out databases as data volume and user load grow, providing performance and high availability by spreading a database’s data across multiple servers.

Read on to learn more about it. Adrien mentions MongoDB, Cassandra, MySQL, and Postgres, though the real trick of sharding is in the client, so it also works for other data platform technologies as well, including SQL Server.

Comments closed

Discerning a Star Schema from an Existing Report

Kelly Broekstra describes a common flow for business intelligence projects:

I have worked as a business intelligence developer for several years, and I’m always asked: “How do you convert user requirements to a functioning data model?”

I follow the Kimball methodology. For more information, check out the official pages.

But, here are some specific tips on what works for me.

Click through for those tips.

Comments closed

Real-Time Streaming in Azure

Temidayo Omoniyi takes us through an architecture:

In today’s world, billions of data are generated daily from messaging applications like WhatsApp, financial data like the New York Stock Exchange, or video streaming platforms like YouTube. As a data engineer or solution architect, you are tasked to design a real-time streaming platform that captures the data as they are generated and stored in the necessary storage for decision-making.

This does a great job of going into detail, not only at the architectural level, but also setup and practical implementation.

Comments closed