Press "Enter" to skip to content

Category: Architecture

Tips for Orchestrating Fabric Notebooks

Stepan Resl talks orchestration:

Let’s start by introducing what orchestration is and why it’s important to talk about shared resources. Orchestration is a discipline focused on managing and coordinating individual items or control elements to collectively manage the flow of our data operations. In the context of Fabric, this involves managing notebooks, dataflows, pipelines, stored procedures, semantic model updates, and many other items, activities, and services that may even be outside of Fabric.

Read on for some of the options, how they work in Microsoft Fabric, and tips for success.

Comments closed

Tablespaces in Oracle and PostgreSQL

Umair Shahid explains how tablespaces work in Oracle and PostgreSQL:

Tablespaces play an important role in database management systems, as they determine where and how database objects like tables and indexes are stored. Both Oracle and PostgreSQL have the concept of tablespaces, but they implement them differently based on the overall architecture of each database.

Oracle’s tablespaces are an integral part of the database that provide various functionalities, including separating data types, managing storage, and optimizing performance. PostgreSQL, on the other hand, takes a more simplified approach, using tablespaces primarily to control where physical files are stored.

This blog aims to provide a comprehensive comparison between Oracle and PostgreSQL tablespaces, covering their architecture, creation, and practical use cases, with the goal of helping DBAs better understand their capabilities and limitations

Read on to learn more about how tablespaces work in each platform and how they differ.

Comments closed

When to Think about Scalability

Steve Jones hits upon a dilemma:

There may not be a large workload in production either, at least not at first.

So, what do you worry about first: your code being used or performing well? That’s a similar question to this one: Worry about Scalability or Popularity First? While most of us don’t work for a startup and our organizations have some sort of financial stability, does popularity matter?

I don’t always have a solid answer for this. The closest I have is to try to make my baseline:

  • Easy for maintainers (including myself) to read
  • Reasonably efficient
  • Capable of some level of scale but not necessarily the most scalable

If you want practical terms, I create a somewhat-educated guess on approximately how many rows there will be in a steady state after launch and then multiply that by a factor of 2-3 when generating test data. If you can dodge 2-3x expectations, you can dodge a ball.

And if you suddenly balloon to 10x, you grouse and grumble and spend a couple of sprints digging out from the mess of success.

Comments closed

A Primer on Database Sharding

Adrien Payong covers one technique to scale out databases:

Companies of all sizes and across industries are struggling to cope with an explosion of data never before seen in the short history of computing. As applications reach new levels of sophistication and become deeply interconnected, these companies find themselves increasingly overworked, overheated, and at their wits’ end, desperately trying to squeeze just a bit more performance and availability out of their aging database architectures.

Enter sharding, a powerful database architecture pattern that offers a solution to these challenges. Sharding scales out databases as data volume and user load grow, providing performance and high availability by spreading a database’s data across multiple servers.

Read on to learn more about it. Adrien mentions MongoDB, Cassandra, MySQL, and Postgres, though the real trick of sharding is in the client, so it also works for other data platform technologies as well, including SQL Server.

Comments closed

Discerning a Star Schema from an Existing Report

Kelly Broekstra describes a common flow for business intelligence projects:

I have worked as a business intelligence developer for several years, and I’m always asked: “How do you convert user requirements to a functioning data model?”

I follow the Kimball methodology. For more information, check out the official pages.

But, here are some specific tips on what works for me.

Click through for those tips.

Comments closed

Real-Time Streaming in Azure

Temidayo Omoniyi takes us through an architecture:

In today’s world, billions of data are generated daily from messaging applications like WhatsApp, financial data like the New York Stock Exchange, or video streaming platforms like YouTube. As a data engineer or solution architect, you are tasked to design a real-time streaming platform that captures the data as they are generated and stored in the necessary storage for decision-making.

This does a great job of going into detail, not only at the architectural level, but also setup and practical implementation.

Comments closed

Query Re-Optimization in Postgres

Andrei Lepikhov walks through an interesting scenario:

What was the impetus to begin this work? It was caused by many real cases that may be demonstrated clearly by the Join Order Benchmark. How much performance do you think Postgres loses if you change its preference of employing parallel workers from one to zero? Two times regression? What about 10 or 100 times slower?

The black line in the graph below shows the change in execution time of each query between two cases: with parallel workers disabled and with a single parallel worker per gather allowed. For details, see the test script and EXPLAINs, with and without parallel workers.

Click through for an overview of what Andrei wrote, including architectural notes. But stick around until the end to see just how difficult the challenge is to re-optimize without making performance worse in the end.

Comments closed

A Reference Architecture for Microsoft Fabric

James Serra draws boxes:

Microsoft Fabric uses a data lakehouse architecture, which means it does not use a relational data warehouse (with its relational engine and relational storage) and instead uses only a data lake to store data. Data is stored in Delta lake format so that the data lake acquires relational data warehouse-like features (check out my book that goes into much detail on this, or my video). Here is what a typical architecture looks like when using Fabric (click here for the .vsd):

Click through for the image as well as James’s explanation of the components.

Comments closed

Multi-Tenant Database Design Choices for SQL Server

Erik Darling finally has a blog post whose text I can quote versus a string of good videos to check out:

When you’re designing the back end of a database, people will have all sorts of ideas.

Normalization, partitioning, referential integrity, and, usually, trying to figure out what to do when you have more than one client.

If your application is user-focused (like Stack Overflow), you don’t have to struggle too much with the idea of isolation. But when your application is geared more towards supporting multiple entities that have one or more users, things change. Sort of like how Stack Overflow manages all the other Stack Network sites.

Were you to ask me which model I prefer, it would be every tenant getting their own database. Your other options are:

  • Everyone all mixed in together like gen-pop
  • Using separate schemas inside a single database

Definitely read what Erik has to say. My prior job was a hybrid multi-tenant environment: for the main transactional database, there were several dozen SQL Server instances. Each instance had anywhere from one to a few dozen copies of the transactional database, and each database hosted one or more customers’ data. There’s not a lot of tooling out there to support that kind of strategy, so we had to build a lot of it in-house. But that said, it did work out reasonably well without having hundreds or thousands of databases on a single instance.

Comments closed

An Overview of Slowly Changing Dimensions

Reza Rad talks slowly changing dimensions:

If you want to use Power BI, Microsoft Fabric, or any other data analytics tools, one of the key concepts to understand when working with a data warehouse system is the SCD (Slowly Changing Dimension). I will do this in a series of at least two articles. The first one (this one) will be on the concept of what SCD is, its meaning, and its different types. Then, the next one will discuss how to implement SCD types (such as Type 2) using Microsoft Fabric and Power BI.

Reza focuses on SCD types 0-4 but does briefly touch on types 5-7 (of which, I’d never heard of SCD type 7).

Comments closed