Press "Enter" to skip to content

Category: Architecture

Creating Orchestrators in Azure Data Factory

Martin Schoombee continues a series on building an orchestration framework in Azure Data Factory:

The orchestration layer of the framework is where all the magic happens. It facilitates the execution of processes and/or tasks as defined in the metadata, and needs to do it both seamlessly and efficiently. Ideally you would want to deploy this layer only once, and never have to touch it again. And it is really with that in mind that I designed this layer…to function independently and with minimal dependencies in both directions.

I would have loved for this layer to consist of only one pipeline but there are some nuances in Data Factory that make it impossible, the primary nuance being that you cannot nest ForEach activities. As a result, this layer contains three pipelines that will be covered by the sections below in more detail.

Read on to see what those three pipelines are.

Comments closed

Number of Fabric Workspaces and the Medallion Architecture

Kevin Chant opens a can of worms:

Since I got asked about it this week during the Learn Together session I did alongside Shabnam Watson (l/X). Plus, it is a highly debated topic in our community, and I wanted to share my thoughts about it.

Due to the fact that my personal opinion is that it depends. However, the number you choose depends on a variety of reasons which I intend to cover in this post.

By the end of this post, you will know my personal opinions as to why. Plus, plenty of things to consider when deciding on the number of workspaces to implement.

Read on for Kevin’s thoughts. My quick opinion is, one workspace per layer. Just from a logistical standpoint, keeping the several layers separated in one workspace is an immense challenge and typically requires exposing data engineering details (like what “gold”/”silver” or “curated”/”refined” actually means) with end users.

Comments closed

Listen and Notify in Postgres

Brandur Leach shows how to use PostgreSQL’s listen/notify capabilities:

Listen/notify in Postgres is an incredible feature that makes itself useful in all kinds of situations. I’ve been using it a long time, started taking it for granted long ago, and was somewhat shocked recently looking into MySQL and SQLite to learn that even in 2024, no equivalent exists.

In a basic sense, listen/notify is such a simple concept that it needs little explanation. Clients subscribe on topics and other clients can send on topics, passing a message to each subscribed client. The idea takes only three seconds to demonstrate using nothing more than a psql shell:

Read on to learn more about the notifier pattern. What’s interesting is that the notifier patter, which adds a fair bit of structure to this very simple process, makes it work a good bit like SQL Server’s Service Broker.

Comments closed

Three Layers of Azure Data Factory Framework Components

Martin Schoombee continues a series on orchestration in Azure Data Factory:

Before we dive into the details of the Data Factory pipelines, it is worth explaining the conceptual structure of my framework and its components. How it all fits together is important, and after reading the post on the metadata as well the pieces of the puzzle will hopefully start falling into place.

When I started thinking about what I’d like the framework to do, three conceptual layers started to emerge and we’ll review them from the bottom up:

Click through for the description of each layer.

Comments closed

Implementing a Star Schema for a Power BI Semantic Model

Nikola Ilic reminds us to keep Ralph Kimball’s Data Warehouse Toolkit book at hand:

But, what is a star schema in the first place? I have good and bad news for you:)…The bad news is: I’m not covering it in this article, because this one focuses on explaining how to implement a star schema in Power BI (assuming that you already know what star schema is). The good news is: I’ve already written about it, so go and read this article first, if you’re not sure what star schema represents in the world of data modeling…

Now, let’s get our hands dirty and build a star schema!

Read on for the demo.

Comments closed

Database Subetting and Data Generation

Phil Factor tells us about two possibilities for loading a lower environment:

When dealing with the development, testing and releasing of new versions of an existing production database, developers like to use their existing production data. In doing so, the development team will be hit with the difficulties of managing and accommodating the large amount of storage used by a typical production database. It’s not a new problem because the practical storage capacity has grown over the years in line with our ingenuity in finding ways of using it.

To deal with using production data for testing, we generally want to reduce its size by extracting a subset of the entities from a ‘production’ database, anonymized and with referential integrity intact. We then deliver this subset to the various development environments.

Phil gets into some detail on the process behind subsetting and then covers data generation as an alternative.

Comments closed

Cloud Governance Guidance in the Cloud Adoption Framework

Stephen Sumner notes an addition to the Microsoft Cloud Adoption Framework (CAF) for Azure:

We are thrilled to announce the latest enhancement to Microsoft’s Cloud Adoption Framework for Azure. We comprehensively updated our cloud governance guidance in the Govern section of the Cloud Adoption Framework (CAF). The updated governance guidance represents Microsoft’s commitment to supporting your organization’s cloud journey, offering a clearer, more accessible, and comprehensive path to effective cloud governance. It encompasses identity, cost, resource, data, and AI governance among other areas of governance categories.

Whether you’re a startup looking to scale efficiently or a large enterprise aiming to refine your governance practices, we designed this governance guidance to meet your needs and guide you to where you need to be.

Read on to learn more about what cloud governance means and the tooling available.

Comments closed

Designing for Direct Lake Mode

Paul Turley shares some advice:

Since the introduction of Power Pivot for Excel, SQL Server Analysis Services Tabular, Azure Analysis Services and Power BI; the native mode for storing data in a semantic data model (previously called a “dataset” in Power BI) has been a proprietary file structure consisting of binary and XML files. These file structures were established in the early days of multidimensional SSAS back in 2000 and 2005. When an Import mode model is published to the Power BI service, deployed to an SSAS server or when Power BI Desktop is running, data for the model is loaded into memory where it remains as long as the service is running. When users interact with a report or when DAX queries are run against the model, results are retrieved very quickly from the data residing in memory. There are some exceptions for very large models or when many models in the service don’t all fit into memory at the same time, the service will page some or all of the model in and out of memory to make sure that the most-often used model pages remain in memory for the next user request. But, for argument’s sake, the entire semantic model sits in memory, waiting for the next report or user request.

Rather than the proprietary SSAS file structure, Direct Lake models use the native Delta-parquet files that store structured data tables for a Fabric lakehouse or warehouse in One Lake. And rather than making a copy of the data in memory, the semantic model is a metadata structure that shares the same Delta-parquet file storage. As soon as a report runs against a model, all of the model data is paged into memory which then behaves a lot like an Import mode model. This means than while the model remains in memory, performance should about the same as Import, with a few exceptions.

Read on to see what the capabilities of Direct Lake mode are today, as well as a few design considerations for your Microsoft Fabric architecture.

Comments closed

Feature Engineering with Azure ML and Microsoft Fabric

Siliang Jiao, et al, talk architecture:

Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. The extracted features are used for training the models that can predict values for relevant business scenarios. A feature engineering system provides the tools, processes, and techniques used to perform feature engineering consistently and efficiently. 

This article elaborates on how to build a feature engineering system based on Azure Machine Learning managed feature store and Microsoft Fabric. 

Click through to see how the pieces fit together.

Comments closed

Data Management with Open Table Formats

Anandaganesh Balakrishnan covers a few open-source products and formats:

Apache Iceberg is an open-source table format designed for large-scale data lakes, aiming to improve data reliability, performance, and scalability. Its architecture introduces several key components and concepts that address the challenges commonly associated with big data processing and analytics, such as managing large datasets, schema evolution, efficient querying, and ensuring transactional integrity. Here’s a deep dive into the core components and architectural design of Apache Iceberg:

Click through for a review of Iceberg, Hudi, and the Delta Lake format.

Comments closed