Category: Architecture

We are thrilled to announce the latest enhancement to Microsoft’s Cloud Adoption Framework for Azure. We comprehensively updated our cloud governance guidance in the Govern section of the Cloud Adoption Framework (CAF). The updated governance guidance represents Microsoft’s commitment to supporting your organization’s cloud journey, offering a clearer, more accessible, and comprehensive path to effective cloud governance. It encompasses identity, cost, resource, data, and AI governance among other areas of governance categories.

Whether you’re a startup looking to scale efficiently or a large enterprise aiming to refine your governance practices, we designed this governance guidance to meet your needs and guide you to where you need to be.

Read on to learn more about what cloud governance means and the tooling available.

Comments closed

Designing for Direct Lake Mode

Published 2024-04-01 by Kevin Feasel

Paul Turley shares some advice:

Since the introduction of Power Pivot for Excel, SQL Server Analysis Services Tabular, Azure Analysis Services and Power BI; the native mode for storing data in a semantic data model (previously called a “dataset” in Power BI) has been a proprietary file structure consisting of binary and XML files. These file structures were established in the early days of multidimensional SSAS back in 2000 and 2005. When an Import mode model is published to the Power BI service, deployed to an SSAS server or when Power BI Desktop is running, data for the model is loaded into memory where it remains as long as the service is running. When users interact with a report or when DAX queries are run against the model, results are retrieved very quickly from the data residing in memory. There are some exceptions for very large models or when many models in the service don’t all fit into memory at the same time, the service will page some or all of the model in and out of memory to make sure that the most-often used model pages remain in memory for the next user request. But, for argument’s sake, the entire semantic model sits in memory, waiting for the next report or user request.

Rather than the proprietary SSAS file structure, Direct Lake models use the native Delta-parquet files that store structured data tables for a Fabric lakehouse or warehouse in One Lake. And rather than making a copy of the data in memory, the semantic model is a metadata structure that shares the same Delta-parquet file storage. As soon as a report runs against a model, all of the model data is paged into memory which then behaves a lot like an Import mode model. This means than while the model remains in memory, performance should about the same as Import, with a few exceptions.

Read on to see what the capabilities of Direct Lake mode are today, as well as a few design considerations for your Microsoft Fabric architecture.

Comments closed

Feature Engineering with Azure ML and Microsoft Fabric

Published 2024-03-18 by Kevin Feasel

Siliang Jiao, et al, talk architecture:

Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. The extracted features are used for training the models that can predict values for relevant business scenarios. A feature engineering system provides the tools, processes, and techniques used to perform feature engineering consistently and efficiently.

This article elaborates on how to build a feature engineering system based on Azure Machine Learning managed feature store and Microsoft Fabric.

Click through to see how the pieces fit together.

Comments closed

Data Management with Open Table Formats

Published 2024-03-07 by Kevin Feasel

Anandaganesh Balakrishnan covers a few open-source products and formats:

Apache Iceberg is an open-source table format designed for large-scale data lakes, aiming to improve data reliability, performance, and scalability. Its architecture introduces several key components and concepts that address the challenges commonly associated with big data processing and analytics, such as managing large datasets, schema evolution, efficient querying, and ensuring transactional integrity. Here’s a deep dive into the core components and architectural design of Apache Iceberg:

Click through for a review of Iceberg, Hudi, and the Delta Lake format.

Comments closed

Piecemeal Database Restoration

Published 2024-03-01 by Kevin Feasel

Chad Callihan restores an elephant one bite at a time…or something:

The larger a database grows, the more difficult it becomes to restore it in a timely manner. When a database is young, you might be able to manage full restores in seconds. But as it matures and backup sizes go from megabytes to gigabytes to terabytes, those restore times will expand as well.

If you plan ahead, it’s not always a requirement to restore the entire database if only part of the database is necessary. This is where the idea of piecemeal restores can save you time and wasted effort.

I’ve always found piecemeal database restoration more of an interesting idea than something quite practical. The problem is, if your data is so easily separable that you can restore one set and not need the other for some reasonable length of time, why are they in the same database? I understand that there are reasonable answers to this question, but I also rarely see those scenarios pop up.

Comments closed

Architecting a Public-Facing Azure Container Registry

Published 2024-01-24 by Kevin Feasel

Kumar Ashwin Hubert and Rajesh Singh share an architecture with us:

This reference architecture describes the deployment of secured Azure Container Registry for consuming docker images and artifacts by customer applications over external (public internet) network.

This architecture builds on Microsoft’s recommended security best practices to expose private applications for external access. It utilizes the ACR’s token and scope map feature to provide granular access control to ACR’s repositories. Also, ACR internally uses the Docker APIs, and it is recommended to be familiar with these concepts before deploying this architecture.

I think this is a great example of the good and the bad of Azure architectures. The good is that you get a thoughful, well-explained, thorough description of the services you need and how they fit together, and there are a lot of those in the Azure Architecture Center. The bad is that, if I want to secure one container registry, I need a dozen different services. If we didn’t have this particular architecture diagram, I doubt 1 in 50 cloud specialists would come up with all of these services.

Comments closed

Common Warehouse Load Patterns

Published 2024-01-02 by Kevin Feasel

Ben Johnston continues a series on warehouse load patterns:

This continues and finishes my two-part series on warehouse load patterns. There are many methods to transfer rows between systems from a basic design perspective. This isn’t specific to any ETL tool but rather the basic patterns for moving data. The most difficult part in designing a pattern is efficiency. It has to be accurate and not adversely impact the source system, but this is all intertwined and dependent on efficiency. You only want to move the rows that have changed or been added since the previous ETL execution, deltas. This reduces the network load, the source system load (I/O, CPU, locking, etc.), the destination system load. Being efficient also improves the speed and as a direct result it increases the potential frequency for each ETL run, which has a direct impact on business value.

The pattern you select depends on many things. The previous part of the series covers generic design patterns and considerations for warehouse loads that can be applied to most of the ETL designs presented below. This section covers patterns I have used in various projects. I’m sure there are some patterns I have missed, but these cover the most used types that I have seen. These are not specific to any data engine or ETL tool, but the examples use SQL Server as a base for functionality considerations. Design considerations, columns available, administrative support, DevOps practices, reliability of systems, and cleanliness of data all come into consideration when choosing your actual ETL pattern.

Click through for a compendium of common patterns you can use to indicate that a row should go into a warehouse.

Comments closed

Data Warehouse ETL Patterns

Published 2023-11-27 by Kevin Feasel

Ben Johnston starts a new series:

No matter the ETL tool used, there are some basic patterns to follow when transferring data between systems. There are many data tools and platforms, but the basic patterns remain the same. This focuses on SQL Server, but most of these methods work in any data platform. Even if you are using a virtualization layer, you likely need to prepare the data before exposing it to that engine, which means ETL and data transfers.

Warehouse is very loosely a data warehouse, but the same process applies to other systems. This includes virtualization layers, and to a smaller degree, bulk transfers between transactional systems.

Read on for a few things Ben recommends you have in place before beginning the project, as well as several warehouse loading patterns.

Comments closed

Surrogate Keys and Logical Data Models

Published 2023-11-22 by Kevin Feasel

I wrap up a series on database normalization:

This video serves as a coda, covering one topic I did not include in the main series: do surrogate keys belong in the logical data model?

The short answer is, “no.” The longer answer is, “no and here’s why.”

Comments closed

Choosing the Right Technology in the Modern Azure Data Warehouse

Published 2023-11-20 by Kevin Feasel

Josephine Bush has some advice:

Here’s a quick description of the options we explored:

Azure Data Factory – Orchestrates and automates data movement and transformation. You can create workflows, pipelines, and ETL (Extract, Transform, Load) processes using it.

Databricks – A unified data science, engineering, and analytics platform based on Apache Spark. It simplifies data exploration, preparation, and machine learning workflows, allowing teams to collaborate efficiently. Interactive notebooks make Databricks a versatile tool for scalable data analysis and processing.

Synapse – Integration of big data and data warehousing in the cloud. It facilitates collaborative analytics and AI-driven insights using serverless and provisioned resources across various data sources. Integrated analytics, warehousing, and data integration are part of Synapse’s unified experience.

Fabric – An all-in-one analytics solution for enterprises that offers data movement, data lakes, data engineering, data integration, data science, and real-time analytics.

Read on for pros and cons of different options Josephine & crew reviewed, as well as the option they landed on and why.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31