Category: Architecture

Data Management with Open Table Formats

Published 2024-03-07 by Kevin Feasel

Anandaganesh Balakrishnan covers a few open-source products and formats:

Apache Iceberg is an open-source table format designed for large-scale data lakes, aiming to improve data reliability, performance, and scalability. Its architecture introduces several key components and concepts that address the challenges commonly associated with big data processing and analytics, such as managing large datasets, schema evolution, efficient querying, and ensuring transactional integrity. Here’s a deep dive into the core components and architectural design of Apache Iceberg:

Click through for a review of Iceberg, Hudi, and the Delta Lake format.

Comments closed

Piecemeal Database Restoration

Published 2024-03-01 by Kevin Feasel

Chad Callihan restores an elephant one bite at a time…or something:

The larger a database grows, the more difficult it becomes to restore it in a timely manner. When a database is young, you might be able to manage full restores in seconds. But as it matures and backup sizes go from megabytes to gigabytes to terabytes, those restore times will expand as well.

If you plan ahead, it’s not always a requirement to restore the entire database if only part of the database is necessary. This is where the idea of piecemeal restores can save you time and wasted effort.

I’ve always found piecemeal database restoration more of an interesting idea than something quite practical. The problem is, if your data is so easily separable that you can restore one set and not need the other for some reasonable length of time, why are they in the same database? I understand that there are reasonable answers to this question, but I also rarely see those scenarios pop up.

Comments closed

Architecting a Public-Facing Azure Container Registry

Published 2024-01-24 by Kevin Feasel

Kumar Ashwin Hubert and Rajesh Singh share an architecture with us:

This reference architecture describes the deployment of secured Azure Container Registry for consuming docker images and artifacts by customer applications over external (public internet) network.

This architecture builds on Microsoft’s recommended security best practices to expose private applications for external access. It utilizes the ACR’s token and scope map feature to provide granular access control to ACR’s repositories. Also, ACR internally uses the Docker APIs, and it is recommended to be familiar with these concepts before deploying this architecture.

I think this is a great example of the good and the bad of Azure architectures. The good is that you get a thoughful, well-explained, thorough description of the services you need and how they fit together, and there are a lot of those in the Azure Architecture Center. The bad is that, if I want to secure one container registry, I need a dozen different services. If we didn’t have this particular architecture diagram, I doubt 1 in 50 cloud specialists would come up with all of these services.

Comments closed

Common Warehouse Load Patterns

Published 2024-01-02 by Kevin Feasel

Ben Johnston continues a series on warehouse load patterns:

This continues and finishes my two-part series on warehouse load patterns. There are many methods to transfer rows between systems from a basic design perspective. This isn’t specific to any ETL tool but rather the basic patterns for moving data. The most difficult part in designing a pattern is efficiency. It has to be accurate and not adversely impact the source system, but this is all intertwined and dependent on efficiency. You only want to move the rows that have changed or been added since the previous ETL execution, deltas. This reduces the network load, the source system load (I/O, CPU, locking, etc.), the destination system load. Being efficient also improves the speed and as a direct result it increases the potential frequency for each ETL run, which has a direct impact on business value.

The pattern you select depends on many things. The previous part of the series covers generic design patterns and considerations for warehouse loads that can be applied to most of the ETL designs presented below. This section covers patterns I have used in various projects. I’m sure there are some patterns I have missed, but these cover the most used types that I have seen. These are not specific to any data engine or ETL tool, but the examples use SQL Server as a base for functionality considerations. Design considerations, columns available, administrative support, DevOps practices, reliability of systems, and cleanliness of data all come into consideration when choosing your actual ETL pattern.

Click through for a compendium of common patterns you can use to indicate that a row should go into a warehouse.

Comments closed

Data Warehouse ETL Patterns

Published 2023-11-27 by Kevin Feasel

Ben Johnston starts a new series:

No matter the ETL tool used, there are some basic patterns to follow when transferring data between systems. There are many data tools and platforms, but the basic patterns remain the same. This focuses on SQL Server, but most of these methods work in any data platform. Even if you are using a virtualization layer, you likely need to prepare the data before exposing it to that engine, which means ETL and data transfers.

Warehouse is very loosely a data warehouse, but the same process applies to other systems. This includes virtualization layers, and to a smaller degree, bulk transfers between transactional systems.

Read on for a few things Ben recommends you have in place before beginning the project, as well as several warehouse loading patterns.

Comments closed

Surrogate Keys and Logical Data Models

Published 2023-11-22 by Kevin Feasel

I wrap up a series on database normalization:

This video serves as a coda, covering one topic I did not include in the main series: do surrogate keys belong in the logical data model?

The short answer is, “no.” The longer answer is, “no and here’s why.”

Comments closed

Choosing the Right Technology in the Modern Azure Data Warehouse

Published 2023-11-20 by Kevin Feasel

Josephine Bush has some advice:

Here’s a quick description of the options we explored:

Azure Data Factory – Orchestrates and automates data movement and transformation. You can create workflows, pipelines, and ETL (Extract, Transform, Load) processes using it.

Databricks – A unified data science, engineering, and analytics platform based on Apache Spark. It simplifies data exploration, preparation, and machine learning workflows, allowing teams to collaborate efficiently. Interactive notebooks make Databricks a versatile tool for scalable data analysis and processing.

Synapse – Integration of big data and data warehousing in the cloud. It facilitates collaborative analytics and AI-driven insights using serverless and provisioned resources across various data sources. Integrated analytics, warehousing, and data integration are part of Synapse’s unified experience.

Fabric – An all-in-one analytics solution for enterprises that offers data movement, data lakes, data engineering, data integration, data science, and real-time analytics.

Read on for pros and cons of different options Josephine & crew reviewed, as well as the option they landed on and why.

Comments closed

Updates to Azure Well-Architected Review Assessments

Published 2023-11-15 by Kevin Feasel

Stephen Sumner shows off some changes:

Microsoft is excited to announce a significant update to the Azure Well-Architected Review assessment helps you build and optimize workloads. It walks you through a series of questions about your workload. Based on your responses, it generates tailored and prioritized recommendations to improve your workload design. The guidance is actionable and applicable to nearly every workload. It aligns with the latest best practices across the five key pillars of reliability, security, cost optimization, operational excellence, and performance efficiency (see figure 1).

I’m a big fan of the Well-Architected Framework and the assessments Microsoft has put together. An assessment can take teams within a company days to complete because the questions are so thorough, but once you do get through the list, you’ll get some great practical insights on your setup and what you can do to improve performance and save money.

Comments closed

Comparing Service Endpoints and Private Endpoints in Azure

Published 2023-10-25 by Kevin Feasel

Khushbu Gandhi clarifies a choice:

For a long time, if you were using the multi-tenant, PaaS version on many Azure services, then you had to access them over the internet with no way to restrict access just to your resources. This restriction was primarily down to the complexity of doing this sort of restrictions with a multi-tenant service. At that time, the only way to get this sort of restriction was to look at using single-tenant solutions like App Service Environment or running service yourself in a VM instead of using PaaS.

This public access was a concern for many, and so Microsoft implemented new services that allow you to limit access to these multi-tenant services. Today, we have two solutions that on the face of it look quite similar, Service Endpoints and Private link/Endpoints. These two services are both designed to allow you to restrict who connects to your service, and how they do it. Because of this, it can be confusing to know which service to use and what the benefits are. In this article, we will look at these services and try to make your decision clearer.

Read on to see what the differences are between the two, as well as a comparison table and recommendations on which to choose in what circumstances.

Comments closed

Building a Multi-Tenant Database

Published 2023-10-25 by Kevin Feasel

Adron Hall looks at multi-tenancy within Postgres:

Music has always been a significant part of my life. From the melodies that accompany my daily routines to the anthems of my most memorable moments, it’s been a constant. As my collection grew, I realized I needed a better way to organize it. That’s when I stumbled upon the concept of multi-tenancy databases and decided to give it a shot with PostgreSQL. Here’s my experience.

Multi-tenancy is one case in which I’m much more relaxed about including the tenant ID on tables where it is not absolutely necessary in order to prevent a series of joins to get the appropriate tenant ID. We can quibble about whether that’s reasonable denormalization or appropriate use of a superkey—especially because, in SQL Server, tenant ID ends up being part of the clustered index and likely part of the primary key anyhow—but it’s extremely useful nonetheless.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31