Category: Architecture

Building Data Mesh Edges in Azure

Published 2021-12-29 by Kevin Feasel

Paul Andrew continues a series on data mesh in Azure:

Anyway, moving on. In part 2 of this blog series, keeping the same focus from part 1, with the first data mesh principal. Let’s take our nodes and start thinking about the edges. The data mesh – data product interfaces… Enter my Azure Resource Group with arms/antenna type things, seen on the right
Caveat: as you may have already gathered, I’m going to use the terms edge and interface a lot in this post. The meaning in the context of the data mesh is the same. Nodes with edges, nodes with interfaces.

Click through for more detail.

Comments closed

Building a Data Mesh in Azure

Published 2021-12-16 by Kevin Feasel

Paul Andrew starts a new series:

The concepts and principals of a data mesh architecture have been around for a while now and I’ve yet to see anyone else apply/deliver such a solution in Azure. I’m wondering if the concepts are so abstract that it’s hard to translate the principals into real world requirements, and maybe even harder to think about what technology you might actually need to deploy in your Azure environment.
Given this context (and certainly no fear of going first with an idea and being wrong ) here’s what I think we could do to build a data mesh architecture in the Microsoft cloud platform – Azure.

Click through for Paul’s take on the first data mesh principle.

Comments closed

Organizing Synapse Workspaces and Lakehouses

Published 2021-11-26 by Kevin Feasel

Jovan Popovic confirms that Microsoft is using the term “Lakehouse” like Databricks does:

The lakehouse pattern enables you to keep a large amount of your data in Data Lake and to get the analytic capabilities without a need to move your data to some data warehouse to start an analysis. A lakehouse represents a good trade-off between query performance and the ability to access the latest version of data without the need to wait for data to be reloaded.
Azure Synapse Analytics workspace enables you to implement the Lakehouse pattern on top of Azure Data Lake storage.
When you think about your lakehouse solution, be aware that there are two options for creating databases over the lake:
– Lake databases that are created using Spark or database template
– SQL databases that are created using serverless SQL pools on top of data lake.
Although you might use different tools and languages to create these types of databases, the principles described in this article apply to both types. I will use the term “lakehouse” whenever i reference Spak Lake database or SQL database created using the serverless SQL pools.

Click through for Jovan’s guidance.

Comments closed

A Case for EAV?

Published 2021-11-22 by Kevin Feasel

Erik Darling makes the case:

EAV styled tables can be excellent for certain data design patterns, particularly ones with a variable number of entries.
Some examples of when I recommend it are when users are allowed to specify multiple things, like:

I’m not sure I agree on the examples. When there are specific known things with expected shapes, I’d rather have a separate entity to model each. Even if each table is a single string, I’d still like the separation for logical modeling purposes.

That said, there are cases when EAV ends up being the best approach (unfortunately), particularly when you don’t even know the types of things a customer would wish to include. Just try to fight back hard when the inevitable request comes in to pivot all of that data.

Comments closed

Azure Synapse Database Templates

Published 2021-11-22 by Kevin Feasel

Aaron Merrill announces database templates for Azure Synapse Analytics:

The Synapse database template for Agriculture is a comprehensive data model that addresses the typical data requirements of organizations engaged in growing crops, raising livestock, and producing dairy products, including field and pasture management and satellite and drone data.
The Synapse database template for Energy & Commodity Trading is a comprehensive data model that addresses the typical data requirements of organizations engaged in trading energy, commodities, and/or carbon credits, whether as a primary trading business or in support of their supply chains, operating businesses, and hedging activities.

You may remember Microsoft buying ADRM Software a while back. This is why.

Comments closed

Azure Synapse Analytics Database Templates

Published 2021-11-08 by Kevin Feasel

Santosh Balasubramanian shows off database templates in Azure Synapse Analytics:

Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated resources—at scale. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
One of the challenges that users in key industry areas face is how to describe and shape the mass of data that they are gathering. Most of this data is currently stored in data lakes or in application-specific data silos. The challenge is to bring all this data together in a standardized format enabling it to be more easily analyzed and understood and for ML and AI to be applied to it.
Azure Synapse solves this problem by introducing industry-specific templates for your data, providing a standardized way to store and shape data. These templates provide schemas for predefined business areas, enabling data to be loaded into a database in a structured way.

Read on to see what they can do, and try them out in a Synapse workspace.

Comments closed

Event-Driven Programming

Published 2021-11-02 by Kevin Feasel

Gigi Sayfan explains the idea of event-driven programming:

Event-driven programming is a great approach for building complex systems. It embodies the divide-and-conquer principle while allowing you to continue using other approaches like synchronous calls.
When discussing event-based systems, several different terms often refer to the same concept. For simplicity, we’ll primarily use the terms listed below in bold:
– Event, message, and notification
– Producer, publisher, sender, and event source
– Consumer, receiver, subscriber, handler, and event sink
– Message queue and event queue

I really like the ideas behind event-driven programming. It’s most commonly useful when working with cloud services, but also in solving some difficult T-SQL problems.

Comments closed

Scaling Out vs Scaling Up

Published 2021-10-12 by Kevin Feasel

Jordan Braiuka compares two models for scaling:

We often get questions from customers about the best way to add capacity to their cluster. Is it better to add nodes, or simply to increase the capacity in their nodes? Unfortunately, the truth is there is no best way—like all complex issues in distributed systems, there are benefits and drawbacks to each scaling approach.
While each of our highly distributed systems (Apache Cassandra, Apache Kafka, etc.) have slightly different implementations of scaling, the concepts remain consistent across most distributed systems.

Click through for a comparison between the two approaches. As the article indicates, both are meaningful strategies and your choice might come down to a combination of the technology stack and the problem at hand.

Comments closed

Trying a Read-Only API

Published 2021-10-11 by Kevin Feasel

Mark Litwintschik reviews ROAPI:

ROAPI is an API Server that exposes CSV, JSON and Parquet files without the need to write any code. The project was started by Qingping Hou around this time last year. Qingping had spent the better part of four years working at LinkedIn prior to joining Scribd as a Senior Engineer. He is also a committer to both the Apache Airflow and Arrow projects.
ROAPI is made up of 4K lines of Rust. This line count is low due to the intense use of 3rd party libraries. These include Apache Arrow for, among other things, Parquet support, Arrow’s DataFusion Project, which provides SQL and query execution support, Actix, which provides the HTTP interface and Rusoto, the AWS SDK for Rust.

Click through to see how to set it up and how to use it.

Comments closed

Distribution Techniques in Azure Synapse Analytics

Published 2021-09-21 by Kevin Feasel

Gauri Mahajan takes us through three distribution techniques when working with Azure Synapse Analytics dedicated SQL pool tables:

Data warehouses host much larger volumes of data compared to transactional databases, the volume of reads is much more compared to writes and queries tend to result in much larger result sets compared to queries that retrieve scalar values or paginated record sets from transactional databases. Due to this nature of data warehouses, there is a higher impetus on the server to perform faster. Modern data warehouses like AWS Redshift, Azure Synapse, Snowflake and others employ approaches like data sharding where data is distributed horizontally on multiple nodes which process data in parallel. This approach is highly scalable as nodes can be easily added to a data cluster as the storage and performance need increases. One key aspect that is different for tables hosted on such data warehouses is that tables are distributed horizontally using different distribution algorithms, so that all the nodes in an Azure Synapse cluster have an equal share of responsibility for hosting, processing, and delivering data for any given query to maximize performance.
In this article, we will learn about the table distribution styles supported in an Azure Synapse and how to use them for creating distributed tables.

Read on to learn more. This is an example of something we don’t think about on the SQL Server side, so when moving to Azure Synapse Analytics dedicated SQL pools, it can be easy to get this wrong and end up with sub-optimal performance.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31