Press "Enter" to skip to content

Category: Architecture

Knowledge Graphs, Data Fabrics, and Data Meshes

Alan Morrison describes the history of three concepts:

By 2014, SAP was using “in-memory data fabric” to describe a virtual data warehouse, a key element of its HANA “360-degree customer view” product line. Gartner for its part uses the term “data fabric” to this day as an all-encompassing means of heterogeneous data integration. From a 2021 post on data fabric architecture: 

Read on for a high-level discussion of what each is and how it fits into the context of data warehouses and data lakes.

Comments closed

Defining Data Products Down

Paul Andrew plays Papers, Please! with data mesh:

The reasons for calling out what should not be declared as a data product can take many different forms and in some cases even result in anti-patterns that should be addressed as technical debt etc. Governanace and compliance has always been a key part of successful platform delivers, but who/how do we go about policing this both technically and in the context of people/process? Maybe ‘Data Product Police’ could be a thing! 

Read on to see how to spot a data pretender in a field of data products.

Comments closed

Minimum Viable Data Mesh in Azure

Paul Andrew was on a podcast:

For Paul, delivering a single data mesh data product on its own is not all that valuable – if you are going to go to the expense of implementing data mesh, you need to be able to satisfy use cases that cross domains. And the greater value is in cross-domain interoperability, getting to a data product that wasn’t possible before. And, you need to deliver the data platform alongside those first 2-3 data products, otherwise you create a very hard to support data asset, not really a data product.

When thinking about minimum viable data mesh, Paul views an approach leveraging DevOps and generally CI/CD – or Continuous Integration/Continuous Deliver – as very crucial. You need repeatability/reproducibility to really call something a data product.

Click through for the interview as well as Scott Hirleman’s summary.

Comments closed

The Move to General Database Platforms

Steve Jones muses on specialization in data platforms:

It’s been a decade-plus of the Not-Only-SQL (NoSQL) movement where a large variety of specialized database platforms have been developed and sold. It seems that there are so many different platforms for data stores that you can find one for whatever specialized type of data you are working with. However, is that what people are doing to store data in their applications?

I saw this piece on the return to the general-purpose database, postulating that a lot of the NoSQL database platforms have added additional capabilities that make them less specialized and more generalized. I’ve seen some of this, just as many relational platforms have added features that compete with one of the NoSQL classes of databases. The NoSQL datastores might be adding SQL-like features because some of these platforms are too specialized, and the vendors have decided they need to cover a slightly wider set of use cases.

I see three overlapping forces here in play. First, you have vendors looking at Total Addressable Market for their specific technology (document, key-value, graph, whatever) versus the size of the relational database market and they salivate about getting those general-purpose fat stacks of cash. That’s what Steve is getting at in the graf above.

I think the second force is that specialization is ultimately a sucker’s game when it comes to databases. By specializing in one area, you ultimately sacrifice others. A tool like Elasticsearch is outstanding as a document search engine and it is miserable as an aggregation engine (ask me how I know—it’s like every 6 months, another product team decides that this time, they’ll get stats aggregation with Elasticsearch to work well…and six months after people actually start to use the thing, they move all the data to someplace else that is adequately queryable). Similarly, document databases are excellent for populating details in an application but is not at all excellent at aggregation or arbitrary queries connecting data together. Specialization seems like a great idea until new requirements come in which require advanced reporting.

The third force is that these systems are independent and getting them to talk to each other typically involves writing a lot of ETL/ELT code or using additional third-party tools. To the extent that there are data virtualization platforms, they’re either excruciatingly slow (e.g., PolyBase) or expensive and out of date because they cache the data periodically. A corollary of the third force is that different platforms tend to use different languages and trying to remember which of the three or four different languages you need to use to access data in this case can be a bit painful. This is part of the reason Feasel’s Law exists.

The net result of all of this is that it seems you end up with the same piece of information in several separate places and build complicated systems to keep these separate systems aligned. Each system is (theoretically) optimized for a given use case but you end up with more and more people spending their time gluing together data from disparate systems, ensuring that data in disparate systems matches up, or moving data between disparate systems. If you need to do all of this, then sure, do it. But if there’s a single general-purpose platform which does all of this stuff 90% as well, a large number of companies and use cases will do just fine with the single tool. And that’s why general-purpose database platforms are still so popular and why I believe they will remain popular indefinitely.

The biggest exception I see is caching but that’s because it’s more a “fire-and-forget” data storage system. If you do it right, you don’t have any ETL/ELT to or from the cache and if cache dies, your system continues to work (albeit slower than with the cache). It’s also tied to a specific application and only exists temporarily, so data mismatches are (hopefully) transitory enough not to matter.

Comments closed

Data Products in Data Mesh

Paul Andrew takes us through a thought process:

In the context of an idealistic data mesh architecture, establishing a working definition of a data product seems to be very real problem for most. What constitutes a data product seems to be very subjective, circumstantial in terms requirements and interlaced with platform technical maturity. AKA, a ‘minefield’ to navigate in definitional terms.

To help get my thoughts in order (as always) here is my currently thinking and definition for a data mesh – data product.

Read on for Paul’s thoughts.

Comments closed

Synapse Database Templates GA

Kevin Schofield makes an announcement:

We’re pleased to announce today that Synapse Database Templates are now Generally Available and that we are also making available three additional Synapse Database Templates for Healthcare Insurance, Healthcare Providers, and R&D and Clinical Trials.

The Healthcare Insurance template is a comprehensive data model that addresses the typical data requirements of organizations providing insurance to cover healthcare needs (sometimes known as Payors).

The Healthcare Providers template is a comprehensive data model that addresses the typical data requirements of organizations providing healthcare services.

The R&D and Clinical Trials template is a comprehensive data model that addresses the typical data requirements of organizations involved in research and development and clinical trials of pharmaceutical products and devices.

Read on to learn more about how these templates work and what you can do with them.

Comments closed

Imagining a SaaS Plane for Data Mesh in Azure

Paul Andrew shares some deep thoughts:

For part 7 of this series, I want to explore what else could be delivered in our Azure Data Mesh if we continue our established thinking around the planes of interaction for our data products. As with part 6, we are still missing good Azure Resources that can deployed for certain situations. However, I want to frontload some concepts now, so we are ready if/when a suitable technical answer arrives in the cloud.

Note that this is all speculative. It’s interesting speculation, though.

Comments closed

Well-Architected Framework for IoT

Ben Brauer announces the Well-Architected Framework for IoT devices on Azure:

The IoT workload guidance outlines the core principles that facilitate a well-architected IoT solution and provides recommendations for each of the 5 pillars of the Well-Architected Framework. This guidance highlights the key considerations and high-level principles for an IoT workload, design considerations to help you enable those principles, and tradeoffs to consider in order to meet your business goals.

Despite its overloaded acronym, I like the Well-Architected Framework as a way of making sure that you are implementing a solution in Azure the right way.

Comments closed

Thinking Azure Data Platform Security Architecture

Craig Porteous begins a new series:

Reference architectures are great! You’ve got all of the key components in there, nice and clear. Colourful lines showing how data moves through each stage, product, or service. Great for a slide deck or a proposal to get rid of that old creaking data warehouse and into a shiny new Data Lakehouse.

Not so great for the finer details demanded by security operations teams however.

This promises to be an interesting series.

Comments closed

Making Daily Standups Worthwhile

Amit Nair has some advice:

 A Big No to technical discussions

This is the first reason why Daily scrum meetings take more than 15 mins. We can understand that Scrum Team needs to discuss the technical issue when they face one during the execution of a task. These technical issues need to be reported to Scrum Master as an obstacle or impediment, which need to be resolved later on not when the meeting is in progress.  

This is generally sound advice, especially because the idea of a daily standup is to together, quickly discuss plans and activities relating to the sprint, and make the team aware of blockers. Going beyond that is typically unnecessary. As teams get smaller, you can be a bit more lax with the rules; as you get closer to that 8-10 person team, you have to be pretty ruthless about keeping on time and on topic. Save the more detailed discussions for relevant meetings afterward.

1 Comment