Press "Enter" to skip to content

Category: Architecture

The Basics of Snowflake Architecture

Arun Sirpal lays out the foundation of Snowflake DB’s architecture:

At the most basic level, Snowflake has 3 important components. The Cloud services layer, centralised storage layer and the compute layer.

Cloud services – they call this the “brains” of snowflake. This is where infrastructure management takes place, the optimiser is based (cost-based), metadata management and security (authentication and access control) are handled.

Read on to learn about the other two layers and how they meet.

Comments closed

The Importance of a Proper Datamart / Data Warehouse

Teo Lachev explains why you want a datamart (or a data warehouse) for BI solutions:

I sent a proposal for implementing a classic BI solution: Azure SQL-based datamart (not Power BI datamart please), ETL, semantic model, and reports. The client had a sticker shock. Return to sender … as other BI companies that quoted can do it for half! Upon digging, it turned out the other companies would build the semantic model (aka Power BI dataset) directly on top of the data source. On a T&M basis, of course, what else? By contrast, I give fixed-price milestone-driven proposals and I don’t get paid unless I deliver and meet written and agreed upon success criteria, but that’s a different story.

So, let me count the ways as the poet would say. It’s certainly technically possible to slap a dataset on top of the data source(s). That’s what self-service BI is all about right … until it doesn’t serve anymore

Read on for more detail.

Comments closed

Proofs of Concept and Pilots

Kenneth Fisher strikes a chord:

If your POC does not follow your companies best practices and standards then it is not a valid POC.

There are way to many settings that will change it’s performance, cause security issues, etc. On top of that, almost every POC I’ve ever seen ends up becoming the test environment if not the actual production environment. So all of those little compromises end up in your actual, non POC environment because it’s way too much work to fix them now. You should have said something when we set this up.

To use one of my favorite lines, “Short answer: yes with an if; long answer: no with a but.”

Before I get to the answer, I do want to differentiate between a proof of concept and a pilot. The idea of a proof of concept is to see if I can make this thing work. Can I get these two processes to talk to each other? Can I build a website which accepts user input and displays something? Can I get this idea from my head into code? Can I process 500,000 records per second using our existing hardware? One important thing about a proof of concept is that it always has the possibility of failure. “No” is a valid answer here based on the conditions. By contrast, a pilot is a starter for the full project. You might work with one business unit instead of all of them, migrate a small amount of traffic to the new system, or only handle data from a single branch office. Also, you want that answer as fast as is reasonable so that your business decision-makers can make business decisions on that information. By contrast, when we do a pilot, we already know the answer is yes; we just need to build it out and answer the technical details along the way.

Returning to the line above: Yes, I agree with Kenneth if your company lacks the discipline to differentiate between proofs of concept and pilots (and that’s not as denigrating a remark as it sounds…though it’s somewhat denigrating). No, do not follow the same practices for a proof of concept that you would for a full product, but you need to ensure that code gets destroyed and you start over with new code which does follow those practices.

2 Comments

Have One Data Model per Business Area

James McGillivray offers us an important piece of advice:

I cannot stress this enough. If people are consuming your data in multiple places, the data needs to come from the same data model. That can be an Enterprise Data Warehouse, a Data Mart, a Power BI Model, or any other data source, but at some point you need to be able to track the data back to a single place. If you don’t do this, you will spend THE REST OF YOUR DAYS explaining the differences between the data models to business and customers, and reconciling the differences over and over again.

Read on to learn why this is so important.

Comments closed

Parameterizing Queries with Amazon Athena

Blayze Stefaniak, et al, architect a service to provide data via Amazon Athena:

Customers tell us they are finding new ways to make effective use of their data assets by providing data as a service (DaaS). In this post, we share a sample architecture using parameterized queries applied in the form of a DaaS application. This is helpful for many types of organizations, whether you’re working with an enterprise making data available to other lines of business, a regulator making reports available to your industry, a company monetizing your data assets, an independent software vendor (ISV) enabling your applications’ tenants to query their data when they need it, or trying to share data at scale in other ways. In DaaS applications, you can provide predefined queries to run against your governed datasets with values your users input. You can expand your DaaS application to break away from monolithic data infrastructure by treating data as a product (DaaP) and providing a distribution of datasets, which have distinct domain-specific data pipelines. You can authorize these datasets to consumers in your DaaS application permissions. You can use Athena parameterized queries as a way to predefine your queries, which you can use to run queries across your datasets, and serve as a layer of protection for your DaaS applications. This post first describes how parameterized queries work, then applies parameterized queries in the form of a DaaS application.

Click through to learn how.

Comments closed

Searching Industry Templates for Lake Databases in Synapse

Lakshmi Murthy is just browsing:

With Azure Synapse Database Templates generally available, our customers are constantly wanting to see and learn more about how to use these templates. Through these blogs we want to share tips and tricks our customers can use to help them utilize these templates in an efficient way. We’ve recently received several questions around the different ways a user can navigate these templates to create their lake databases. In this blog, I’d like to walk through a few options that may come handy as you give database templates a try.

Azure Synapse Analytics offers a no-code database designer which allows you to browse these database templates, select and customize the tables you want to use, to model your enterprise data. There are several ways to browse the tables provided by the comprehensive industry templates within the designer’s exploration experience. Though the user experience is super intuitive, there are a few tips and tricks that can make this process even easier. Let’s do a quick tour to learn about the different ways to browse these templates.

Click through for a few different ways to look at standard tables for different industries.

Comments closed

Guidance on When to Use Azure Data Explorer

Tzvia Gitlin Troyna has a flow chart for us:

Azure Data Explorer is a big data interactive analytics platform that empowers people to make data driven decisions in a highly agile environment. The factors listed below can help assess if Azure Data Explorer is a good fit for the workload at hand. These are the key questions to ask yourself.

The following flowchart table summarize the key questions to ask when you’re considering using Azure Data Explorer.

In addition to the flow chart, there is a table of three common patterns of interaction which ADE can do well.

Comments closed

Knowledge Graphs, Data Fabrics, and Data Meshes

Alan Morrison describes the history of three concepts:

By 2014, SAP was using “in-memory data fabric” to describe a virtual data warehouse, a key element of its HANA “360-degree customer view” product line. Gartner for its part uses the term “data fabric” to this day as an all-encompassing means of heterogeneous data integration. From a 2021 post on data fabric architecture: 

Read on for a high-level discussion of what each is and how it fits into the context of data warehouses and data lakes.

Comments closed

Defining Data Products Down

Paul Andrew plays Papers, Please! with data mesh:

The reasons for calling out what should not be declared as a data product can take many different forms and in some cases even result in anti-patterns that should be addressed as technical debt etc. Governanace and compliance has always been a key part of successful platform delivers, but who/how do we go about policing this both technically and in the context of people/process? Maybe ‘Data Product Police’ could be a thing! 

Read on to see how to spot a data pretender in a field of data products.

Comments closed

Minimum Viable Data Mesh in Azure

Paul Andrew was on a podcast:

For Paul, delivering a single data mesh data product on its own is not all that valuable – if you are going to go to the expense of implementing data mesh, you need to be able to satisfy use cases that cross domains. And the greater value is in cross-domain interoperability, getting to a data product that wasn’t possible before. And, you need to deliver the data platform alongside those first 2-3 data products, otherwise you create a very hard to support data asset, not really a data product.

When thinking about minimum viable data mesh, Paul views an approach leveraging DevOps and generally CI/CD – or Continuous Integration/Continuous Deliver – as very crucial. You need repeatability/reproducibility to really call something a data product.

Click through for the interview as well as Scott Hirleman’s summary.

Comments closed