Architecture – Page 22

Star Schemas and Power BI

Published 2019-11-14 by Kevin Feasel

Alberto Ferrari explains why star schemas are so important to Power BI:

A common question among data modeling newbies is whether it is better to use a completely flattened data model with only one table, or to invest time in building a proper star schema (you can find a description of star schemas in Introduction to Data Modeling). As coined by Koen Verbeeck, the motto of a seasoned modeler should be “Star Schema all The Things!”
The goal is to demonstrate that a report using a flattened table returns inaccurate numbers, whereas using a star schema turns it into a sound analytical system.

Read on for the example.

Comments closed

Being a SQL Server Product Owner

Published 2019-10-30 by Kevin Feasel

Kevin Chant has an interesting role:

Now, I have had a few people ask me what a Product Owner actually does. Some say that it sounds like an architect role.
In reality, the role is one that’s mainly related to newer working practices like Scrum.
A Product Owners list of responsibilities include talking to all the stakeholders for you team in the business and organise the priorities on your backlog board.

The concept makes sense, though this is the first time I’ve heard of such a role for a tool the engineers use rather than a product offered for sale.

Comments closed

Against Premature Re-Architecture

Published 2019-10-28 by Kevin Feasel

Cyndi Johnson has a good rant:

One of my biggest pet peeves in software development is the compulsion that so many developers have to rip up the foundation and completely build something over again, pretty much from scratch.

I’ve been that developer plenty of times. It’s easy to walk in, see that there are some problems, and want to raze everything. Sometimes that’s a reasonable answer, but every apparent mismatch or hack was put in to solve a particular business rule, many of which are lost to the mists of time. Burning down and starting over loses a lot of that information, so rebuilding is something you do with caution.

Comments closed

The Flexible Data Lake

Published 2019-10-09 by Kevin Feasel

Neil Stokes explains how you can optimize a Hadoop-based data lake:

There are many details, of course, but these trade-offs boil down to three facets as shown below.
Big refers to the volume of data you can handle with your environment. Hadoop allows you to scale your storage capacity – horizontally as well as vertically – to handle vast volumes of data.
Fast refers to the speed with which you can ingest and process the data and derive insights from it. Hadoop allows you to scale your processing capacity using relatively cheap commodity hardware and massively parallel processing techniques to access and process data quickly.
Cheap refers to the overall cost of the platform. This means not just the cost of the infrastructure to support your storage and processing requirements, but also the cost of building, maintaining and operating the environment which can grow quite complicated as more requirements come into play.
The bottom line here is that there’s no magic in Hadoop. Like any other technology, you can typically achieve one or at best two of these facets, but in the absence of an unlimited budget, you typically need to sacrifice in some way.

Software development is full of trade-offs, and data lakes are no different. Read the whole thing.

Comments closed

VM Storage Performance in the Cloud

Published 2019-09-19 by Kevin Feasel

Joey D’Antoni explains how storage architecture has changed from on-prem to the cloud:

This architecture design dates back to when a storage LUN was literally a built of a few disks, and we wanted to ensure that there were enough I/O operations per second to service the needs of the SQL Server, because we only had the available IO of a few disks.
As virtualization became popular storage architectures changes and the a SAN lun was carved out into many small extents (typically 512k-1MB depending on vendor) across the entire array. What this meant was that with modern storage there was no need to separate logs and data files, however some DBAs did, however in an on-premises world there was no penalty for this.

It’s important to keep up on these changes.

Comments closed

Machine Learning and Delta Lake

Published 2019-08-19 by Kevin Feasel

Brenner Heintz and Denny Lee walk us through solving data engineering problems with Delta Lake:

As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to the data so that it can be fed into machine learning models, these data lakes can quickly become convoluted, unorganized messes that have given rise to the term “data swamps.”
Before a single data point has been transformed or analyzed, data engineers have already run into their first dilemma: how to bring together processing of historical (“batch”) data, and real-time streaming data. Traditionally, one might use a lambda architecture to bridge this gap, but that presents problems of its own stemming from lambda’s complexity, as well as its tendency to cause data loss or corruption.

Read the whole thing.

Comments closed

Hooking SQL Server to Kafka

Published 2019-07-17 by Kevin Feasel

Niels Berglund has an interesting scenario for us:

We see how the procedure in Code Snippet 2 takes relevant gameplay details and inserts them into the dbo.tb_GamePlay table.
In our scenario, we want to stream the individual gameplay events, but we cannot alter the services which generate the gameplay. We instead decide to generate the event from the database using, as we mentioned above, the SQL Server Extensibility Framework.

Click through for the scenario in depth and how to use Java to tie together SQL Server and Kafka.

Comments closed

Choosing Clustered Index Columns

Published 2019-06-04 by Kevin Feasel

Ed Elliott wades into the clustered index debate:

I have seen this debated in forums spread over the internet for decades, and the advice that we gave ten years ago isn’t as valid today as it was then. Ten years ago, memory was considerably less, and disks were spinning rust. The advent of SSD’s and the ability to get servers with more memory than data, even on large systems have changed how we should think about designing and maintaining databases.

I generally subscribe to the NUSE philosophy: Narrow, Unique, Static, Ever-Increasing. That generally leads me to selecting identity integers or longs. For junction tables (whose entire purpose is to join two tables together and which never get referenced outside of that), I use the primary key as the clustered index.

In extreme insert scenarios, I can see wanting to maximize fragmentation in order to insert into more pages in the B-tree and avoid hot spot pages.

Comments closed

Event-Driven Microservices

Published 2019-05-28 by Kevin Feasel

Saeed Barghi gives us an overview of what event-driven microservices are:

Modern Microservices are all about making systems event-driven: instead of making remote requests and waiting for the response (services and components calling each other and tell each other what to do), we can send notifications to related microservices when an event occurs.
These events are facts about the business. For example, an ATM or online transaction, a new log entry, or a customer registering for a new mobile plan. They are the data points collected by organizations that make their datasets. The good thing is, we can store these events in the very same infrastructure that we use to broadcast them: Apache Kafka. The better thing is we can even process them in the same infrastructure with Stream Processing applications. This means our applications and systems are linked via this central data pipeline, that is capable of real time data broadcast and processing and all data sources are shared via this data pipeline.

Read the whole thing.

Comments closed

Finding Queries to Cache In-App

Published 2019-05-16 by Kevin Feasel

Brent Ozar provides guidance on the types of queries you might want to cache in your application:

Question 2: Will out-of-date data really hurt? Some data absolutely, positively has to be up to the millisecond, but you’d be surprised how often I see data frequently queried out of the database when the freshness doesn’t really matter. For example, one client discovered they were fetching live inventory data on every product page load, but the inventory number didn’t even matter at that time. During sales, the only inventory count that mattered was when the customer completed checkout – inventory was checked then, and orders only allowed if products were still in inventory then.

There are some good questions in here which can help figure out what can fit in a cache and what really needs to be fresh.

Comments closed

Category: Architecture