Press "Enter" to skip to content

Category: Architecture

Storing Semi-Additive Facts as Timespans

Timo Zishiri gives a new spin to a common warehousing problem:

In these cases, the measure may be aggregated across dates by averaging over the number of periods, e.g., average daily inventory levels. Measures can also be aggregated across dates by taking the maximum/minimum for the time interval.

More specifically, this blog focuses on an alternative approach to providing end users with the ability to do point-in-time analysis, so-called trend analysis.

Click through to see how a timespan table would work.

Comments closed

The Importance of Proper Data Modeling in Power BI

Paul Turley avoids “big, wide tables”:

Power BI is architected to consume data in a dimensional model, with narrow fact tables and related dimensions. Introducing a big, wide table in a tabular model is extremely inefficient. It takes up space and memory resources, impacts performance, and complicates measure coding. Flattening records into a flat table is one of the worst things you can do in Power BI and a common mistake made by novice Power BI users.

This is a conversation I’ve had with many customers. We want our cake, and we want to eat it too. We want to have all the analytic capabilities, interactivity and high performance but we also want the ability to drill-down to a lot of details. What if we have a legitimate need to report on transaction details and/or a large table with many columns? It is well-known that the ideal shape is a star schema but what if we need to shape data for detail reporting? The answer is that you can have it both ways, but just not in one table.

Read on for a better model design (hint: the Kimball style) as well as several tips and tricks.

Comments closed

Automating Archive Table Creation

Aaron Bertrand doesn’t want to archive things by himself every month:

Earlier in this series (part 1 | part 2), I wrote at a high level about how to solve issues with ever-growing log tables without large delete operations or data movement to a secondary archive table. In this tip, I’ll share a few code snippets you can use to automate the generation of objects to help make these solutions hands-free.

Read on for the tips.

Comments closed

Tips for Large Table Data Archival

Aaron Bertrand follows up on a prior post:

As soon as you realize your growth rates are higher than expected, you need to plan to buy or allocate more disk space. There is no way around this—more data means more disk. You can delay the inevitable for a little bit with better compression, but this is not a long-term fix, and it can impact query performance in different ways (trading CPU for I/O).

Once more disk is in place, you can plan your growth better.

Click through for some guidance on how to plan that growth.

Comments closed

Archival Tables in SQL Server

Aaron Bertrand starts a new series:

We all have one: the table that grows forever. Maybe it contains chat messages, post comments, or simple web traffic. Eventually, the table gets large enough that it becomes problematic – for example, users will notice that searches or updates take longer and longer as this massive, ever-growing table is scanned.

People often deal with this by archiving older data into a separate table. In this tip series, I’ll describe an archive table, explain why that solution carries its own set of problems, and show other potential ways to deal with data that grows indefinitely.

This is where we say, “Ah, if only Stretch DB had been priced approximately 1/100th of what it really was.” Stretch DB also had its own problems—especially if you ever needed to change the large table’s schema—but stay tuned for Aaron’s answers.

Comments closed

Software-as-a-Service: Single DB or Per-Client DB

Greg Low makes a choice:

On-premises applications are mostly single-tenant. They support a single organization. We do occasionally see multi-tenant databases. They hold the same types of information for many organizations.

But what about SaaS based applications? By default, you’ll want to store data for many client organizations. Should you create a large single database that holds data for everyone? Should you create a separate database for each client? Or should you create something in-between.

As with most things in computing, there is no one simple answer to this. Here are the main decision points that I look at:

Click through for Greg’s thoughts on the matter. Most of these factors are also relevant for on-premises SQL Server installations, not just Azure SQL DB/Managed Instance.

Comments closed

An Introduction to Event Sourcing

Aasif Ali provides a high-level introduction to the concept of event sourcing:

Event sourcing is a way to store data as events in an append-only log. It only keeps the latest version of the entity state. This method stores the state of a database object as a sequence of events. It is essentially a new event each time the object changed state, from the beginning of the object’s existence. An event can be anything that is generated by a user, a mouse click, a key press on a keyboard, and so on. It is a great way to atomically update the state and publish events. Not just can we query these events, but we can also use the event log to reconstruct past states, and as a foundation to automatically adjust the state to cope with retroactive changes.

Events are immutable, they cannot be changed. This well-known rule of event stores is often the first defining characteristic of event stores and event sourcing.

Read on to see how this concept works and how products like Apache Kafka make event sourcing viable.

Comments closed

Designing Event Streams for Kafka

Dave Shook announces a new course:

Properly designing your events and event streams is essential for any event-driven architecture. Precisely how you design and implement them will significantly affect not only what you can do today, but what you can do tomorrow. For such a critical part of any data infrastructure, most event streaming tutorials gloss over event design.

In the new course on Confluent Developer, events and event streams are put front and center. We’re going to look at the dimensions of event and event stream design and how to apply them to real-world problems. But dimensions and theory are nothing without best practices, so we are also going to take a look at these to help keep you clear of pitfalls and set you up for success. This course also includes hands-on exercises, during which you will work through use cases related to the different dimensions of event design and event streaming.

Click through to learn more about what’s in the course and to check it out–it is free, after all.

Comments closed

General Purpose Tier Azure SQL DB Performance

Reitse Eskens continues a series on comparing tiers of Azure SQL Database:

In my previous blog, I wrote about the serverless tier, the one that can go to sleep if you’re not using it for more than one hour (minimum). That tier is cheaper as long as you’re not running it for more than 25% of the time. If you need more time, go provisioned.
Another difference between serverless and provisioned is that the provisioned one gets a set number of cores whereas the serverless one has a minimum and a maximum number of cores. So this time, the blog is about the provisioned tier where you choose a fixed number of CPU’s with a fixed monthly cost.

Click through for the analysis. I’ll reiterate here that I really hope Reitse has some graphics at the end (or at least tables) which sort of lay out where the boundaries between tiers are and what the performance and cost profiles look like between them.

Comments closed

Architecting a Data Lake

James Serra provides some guidance:

I have had a lot of conversations with customers to help them understand how to design a data lake. I touched on this in my blog Data lake details, but that was written a long time ago so I wanted to update it. I often find customers do not spend enough time in designing a data lake and many times have to go back and redo their design and data lake build-out because they did not think through all their use cases for data. So make sure you think through all the sources of data you will use now and in the future, understanding the size, type, and speed of the data. Then absorb all the information you can find on data lake architecture and choose the appropriate design for your situation.

The concepts are simple but there are some interesting implications to what James includes as well as additional resources, so check it out.

Comments closed