A classic interview question is: “How do you go about displaying large amounts of data in a performant way?” Most people (at least on the front end), usually come up with pagination first. An implementation for pagination might go something like this:
Out of a list of 100, request 10 items at a time until 100 items are reached. So you would do 9 requests, asking for 1–10, 11–20, etc., until the 100 are reached.
In Kafka’s case, there could be 1 million messages between successive requests, so a user can never see the “latest” message, only the range as requested by the browser. In addition, there is a fundamental problem with pagination as it relates to Kafka. Message ordering across partitions is non-deterministic, so what is displayed in the UI, a linear sequence from 1–100, would not represent the data as it is laid out inside of Kafka.
Very interesting reading.
I couldn’t find numbers from Google but others say that by 2017 Google processed over 20PB a day (not to mention answering 40K search queries/second) so Google is definitely in the big data game. The numbers go down fast after that, even for companies who are really big data companies — Facebook presented back in 2017 that they handle 500TB+ of new data daily, the whole of Twitter’s data as of May 2018 was around 300PB, and Uber reported their data warehouse is in the 100+ PB range.
Ok, but what about the rest of us? Let’s take a look at an example.
I often fight with this myself—SQL Server can easily handle multi-billion row data sets, for example. It’s the same problem in Azure with SQL Data Warehouse: the “you must be this tall to ride the rides” marker is set pretty high.
Power Query (M) made a lot of data transformation activities much easier and value replacement is one of them. You can easily right click on any desired value in Power Query, either in Excel or Power BI, or other components of Power Platform in general, and simply replace that value with any desired alternative. Replacing values based on certain conditions however, may not seem that easy at first. I’ve seen a lot of Power Query (M) developers adding new columns to accomplish that. But adding a new column is not always a good idea, especially when you can do it in a simple single step in Power Query. In this post I show you a quick and easy way to that can help you handling many different value replacement scenarios.
Imagine you have a table like below and you have a requirement to replace the values column [B] with the values of column [C] if the [A] = [B].
Click through for the solution.
I would not mind quoting the Druid documentation for this purpose: “Druid is a data store designed for high-performance slice-and-dice analytics (“OLAP“-style) on large data sets. Druid is most often used as a data store for powering GUI analytical applications, or as a backend for highly-concurrent APIs that need fast aggregations.”
You might be wondering where is “SQL” in that? Actually, the fact is Druid is designed for special kind of SQL workloads which we can relate with powering the GUI analytical applications which require low latency query response. But in this post, we will only look in the “how part” of it using Druid to quickly run queries.
Click through to see how.
When Azure SQL Data Warehouse was chosen to implement a multi-dimensional data warehouse, it may have seemed like the ideal choice. Why? because it was plain to see: keywords: “SQL”, “Warehouse”. However, no, SQL Data Warehouse is ideal only when you have data loads that are quite high, not when it is only several 100GBs. Armed with a few more reasons as to why not (A good reference for choosing Azure SQL Data Warehouse), I had confronted them. But the rebuke then was that they did get good enough performance, and that cost wasn’t a problem. Until of course a few months later when complex queries started hitting the system, and despite being able to afford that cost, the value of paying that amount did not seem worth it.
Having a good architectural understanding of the Azure or AWS platform—even if you aren’t deeply familiar with all of the tools—can help avoid these types of problems.
Containers are perfectly suited for testing, meant to fast deployment of a solution, and can be easily deployed to the cloud. It’s cost effective!
Very important to understand! Containers disk data only exist while the container is running. If the container is removed, that data is gone.
So, you got to find the way to properly configure your container environment to make the data persist on disk.
Click through for an example.
In this post, I show my next step: executing an SSIS package in a container. Spoilers:
1. It’s more work than you think;
2. This is merely one way to do it; and
3. This is not my ultimate goal.
Read on to see how to do this.
First let’s think a little about indexes in general.
An index provides an ordered structure to a set of data. (I could be pedantic and point out that reading through the data in an index from start to end might jump you from page to page in a seeming haphazard way, but still as you’re reading through pages, following the pointers from one page to the next you can be confident the data is ordered. Within each page you might even jump around to read the data in order, but there is a list showing you which parts (slots) of the page should be read in which order. There really is no point in my pedantry except to answer those equally pedantic who will comment if I don’t.)
And this order is according to the key columns – that’s the easy bit that everyone gets. It’s useful not only for being able to avoid re-ordering the data later, but also for being able to quickly locate any particular row or range of rows by those columns.
Rob does a great job of covering some of the nuances of filtered indexes.