Press "Enter" to skip to content

Category: Internals

Optimizing Kafka Streams Apps

Bill Bejeck and Guozhang Wang give us an idea of some Kafka Streams internals:

At a high level, when you use the Streams DSL, it auto-creates the processor nodes as well as state stores if needed, and connects them to construct the processor topology. To dig a little deeper, let’s take an example and focus on stateful operators in this section.

An important observation regarding the Streams DSL is that most stateful operations are keyed operations (e.g., joins are based on record keys, and aggregations are based on grouped-by keys), and the computation for each key is independent of all the other keys. These computational patterns fall under the term data parallelism in the distributed computing world. The straightforward way to execute data parallelism at scale is to just partition the incoming data streams by key, and work on each partition independently and in parallel. Kafka Streams leans heavily on this technique in order to achieve scalability in a distributed computing environment.

They then use that info to show you how you can make your Streams apps faster.

Comments closed

Reverse Engineering the Key Influencers Visual in Power BI

Chris Webb learns how the Key Influencers Power BI visual works, including some interesting undocumented functions:

A fascinating insight into how Power BI works, but is this any practical use to us? Let me be clear: I don’t think you should be using any of these functions yourself in a real-world report. I’m sure all this would be documented and publicised if Microsoft did want us to use it ourselves! Another consideration is that these new functions return tables and that makes them awkward to use in regular .pbix Power BI reports – I guess we could create calculated tables although that’s not as flexible as returning a table from a query as shown above. That said, even though we can’t write our own DAX queries in regular Power BI reports, we can write our own DAX queries in Paginated Reports and we can now create Paginated Reports that use a Power BI dataset as a data source. I tested putting one of the queries generated by the Key Influencers visual into a Paginated Report connected to the same dataset and it worked ok (even after publishing). You can also embed DAX queries connected to a published dataset in Excel too, as I show here. Hmm, plenty to think about then…

Chalk this up as “fun to know but not recommended to use yourself.”

Comments closed

Understanding the Page Resource Cracker

John Morehouse looks at the new sys.fn_PageResCracker() function in SQL Server 2019:

In a previous blog post, I discussed two new methods in SQL Server 2019 to determine exactly which page a request might be waiting for when there is contention.  One of these new methods involves a new function, fn_pagerescracker.   Naturally, I wanted to see how this function operates.  Let’s look at the Master database to investigate how it works!

Click through for the function definition and what it all means.

Comments closed

Testing Data Pages in Linux

John Morehouse shows that SQL Server data pages are the same in Windows as they are in Linux:

One of the new phrases coming out of Microsoft is that “SQL is just SQL” regardless of what operating system it resides on.   This was echoed during the keynote at SQL Bits 2019 by the Microsoft team, which you can watch here.   Later that weekend, I gave a session about database internals.  My presentation is about how data is structured within a row and why that matters.  Understanding the internals of table structures, even in today’s age of technology, include SQL Server 2019 (which will be released in Q3/Q4 of 2019) is important.   During my session, a question came up about how a data page is structured if SQL Server is sitting on top of a Linux server, such as Ubuntu.  Does the data page have the same size and shape in Linux as it does in Windows?

They do. Click through to see John prove it.

Comments closed

Looking At Compressed Pages

Jess Pomfret shows us what compressed data looks like in SQL Server:

We first need to switch on trace flag 3604: this will write the output of our DBCC PAGE command to the messages tab instead of the event log.

There are 4 parameters for DBCC PAGE: we will need to pass in the database name (or id), the file number, the page id and the print option.  Using a print option of 0 will give us just the page header. In these examples I’m going to use option 3 which gives us more details on the rows stored on the page. For more information on using DBCC PAGE I’d recommend Paul Randal’s post “How to use DBCC PAGE“.

This kind of investigation lets you see how compression really works.

Comments closed

When A Procedure Has Multiple Plan Cache Entries

Arthur Daniels shows that multi-statement stored procedures can have multiple entries in the plan cache:

So we have two entries for this stored procedure. I included the statement sql handle to show that each statement handle has its own text. Let’s parse that text to see each statement. I copied the parsing SQL from this Plan Cache article.

This is a good thing to keep in mind if you’re trying to figure out how often a procedure gets called: SUM on the execution counts grouped only by text might not give you the results you expect.

Comments closed

Computer Internals and the Buffer Pool

Randolph West starts a new series on the buffer pool in SQL Server:

Now that we’ve reminded ourselves of those fundamentals, let’s take a closer look at the buffer pool.

The buffer pool in SQL Server resides in the computer’s main memory (RAM). When the database engine requests a data page for reading or writing, it is assumed to be in the buffer pool. The buffer pool itself controls access between RAM and storage. If the data page that the database engine requests is not in RAM, a request is sent to the storage engine to retrieve that page. This may be storage directly attached to the system, or via a network interface card.

This first post is a high-level overview, but it looks like there’s a lot more in store from Randolph.

Comments closed

The Bitmap Operator

Hugo Kornelis describes a new operator:

The Bitmap operator is used to build a bitmap that, based on a hash, represents which values may be present in a data flow. Due to the chance of hash collisions in the hash function used, the Bitmap process can produce false positives but not false negatives – so a match based on a bitmap is not guaranteed to be a match to the actual data, but a non-match based on a bitmap is guaranteed to not be a match in the actual data.
The generated bitmap is typically used in other operators to remove rows for which there is no match in the bitmap, and hence guaranteed no match in the original set of data processed by the Bitmap operator. The use of Bitmap operators is most common in execution plans for star join queries in large data warehouses. An example can be seen here.

Click through for details on how it works and plenty of good information on it.

Comments closed

Visualizing Merge Joins

Bert Wagner continues his series on physical join operators:

The base algorithm works as follows: SQL Server compares the first rows from both sorted inputs.  It then continues comparing the next rows from the second input as long as the values match the first input’s value.
Once the values no longer match, SQL Server increments the row of whichever input has the smaller value – it then continues performing comparisons and outputting any joined records. (For more detailed information, be sure to check out Craig Freedman’s post on merge joins.)

Beware:  Bert has animations which might cause you to space out for a few hours.

Comments closed

Understanding Analysis Services Memory Behavior

Shabnam Watson walks us through SQL Server Analysis Services memory settings and application behavior under memory stress:


If memory consumption is below the Low limit everything is fine and it is free to stay in memory. Once the consumption passes the Low limit a cleaner thread wakes up and tries to clean up memory. At this point price of memory is no longer zero. It starts from 2 at the Low limit and goes as high as 1000 when memory consumption reaches the Total limit. The higher the memory pressure the more aggressive the cleaner gets. Once memory consumption reaches the Hard limit all connections/sessions are closed and queries are cancelled with an out of memory error.

This is a thorough explanation with some good demos and terrible queries.  Give it a read.

Comments closed