Press "Enter" to skip to content

Curated SQL Posts

More R Services Internals Spelunking

Niels Berglund continues his series on R Services internals:

What happens is that straight after the AuthenticateConnection you will hit the WriteAsync breakpoint twice, the same way as we see in Code Snippet 4. The first hit at WriteAsync sort of makes sense, as it ties up with the call-stack we see in Code Snippet 5. But what about the second WriteAsync (the one that causes the second package to be sent), where does that come from? To try to figure that out, we start with the call-stack for that particular WriteAsync.

Execute the code in Code Snippet 1 again, and continue to the second WriteAsync. When you hit the breakpoint do a kc again. The call-stack should now look somewhat like this (this time it is the full call-stack):

It’s interesting the kind of stuff you can find with Wireshark and a debugger.

Comments closed

External Memory Pressure With SQL On Linux

Anthony Nocentino explains how SQL Server on Linux reacts to memory pressure:

We can use tools like ps, top and htop to look our are virtual and physical memory allocations. We can also look in the /proc virtual file system for our process and look at the status file. In here we’ll find the point in time status of a process, and most importantly the types of memory allocations for a process. We’ll get granular data on the virtual memory allocations and also the resident set size of the process. Here are the interesting values in the status file we’re going to focus on today.

  • VmSize – total current virtual address space of the process

  • VmRSS – total amount of physical memory currently allocated to the process

  • VmSwap – total amount of virtual memory currently paged out to the swap file (disk)

The differences are going to be interesting for people to troubleshoot later, particularly if you look at SOS_SCHEDULER_YIELD and give a knee-jerk reaction that the problem is with CPU.

Comments closed

Powershell’s Switch Statement

Shane O’Neill shows us Powershell’s switch statement:

We send out notification emails if jobs fail as well but hey, we’re DBAs, we like a backup!

Since this used to be done manually, and since I like PowerShell, I created separate functions that can take individual log files and parse out the needed data in a nice, easy format.

My problem, and the reason for this post, was figuring out, based on the name of the file in a directory, which function to call…

Click through to see how to use the switch statement, as well as how to switch on an expression.

Comments closed

Page Ranking With Kafka Streams

Hunter Kelly walks through a page ranking algorithm:

Once you have the adjacency matrix, you perform some straightforward matrix calculations to calculate a vector of Hub scores and a vector of Authority scores as follows:

  • Sum across the columns and normalize, this becomes your Hub vector
  • Multiply the Hub vector element-wise across the adjacency matrix
  • Sum down the rows and normalize, this becomes your Authority vector
  • Multiply the Authority vector element-wise down the the adjacency matrix
  • Repeat

An important thing to note is that the algorithm is iterative: you perform the steps above until  eventually you reach convergence—that is, the vectors stop changing—and you’re done. For our purposes, we just pick a set number of iterations, execute them, and then accept the results from that point.  We’re mostly interested in the top entries, and those tend to stabilize pretty quickly.

This is an architectural-level post, so there’s no code but there is a useful discussion of the algorithm.

Comments closed

Stateful Processing In Spark Streaming

Bill Chambers and Jules Damji look at a couple of stateful scenarios within Spark Streaming:

No streaming events are free of duplicate entries. Dropping duplicate entries in record-at-a-time systems is imperative—and often a cumbersome operation for a couple of reasons. First, you’ll have to process small or large batches of records at time to discard them. Second, some events, because of network high latencies, may arrive out-of-order or late, which may force you to reiterate or repeat the process. How do you account for that?

Structured Streaming, which ensures exactly once-semantics, can drop duplicate messages as they come in based on arbitrary keys. To deduplicate data, Spark will maintain a number of user-specified keys and ensure that duplicates, when encountered, are discarded.

Just as other stateful processing APIs in Structured Streaming are bounded by declaring watermarking for late data semantics, so is dropping duplicates. Without watermarking, the maintained state can grow infinitely over the course of your stream.

In this scenario, you would still want some sort of de-duplication code at the far end of your process if you can never have duplicates come in across the lifetime of the application.  This sounds like it’s more about preventing bursty duplicates from sensors.

Comments closed

Data Lake Zones

Shannon Lowder walks us through a multi-zone approach to storing data in a data lake:

Our first zone is the raw zone.  This zone will serve as the landing point for source files.  Like the extract (or stage) schema in our data warehouse, we want these files to match the source system as close as possible.In the data lake, we actually go one step beyond saying we want the schema of our raw files to match the source system, we also want these files to be immutable.

Immutable means once they are written to the raw folder we shouldn’t be able to modify or delete them.  That way, we can always reconstruct different states from these files without having to retrieve them from the source system.

Worth reading the whole thing.

Comments closed

Multi-Parameter Website Scraping With Power Query

Callum Green shows how to build up a URL based off of multiple parameters, scraping data from a page for each permutation of parameters:

The sections highlighted in red are the parameters and sit in between some of the hard-coded URL text

Code Breakdown:

–          Text = http://www.boxofficemojo.com/monthly/?page=

–          Parameter = [Page]

–          Text = &view=calendargross&yr=

–          Parameter = [Year]

–          Text = &month=

–          Parameter = [Month]

–          Text = &p=.htm

This is a rather clever solution, and if your parameters are functionally dependent (unlike this example, where it was a simple cross join of the three domains), you can still use the solution the same way; you just need to populate your parameter combination table differently.

Comments closed

Comparing Ranking Functions

Doug Kline compares three window functions:  RANK, DENSE_RANK, and ROW_NUMBER:

— so let’s say that we’ve created a contest

— places in the contest (top place, 2nd place, etc.)
— will be determined by the test score

— in other words, we’re not so concerned with the raw score
— but rather, we’re interested in the *relative* score
— and the order in which people appear, based on their score

— we can use the ROW_NUMBER() function to give a
— ‘ranking’ to each record, based on Score

Doug’s post is a video and an extended script so you can follow along.

Comments closed

Row Goals On Nested Loops

Joe Obbish has performed a very interesting investigation of how row goals work with nested loop joins and the TOP operator:

This does not happen. The cost remains the same as before: 0.294842 units. This is because the scan is costed according to density instead of by looking at the histogram of the outer table. The following query with a local variable repeated five times also has a cost of 0.294842 optimizer units:

DECLARE @var BIGINT = 1;
SELECT *
FROM (
VALUES (@var), (@var), (@var), (@var), (@var)
) s (ID)
WHERE NOT EXISTS
(
	SELECT 1
	FROM dbo.BIG_HEAP b
	WHERE s.ID = b.ID
)
OPTION (NO_PERFORMANCE_SPOOL);

The problem with using density instead of looking at the data in the outer table is mostly apparent when the outer table contains rows without a match in the inner table.

It’s a great bit of investigative legwork and Joe has a Connect item he’d like you to upvote.

Comments closed