# Month: October 2019

Here’s where my math nerd-iness comes out… There’s a little trick for “bucketizing” numbers (in this case, turning “Months” into “Month Buckets”):

– Take a number
– Divide it by your bucket size
– Round that number down to a whole number–We’ll call this the “divided number”
– Multiply the “divided number” by the bucket size–This is your bucket floor
– Add one to the “divided number” and then multiply that by your bucket size–This is your bucket ceiling

I was promised that there would be no math.

If you’re the kind of person who cares about various caches on your server, like the buffer pool or the plan cache, then you’d wanna measure something totally different. You’d wanna measure how much free space you have on each page, because having a bunch of empty space on each page means your data will take up more space in memory when you read it in there from disk.

You could do that with the column avg_page_space_used_in_percent.

BUT…

We can select those rows which we want into our real table which will either already have constraints enabled or we can enable the constraints after we load the data. There are a few problems with this approach, the first problem is that this isn’t very easy to debug. When you realise that your load process only loaded half of the expected rows, why didn’t it load the rest? Other problems include, what do we do with any failed rows – i.e. which rows failed? To answer this we need another query that is the reverse of the “get good rows”. Finally, this is quite a lot of SQL for a file with four columns and three rules, what if the rules are more complex and our data file has lots of columns? I tell you what happens if the rules are more complex: “a mess of SQL jank”, that’s what.

Read on to learn about the alternative, which is an interesting approach. This is another case where a good SQL-based solution is almost a pivot of a good imperative solution: instead of thinking in rows, focus on columns first and let the rows sort themselves out.

Borrowing and adapting code from a previous post, Examining the Performance Impact of an Adhoc Workload, we will first create two stored procedures. The first, dbo.RandomSelects, generates and executes an ad hoc statement, and the second, dbo.SPRandomSelects, generates and executes a parameterized query.

Erin then shows how to review query stats and group together executions which are the same save for a change in literals. Read the whole thing.

Now, I have had a few people ask me what a Product Owner actually does. Some say that it sounds like an architect role.

In reality, the role is one that’s mainly related to newer working practices like Scrum.

A Product Owners list of responsibilities include talking to all the stakeholders for you team in the business and organise the priorities on your backlog board.

The concept makes sense, though this is the first time I’ve heard of such a role for a tool the engineers use rather than a product offered for sale.

In the null values post mentioned above, I mentioned that there are differences between SQL Server and Java in how they handle null. So, when we call into Java from SQL Server, we may want to treat null values the same way as we do in SQL Server.

I wrote about this in the SQL Server 2019 Extensibility Framework & Java – Null Values post mentioned above. However, that post was written before SQL Server 2019 CTP 2.5. In CTP 2.5 Microsoft introduced the Java SDK, and certain things changed. Amongst the things that changed is the way we handle nulls when we receive datasets from SQL Server in our Java code.

Read on to learn how it works today.

Linear models are a standard way of predicting or explaining some data. Visualizing data is not only of didactical value but provides heuristical value too, as demonstrated by Anscombe’s Quartet.

Visualizing linear models in 2D is straightforward, but visualizing linear models with more than one predictor is much less so. The aim of this post is to demonstrate some ways do visualize linear models with more than one predictor, using popular R packages. We will focus on 3D examples, that is, two predictors.

I have a strong bias against 3D visuals because they tend to be so difficult to see clearly. There are times when they’re necessary, though.

We can make use of the built-in .join() function for RDDs.  Similar to the .aggregateByKey() function we saw in the previous post, the .join() function for RDDs requires a 2-element tuple, with the first element being the key and the second element being the value.  So, we need to use the .map() function to restructure our RDDs to store the keys in the first element and the original array/tuple in the second element.  After the join, we end up with an awkward nested structure of arrays and tuples that we need to restructure using another .map() function, leading to a lengthy code snippet.

This is a place where DataFrames make so much more sense.

By using the same formula, the estimated number of rows is:

`SELECT 0.001992032*1000000--1992.032000000`
This is exactly what we see in the execution plan. OK, that was CL 140, let’s see how SQL Server 2019 handles this simple case.

When we switch to CL 150, the plan and estimations for the c1 column (non-nullable) are the same. However, the estimation for the nullable column is changed!

`DELETE dbo.ChildTable1 WHERE ParentID < <some constant>; -- This would delete 47 row(s).`
`DELETE dbo.ChildTable2 WHERE ParentID < <some constant>; -- This would delete 14 row(s). ... DELETE dbo.ParentTable WHERE ID < <some constant>; -- This would delete 11 row(s). `