Press "Enter" to skip to content

Curated SQL Posts

Using map And flatMap In Scala

Shubham Verma explains the map and flatMap functions in Scala:

Consider two sets, A = {-2, -1, 0, 1, 2} and B = {0.5, 1, 1.5, 2.5, 4, 4.5, 5, 5.5} and a function          f: A => B

y = x ^ 2 + 0.5;  x is an element from set A and y corresponds to an element from set B, now we see that function f is applied to every element of set A but the result could be a subset of set B also.

So from the above text, we can draw the analogy that sets A and B can be seen as any collection in programming paradigm. Now what is “f”, so “f” could be seen as a function that takes an element from A and returns an element that exists in B, the point here to note is that, as scala promotes immutability whenever we apply map (or any other transformer) on some collection of type A, it returns a new collection of the same type with elements of type B. It would be helpful to understand it from the snippet below.

val result: List[B] = List[A].map(f: A => B)

So when a map operation is applied on a collection (here a List) of type A, with passing f as its argument it applies that function to every element of List of type A returns a new collection (again a List) of type B.

Read the whole thing.

Comments closed

Smarter Distribution Database Cleanup

Sourabh Agarwal writes about an improvement to the Distribution Cleanup job:

In case of a highly active environment the cleanup job would need to delete a substantially large set of expired transactions and commands. The cleanup job uses a while loop to delete entries from the MSRepl_Commands and MSRepl_Transactions in batches of 2000 and 5000 rows, respectively. Depending on how large the MSRepl_Commands and MSRepl_Transactions Tables are, the cleanup activity can be cumbersome and time consuming, leading to several performance issues like lock blocking and sometimes replication agent failures. These hard-coded batch sizes work well when the size of the replication tables is small, but do not perform well when these replication tables become large, let’s say like 200-300 million rows.

The new implementation of the procedure allows users to parameterize the batch sizes inside the cleanup stored procedure and introduces a new adaptive approach to determine the batch sizes for each iteration. The procedure will, by default, honor any batch sizes that are mentioned explicitly as parameters. If no parameters are supplied, then the procedure will start with the default batch sizes of 2000 and 5000 rows and increase/decrease the batch sizes based on the performance of the previous iteration of the delete operation. If the time taken by the delete query improves by 50% compared to previous execution, the batch size value is increased by 20% up to a max value of 50000 rows per batch, and if the performance of the delete operation decreases by 20% as compared to the previous iteration, the batch size is decreased by 50% up to a default value of 2000/5000 rows per batch for MSReplCommands and MSRepl_Transactions, respectively.

It’s good to see Microsoft making incremental improvements to replication.

Comments closed

Unit Testing ETL Processes

Ed Elliott has a post showing how to perform unit testing on ETL processes:

In tSQLt, we can call tSQLt.FakeTable and then do an insert, if we don’t use tSQLt what do we do? Well, we need to setup the data we want, this could be by using a tool or by writing a load of insert statements. I have seen this done in various ways such as:

  1. Writing manual insert scripts
  2. Using a tool to setup the data
  3. Making use of API’s in the application to setup the data we need
  4. Some wierd and wonderful things that we shouldn’t recommend

Ultimately, for each test that you do you need to know what data you need for it. There isn’t really any way around this, and the sooner you get yourself in the position where you can setup the data you need for a test, the better.

Read the whole thing.

Comments closed

Test Your Restore Process

Clive Strong tells a tale about a mental flub while restoring a backup:

Our automated restore process works really nicely. We take full backups on Saturday and differential backups through the week. We also take log backups through the day, but we were not going to be restoring those for this task. We have a number of internal platforms we restore to in full (or in part following a cut down process) so which gives us good validation of our backup files on a regular basis. We also have regular test restores from tape just for good measure.

However, a while ago I was asked to build a new server and restore the databases up to a specific date. We didn’t need a point in time restore, just to a specific day, so I pulled the full and differentials and wrote the script to do the restore for me. The script restored the full backup and the differential backup for Sunday, Monday, Tuesday, Wednesday and Thursday. I gave it the once over and executed the script. A while later and I came back and it was unexpectedly, still running. I eventually left the office and noted it finished in the early hours and ran for many hours longer than I had anticipated.

Read on for Clive’s more detailed explanation of the whoopsie moment.

Comments closed

Rotating Encryption Keys When Using Always Encrypted

Ed Leighton-Dick explains how to rotate your encryption keys when using Always Encrypted in SQL Server:

In my last post, I talked about the process of rotating your encryption keys. It’s just one of those routine maintenance tasks that need to be done from time to time in order to keep your encryption strong for the long haul. One type of rotation I didn’t address in that post was rotation for Always Encrypted, SQL Server’s newest form of encryption.

If you recall, Always Encrypted has two associated keys: a Column Master Key and a Column Encryption Key. The Column Encryption Key (CEK) is a symmetric key, stored in SQL Server. Like other symmetric keys, the CEK is not changed during a rotation. The Column Master Key (CMK), on the other hand, is a certificate, similar to the certificates we’ve used for transparent data encryption and for in-column encryption, and it therefore needs to be rotated regularly. The biggest difference is that the CMK is stored outside of SQL Server, in the Windows certificate store by default, so DBAs may need assistance from their Windows administrators or security administrators.

Read on to see what you need to do.

Comments closed

Architecting A Power BI Environment

Reza Rad explains different architectural patterns for a Power BI implementation:

Implementing a Power BI solution is not just about developing reports, creating a data model, or using visuals. Power BI, like any other technologies, can be used in a correct, or incorrect way. Any technology can be used more effective if it harnesses the right architecture. A right architecture can be achieved after a requirement gathering and designing aspects and components of the technology to fit the requirement. In this post, you will learn about some of the most common architectures to use Power BI. You will learn about using Power BI in different architecture guidelines;

  • Sharing architecture

  • Self-service architecture

  • Enterprise architecture

Read on to learn more about these three patterns.

Comments closed

Operator Precedence In Complex Check Constraints

Louis Davidson walks us through the case of the disappearing parentheses:

You will see the code is not the same as when I coded it:

Looking as this, one thing stands out to the eye: the parenthesis have been removed from the comparisons, but added to the literal value of 2. (Also, everything is surrounded by square brackets, but as awful as that looks, it is a common thing to see in code that is generated/parsed, such as this or a query plan.) When the person who wrote the similar code saw this, they were concerned it was wrong, then the next person who did a code review was also concerned it was wrong, and when I saw it, I was sure it was right, but only because I had a bit more faith that if this were a problem with SQL Server’s toolset, I would probably have read about it!

Read on as Louis explains the logic.  My preference would be to retain the parentheses to make it easier for humans to follow.

Comments closed

The Basics Of Lambda Calculus

Kevin Sookocheff walks us through some of the basics of Lambda calculus:

Functions are a bit more complicated. Michaelson states that a λ function serves as an abstraction over a λ expression, which isn’t that informative unless we take some time to understand what abstraction actually means.

Programmers use abstraction all the time by generalizing from a specific instance of a problem to a parameterized version of it. Abstraction uses names to refer to concrete objects or values (you can call them parameters if you like), as a means to create generalizations of specific problems. You can then take this abstraction (you can call it a function if you like), and replace the names with concrete objects or values to create a particular concrete instance of the problem. Readers familiar with refactoring can view abstraction as an “Extract Method” refactoring that turns a fragment of code into a method with parameters that explain the purpose of the method.

I think having a good understanding of Lambda calculus is a huge advantage for a data platform professional, as it gives you an inroad to learning data-centric functional programming languages (e.g., Scala, R, and F#) and neatly sidesteps the impedance mismatch problem with object-oriented languages.

Comments closed

Findings From The Field: Reviewing Monitored Servers

Brent Ozar shares some findings from his SQL ConstantCare service:

Let’s look at a sample day from last week when 562 servers sent in data (out of the users who agreed to share their data to improve recommendations):

  • No one was using the -x startup flag

  • No one had known-dangerous third party modules installed

  • No one was running a 32-bit SQL Server

  • No one had high memory usage for in-memory OLTP (Hekaton)

  • No one’s plan cache had been erased recently (aside from server restarts)

  • No one needed to restart to fix a TempDB file configuration mismatch in the DMVs

  • Everyone on 2016 Standard Edition had patched to SP1 to get the free features

Read on to see the rules that customers are struggling with the most, as well as some thoughts on a scenario where a broken rule might not be that important.

Comments closed

Finding Invoices Containing A Product

Matt Allington shows how to use Power Query to get details on a top-level item (including all child items) which contains a particular child item:

As you can see in the image below, when a slicer is added to the product table and then you select an item in the slicer, the invoice detail table ONLY shows that single selected product.  But the requirement it to see ALL the items on ALL the invoices that contain that selected product.

Read on, as the solution definitely isn’t trivial.

Comments closed