Press "Enter" to skip to content

Author: Kevin Feasel

Getting Pagination Wrong

Lukas Eder discusses common pagination issues:

If your data source is a SQL database, you might have implemented pagination by using LIMIT .. OFFSET, or OFFSET .. FETCH or some ROWNUM / ROW_NUMBER() filtering (see the jOOQ manual for some syntax comparisons across RDBMS). OFFSET is the right tool to jump to page 317, but remember, no one really wants to jump to that page, and besides, OFFSET just skips a fixed number of rows. If there are new rows in the system between the time page number 316 is displayed to a user and when the user skips to page number 317, the rows will shift, because the offsets will shift. No one wants that either, when they click on “next”.

Instead, you should be using what we refer to as “keyset pagination” (as opposed to “offset pagination”).

He also has a good explanation of the seek method.

I will throw in one jab at Oracle (because hey, it’s been a while since I’ve lobbed a bomb at Oracle on this blog):  it’d really suck to have a system where I legally wasn’t allowed to distribute relevant performance comparison benchmarks.  Fortunately, I tend to work on better data stacks.

Comments closed

Extended Events Audit

Steve Jones creates an audit with Extended Events:

The third part of the invitation was to write this. I covered what I did, and some of what I learned. I’ll add a bit more here.

I certainly was clumsy working with XE, and despite working my way through the course, I realize I have a lot of learning to do in order to become more familiar with how to use XE. While I got a basic session going, depending on when I started it and what I was experimenting with, I sometimes found myself with events that never went away, such as a commit or rollback with no corresponding opening transaction.

This T-SQL Tuesday was a bit broader in scope, so it has been interesting watching people respond.

Comments closed

Trapping HTTP Error Codes In Power BI

Chris Webb shows how to handle specific HTTP error codes when using the Web.Contents() function in M:

This thread on the Power Query forum suggests it’s something to do with lazy evaluation, but I haven’t been able to determine the situations when it does work and when it doesn’t.

Instead, it is possible to handle specific HTTP error codes using the ManualStatusHandling option in Web.Contents()

I guess this beats not being able to handle errors at all, but it seems like a fairly fragile solution if you next want to start handling the entire 500 class of response codes.

Comments closed

Unit Testing Of Spark Streaming

Felipe Fernandez shows how to unit test Spark Streaming:

Controlling the lifecycle of Spark can be cumbersome and tedious. Fortunately, Spark Testing Baseproject offers us Scala Traits that handle those low-level details for us. Streaming has an extra bit of complexity as we need to produce data for ingestion in a timely way. At the same time, Spark internal clock needs to tick in a controlled way if we want to test timed operations as sliding windows.

This is part one of a series.  I’m interesting in seeing where this goes.

Comments closed

Securing Spark Shuffle

Cheng Xu uses Apache Commons Crypto to secure data when Spark shuffles off to disk:

The basic steps can be described as follows:

  1. When a Spark job starts, it will generate encryption keys and store them in the current user’s credentials, which are shared with all executors.

  2. When shuffle happens, the shuffle writer will first compress the plaintext if compression is enabled. Spark will use the randomly generated Initial Vector (IV) and keys obtained from the credentials to encrypt the plaintext by using CryptoOutputStream from Crypto.

  3. CryptoOutputStream will encrypt the shuffle data and write it to the disk as it arrives. The first 16 bytes of the encrypted output file are preserved to store the initial vector.

  4. For the read path, the first 16 bytes are used to initialize the IV, which is provided to CryptoInputStreamalong with the user’s credentials. The decrypted data is then provided to Spark’s shuffle mechanism for further processing.

Once you have things optimized, the performance hit is surprisingly small.

Comments closed

AutoRestart SSAS Extended Events

Bill Anton looks at the AutoRestart option on Extended Events for Analysis Services:

So how do we handle the scenario where the server is rebooted?

  • Option 1: always remember to restart the trace after server reboots
  • Option 2: create a SQL Agent job to poll for the SSAS service status and start the xEvent trace if its not already running
  • Option 3: write a custom .NET watchdog service to poll for the SSAS service status and start the xEvents trace if its not already running

Those are the options I’ve used or seen used in the past… and to be sure, all of them have their drawbacks in reliability and/or complexity.

…which is why I was so excited when it was brought to my attention that there is an “AutoRestart” option for SSAS xEvents!

Do read the whole thing.

Comments closed

Microsoft R Server On Spark

Max Kaznady, et al, discuss using Microsoft R Server on Spark to perform rapid prototyping against the NYC Taxi dataset:

Once the cluster is created, you can connect to the edge node where MRS is already pre-installed by SSHing to r-server.YOURCLUSTERNAME-ssh.azurehdinsight.net with the credentials which you supplied during the cluster creation process. In order to do this in MobaXterm, you can go to Sessions, then New Sessions and then SSH.

The default installation of HDI Spark on Linux cluster does not come with RStudio Server installed on the edge node. RStudio Server is a popular open source integrated development environment (IDE) available for R that provides a browser-based IDE for use by remote clients. This tool allows you to benefit from all the power of R, Spark and Microsoft HDInsight cluster through your browser. In order to install RStudio you can follow the steps detailed in the guide, which reduces to running a script on the edge node.

If you’ve been meaning to get further into Spark & R, this is a great article to follow along with on your own.

Comments closed

Azure SQL Threat Detection

Ron Matchoro discusses use cases for Azure SQL Threat Detection:

Thanks to SQL Threat Detection, we were able to detect and fix code vulnerabilities to SQL injection attacks and prevent potential threats to our database. I was extremely impressed how simple it was to enable threat detection policy using the Azure portal, which required no modifications to our SQL client applications. A while after enabling SQL Threat Detection, we received an email notification about ‘An application error that may indicate a vulnerability to SQL injection attacks’.  The notification provided details of the suspicious activity and recommended concrete actions to further investigate and remediate the threat.  The alert helped me to track down the source my error and pointed me to the Microsoft documentation that thoroughly explained how to fix my code.  As the head of IT for an information technology and services company, I now guide my team to turn on SQL Auditing and Threat Detection on all our projects, because it gives us another layer of protection and is like having a free security expert on our team.”

Anything which helps kill SQL injection for good makes me happy.

Comments closed

Data Cleansing In SQLite

Allison Tharp wants to clear out kinda-sorta duplicates from a SQLite table:

However, now I have a lot of database entries that are unneeded.  I thought I would take the time to clean this up (even though I’ll no longer use the data and could easily just delete the tables).  For the BGG Hotness, I have the tables: hotgame, hotperson, and hotcompany.  I have 7,350 rows in each of those tables, since I collected data on 50 rankings every hour for just over 6 days.  However, since the BGG hotness rankings only update daily, I really only need 300 rows (50 rankings * 6 days = 300 rows).

I know think the rankings update between 3 and 4, so I want to only keep the entries from 4:00 AM.  I use the following SELECT statement to make sure I’m in the ballpark with where the data is that I want to keep:

There are several ways to solve this problem; this one is easy and works.  The syntax won’t work for all database platforms, but does the trick for SQLite.

Comments closed

Considerations For Azure SQL Database

Grant Fritchey discusses whether new database administrators might want to start with Azure SQL Database rather than on-premises SQL Server:

Since you are right at the start of your career, you may as well plan on maximizing the life of the knowledge and skills you’re building. By this, I mean spend your time learning the newest and most advanced software rather than the old approach. Is there still work for people who only know SQL Server 2000? Sure. However, if you’re looking at the future, I strongly advocate for going with online, cloud-based systems. This is because, more and more, you’re going to be working with online, connected, applications. If the app is in the cloud, so should the data be. Azure and the technologies within it are absolutely the cutting edge today. Spending your limited learning time on this technology is an investment in your future.

This answer is a tougher call for me.  Looking at new database developers (or development DBAs or database engineers or whatever…), I think the case is pretty solid:  there’s so much skill overlap that it’s relatively easy to move from Azure SQL Database to on-prem.  With production DBAs, the story’s a little different:  as Grant mentions, this is a Platform as a Service technology, and so the management interface is going to be different.  There are quite a few commonalities (common DMVs, some common functionality), but Grant gives a good example of something which is quite different between the PaaS offering and the on-prem offering:  database backup and restoration.  I think the amount of skills transfer is lower, and so the question becomes whether the marginal value of learning PaaS before IaaS/on-prem is high enough.  Given my (likely biased) discussions of Azure SQL Database implementations at companies, I’d stick with learning on-prem first because you’re much more likely to find a company with an on-prem SQL Server installation than an Azure SQL Database.

Comments closed