Press "Enter" to skip to content

Curated SQL Posts

Using Sqoop’s Logic To Improve Spark JDBC Performance

Avi Yehuda analyzes how Sqoop works to make relational database access from Spark faster:

Sqoop performed so much better almost instantly, all you needed to do is to set the number of mappers according to the size of the data and it was working perfectly.
Since both Spark and Sqoop are based on the Hadoop map-reduce framework, it’s clear that Spark can work at least as good as Sqoop, I only needed to find out how to do it. I decided to look closer at what Sqoop does to see if I can imitate that with Spark.
By turning on the verbose flag of Sqoop, you can get a lot more details. What I found was that Sqoop is splitting the input to the different mappers which makes sense, this is map-reduce after all, Spark does the same thing. But before doing that, Sqoop does something smart that Spark doesn’t do.

Read on to see what in particular Sqoop does, and how you can use that in your Spark code.

Comments closed

Powershell Core Preview 3 Fails To Start

Max Trinidad walks us through an issue with Powershell Core Preview 3:

Just in case you haven’t try to install PowerShell Core Preview.3, in Windows, which became available on the evening of the 10th of December. If are doing a clean installation, meaning that it was previously manually uninstall, or that this is your first installation, then you are fine.
The issue with installing PowerShell Core Preview.3 is when you do an upgrade over a previous version: either Preview.1 or Preview.2.

That’s a common occupational hazard when working with preview tools.  But Max has your solution.

Comments closed

Deploying SQL Server 2019 Big Data Clusters With Kubernetes

Chris Adkin has the start of a new series:

Minikube is a good learning tool and Microsoft provides instructions for deploying a big data cluster to this ‘Platform’. However, its single node nature and the fact that application pods run on the master node means that this does not reflect a cluster that anyone would run in production. Kubernetes-as-a-service is probably by far the easiest option for spinning a cluster up, however it relies on an Aws, Azure or Google Cloud Platform account, hence there is a $ cost associated with this. This leaves a vanilla deployment of Kubernetes on premises. Based on the assumption that most people will have access to Windows server version 2008 or above, a relatively cheap and way of deploying a Kubernetes cluster is via Linux virtual machines running on Hyper-V. This blog post will provide step by step instructions for creating the virtual machines to act as the master and worker nodes in the cluster. 

This is going on my “try this out when I have time” list.  So expect a full report sometime in the year 2023.

Comments closed

Using Hidden Slicers To Control Visible Power BI Slicers

Prathy Kamasani has an interesting problem to solve:

In my current project, one of the user requirement was to have a filter on the Year Slicer. To explain in detail, we have various measures to show metrics for current and earlier measures. For example 2016,2017 and 2018. In 2016, we always have blank values for Last Year metrics, having empty values don’t tell the story well. So to tell the story, we need to pull three years worth of data but display only two years in the Slicer. The easiest way to handle this situation would have had a visual level filter on the Year slicer.
Power BI Slicers doesn’t support Visual Slicers. However, with the help of Selection Pane and Sync slicers, I did a quick workaround.

Read on to see how to use hidden slicers to  control what’s displayed to the user in visible slicers.

Comments closed

Visualizing Merge Joins

Bert Wagner continues his series on physical join operators:

The base algorithm works as follows: SQL Server compares the first rows from both sorted inputs.  It then continues comparing the next rows from the second input as long as the values match the first input’s value.
Once the values no longer match, SQL Server increments the row of whichever input has the smaller value – it then continues performing comparisons and outputting any joined records. (For more detailed information, be sure to check out Craig Freedman’s post on merge joins.)

Beware:  Bert has animations which might cause you to space out for a few hours.

Comments closed

Traversing Nearest Neighbors With Dijkstra’s Algorithm And SQL Server Graph

James Livingston gives us a Powershell-based implementation of Dijkstra’s shortest path algorithm:

In my previous post I shared a SQL Server 2017 graph database of US capitals. Graphs are a computer science core competency and present some interesting challenges for programmers. Most notable of these challenges is finding the shortest path between nodes. Dijkstra’s algorithm is a commonly taught algorithm for finding shortest path. Dijkstra’s is often asked about during entry level developer interviews and it is a great algorithm to implement when learning a new language since it requires utilizing loops, logic, and data structures.
Here’s my implementation of Dijkstra’s algorithm using PowerShell, traversing a graph of US capitals. Rather than manage our own graph nodes and edges, we’ll utilize graph tables and queries in SQL Server. There’s a lot of different ways to implement this in PowerShell, my first cut of this ended up using a hash table so I could perform random access. There’s a give-and-take with custom PowerShell objects, which sacrifice random access for some other benefits.

Click through for the code.

Comments closed

Backups With Checksum Validation

Eric Blinn shows the upside to performing checksum validation during backups:

Since a full backup reads every data page it makes sense that we can ask SQL Server to calculate and compare each checksum during this operation. Even a differential backup reads all of the pages for any extent (a group of 8 data pages) if any one page in it changes so it will validate some pages that didn’t change.

Read on for a demonstration.

Comments closed

KSQL Deployment Options

Hojjat Jafarpour shows us two deployment options for Kafka Streams with KSQL:

As I mentioned, we have implemented KSQL on top of the Kafka Streams API. This means that every KSQL query is compiled into a Kafka Streams application. Therefore, KSQL queries follow the same execution model of Kafka Streams applications.
A query can be executed on multiple instances, and each instance will process a portion of the input data from the input topic(s) as well as generate portions of the output data to output topic(s). Based on this execution model and depending on how we want to run our queries, currently, KSQL provides two deployment options.

Read on for those options.

Comments closed

Reviewing Word Associations With R

Julia Silge does some exploratory analysis on the Small World of Words project:

The Small World of Words project focuses on word associations. You can try it out for yourself to see how it works, but the general idea is that the participant is presented with a word (from “telephone” to “journalist” to “yoga”) and is then asked to give their immediate association with that word. The project has collected more than 15 million responses to date, and is still collecting data. You can check out some pre-built visualizations the researchers have put together to explore the dataset, or you can download the data for yourself.

It’s an interesting analysis of the data set, mixed in with some good R code.

Comments closed