Press "Enter" to skip to content

Curated SQL Posts

Throwing Hardware At The Problem

Erik Darling says get more RAM:

I’m not saying you need a 1:1 relationship between data and memory all the time, but if you’re not caching the stuff users are, you know, using, in an efficient way, you may wanna think about your strategy here.

  • Option 1: Buy some more RAM
  • Option 2: Buy an all flash array

You’ll still need to blow some development time on tuning queries and indexes, but hardware can usually bridge the gap if things are already critical.

Looking at hardware is a reasonable approach.  The best bet is to satisfy the most pressing need at the margin.  Sometimes that means more (or better) hardware, sometimes it means tuning queries, and sometimes it means application-level changes to retrieve data differently.

Comments closed

Group Workspaces

Ginger Grant shows how to use Group Workspaces within Power BI to allow a team to work together on the same files without stepping on each other’s feet:

Many of the functionality people associate with source control programs live inside the group one drive which is created for Power BI. Looking at the picture of the group screen, which was created when a Power BI Workspace was created, you will see that this group contains 7 members and four files. The members of this groups are the only ones who have access to the files. The file AcmeThree.pbix is selected, and cClicking on the elipise (…) brings up a menu for the file. Notice one of them is Check Out. If I check out a file, the icon next to the name changes, providing a visual queue to all who wish to edit the file that it is being working on. The menu option for me would change to Check In, providing the ability to check the file in to the directory, allowing others to check out the file and work on it. Notice Version History also exists. This feature allows previous versions of the file to be loaded, which means that changes made to a file can be rolled back.

It’s good that this is available, and I’d make use of it.  For Power BI Desktop, it seems prudent to continue using source control.

Comments closed

Cache Eviction Policies

Dan Luu has a great article from a couple years ago on when a random cache eviction policy might be preferable to Least Recently Used:

Once upon a time, my computer architecture professor mentioned that using a random eviction policy for caches really isn’t so bad. That random eviction isn’t bad can be surprising – if your cache fills up and you have to get rid of something, choosing the least recently used (LRU) is an obvious choice, since you’re more likely to use something if you’ve used it recently. If you have a tight loop, LRU is going to be perfect as long as the loop fits in cache, but it’s going to cause a miss every time if the loop doesn’t fit. A random eviction policy degrades gracefully as the loop gets too big.

In practice, on real workloads, random tends to do worse than other algorithms. But what if we take two random choices and just use LRU between those two choices?

Here are the relative miss rates we get for SPEC CPU1 with a Sandy Bridge-like cache (8-way associative, 64k, 256k, and 2MB L1, L2, and L3 caches, respectively). These are ratios (algorithm miss rate : random miss rate); lower is better. Each cache uses the same policy at all levels of the cache.

Dan writes at a depth I appreciate and on topics I often don’t understand (particularly when he gets into CPU engineering details).

Comments closed

MapReduce

I talk about Hadoop a good bit on Curated SQL.  Therefore, I think it’s worth mentioning the original MapReduce paper that Jeffrey Dean and Sanjay Ghemawat published in 2004:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

If you’ve never read this paper before, today might be a good day to do so.

Comments closed

Arrays And Lists In SQL Server

Erland Sommarskog has updated his essay on Arrays and Lists in SQL Server.  He’s broken it down into a few parts.  First, the short version:

Now you know why IN (@list) does not work as you hoped for, but if you have a comma-separated list you still need to know to work with it.

The best approach in my opinion is to reconsider having a comma-separated list at all. After all, you are in a relational database, so why not use a table instead? That is, you should pass the data in a table-valued parameter (TVP) instead of that comma-separated list. If you have never used TVPs before, I have an article, Using Table-Valued Parameters in SQL Server and .NET, where I give a tutorial of passing TVPs from .NET to SQL Server, and there is a detailed description exactly of the case of passing a comma-separated list to a TVP. You will find that it is astonishly simple.

Unfortunately, not all environments support TVPs – Entity Framework has no real support for TVPs, reportedly nor has Reporting Services. The same applies if you are on SQL 2005 or earlier, since TVPs were added in SQL 2008. Or you may just be plain stubborn and want to use your comma-separated list. Or you are simply pressed for time, and don’t have the time to learn something new right now.

If you want a longer article on using table-valued parameters, Erland has one of those as well:

This is an article that is intended to get you started with passing table-valued parameters (TVPs) from SQL Server to .NET. I begin with presenting how you use table-valued parameters in SQL Server itself whereupon I give a quick overview of the mechanisms to pass TVPs from ADO .NET to SQL Server.

The main meat of this article are two real-world examples where I use TVPs. The first example is the classical problem of passing a comma-separated list of values to SQL Server, this time through a table-valued parameter. You will be amazed of how simple it is. In the second example I show two ways to load a file with master-detail data into tables in SQL Server. In addition to the examples, there is also some discussion on how you can improve performance when loading large amounts of data.

Despite the appearance of .NET in the title of this article, there is a final chapter that explores the possibilities in other APIs, of which some and some do not support TVPs. This includes Entity Framework which has no for support TVPs. In this chapter I briefly discuss workarounds when TVPs are not available to you.

And for the advanced look at arrays and lists, you have the long-form article:

A problem that has been popular over the years with SQL Server is how to handle a list of values. In the majority of the cases, people have a comma-separated list, because this format is produced by commonly used tools like multi-choice controls in .NET, Reporting Services and other places.

When I say that the problem is popular, I don’t only mean that the questions are commonplace – but so are solutions. You can find no end of blog posts etc that presents string-splitting functions, including performance tests of such functions and there are function that are known to be the fastest etc.

The aim of this article is two-fold: 1) Give a general discussion of how to design string-splitting functions. 2) Present and discuss each method from the angles I bring up in the general discussion. This includes performance, but not only.

Even if you’ve read this article before, it’s worth checking again to refresh your memory and to see his changes.

Comments closed

Labor Day

Today is Labor Day in the United States.  Because most Curated SQL readers have the day off, I’m going to link to some longer-form and more timeless material.

Comments closed

Data Quality

Milind Paradkar discusses clean data:

We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 – 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.

NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.

I like this post because it exposes a data quality issue people don’t tend to think about very often:  when all of the data is legitimate and correctly-structured, but there are gaps in the available data set.  This is one of many data quality problems you’ll run into, so it may be important to have a plan in place in case you hit this scenario.

Comments closed

LIME

William Vorhies discusses a new technical paper on Local Interpretable Model-Agnostic Explanations:

What the model actually used for classification were these: ‘posting’, ‘host’, ‘NNTP’, ‘EDU’, ‘have’, ‘there’.  These are meaningless artifacts that appear in both the training and test sets and have nothing to do with the topic except that, for example, the word “posting” (part of the email header) appears in 21.6% of the examples in the training set but only two times in the class “Christianity.”

Is this model going to generalize?  Absolutely not.

An Example from Image Processing

In this example using Google’s Inception NN on arbitrary images the objective was to correctly classify “tree frogs”.  The classifier was correct in about 54% of cases but also interpreted the image as a pool table (7%) and a balloon (5%).

Looks like an interesting paper.  Click through for a link to the paper.

Comments closed

The Spark Ecosystem

Frank Evans gives an overview of what the Apache Spark ecosystem looks like:

The built-in machine learning library in Spark is broken into two parts: MLlib and KeystoneML.

  • MLlib: This is the principal library for machine learning tasks. It includes both algorithms and specialized data structures. Machine learning algorithms for clustering, regression, classification, and collaborative filtering are available. Data structures such as sparse and dense matrices and vectors, as well as supervised learning structures that act like vectors but denote the features of the data set from its labels, are also available. This makes feeding data into a machine learning algorithm incredibly straightforward and does not require writing a bunch of code to denote how the algorithm should organize the data inside itself.

  • KeystoneML: Like the oil pipeline it takes its name from, KeystoneML is built to help construct machine learning pipelines. The pipelines help prepare the data for the model, build and iteratively test the model, and tune the parameters of the model to squeeze out the best performance and capability.

Whereas Hadoop’s ecosystem is large and sprawling, the Spark ecosystem tends to be more tightly constrained.  The nice part about Spark is that it plays nicely with the Hadoop ecosystem—you can have a cluster or architecture with Spark and Hadoop-centric technologies (Storm, Kafka, Hive, Flume, etc. etc.) working together quite nicely.

Comments closed

Spark Notebook Workflows

Dave Wang, Eric Liang, and Maddie Schults introduce Notebook Workflows:

Notebooks are very helpful in building a pipeline even with compiled artifacts. Being able to visualize data and interactively experiment with transformations makes it much easier to write code in small, testable chunks. More importantly, the development of most data pipelines begins with exploration, which is the perfect use case for notebooks. As an example, Yesware regularly uses Databricks Notebooks to prototype new features for their ETL pipeline.

On the flip side, teams also run into problems as they use notebooks to take on more complex data processing tasks:

  • Logic within notebooks becomes harder to organize. Exploratory notebooks start off as simple sequences of Spark commands that run in order. However, it is common to make decisions based on the result of prior steps in a production pipeline – which is often at odds with how notebooks are written during the initial exploration.
  • Notebooks are not modular enough. Teams need the ability to retry only a subset of a data pipeline so that a failure does not require re-running the entire pipeline.

These are the common reasons that teams often re-implement notebook code for production. The re-implementation process is time-consuming, tedious, and negates the interactive properties of notebooks.

Those two reasons are why I’ve argued that you should sit down in front of a REPL and figure out what you’re doing with a particular data set.  Once you’ve got it figured out, perform the operations in a notebook for posterity and to replicate your actions later.  I’m curious to see how this gets adopted in practice.

Comments closed