Press "Enter" to skip to content

Curated SQL Posts

Quick Geospatial Data Plots In R And Python

Harry McLellan shows us how we can use R and Python to generate quick-and-dirty plots of geospatial data:

Now R has some useful packages like ggmap, mapdata and ggplot2 which allow you to source you map satellite images directly from google maps, but this does require a free google API key to source from the cloud. These packages can also plot the map around the data as I am currently trimming the map to fit the data. But for a fair test I also used a simplistic pre-built map in R. This was from the package rworldmap, which allows plotting at a country level with defined borders. Axes can be scaled to act like a zoom function but without a higher resolutions map or raster satellite image map it is pointless to go past a country level.

There’s a lot more you can do with both languages, but when you just want a plot in a few lines of code, both are up to the task.

Comments closed

Using K-Means Clustering For Anomaly Detection

Anais Dotis-Georgiou gives us an interesting use case of using k-means clustering along with InfluxDB (a time-series database) to detect anomalies in EKG data:

If you read Part Two, then you know these are the steps I used for anomaly detection with K-means:

  1. Segmentation – the process of splitting your time series data into small segments with a horizontal translation.

  2. Windowing – the action of multiplying your segmented data by a windowing function to truncate the dataset before and after the window. The term windowing gets its name from its functionality: it allows you to only see the data in the window range since everything before and after (or outside the window) is multiplied by zero. Windowing allows you to seamlessly stitch your reconstructed data together.

  3. Clustering – the task of grouping similar windowed segments and finding the centroids in the clusters. A centroid is at the center of a cluster. Mathematically, it is defined by the arithmetic mean position of all the points in the cluster.

  4. Reconstruction – the process of rebuilding your time series data. Essentially, you are matching your normal time series data to the closest centroid (the predicted centroid) and stitching those centroids together to produce the reconstructed data.

  5. Normal Error – The purpose of the Reconstruction is to calculate the normal error associated with the output of your time series prediction.

  6. Anomaly Detection – Since you know what the normal error for reconstruction is, you can now use it as a threshold for anomaly detection. Any reconstruction error above that normal error can be considered an anomaly.

Read the whole thing.  This is a really cool use case of a set of technologies along with a venerable (if sometimes troublesome) algorithm.

Comments closed

Running Windows On Kubernetes

Chris Adkin shows us how you can run the Windows OS on Kubernetes (with some limitations):

With the caveat that this information is correct at the time of writing, the following points should be noted:

  • The control plane and worker nodes should always be on the same release of Kubernetes.
  • The control plane can only run on Linux
  • The minimum version of Windows 2016 RTM is required for worker nodes, but version 1709 is preferred.

A full list of restriction can be found here.

But here is something particularly significant for anyone wishing to deploy highly available SQL Server infrastructures to Kubernetes via availability groups:

Read on for that particularly significant limitation; it’s a doozy.

Comments closed

Using Power BI Dataflows To Create Common Data Sets

Alexander Arvidsson shares an interesting use case for Power BI Dataflows:

There are several more use cases for a dataflow, but one that is very useful is the ability to share a dataset between apps. Previously we had to duplicate the dataset to each and every app that needed to use it, increasing the risk that one dataset was ignored, not refreshed properly or otherwise out of sync with reality. By using dataflows we can have several apps rely on the same dataflow (via a dataset), and thus it is quite possible to have a “master dataset”.

Click through for a walkthrough, as well as an understanding of the process’s limitations.

Comments closed

No Laptop For You: A Case Of Machine Learning Failure

Ginger Grant walks us through a scenario where Lenovo refused to sell her a laptop four times:

Buying a laptop from Lenovo reminded me of an episode of Seinfeld when Elaine was trying to buy soup.  For some unknown reason, when I placed an order on their website and gave them my money, Lenovo gave me a Cancellation Notice, the email equivalent of “No Soup for you!”  After placing an order, about 15 minutes later, I received a cancellation notice.  I called customer service.  They looked at the order and advised me the systemincorrectly cancelled the order.  I was told to place the order again as they had resolved the problem.  I created a new order, and just like the last time, I received the No Laptop for You cancellation email.  I called back. This time I was told that the system thinks I am a fraud. Now I have no laptop and I have been insulted.

In all the talk of ML running the future, one thing that gets forgotten is that models, being simplifications of reality, necessarily make mistakes.  Failing to have some sort of manual override means, in this case, throwing away money for no good reason.

Comments closed

Join Estimation: Details On A Cardinality Estimator Change

Paul White shares some detail on a change to the Cardinality Estimator in SQL Server 2014:

The question is now how to perform a coarse estimation of the equijoin cardinality of the highlighted steps, using the information available.

The original cardinality estimator would have performed a fine-grained step-by-step histogram alignment using linear interpolation, assessed the join contribution of each step (much as we did for the minimum step value before), and summed each step contribution to acquire a full join estimate. While this procedure makes a lot of intuitive sense, practical experience was that this fine-grained approach added computational overhead and could produce results of variable quality.

The original estimator had another way to estimate join cardinality when histogram information was either not available, or heuristically assessed to be inferior. This is known as a frequency-based estimation[…]

It’s an interesting dive into one of the changes in 2014’s CE.  The post is a little math-heavy but Paul does a great job keeping it interesting.

Comments closed

SQL-to-Excel: A Tool For Automating Exports To Excel

Dave Mason has written a command line tool for Excel-based productivity:

In 2018, I’ve found myself frequently running a series of static SQL Server queries and copying/pasting the output of each query into a separate sheet in a Microsoft Excel Worksheet file. They say if you perform a manual task X number of times or more, you should automate it. Great advice, right? There are a number of ways to export SQL Server query results to Excel, and an internet search readily turns up many options. I didn’t find any that suited my needs, so I wrote my own utility, uncleverly named “SQL to Excel”.

SQL to Excel is a modest command line utility. It iterates though a folder of *.sql script files, executing each script, capturing the script query output, and writing the results to a Microsoft Excel Worksheet file (one worksheet per script file). And yes, it’s an actual Excel file, not a .csv file. The C# code for SQL to Excel is available as a GitHub repository.

Click through for more details.  This sounds a bit more robust than building Integration Services packages to do the same.

Comments closed

Corrupting Databases For Fun And Profit

Eric Blinn has started a new series on database corruption.  In part one, he shows us how to corrupt a database (probably by letting it listen to Alice Cooper and Frank Zappa):

I’m going to start a series of posts regarding corruption detection, prevention, and correction.  In order to detect corruption we must first have a corrupt database.  This post is going to show how to purposely corrupt a database.  I’ll refer back to this post often as it will be the basis of most of the future posts in this series which will pick up where this leaves off.

Step 1.  Create a new database.  You didn’t think I was going to ask you to corrupt a real database did you?  We will create one table with some names in it.  The table is abnormally wide on purpose.  The goal is to get fewer rows on any one page of data so that we can get several pages of data without needing many hundreds of rows.

Part two explains the concept of page verification:

Page verification is a process in SQL Server where the engine writes extra data to the header of the page while writing it to disk.  This extra data can be used to verify that the data later read from that page is what was expected.  There are 3 options for the setting.  They are NONE, CHECKSUM, and TORN_PAGE_DETECTION.  The torn page option is deprecated.  It should not be used and will not be covered in this series.

When set to CHECKSUM SQL Server will quickly determine a checksum for the page of data and write it to the header of the data page.  Any time from that point forward when the page is read from disk SQL Server will perform the same checksum calculation and compare that to to the stored value in the header of the page.  If the value matches that would indicate there is probably not corruption. If the values do not match that would indicate there is almost certainly some kind of corruption.

Corruption is one of the scariest things that can happen to a database; knowing how to respond to an incident of corruption is critical even as the incidence of corruption is quite low in practice (unless you’re running on awful hardware).

Comments closed

Using Kafka To Drive ML Predictions

Kai Waehner shows us a model architecture for using Apache Kafka to generate predictions from trained models:

Kafka applications are event based, and leverage stream processing to continuously process input data. If you’re using Kafka, then you can embed an analytic model natively in a Kafka Streams or KSQLapplication. There are various examples of Kafka Streams microservices embedding models built with TensorFlow, H2O or Deeplearning4j natively.

It is not always possible or feasible to embed analytic models directly due to architectural, security or organizational reasons. You can also choose to use RPC to perform model inference from your Kafka application (bearing in mind the the pros and cons discussed above). You can visit my project for an example of gRPC integration between a Kafka Streams microservice and locally hosted TensorFlow Serving container for making predictions with a hosted TensorFlow model.

There are a couple separate and interesting patterns here.

Comments closed

Strategies For Dealing With Failed Projects

Edwin Thoen gives us a few tips for dealing with failing data science projects:

At the beginning of a project the levels enthusiasm and optimism are always at its peak. Especially in data science projects. Isn’t data the new oil? This is the time we are finally going to dig into that well and leverage our data in unprecedented ways! No setbacks are experienced yet. There is only one road ahead and it will lead us to success. Probably at this stage you, the data scientist, are already well aware of a number of project risks. You might want to keep these concerns to yourself, as you don’t want to come across as negative, or worse, someone who is not up to the job ahead. Please don’t, if you foresee possible problems at this stage and you don’t speak out, they can come back as a boomerang when the problems actually occur. Rather, invite all stakeholders to perform a risk analysis together.

This is good advice and applies outside of data science projects as well.  H/T R-bloggers

Comments closed