Press "Enter" to skip to content

Curated SQL Posts

Tuning a Query Searching for a Substring in Text

Eddy Djaja gives us two methods for improving performance of a search for a fixed substring:

The reason substring function is used because the column ACCOUNTDISPLYVALUE has multiple values combined in one column. In this case, the query is searching for the Account Number which is the first six characters. The long running query is listed below:

set statistics io on
go
select sum(ACCOUNTINGCURRENCYAMOUNT)from [d365].[GeneralJournalAccountMultiCompanyEntries]where substring([ACCOUNTDISPLAYVALUE], 1, 6)  = '877601'

Eddy gives us two solutions. As a quick note, these solutions work because the query is looking for a specific stretch of characters after a specific starting point. For arbitrary text, things get a little trickier.

Comments closed

An Example of Complex CSV Rule Parsing with Power Query

Cedric Charlier shows off some of the benefit of Power Query with a fairly complicated set of rules:

At the beginning, some of us thought that it would be easy to fix these issues by returning to the data quality team and ask them to fix these issues but it was not so easy. Identifing the rules needing a fix would be huge task (the CSV files are not created if the test is successful, maling it impossible to address this issue in one run and other impediments). I took the decision to go over this issue with the implementation of the following heuristic:

– if the CSV has a column DateTime then we’ll use it
– if the header is empty or no column is named DateTime then use the first column
– if the content of the selected column is not a date then try to parse it as the inner content of a JSON element.

Read on to see how.

Comments closed

Characterizing and Optimizing a Serverless Workload

Adrian Colyer reviews an interesting paper:

Today’s paper analyses serverless workloads on Azure (the characterisation of those workloads is interesting in its own right), where users want fast function start times (avoiding cold starts), and the cloud provider wants to minimise resources consumed (costs). With fine-grained, usage based billing, resources used to keep function execution environments warm eat directly into margins. The authors demonstrate a policy combining keep-alive times for function runtimes with pre-warming, that dominates the currently popular strategy of simply keeping a function execution environment around for a period of time after it has been used, in the hope that it will be re-used. Using this policy, users see much fewer cold starts, while the cloud provider uses fewer resources. It’s the difference between the red (state-of-the-practice) and green (this paper) policies in the chart below. Win-win!

Very interesting.

Comments closed

Correlation in easystats

The easystats team announces a new R package:

The easystats project continues to grow with its more recent addition, a package devoted to correlations. Check-out its webpage here!

It’s lightweight, easy to use, and allows for the computation of many different kinds of correlations, such as partial correlations, Bayesian correlations, multilevel correlations, polychoric correlations, biweightpercentage bend or Sheperd’s Pi correlations (types of robust correlation), distance correlation (a type of non-linear correlation) and more, also allowing for combinations between them (for instance, Bayesian partial multilevel correlation).

I’d recommend reading the examples on the GitHub repo due to formatting. Looks quite interesting. H/T R-Bloggers.

Comments closed

Visualizing “Check All that Apply” Options

Stephanie Evergreen shows a couple of ways to visualize multi-select results:

Which means a bar chart, ordered greatest to least, is your alternative. But that can have many variations.

In this example, created by Dr. Sheila B. Robinson, she used 100% stacked bars for each survey item, to indicate that each item could have totaled 100% if all respondents checked that box. This is a nice way to show that, while the response options as a whole can’t add to 100%, each option on its own CAN. Plus, look at the cute icons.

Click through for several alternatives depending upon the story you’re trying to tell.

Comments closed

Connecting to Cosmos DB from the Gremlin Console

Hasan Savran shows how to connect to Cosmos DB’s Gremlin API via the Gremlin Console:

Graph Databases have been popular lately. You can use Azure Cosmos DB as your  Graph database source by selecting Gremlin API. Gremlin programing language is developed by Apache TinkerPop of the Apache Software Foundation. I will show you how to connect to Cosmos DB Gremlin API from TinkerPop Gremlin console.

     You can download the latest version of Gremlin console from here. The latest version was 3.4.6 when I wrote this post. I was able to connect to the Cosmos DB by using the versions 3.4.3 and 3.4.6.  You can run the console from Linux or Windows, I will focus on the Windows version here but Linux version should work the same way. You must have Java SDK 8 to run this console. Latest version of Java SDK does not work with this console.

There are a couple of configuration steps, but nothing crazy.

Comments closed

Trimming Characters with Power Query

Imke Feldmann shows off an optional parameter with Text.TrimStart() and Text.TrimEnd() in Power Query:

When cleaning dirty data, you might have used the Trim-function (Text.TrimStart or Text.TrimEnd) to delete leading or trailing whitespace from your strings. However, did you know that you could use these functions as well to delete any other characters as well from the start or end of a string? Trimming text with custom characters is pretty straightforward:

Interesting that it does not take a regular expression but rather a list of characters.

Comments closed

Tips and Traps with PowerShell 7

Jeffrey Hicks takes us through some of the tricky parts of migrating to Powershell 7:

A long established community best practice in PowerShell scripting is not using command and parameter aliases. In a cross-platform world, this is even more critical. You may have been in the habit of using Sort in your code in place of Sort-Object. I know I have. I didn’t mind bending the no alias rule abit because there was nothing cryptic about Sort.

But in the Linux world, sort is a native command. There is no PowerShell alias. If your code uses sort, on Linux it will call the native command which will most likely break your code.

Read on for several more hints.

Comments closed

Evaluating Regression Models in Azure ML

Dan Fitton continues a series on model evaluation with Azure Machine Learning:

The initial go-to metric for understanding a regression model is the R squared (or R2) value, also known as the coefficient of determination. R squared measures how well the model is fitted to the data – the goodness of fit. It indicates how much of the variation of y (the target) is explained by the variation in x (the features).

The measures are bog standard if you’ve worked with regressions before, and Dan does a good job explaining them.

Comments closed