Curated SQL – Page 1043 – A Fine Slice Of SQL Server

At the beginning, some of us thought that it would be easy to fix these issues by returning to the data quality team and ask them to fix these issues but it was not so easy. Identifing the rules needing a fix would be huge task (the CSV files are not created if the test is successful, maling it impossible to address this issue in one run and other impediments). I took the decision to go over this issue with the implementation of the following heuristic:
– if the CSV has a column DateTime then we’ll use it
– if the header is empty or no column is named DateTime then use the first column
– if the content of the selected column is not a date then try to parse it as the inner content of a JSON element.

Read on to see how.

Comments closed

Enabling the SQL Server Agent on Big Data Clusters

Published 2020-03-23 by Kevin Feasel

Mohammad Darab shows how you can enable the SQL Server Agent when running a Big Data Cluster:

In order to enable the SQL Agent, Microsoft has provided a workaround. I’m sure in a future release you will have that option, but as of right now there is no other way. The process is straight forward.

Mohammad gives us the steps and shows it to us in video form.

Comments closed

Characterizing and Optimizing a Serverless Workload

Published 2020-03-20 by Kevin Feasel

Adrian Colyer reviews an interesting paper:

Today’s paper analyses serverless workloads on Azure (the characterisation of those workloads is interesting in its own right), where users want fast function start times (avoiding cold starts), and the cloud provider wants to minimise resources consumed (costs). With fine-grained, usage based billing, resources used to keep function execution environments warm eat directly into margins. The authors demonstrate a policy combining keep-alive times for function runtimes with pre-warming, that dominates the currently popular strategy of simply keeping a function execution environment around for a period of time after it has been used, in the hope that it will be re-used. Using this policy, users see much fewer cold starts, while the cloud provider uses fewer resources. It’s the difference between the red (state-of-the-practice) and green (this paper) policies in the chart below. Win-win!

Very interesting.

Comments closed

Correlation in easystats

Published 2020-03-20 by Kevin Feasel

The easystats team announces a new R package:

The easystats project continues to grow with its more recent addition, a package devoted to correlations. Check-out its webpage here!
It’s lightweight, easy to use, and allows for the computation of many different kinds of correlations, such as partial correlations, Bayesian correlations, multilevel correlations, polychoric correlations, biweight, percentage bend or Sheperd’s Pi correlations (types of robust correlation), distance correlation (a type of non-linear correlation) and more, also allowing for combinations between them (for instance, Bayesian partial multilevel correlation).

I’d recommend reading the examples on the GitHub repo due to formatting. Looks quite interesting. H/T R-Bloggers.

Comments closed

Visualizing “Check All that Apply” Options

Published 2020-03-20 by Kevin Feasel

Stephanie Evergreen shows a couple of ways to visualize multi-select results:

Which means a bar chart, ordered greatest to least, is your alternative. But that can have many variations.
In this example, created by Dr. Sheila B. Robinson, she used 100% stacked bars for each survey item, to indicate that each item could have totaled 100% if all respondents checked that box. This is a nice way to show that, while the response options as a whole can’t add to 100%, each option on its own CAN. Plus, look at the cute icons.

Click through for several alternatives depending upon the story you’re trying to tell.

Comments closed

Connecting to Cosmos DB from the Gremlin Console

Published 2020-03-20 by Kevin Feasel

Hasan Savran shows how to connect to Cosmos DB’s Gremlin API via the Gremlin Console:

Graph Databases have been popular lately. You can use Azure Cosmos DB as your Graph database source by selecting Gremlin API. Gremlin programing language is developed by Apache TinkerPop of the Apache Software Foundation. I will show you how to connect to Cosmos DB Gremlin API from TinkerPop Gremlin console.
You can download the latest version of Gremlin console from here. The latest version was 3.4.6 when I wrote this post. I was able to connect to the Cosmos DB by using the versions 3.4.3 and 3.4.6. You can run the console from Linux or Windows, I will focus on the Windows version here but Linux version should work the same way. You must have Java SDK 8 to run this console. Latest version of Java SDK does not work with this console.

There are a couple of configuration steps, but nothing crazy.

Comments closed

Trimming Characters with Power Query

Published 2020-03-20 by Kevin Feasel

Imke Feldmann shows off an optional parameter with Text.TrimStart() and Text.TrimEnd() in Power Query:

When cleaning dirty data, you might have used the Trim-function (Text.TrimStart or Text.TrimEnd) to delete leading or trailing whitespace from your strings. However, did you know that you could use these functions as well to delete any other characters as well from the start or end of a string? Trimming text with custom characters is pretty straightforward:

Interesting that it does not take a regular expression but rather a list of characters.

Comments closed

Tips and Traps with PowerShell 7

Published 2020-03-20 by Kevin Feasel

Jeffrey Hicks takes us through some of the tricky parts of migrating to Powershell 7:

A long established community best practice in PowerShell scripting is not using command and parameter aliases. In a cross-platform world, this is even more critical. You may have been in the habit of using Sort in your code in place of Sort-Object. I know I have. I didn’t mind bending the no alias rule abit because there was nothing cryptic about Sort.
But in the Linux world, sort is a native command. There is no PowerShell alias. If your code uses sort, on Linux it will call the native command which will most likely break your code.

Read on for several more hints.

Comments closed

Evaluating Regression Models in Azure ML

Published 2020-03-19 by Kevin Feasel

Dan Fitton continues a series on model evaluation with Azure Machine Learning:

The initial go-to metric for understanding a regression model is the R squared (or R²) value, also known as the coefficient of determination. R squared measures how well the model is fitted to the data – the goodness of fit. It indicates how much of the variation of y (the target) is explained by the variation in x (the features).

The measures are bog standard if you’ve worked with regressions before, and Dan does a good job explaining them.

Comments closed

Ensuring Ordering of Kafka Data

Published 2020-03-19 by Kevin Feasel

Zeke Dean takes us through an explanation of Kafka partitioning and message delivery behaviors:

Apache Kafka provides developers with a uniquely powerful, open source and versatile distributed streaming platform – but it also has some rather complex nuances to understand when trying to store and retrieve data in your preferred order.
Kafka captures streaming data by publishing records to a category or feed name called a topic. Kafka consumers can then subscribe to topics to retrieve that data. For each topic, the Kafka cluster creates and updates a partitioned log. Kafka send all messages from a particular producer to the same partition, storing each message in the order it arrives. Partitions therefore function as a structure commit log, containing sequences of records that are both ordered and immutable.

Click through for the explanation.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Curated SQL Posts