Press "Enter" to skip to content

Author: Kevin Feasel

Cloudera Data Platform in Azure Marketplace

Ram Venkatesh announces availability for Cloudera Data Platform in the Azure Marketplace:

Cloudera Data Platform (CDP) is now available on Microsoft Azure Marketplace – so joint customers can easily deploy the world’s first enterprise data cloud on Microsoft Azure.

Last week we announced the availability of Cloudera Data Platform (CDP) on Azure Marketplace. CDP is an integrated data platform that is easy to secure, manage, and deploy. With its availability on the Azure Marketplace, joint customers of Cloudera and Microsoft will be able to easily discover and provision CDP Public Cloud across all Azure regions. Additionally, by procuring CDP through the Azure Marketplace, these customers can leverage integrated billing i.e. the cost of CDP will be part of a single Azure bill making procurement simple and friction-free.

The new Cloudera’s approach has been cloud-first to the point of being cloud-only. It’s an interesting shift from the merger of two on-prem companies.

Comments closed

Tuning a Query Searching for a Substring in Text

Eddy Djaja gives us two methods for improving performance of a search for a fixed substring:

The reason substring function is used because the column ACCOUNTDISPLYVALUE has multiple values combined in one column. In this case, the query is searching for the Account Number which is the first six characters. The long running query is listed below:

set statistics io on
go
select sum(ACCOUNTINGCURRENCYAMOUNT)from [d365].[GeneralJournalAccountMultiCompanyEntries]where substring([ACCOUNTDISPLAYVALUE], 1, 6)  = '877601'

Eddy gives us two solutions. As a quick note, these solutions work because the query is looking for a specific stretch of characters after a specific starting point. For arbitrary text, things get a little trickier.

Comments closed

Illogical Errors and Implicit Conversion

Aaron Bertrand takes us through a problem with seemingly indeterminate query errors:

I’ve talked about illogical errors before. In several answers on Database Administrators (onetwothree), I show how you can use a CASE expression or TRY_CONVERT to work around an error where a non-numeric value, that should have been filtered out by a join or other clause, still leads to a conversion error. Erland Sommarskog raised a Connect item over a decade ago, still unaddressed, called “SQL Server should not raise illogical errors.”

Recently we had a scenario where a query was failing on one server but not another. But this was slightly different; there were no numerics involved. Imagine this scenario: a source table has a column that is varchar(20). A query creates a table variable with a column that is varchar(10), and inserts rows from the source table, with a filter in place that only exposes values that are 10 characters or less.

In a lot of cases, of course, this scenario is perfectly fine, and everything works as expected.

Read the whole thing. There is a method to the madness, and Aaron explains how it can come up in some cases but not others.

Comments closed

An Example of Complex CSV Rule Parsing with Power Query

Cedric Charlier shows off some of the benefit of Power Query with a fairly complicated set of rules:

At the beginning, some of us thought that it would be easy to fix these issues by returning to the data quality team and ask them to fix these issues but it was not so easy. Identifing the rules needing a fix would be huge task (the CSV files are not created if the test is successful, maling it impossible to address this issue in one run and other impediments). I took the decision to go over this issue with the implementation of the following heuristic:

– if the CSV has a column DateTime then we’ll use it
– if the header is empty or no column is named DateTime then use the first column
– if the content of the selected column is not a date then try to parse it as the inner content of a JSON element.

Read on to see how.

Comments closed

Correlation in easystats

The easystats team announces a new R package:

The easystats project continues to grow with its more recent addition, a package devoted to correlations. Check-out its webpage here!

It’s lightweight, easy to use, and allows for the computation of many different kinds of correlations, such as partial correlations, Bayesian correlations, multilevel correlations, polychoric correlations, biweightpercentage bend or Sheperd’s Pi correlations (types of robust correlation), distance correlation (a type of non-linear correlation) and more, also allowing for combinations between them (for instance, Bayesian partial multilevel correlation).

I’d recommend reading the examples on the GitHub repo due to formatting. Looks quite interesting. H/T R-Bloggers.

Comments closed

Characterizing and Optimizing a Serverless Workload

Adrian Colyer reviews an interesting paper:

Today’s paper analyses serverless workloads on Azure (the characterisation of those workloads is interesting in its own right), where users want fast function start times (avoiding cold starts), and the cloud provider wants to minimise resources consumed (costs). With fine-grained, usage based billing, resources used to keep function execution environments warm eat directly into margins. The authors demonstrate a policy combining keep-alive times for function runtimes with pre-warming, that dominates the currently popular strategy of simply keeping a function execution environment around for a period of time after it has been used, in the hope that it will be re-used. Using this policy, users see much fewer cold starts, while the cloud provider uses fewer resources. It’s the difference between the red (state-of-the-practice) and green (this paper) policies in the chart below. Win-win!

Very interesting.

Comments closed

Connecting to Cosmos DB from the Gremlin Console

Hasan Savran shows how to connect to Cosmos DB’s Gremlin API via the Gremlin Console:

Graph Databases have been popular lately. You can use Azure Cosmos DB as your  Graph database source by selecting Gremlin API. Gremlin programing language is developed by Apache TinkerPop of the Apache Software Foundation. I will show you how to connect to Cosmos DB Gremlin API from TinkerPop Gremlin console.

     You can download the latest version of Gremlin console from here. The latest version was 3.4.6 when I wrote this post. I was able to connect to the Cosmos DB by using the versions 3.4.3 and 3.4.6.  You can run the console from Linux or Windows, I will focus on the Windows version here but Linux version should work the same way. You must have Java SDK 8 to run this console. Latest version of Java SDK does not work with this console.

There are a couple of configuration steps, but nothing crazy.

Comments closed

Visualizing “Check All that Apply” Options

Stephanie Evergreen shows a couple of ways to visualize multi-select results:

Which means a bar chart, ordered greatest to least, is your alternative. But that can have many variations.

In this example, created by Dr. Sheila B. Robinson, she used 100% stacked bars for each survey item, to indicate that each item could have totaled 100% if all respondents checked that box. This is a nice way to show that, while the response options as a whole can’t add to 100%, each option on its own CAN. Plus, look at the cute icons.

Click through for several alternatives depending upon the story you’re trying to tell.

Comments closed

Tips and Traps with PowerShell 7

Jeffrey Hicks takes us through some of the tricky parts of migrating to Powershell 7:

A long established community best practice in PowerShell scripting is not using command and parameter aliases. In a cross-platform world, this is even more critical. You may have been in the habit of using Sort in your code in place of Sort-Object. I know I have. I didn’t mind bending the no alias rule abit because there was nothing cryptic about Sort.

But in the Linux world, sort is a native command. There is no PowerShell alias. If your code uses sort, on Linux it will call the native command which will most likely break your code.

Read on for several more hints.

Comments closed