Curated SQL – Page 1115 – A Fine Slice Of SQL Server

Migrating Databricks Workspaces

Published 2019-10-03 by Kevin Feasel

Gerhard Brueckl has made DatabricksPS better:

I do not know what is/was the problem here but I did not have time to investigate but instead needed to come up with a proper solution in time. So I had a look what needs to be done for a manual export. Basically there are 5 types of content within a Databricks workspace:
– Workspace items (notebooks and folders)
– Clusters
– Jobs
– Secrets
– Security (users and groups)
For all of them an appropriate REST API is provided by Databricks to manage and also exports and imports. This was fantastic news for me as I knew I could use my existing PowerShell module DatabricksPS to do all the stuff without having to re-invent the wheel again.

I’ve used DatabricksPS and really like it for cases where I’d have to loop with the Databricks REST API—for example, when uploading files.

Comments closed

Backing Up Cosmos DB

Published 2019-10-03 by Kevin Feasel

Josh Smith takes us through backing up Cosmos DB yourself:

Unfortunately if you are restricting access to your Cosmos DB service based on IP address (a reasonable security measure) then Data Factory won’t work as of this writing as Azure Data Factory doesn’t operate like a trusted Azure service and presents as IP address from somewhere in the data center where it is spun up. Thankfully they are working on this. In the meantime however the next best thing is to use the Cosmos DB migration tool (scripts below) to dump the contents to a location where they can be retained as long as needed. Be aware in addition to the RU cost of returning the data that if you bring these backups back out of the data center where the Cosmos DB lives you’ll also incur egress charges on the data.

Having a plan for this kind of thing is important, even if you normally rely on service-provided automated backups.

Comments closed

Building an Azure Usage Report with Powershell

Published 2019-10-03 by Kevin Feasel

June Castillote shows us how we can use Powershell to get usage data from Azure for our subscriptions:

In the section above, it would be common for the command to return many thousand objects especially for long date ranges. To prevent overwhelming the API, the Get-UsageAggregates command only returns a maximum of 1000 results. If you’ve saved the $usageData variable as covered in the previous section, you can confirm it by using running this command $usageData.UsageAggregations.count.
What if there are more than 1000 results? You’re going to have to do a little more work.

Knowing how much you’re spending is critical in an Op-X world like Azure or AWS.

Comments closed

Generating Anonymous Data

Published 2019-10-03 by Kevin Feasel

Daniel Hutmacher has a nice web API to generate fake customer data:

I’ve been working on a little gadget for a while now, and today I finally got around to completing it and so now I’ve published it for everyone to try out. It’s a web API (wait, wait, don’t go away – it’s for database people!) that creates a randomized list of names, addresses, etc.
In this post, I’ll show you how easy it is to use this service to anonymize a development or test database so you don’t have all that personally identifiable information floating around.

Read the whole thing and check out his service. Also, Daniel was the one who spurred me on to update the theme here to get rid of some problems, so you can thank him for that too.

Comments closed

Percentages of Totals in Snowflake

Published 2019-10-03 by Kevin Feasel

Koen Verbeeck shows how you can use the RATIO_TO_REPORT() function in Snowflake to determine the current row’s percentage of the total:

This episode talks about a new window function Snowflake recently introduced: RATIO_TO_REPORT. The function returns the ratio of the value of the current row to the sum of the values within the set. Or in other words, some sort of “percentage of total”. Nothing we couldn’t calculate before, but a bit of syntactic sugar so we don’t have to write two expressions.

Click through to see how to use it and a contrast with the ANSI SQL approach.

Comments closed

Naming Temporary Columns in DAX

Published 2019-10-03 by Kevin Feasel

Marco Russo and Alberto Ferrari team up to share a standard for naming temporary columns in DAX:

The formula works just fine, but it violates one of the golden rules of DAX code: you always prefix a column reference with its table name, and you never use the table name when referencing a measure. Therefore, when reading DAX code, [Sales Amt] is a measure reference, whereas ‘Product'[Sales Amt] is a column reference.
Nevertheless, in our DAX example ProdSalesAmt is a column of a temporary table (SalesByProduct) created by the FilteredSalesAmount measure. As such, ProdSalesAmt is a temporary column that does not originate from any column in the model and does not have a table name you can use as a prefix. This situation creates ambiguity in the code: it is not easy to discriminate between a column reference and a measure reference. Therefore, the code is harder to read and more error prone.

Read on for their standard, which is pretty easy to follow.

Comments closed

Azure Data Studio Process Explorer

Published 2019-10-03 by Kevin Feasel

Dave Bland shows us the process explorer in Azure Data Studio:

Notice that the pids all point to the azuredatestudio.exe processes. Azure Data Studio provides just a bit more information than Task Manager. Please be careful changing the state of a service. In other words, be careful stopping a process unless it is a last restore approach to fixing an issue.

The first thing I thought when looking at it wasn’t the Task Manager; it was Chrome’s process explorer.

Comments closed

Significance, Confidence Level, and Confidence Interval

Published 2019-10-02 by Kevin Feasel

Stephanie Glen disambiguates three commonly confused but quite different terms:

In a nutshell, here are the definitions for all three.
1. Significance level: In a hypothesis test, the significance level, alpha, is the probability of making the wrong decision when the null hypothesis is true.
2. Confidence level: The probability that if a poll/test/survey were repeated over and over again, the results obtained would be the same. A confidence level = 1 – alpha.
3. Confidence interval: A range of results from a poll, experiment, or survey that would be expected to contain the population parameter of interest. For example, an average response. Confidence intervals are constructed using significance levels / confidence levels.

Read on for several examples and more elaboration.

Comments closed

Resource Allocation in Spark Applications

Published 2019-10-02 by Kevin Feasel

The folks at Beginner’s Hadoop take us through resource allocation in Spark applications:

Tiny executors essentially means one executor per core. Following table depicts the values of our spar-config params with this approach:

Analysis: With only one executor per core, as we discussed above, we’ll not be able to take advantage of running multiple tasks in the same JVM. Also, shared/cached variables like broadcast variables and accumulators will be replicated in each core of the nodes which is 16 times. Also, we are not leaving enough memory overhead for Hadoop/Yarn daemon processes and we are not counting in ApplicationManager. NOT GOOD!

Read on for the full analysis.

Comments closed

Query Folding, Azure DevOps, and Power BI

Published 2019-10-02 by Kevin Feasel

Eugene Meidinger tries to work around a query folding limitation:

Query folding is one of the most powerful tools in Power Query and Power BI. It is the automatic process of pushing down filters and other transformations back to the data source. This can dramatically improve performance for your queries.
Unfortunately, OData is not guaranteed to support query folding. According to the Power BI documentation on incremental refresh.

Click through for Eugene’s alternative solution.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Curated SQL Posts