Press "Enter" to skip to content

Author: Kevin Feasel

Backing Up Cosmos DB

Josh Smith takes us through backing up Cosmos DB yourself:

Unfortunately if you are restricting access to your Cosmos DB service based on IP address (a reasonable security measure) then Data Factory won’t work as of this writing as Azure Data Factory doesn’t operate like a trusted Azure service and presents as IP address from somewhere in the data center where it is spun up. Thankfully they are working on this. In the meantime however the next best thing is to use the Cosmos DB migration tool (scripts below) to dump the contents to a location where they can be retained as long as needed. Be aware in addition to the RU cost of returning the data that if you bring these backups back out of the data center where the Cosmos DB lives you’ll also incur egress charges on the data.

Having a plan for this kind of thing is important, even if you normally rely on service-provided automated backups.

Comments closed

Building an Azure Usage Report with Powershell

June Castillote shows us how we can use Powershell to get usage data from Azure for our subscriptions:

In the section above, it would be common for the command to return many thousand objects especially for long date ranges. To prevent overwhelming the API, the Get-UsageAggregates command only returns a maximum of 1000 results. If you’ve saved the $usageData variable as covered in the previous section, you can confirm it by using running this command $usageData.UsageAggregations.count.

What if there are more than 1000 results? You’re going to have to do a little more work.

Knowing how much you’re spending is critical in an Op-X world like Azure or AWS.

Comments closed

Generating Anonymous Data

Daniel Hutmacher has a nice web API to generate fake customer data:

I’ve been working on a little gadget for a while now, and today I finally got around to completing it and so now I’ve published it for everyone to try out. It’s a web API (wait, wait, don’t go away – it’s for database people!) that creates a randomized list of names, addresses, etc.

In this post, I’ll show you how easy it is to use this service to anonymize a development or test database so you don’t have all that personally identifiable information floating around.

Read the whole thing and check out his service. Also, Daniel was the one who spurred me on to update the theme here to get rid of some problems, so you can thank him for that too.

Comments closed

Percentages of Totals in Snowflake

Koen Verbeeck shows how you can use the RATIO_TO_REPORT() function in Snowflake to determine the current row’s percentage of the total:

This episode talks about a new window function Snowflake recently introduced: RATIO_TO_REPORT. The function returns the ratio of the value of the current row to the sum of the values within the set. Or in other words, some sort of “percentage of total”. Nothing we couldn’t calculate before, but a bit of syntactic sugar so we don’t have to write two expressions.

Click through to see how to use it and a contrast with the ANSI SQL approach.

Comments closed

Naming Temporary Columns in DAX

Marco Russo and Alberto Ferrari team up to share a standard for naming temporary columns in DAX:

The formula works just fine, but it violates one of the golden rules of DAX code: you always prefix a column reference with its table name, and you never use the table name when referencing a measure. Therefore, when reading DAX code, [Sales Amt] is a measure reference, whereas ‘Product'[Sales Amt] is a column reference.

Nevertheless, in our DAX example ProdSalesAmt is a column of a temporary table (SalesByProduct) created by the FilteredSalesAmount measure. As such, ProdSalesAmt is a temporary column that does not originate from any column in the model and does not have a table name you can use as a prefix. This situation creates ambiguity in the code: it is not easy to discriminate between a column reference and a measure reference. Therefore, the code is harder to read and more error prone.

Read on for their standard, which is pretty easy to follow.

Comments closed

Azure Data Studio Process Explorer

Dave Bland shows us the process explorer in Azure Data Studio:

Notice that the pids all point to the azuredatestudio.exe processes.  Azure Data Studio provides just a bit more information than Task Manager.  Please be careful changing the state of a service.  In other words, be careful stopping a process unless it is a last restore approach to fixing an issue.

The first thing I thought when looking at it wasn’t the Task Manager; it was Chrome’s process explorer.

Comments closed

Significance, Confidence Level, and Confidence Interval

Stephanie Glen disambiguates three commonly confused but quite different terms:

In a nutshell, here are the definitions for all three.

1. Significance level: In a hypothesis test, the significance level, alpha, is the probability of making the wrong decision when the null hypothesis is true.
2. Confidence level: The probability that if a poll/test/survey were repeated over and over again, the results obtained would be the same. A confidence level =  1 – alpha. 
3. Confidence interval: A range of results from a poll, experiment, or survey that would be expected to contain the population parameter of interest. For example, an average response. Confidence intervals are constructed using significance levels / confidence levels.

Read on for several examples and more elaboration.

Comments closed

Resource Allocation in Spark Applications

The folks at Beginner’s Hadoop take us through resource allocation in Spark applications:

Tiny executors essentially means one executor per core. Following table depicts the values of our spar-config params with this approach:

Analysis: With only one executor per core, as we discussed above, we’ll not be able to take advantage of running multiple tasks in the same JVM. Also, shared/cached variables like broadcast variables and accumulators will be replicated in each core of the nodes which is 16 times. Also, we are not leaving enough memory overhead for Hadoop/Yarn daemon processes and we are not counting in ApplicationManager. NOT GOOD!

Read on for the full analysis.

Comments closed

Query Folding, Azure DevOps, and Power BI

Eugene Meidinger tries to work around a query folding limitation:

Query folding is one of the most powerful tools in Power Query and Power BI. It is the automatic process of pushing down filters and other transformations back to the data source. This can dramatically improve performance for your queries.

Unfortunately, OData is not guaranteed to support query folding. According to the Power BI documentation on incremental refresh.

Click through for Eugene’s alternative solution.

Comments closed

Principal Component Analysis in Python

Abhinav Choudhary shows us how to implement Principal Component Analysis in Python:

Principal Component Analysis (PCA) is an unsupervised statistical technique used to examine the interrelation among a set of variables in order to identify the underlying structure of those variables. In simple words, suppose you have 30 features column in a data frame so it will help to reduce the number of features making a new feature which is the combined effect of all the feature of the data frame. It is also known as factor analysis.

PCA is quite useful in practice, though it has the unfortunate side effect of making it harder to interpret which factors are driving your solution.

Comments closed