Press "Enter" to skip to content

Curated SQL Posts

Retrieving Counts of Cosmos DB Collections

Manoj Pandey shows how you can retrieve counts of records in Cosmos DB using the .NET client:

Here in this post we will use C# .net code (for beginners like me) to see how to:
1. Connect to a Cosmos DB instance
2. Get list of all Databases in a Cosmos DB
3. Iterate through all the Databases and get the list of all Collections (or Tables)
4. Get COUNT of all documents/items (or records) in these Collections

Click through to see how.

Comments closed

Building a Better Random

David Fowler wants to generate a random number:

You see the problem? Well in the post that I mentioned, I adressed this issue and offered a solution using CHECKSUM() and NEWID().

That’s ok but can be a bit cumbersome. I recently stumbled across a function that I hadn’t realised was there before (and of course this is where you all tell me, we’ve known about that all the time), CRYPT_GEN_RANDOM().

My expectation with this was that CRYPT_GEN_RANDOM() would generate according to a uniform distribution. Performing a simple experiment with about 4 million entries against random numbers 0-255 (1 byte), I can confirm that this does approximate a uniform distribution.

Comments closed

When You Can’t Win for Trying: SQL Agent Failures

Garry Bargsley troubleshoots a strange issue:

What are your troubleshooting steps when a job failure is reported?

1. Open SQL Agent Job History for the failed job
2. Look at the SQL Server Log
3. This was an Ola job, so look at the CommandLog table
4. Look at the text log file stored on the file system
5. Open the job step and get the code being executed and run in a new query window

Now that you and I have performed the same steps and found no smoking gun, where do you go next?

But what happens when all of your indicators look fine, yet the job is still failing? Read on for one possible answer.

Comments closed

Using Azure Functions to Tag Resources

Jess Pomfret shows off an interesting way of using Azure Functions to apply tags to resources:

In part one I discussed how useful Azure tags can be, and specifically about how adding a ‘dateCreated’ tag can help you keep track of your resources, and how to find resources with certain tags using PowerShell.  Part 2 and 3 are based around the fact that adding the ‘dateCreated’ tag is a great idea, but relying on a human to remember to add it is less than ideal. In part 2 we looked at using Azure Policy to automatically add the tag. Today’s post will cover another option using Azure Functions.

Azure Functions gives us a way of running serverless code, written in a number of different languages, triggered by specific events or timings.  Looking through the documentation there are many use cases from processing files to analysing IoT workstreams.  Our use case is to run a PowerShell script that tags any resources that are missing the ‘dateCreated’.

Click through to see how.

Comments closed

Window Functions in Row and Batch Modes

Erik Darling digs into a new series:

To start things off, we’re going to talk about query plan patterns related to windowing functions.

There are several things to consider with windowing function query plans:

– Row vs Batch mode
– With and Without Partition By
– Index Support for Partition and Order By
– Column SELECTion
– Rows vs Range/Global aggregates

We’ll get to them in separate posts, because there are particulars about them that would make covering them all in a single post unwieldy.

Anyway, the first one is pretty simple, and starting simple is about my speed.

Read on for this quick coverage of row mode versus batch mode processing with respect to window functions.

Comments closed

Caching versus Persisting in Spark

The Hadoop in Real World team explains a subtle difference:

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, the objects behind the RDD, DataFrame or Dataset on which cache() or persist() is called will be kept in memory or on the configured storage level on the nodes. 

That’s the similarity, but click through for the difference.

Comments closed

Working with Prediction Intervals

Bryan Shalloway explains how generating prediction intervals is different from making point predictions:

Before using the model for predictive inference, one should have reviewed overall performance on a holdout dataset to ensure the model is sufficiently accurate for the business context. For example, for our problem is an average error of ~12% and 90% prediction intervals of +/- ~25% of Sale_Price useful? If the answer is “no,” that suggests the need for more effort in improving the accuracy of the model (e.g. trying other transformations, features, model types). For our examples we are assuming the answer is ‘yes,’ our model is accurate enough (so it is appropriate to move-on and focus on prediction intervals).

Click through for the article.

Comments closed

Spark Performance in Azure Synapse Analytics

Euan Garden shares some numbers around Apache Spark performance in Azure Synapse Analytics:

To compare the performance, we derived queries from TPC-DS with 1TB scale and ran them on 8 nodes Azure E8V3 cluster (15 executors – 28g memory, 4 cores). Even though our version running inside Azure Synapse today is a derivative of Apache Spark™ 2.4.4, we compared it with the latest open-source release of Apache Spark™ 3.0.1 and saw Azure Synapse was 2x faster in total runtime for the Test-DS comparison.

Click through for several techniques the Azure Synapse Analytics team has implemented to make some significant performance improvements. It’s still slower than Databricks, but considerably faster than the open-source Apache Spark baseline.

Comments closed