Press "Enter" to skip to content

Curated SQL Posts

Powershell Functions and Return Type Oddities

Dave Mason goes down the rabbit hole:

As a result of some struggles trying to automate a process, I’ve learned some things about PowerShell. After getting to the bottom of a time-consuming problem, I thought it worth a blog post that might save someone else some time and heartache.

Let’s begin with this simple function named Get-RandomDate. It generates and returns a random date that is between today and X days ago. It has an input parameter $DaysAgo, which is of type [System.Int32]–it is a mandatory parameter.

That’s all straightforward, but then things get weird.

Comments closed

Understanding Capitalization Changes in Power BI

Jeroen ter Heerdt explains why you might see the capitalization change in text loaded into Power BI:

Some of you might have seen this before – you load some data into Power BI and suddenly the capitalization (uppercase / lowercase) of your text changes on you. Let me explain what is happening here.

Click through to learn why. Normally I’d say “and what you can do about it,” but there’s not much. I guess the best answer is, embrace consistent capitalization and then you won’t notice anything.

Comments closed

Managing Spark on Kubernetes instead of YARN

Rohit Choudhary argues that it’s a good idea to move from YARN to Kubernetes for your Spark clusters:

When it comes to data operations, Spark provides a tremendous advantage as a resource for data operations because it aligns with the things that make data ops valuable. It is optimized for machine learning and AI, which are used for batch processing (in real-time and at scale), and it is adept at operating within different types of environments.

Spark doesn’t completely manage these clusters of machines but instead uses a cluster manager (known as a scheduler). Most companies have traditionally used the Java Virtual Machine (JVM)-based Hadoop YARN to manage their clusters. But with the dramatic rise of Kubernetes and cloud-native computing, many organizations are moving away from YARN to Kubernetes to manage their Spark clusters. Spark on Kubernetes is even now generally available since the Apache Spark 3.1 release in March 2021.

I see some of the benefits there but am not totally sold, especially given the complexity of Kubernetes and its own lack of built-in security measures.

Comments closed

Ordered Set Functions in SQL Server

I continue a series on window functions in SQL Server:

As of SQL Server 2019, there is only one ordered set function: STRING_AGG(). I like STRING_AGG() a lot, especially because it means my days of needing to explain the STUFF() + FOR XML PATH trick to concatenate values together in SQL Server are numbered.

STRING_AGG() is interesting in that we categorize it as a window function and yet it violates my first rule of window functions: there isn’t an OVER() clause. Instead, it accepts but does not require a WITHIN GROUP() clause. Let’s see it in action.

Click through for a look at that, as well as a little hint that maybe we’ve seen ordered set functions before in a different guise.

Comments closed

FAST_FORWARD and Cursors

Joe Obbish skips past the commercials:

If you’re like me, you started your database journey by defining cursors with the default options. This went on until a senior developer or DBA kindly pointed out that you can get better performance by using the FAST_FORWARD option. Or maybe you were a real go-getter and found Aaron Bertrand’s performance benchmarking blog post on different cursor options. I admit that for many years I didn’t care to know why FAST_FORWARD sometimes made my queries faster. It had “FAST” in the name and that was good enough for me.

Recently I saw a production issue where using the right cursor options led to a 1000X performance improvement. I decided that ten years of ignorance was enough and finally did some research on different cursor options. This post contains a reproduction and discussion of the production issue.

I thought everybody knew how this works: the database streams the data tape from the supply pully to the play shaft by using a sprocket to rotate the gear in the cassette at a fixed speed. The FAST_FORWARD cursor option engages the fast forward idler in the VCR database and causes rotation to occur more rapidly than normal.

Comments closed

Building an MLOps Workflow with SageMaker and GitLab

Lauren Mullennex, et al, build out some pipelines:

Machine learning operations (MLOps) are key to effectively transition from an experimentation phase to production. The practice provides you the ability to create a repeatable mechanism to build, train, deploy, and manage machine learning models. To quickly adopt MLOps, you often require capabilities that use your existing toolsets and expertise. Projects in Amazon SageMaker give organizations the ability to easily set up and standardize developer environments for data scientists and CI/CD (continuous integration, continuous delivery) systems for MLOps engineers. With SageMaker projects, MLOps engineers or organization administrators can define templates that bootstrap the ML workflow with source version control, automated ML pipelines, and a set of code to quickly start iterating over ML use cases. With projects, dependency management, code repository management, build reproducibility, and artifact sharing and management become easy for organizations to set up. SageMaker projects are provisioned using AWS Service Catalog products. Your organization can use project templates to provision projects for each of your users.

In this post, you use a custom SageMaker project template to incorporate CI/CD practices with GitLab and GitLab pipelines. You automate building a model using Amazon SageMaker Pipelines for data preparation, model training, and model evaluation. SageMaker projects builds on Pipelines by implementing the model deployment steps and using SageMaker Model Registry, along with your existing CI/CD tooling, to automatically provision a CI/CD pipeline. In our use case, after the trained model is approved in the model registry, the model deployment pipeline is triggered via a GitLab pipeline.

Click through for the step-by-step guide on how to do this.

Comments closed

Statistical Window Functions in SQL Server

I continue a series on window functions in SQL Server:

CUME_DIST() doesn’t show 0 for the smallest record. The reason for this is in the definition: CUME_DIST() tells us how far along we are in describing the entire set—that is, what percentage of values have we covered so far. This percentage is always greater than 0. By contrast, PERCENT_RANK() forces the lowest value to be 0 and the highest value to be 1.

Another thing to note is ties. There are 117 values for customer 1 in my dataset. Rows 5 and 6 both have a percent rank of 0.0344, which is approximately rank 4 (remembering that we start from 0, not 1). Both rows 5 and 6 have the same rank of 4, and then we move up to a rank of 6. Meanwhile, for cumulative distribution, we see that rows 5 and 6 have a cumulative distribution of 6/117 = 0.5128. In other words, PERCENT_RANK() ties get the lowest possible value, whereas CUME_DIST() ties get the highest possible value.

Click through for much more detail, including examples galore.

Comments closed

Organizing Synapse Workspaces and Lakehouses

Jovan Popovic confirms that Microsoft is using the term “Lakehouse” like Databricks does:

The lakehouse pattern enables you to keep a large amount of your data in Data Lake and to get the analytic capabilities without a need to move your data to some data warehouse to start an analysis. A lakehouse represents a good trade-off between query performance and the ability to access the latest version of data without the need to wait for data to be reloaded.

Azure Synapse Analytics workspace enables you to implement the Lakehouse pattern on top of Azure Data Lake storage.

When you think about your lakehouse solution, be aware that there are two options for creating databases over the lake:

Lake databases that are created using Spark or database template

SQL databases that are created using serverless SQL pools on top of data lake.

Although you might use different tools and languages to create these types of databases, the principles described in this article apply to both types. I will use the term “lakehouse” whenever i reference Spak Lake database or SQL database created using the serverless SQL pools.

Click through for Jovan’s guidance.

Comments closed