Press "Enter" to skip to content

Month: November 2021

Building an MLOps Workflow with SageMaker and GitLab

Lauren Mullennex, et al, build out some pipelines:

Machine learning operations (MLOps) are key to effectively transition from an experimentation phase to production. The practice provides you the ability to create a repeatable mechanism to build, train, deploy, and manage machine learning models. To quickly adopt MLOps, you often require capabilities that use your existing toolsets and expertise. Projects in Amazon SageMaker give organizations the ability to easily set up and standardize developer environments for data scientists and CI/CD (continuous integration, continuous delivery) systems for MLOps engineers. With SageMaker projects, MLOps engineers or organization administrators can define templates that bootstrap the ML workflow with source version control, automated ML pipelines, and a set of code to quickly start iterating over ML use cases. With projects, dependency management, code repository management, build reproducibility, and artifact sharing and management become easy for organizations to set up. SageMaker projects are provisioned using AWS Service Catalog products. Your organization can use project templates to provision projects for each of your users.

In this post, you use a custom SageMaker project template to incorporate CI/CD practices with GitLab and GitLab pipelines. You automate building a model using Amazon SageMaker Pipelines for data preparation, model training, and model evaluation. SageMaker projects builds on Pipelines by implementing the model deployment steps and using SageMaker Model Registry, along with your existing CI/CD tooling, to automatically provision a CI/CD pipeline. In our use case, after the trained model is approved in the model registry, the model deployment pipeline is triggered via a GitLab pipeline.

Click through for the step-by-step guide on how to do this.

Leave a Comment

Statistical Window Functions in SQL Server

I continue a series on window functions in SQL Server:

CUME_DIST() doesn’t show 0 for the smallest record. The reason for this is in the definition: CUME_DIST() tells us how far along we are in describing the entire set—that is, what percentage of values have we covered so far. This percentage is always greater than 0. By contrast, PERCENT_RANK() forces the lowest value to be 0 and the highest value to be 1.

Another thing to note is ties. There are 117 values for customer 1 in my dataset. Rows 5 and 6 both have a percent rank of 0.0344, which is approximately rank 4 (remembering that we start from 0, not 1). Both rows 5 and 6 have the same rank of 4, and then we move up to a rank of 6. Meanwhile, for cumulative distribution, we see that rows 5 and 6 have a cumulative distribution of 6/117 = 0.5128. In other words, PERCENT_RANK() ties get the lowest possible value, whereas CUME_DIST() ties get the highest possible value.

Click through for much more detail, including examples galore.

Leave a Comment

Organizing Synapse Workspaces and Lakehouses

Jovan Popovic confirms that Microsoft is using the term “Lakehouse” like Databricks does:

The lakehouse pattern enables you to keep a large amount of your data in Data Lake and to get the analytic capabilities without a need to move your data to some data warehouse to start an analysis. A lakehouse represents a good trade-off between query performance and the ability to access the latest version of data without the need to wait for data to be reloaded.

Azure Synapse Analytics workspace enables you to implement the Lakehouse pattern on top of Azure Data Lake storage.

When you think about your lakehouse solution, be aware that there are two options for creating databases over the lake:

Lake databases that are created using Spark or database template

SQL databases that are created using serverless SQL pools on top of data lake.

Although you might use different tools and languages to create these types of databases, the principles described in this article apply to both types. I will use the term “lakehouse” whenever i reference Spak Lake database or SQL database created using the serverless SQL pools.

Click through for Jovan’s guidance.

Leave a Comment

Dynamic Data Masking and Granular Unmasking

Dennes Torres points out a change to dynamic data masking in Azure SQL DB:

Dynamic data mask is a very interesting security feature allowing us to mask critical fields such as e-mail, phone number, credit card and so on. We can decide what users will be able to see the value of these features or not.

This feature faced many flaws when it was released, but I believe it’s stable now, although It’s not the main security feature you should care about, it can still be very useful.

However, until very recently, this feature was not very useful. If you mask many fields in many different tables, the fields may require different permission levels in order to be unmasked.

I agree that this is definitely not a security feature. But hey, at least it’s a bit more useful than it was before.

Leave a Comment

Logging Database Scoped Configuration Changes

Jonathan Kehayias wants to know what changes you’ve made:

The introduction of DATABASE SCOPED CONFIGURATIONS in SQL Server 2016 enabled different configuration settings at the individual database level. However, there is no logging of changes to the database scoped settings by default in SQL Server, making it nearly impossible to track down when a change was made and by who. After recently working on a client problem where performance issues were attributed to a DATABASE SCOPED CONFIGURATION of MAXDOP = 1 multiple times, I decided to create a DDL trigger for the ALTER_DATABASE_SCOPED_CONFIGURATION events in SQL Server to have it log the change to the ERRORLOG file similar to the one I wrote years ago for logging Extended Event session changes.

Click through for the definition of that trigger.

Leave a Comment

BETWEEN and Overlaps

Chad Callihan reminds us that BETWEEN is inclusive of both sides:

Thanks to Robert for his comment on the last post that then spawned this post. In the example about sargable dates, I thought I would go with the more simple look and only use dates instead of adding the times. The point is to look at sargability, right? Well, here’s an example on why you don’t mix and match dates and datetimes.

Click through for the demonstration.

Leave a Comment

Storytelling with Data Book Review

Camila Henrique has a book review for us:

Hello! As you may have noticed from my Reading List page here, I like to read. Recently, with the new job, I was looking for a book that talked about Data Visualization. While searching, I came across “Storytelling with Data”, and it was not the first time I saw it. After checking a few reviews, I decided to invest my time reading it. Turns out it was a great decision! I liked it so much that I wanted to talk about it here, so here it comes, grab your reading glasses.

This has been on my backlog of books to review, and I agree with Camila that it’s absolutely worth grabbing a copy.

Leave a Comment

MMLSpark Is Now SynapseML

Mark Hamilton has an announcement:

Today, we’re excited to announce the release of SynapseML (previously MMLSpark), an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. Building production-ready distributed ML pipelines can be difficult, even for the most seasoned developer. Composing tools from different ecosystems often requires considerable “glue” code, and many frameworks aren’t designed with thousand-machine elastic clusters in mind. SynapseML resolves this challenge by unifying several existing ML frameworks and new Microsoft algorithms in a single, scalable API that’s usable across Python, R, Scala, and Java.

Read on to learn more about the library.

Leave a Comment

Documenting Power BI Dataset Measures

Gilbert Quevauvilliers thinks about documentation:

One thing that often happens is when users are using a dataset, they want to know which measures are available. And not only that sometimes they want to know the measure definition.

This got me thinking and how best could I give this to the users in my organization to be able to find this information quickly and easily.

In the past this was a manual effort not only to export the measures, but also to maintain a document, so that as measures are added, updated, or deleted I would then need to manually update some document.

Yep, you guessed it I created a Power BI report which has got all the measures and their measure definitions, which will update with the dataset! And I show you how I did this below.

Click through to see how.sfff

Leave a Comment

Clear out Those Old Container Images

Joy George Kunjikkur has a public service announcement for us:

When we use self-hosted Azure pipeline agents, we may encounter the below issue during the build process. This is not a hard issue to troubleshoot. The reason is there in the error message.

Error processing tar file(exit status 1): open /root/.local/share/NuGet/v3-cache/670c1461c29885f9aa22c281d8b7da90845b38e4$ps:_api.nuget.org_v3_index.json/nupkg_system.reflection.metadata.1.4.2.dat: no space left on device

This is known in the industry as a whoopsie-doops. Click through to see what you can do to resolve the problem.

Leave a Comment