Press "Enter" to skip to content

Category: Source Control

Git Native Support for Databricks Workflows

Vaibhav Sethi and Roland Faeustlin make an announcement:

We are happy to announce native support for Git in Databricks Workflows, which enables our customers to build reliable production data and ML workflows using modern software engineering best practices. Customers can now use a remote Git reference as the source for tasks that make up a Databricks Workflow, for example, a notebook from the main branch of a repository on GitHub can be used in a notebook task. By using Git as the source of truth, customers eliminate the risk of accidental edits to production code. They also remove the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improve reproducibility as each job run is tied to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure Devops and AWS CodeCommit.

Read on to see how it works.

Comments closed

Discovering Data Drift with DVC

Milecia McGregor looks at a version control system for ML projects (and data):

What happens when the machine learning model you’ve worked so hard to get to production becomes stale? Machine learning engineers and data scientists face this problem all the time. You usually have to figure out where the data drift started so you can determine what input data has changed. Then you need to retrain the model with this new dataset.

Retraining could involve a number of experiments across multiple datasets, and it would be helpful to be able to keep track of all of them. In this tutorial, we’ll walk through how using DVC, an open source version control system for machine learning projects, can help you keep track of those experiments and how this will speed up the time it takes to get new models out to production, preventing stale ones from lingering too long.

My team is working on integrating DVC. It’s a really good project for analytics teams, as it extends the notion of version control to datasets and helps you tie in code (source control), models (tools like MLflow), and data.

Comments closed

Logic Apps: Source Control and Deployment

Koen Verbeeck has a two-parter. First up is storing Logic App code in source control:

At a data warehouse project I’m using a couple of Logic Apps to do some lightweight data movements. For example: reading a SharePoint list and dumping the contents into a SQL Server table. Or reading CSV files from a OneDrive directory and putting them in Blob storage. Some of those things can be done in Azure Data Factory as well, but it’s easier and cheaper to do them with Logic apps.

Logic Apps are essentially JSON code behind the scenes, so they should be included into the source control system of your choice (for the remainder of the blog post we’re going to assume this is git).

The second post covers deployment:

It’s easy to duplicate an Azure Logic App in a resource group, but unfortunately you cannot duplicate a Logic App between environments (you might try to copy paste the JSON though). So unless you want to hand craft every Logic App yourself on each of your environments, you need a way to automatically deploy your Logic Apps. It’s easier, faster and less error-prone than any manual method.

Check out both posts.

Comments closed

PSProjectStatus

Jeffery Hicks wants to check Git status:

I write a lot of PowerShell modules. And probably like you, I am working on more than one project at a time. I was finding it difficult to keep track of what I was working on and what I might be neglecting. So I turned to PowerShell and created a tool that I use to keep on top of my projects. The PowerShell module is called PSProjectStatus and you can install it from the PowerShell Gallery. You can find the project on GitHub, but I thought I’d provide an introduction here.

Read on to see how it works.

Comments closed

Version Control for SSMS Templates

Kevin Chant saves some templates:

Previously I wrote a post about how to do version control for SQL Server Management Studio templates using Azure Repos. I wanted to highlight some things I did not point out in that post. In addition, I thought it was only fair that I showed how to do it with GitHub.

Plus, in my last T-SQL Tuesday post I mentioned the SQL Server diagnostic queries provided by Glenn Berry. Which reminded me to do this post. Because I want to do an example based on sharing one of the queries with your colleagues via GitHub. Like in the below diagram.

Click through to see the process.

Comments closed

Identifying R Functions and Packages in GitHub Gists

Bryan Shalloway looks at gists:

A problem I bumped into was that most of Chelsea’s gists don’t actually have .R or .Rmd extensions so my approach skipped most of her snippets. I wanted to parse my own gists but ran into a related problem that most of my github gist code snippets are saved as .md files1.

In this post I…

1. create a function to extract code chunks from simple .md files

2. parse the functions and packages in my code using funspotr.

Click through to see the code in action.

Comments closed

Handling Merge Conflicts with SSAS Tabular Projects

Richard Swinbank fights Visual Studio:

I sometimes find working with Visual Studio’s projects a challenge in multi-developer environments, because each project type seems to have its own vulnerability to Git merge conflicts. In the case of SSAS tabular, I’ve found two issues to be a regular source of conflicts:

Click through to see what those two causes are and what you can do to reduce the risk of having either one burn you.

Comments closed

Databricks Integration with Git Repos

Ka-Hing Chueng and Vaibhav Sethi announce Databricks Repos is now generally available:

Thousands of Databricks customers have adopted Databricks Repos since its public preview and have standardized on it for their development and production workflows. Today, we are happy to announce that Databricks Repos is now generally available.

Databricks Repos was created to solve a persistent problem for data teams: most tools used by data engineering/machine learning practitioners offer poor or no integration with Git version control systems, forcing them to navigate through multiple files, steps and UIs to simply review and commit code. Not only is this time-consuming, but it’s also error-prone.

This has been a bit of a pain point with Databricks in the past, and they’ve come up with this solution. Given that Azure Synapse Analytics has some of the same pain points, I’d expect we’ll see something similar in time.

Comments closed