Press "Enter" to skip to content

Category: Source Control

Adding an Existing Data Factory to GitHub

Andy Leonard has a three-parter for us. Part 1 shows you how to create a GitHub account and repo:

The unabridged topic of source control with github is beyond the scope of this post. There are a number of ways to accomplish the tasks described in this post and series. I welcome your suggestions in the comments.

This post is written to help Azure Data Factory developers get started using github.

Part 2 connects a Data Factory to the repository:

For the purposes of this demo, accept the defaults for “Publish branch” and “Root folder.” Check the “Import existing resources to repository” checkbox under the “Import existing resource” property, select the main branch in the “Import resource into this branch” property, and then click the “Apply” button:

Part 3 handles changes:

Applying what we’ve configured and learned thus far, let’s put this to work in a code-management workflow.

When it’s time to make a change, first create a new branch. I can hear some of you thinking, “Why, Andy? Why create a new branch?” That’s an excellent question. I am so glad you asked! Think of the new branch as a temporary copy of the current state of my Azure Data Factory. 

This series works from the assumption that you don’t have any real experience with Git (or GitHub) for source control, and maybe not much source control experience at all.

Comments closed

Updates to AzureDevOps-AzureSQLDatabase Repo

Kevin Chant updates a repo:

In this post I want to cover some significant updates to an Azure SQL Database repository that I have been doing for one of the public GitHub repositories that I share.

Due to the fact that I have updated the AzureDevOps-AzureSQLDatabase repository. Which contains an example of a SQL Server database project that you can use to perform CI/CD on an Azure SQL Database using Azure DevOps.

It does this by using the popular state-based migration method of creating a dacpac file based on the contents of a database project. From there, the dacpac file can be used to update one or more databases.

Click through for those updates.

Comments closed

Choosing Azure DevOps or GitHub Actions

Sarah Dutkiewicz compares and contrasts:

Every time I do an Azure DevOps talk, I get someone asking me about migrating from GitHub to Azure DevOps. Every time, I have to ask “Why do you want to migrate from GitHub to Azure DevOps?” Why would you choose between Azure DevOps and GitHub? Or better yet – do you have to choose between them? Let’s look at how they compare and the tooling available.

This is a really tough question and Sarah helps explain why.

Comments closed

Sources of Data Structure Truth

Deb Melkin performs database epistomology:

The “source of truth” is my newly made up phrase for whatever you are using to say this is my database schema and initial data needed to start up the application. This can be your script directory; this can be a dacpac or bacpac; this can be your data model; this can be a combination of these things. My go-to “source of truth” right now is my source control repository. I’ve got both the schema and the default data needed in the same location. In the past, I would have probably included the data model as way to help me make sure whatever database table changes I have in my source control are there, especially for that one database which only had tables and views. (A different rant for a different time.) Whatever you use, it absolutely CANNOT be an actual database. There are two main reasons for this:

Read on for those reasons.

Comments closed

Database Project Versioning and Identification

Eitan Blumin answers an important question:

“What is SSDT“, you ask? Oh, you didn’t? Well, let me tell you anyway! SSDT is the go-to solution from Microsoft for versioning SQL Server databases and performing state-based deployments (and it’s free!). It has many useful capabilities for developing and publishing changes from your SQL Database project to your SQL Database in production (or wherever).

One of the things that are not so clear about SSDT specifically and database versioning in general, is how should one identify which “version” of your database project was last deployed to your server?

Eitan includes several ways of tracking and controlling database versions.

Comments closed

Databases, Applications, and Source Control Repos

Eitan Blumin asks and answers a question:

Following the rise in popularity of DevOps for Databases, many interesting questions are being asked on the topic.

One of these questions is: Should your SQL Database project be in the same source control repository and solution as the App code project? Or maybe they should be in the same repository but separate solutions? Or maybe they should be in completely separate repositories?

Pre-registering my answer here: for most organizations, databases should be in a separate repository. The deployment cadence is different, the deployment mechanism is different, and the people working on each likely differ. Read on for Eitan’s thoughts, which get into more of the nuance behind the answer.

Comments closed

Filling in GitHub Repo Details

Kevin Chant practices GitHub hygeine:

To clarify, GitHub hygiene is a term that I use to describe the practice of keeping GitHub repositories healthy.

Some of you have probably noticed I have been doing this more recently. With this in mind, I thought I would share what I have been doing in this post for a couple of reasons.

First of all, to help raise awareness about some of the best practices I have been doing.

Secondly, because I am interested to get feedback from other members of the Microsoft Data Platform community about this. For example, do you also follow the same practices?

This is a reminder that there’s a lot you can include in a GitHub repo aside from the code itself.

Comments closed

“Unsafe Repository” When using Git

Niels Berglund sees something odd:

Every 6 – 9 months (or so), I clean up my development PC just to keep it “lean and mean”. I do it by formatting the hard-drive partition the OS (in this case, Windows) is on, followed by a new install. Recently I had a four-day weekend here in SA. Four glorious days off, a perfect time to “nuke” my PC and re-install!

Off I go, everything goes to plan (Chocolatey is my friend), and after a while, I am done (or as done as one can be). At this stage, I needed to do my weekly roundup blog post for the week gone by, and as I had done some changes to the GitHub repo from my MacBook Pro, I wanted to do a git pull in the repo directory for my blog. Part of the story is that on my dev PC, I have all my repos on a separate partition from the system partition, so the non-system partition was un-affected by the reformat (or so I thought). Imagine my surprise when doing the git pull I got:

Click through to see the error and root cause.

Comments closed

Git Native Support for Databricks Workflows

Vaibhav Sethi and Roland Faeustlin make an announcement:

We are happy to announce native support for Git in Databricks Workflows, which enables our customers to build reliable production data and ML workflows using modern software engineering best practices. Customers can now use a remote Git reference as the source for tasks that make up a Databricks Workflow, for example, a notebook from the main branch of a repository on GitHub can be used in a notebook task. By using Git as the source of truth, customers eliminate the risk of accidental edits to production code. They also remove the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improve reproducibility as each job run is tied to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure Devops and AWS CodeCommit.

Read on to see how it works.

Comments closed

Discovering Data Drift with DVC

Milecia McGregor looks at a version control system for ML projects (and data):

What happens when the machine learning model you’ve worked so hard to get to production becomes stale? Machine learning engineers and data scientists face this problem all the time. You usually have to figure out where the data drift started so you can determine what input data has changed. Then you need to retrain the model with this new dataset.

Retraining could involve a number of experiments across multiple datasets, and it would be helpful to be able to keep track of all of them. In this tutorial, we’ll walk through how using DVC, an open source version control system for machine learning projects, can help you keep track of those experiments and how this will speed up the time it takes to get new models out to production, preventing stale ones from lingering too long.

My team is working on integrating DVC. It’s a really good project for analytics teams, as it extends the notion of version control to datasets and helps you tie in code (source control), models (tools like MLflow), and data.

Comments closed