Working with Azure Data Factory (ADF) enables me to build and monitor my Extract Transform Load (ETL) workflows in Azure. My ADF pipelines is a cloud version of previously used ETL projects in SQL Server SSIS.
And prior to this point, all my sample ADF pipelines were developed in so-called “Live Data Factory Mode” using my personal workspace, i.e. all changes had to be published in order to be saved. This hasn’t been the best practice from my side, and I needed to start using a source control tool to preserve and version my development code.
Click through for a detailed demo.
There are a lot of source control applications and software, everyone has its pros and cons, but personally, I like to use GitHub, since it is free to use and since it was recently acquired by Microsoft, support for other products is easier (SQL Server for this case).
On this post, I will show you how to implement a source control for a database using GitHub and Azure Data Studio (ADS).
Click through for the step-by-step instructions.
To clarify, Azure DevOps Server 2019 is the new name for TFS. Furthermore, it’s now rebranded and updated to look and feel like the Azure DevOps service. Which is the new name for Visual Studio team Services (VSTS) in Azure.
Therefore, I decided to do this because I am currently involved with Azure DevOps and using it with SQL Server deployments. Hence, I decided I wanted a local copy of my own to test things.
Click through for a couple of interesting findings, especially around required (and disallowed) versions of prerequisites like SQL Server versions.
If I could teach SQL to analysts who plan to work in industry data science, I’d start by sharing a few SQL Truths I’ve learned, and why I recommend tracking SQL queries in git. Here goes:
- You will *always* need that query again
- Queries are living artifacts that change over time
- If it’s useful to you, it’s useful to others (and vice versa)
Focusing on these points has led to my continuous adoption of a query library — a git repository for saving and sharing commonly (and uncommonly) used queries, all while tracking any changes made to these queries over time.
This is separate from keeping database objects (like table or procedure definitions) in source control.
But wait, I would need to learn an additional tool?
Yes, but don’t panic! Git is a tool with various commands that you can use to help track your changes. Luckily, you don’t need to know too many commands in Git to use the basic functionality. As an added bonus, using Git with RStudio takes away some of the burden of knowing Git commands by including buttons for common actions.
As with any tool that you pick up to help your scientific workflows, there is some upfront work before you can start seeing the benefits. Don’t let that deter you. Git can be very easy once you get the gist. Think about the benefits of being able to track changes: you can make some changes, have a record of that change and who made it, and you can tie that change to a specific problem that was reported or feature request that was noted.
It’s still code, and you gain a lot by keeping code in source control.
I almost always have trouble remembering which option is for use with a non-empty folder of “here’s a bunch of files that I want to dump into the repo to start with”, vs. “here’s an empty folder where I want to pull down the contents of an existing repo”. Fortunately, Tortoise yells at you if you try to do the latter — which is
Export— into a non-empty folder. So we want to
Import. Assuming you have a folder where all your SQL scripts live already, right-clicky and say “Tortoise SVN .. Import.”
Check it out. The only concern I have is that this source control is just local source control. That’s very helpful in situations where you accidentally mess something up, but my preference is to put my code in the same source control system the developers are using. And if the developers aren’t using source control, get that institution in place as soon as possible because that’s begging for trouble.
Fantastic, one build task created! How easy was that??
Let’s test by running: –az acr build-task run --registry TestContainerRegistry01 --name buildsqlimage
And the progress of the build task can be monitored: –az acr build-task logs --registry TestContainerRegistry01
Andrew gives us the step-by-step details, so check it out.
Yeah, things get messy, no matter what model you choose for securing your PROD data:
- Create one user and give it all rights to the database
- Create specific users and give them the least amount of rights to do what is must, and no more
- Somewhere in between the previous 2
Truly, #3 is generally the answer. Let’s say that you give the application all the rights that any user of the system can have, and let the application dole out the rights to individuals. This is not a terrible plan, but I dare say that many databases contain data, or utilities that it is not desirable to give to the users. (My utility schema generally has tools to maintain and release code, something that you don’t want general users to have access to. And lest you have a developer working “with” you like I once did, you don’t want the application to have access to the tools to disable all of the constraints in the database, even if you have ETL uses for that code.)
Check it out for some examples.
For the advocates of python, a commonly used application is Jupyter Notebook. Jupyter Notebook is a server-client application that allows editing and running of python code via a web browser combining python code, SQL, equations, text, and visualizations. It also offers syncing with GitHub repositories.
More specifically, Jupyter Notebook will be rendered by GitHub directly on your repo page. This means that one can enjoy all the benefits that Git offers regarding version control, branching, merging and collaborative development when using Jupyter Notebook.
The best strategy is probably a multi-tiered strategy. It absolutely starts with source control, but it doesn’t have to end there.
I will assume that you already have TFS setup and are just using HTTP and want to make things a bit more secure with HTTPS. I am also assuming that you will be using port 443 for HTTPS traffic.
To update TFS to use HTTPS you need to do a couple of things:
Have a legitimate certificate installed on the server that you can bind to
Have an IP address on the server and have firewall access setup to that IP address on port 443
But there are a few more steps as well, so click through to see them all.