Minikube is a good learning tool and Microsoft provides instructions for deploying a big data cluster to this ‘Platform’. However, its single node nature and the fact that application pods run on the master node means that this does not reflect a cluster that anyone would run in production. Kubernetes-as-a-service is probably by far the easiest option for spinning a cluster up, however it relies on an Aws, Azure or Google Cloud Platform account, hence there is a $ cost associated with this. This leaves a vanilla deployment of Kubernetes on premises. Based on the assumption that most people will have access to Windows server version 2008 or above, a relatively cheap and way of deploying a Kubernetes cluster is via Linux virtual machines running on Hyper-V. This blog post will provide step by step instructions for creating the virtual machines to act as the master and worker nodes in the cluster.
This is going on my “try this out when I have time” list. So expect a full report sometime in the year 2023.
Whatever you elect to do there will always be a master branch, where you go from here depends on whether you favor branching or feature toggles. Wikipedia provides a nice definition of what a feature toggle is, thus:
A feature toggle (also feature switch, feature flag, feature flipper, conditional feature, etc.) is a technique in software development that attempts to provide an alternative to maintaining multiple source-code branches (known as feature branches), such that a feature can be tested even before it is completed and ready for release. Feature toggle is used to hide, enable or disable the feature during run time. For example, during the development process, a developer can enable the feature for testing and disable it for other users.
A branch is initially a clone of the master branch to begin with, developers work on the branch. Once the work on that branch is code complete and it has been tested to satisfaction, it is merged into the master. An overview of the branching and merging process is provided in the Git documentation here.
The continuous integration and delivery purist are not great fans of branches and prefer the ethos of integrating changes into one place to be rigidly adhered to, ergo one code branch only. However, in practice you will find that most projects have to come up with some sane branching strategy. The subject of branching is a topic in its own right, suffice it to say there is an overhead in applying changes across multiple branches and overheads involved in merging into the master branch. Therefore, there needs to be some governance and rigor applied around the number of branches in the source code repository.
Chris then shows us how to create a multi-branch pipeline with Jenkins.
The mainstay of my presentation material this year has been my deck on continuous integration, Docker and Jenkins. For people who have not had the chance to see this presentation or have seen it and wanted to get some more context around it, I have written this first in a series of posts. Much, in fact just about all of the material in this post features in other posts on my blog. The aim of this set of posts is to present the material in a more digestible manner for people who might not be fully fully familiar with Docker and Jenkins.
This first post will cover an introduction to Jenkins and use of the “Sidecar pattern” for deploying DACPACs to. Subsequent posts will expand on this to include:
- Multi branch build pipelines
- Unit testing with tSQLt
- The management of database state via Docker volumes
Many people in the SQL Server community have displayed a great interest in containers, only to be left scratching their heads thinking “Well, that is nice, but what can I practically use them for ?”. In my humble opinion, spinning up SQL Server inside a container as a deployment target for a continuous integration pipeline, is one of the, if not the best ways to leverage SQL Server and Docker.
I’m looking forward to the rest of the series.
But this had to be done.
I thanked Geoff and hung up the phone. I then made another judgment call and exercised yet more of my ETL Architect authority. I assigned the PrUAT ticket to myself, logged into PrUAT, executed the patch, copied the output of the execution to the Notes field of the ticket (as we’d trained all DBAs and Release Management people to do), and then manually verified the patch was, in fact, deployed to PrUAT.
I closed the ticket and called my boss. “Done. And verified,” I said. My boss replied, “Good,” and hung up. He passed the good news up the chain.
A funny thing happened the next morning. And by “funny,” I mean no-fun-at-all. My boss called and asked, “Andy? I thought you said the patch was was deployed to PrUAT.” I was a little stunned, grappling with the implications of the accusation. He continued, “The process failed again last night and vendor checks were – again – not cut.” I finally stammered, “Let me check on it and get back to you.”
It’s a good story and really sells the idea that you have to track deployment details, including when you’re doing manual deployments.
Stop jobs that process CDC (SSIS).
Inside a transaction with isolation level serializable: Alter Table schema and create temporary CDC table
Copy old CDC rows to new table excluding dup rows (based on [__$seqval])
Disable old (original) CDC table (schema is outdated). Will drop table
Click through for the rest of the steps and an example script.
The documentation for how to specify build steps in YAML is still a work in progress. In summary, the current procedure is to visit the VSTS Tasks repo on GitHub, open the folder for the task your are interested in, and take a look at the
In our case, the first task we need is MSBuild, to build the database project. Looking inside task.json, we can see that the name of the task we need is
MSBuild, and that there are a huge number of available
inputswe can use to configure the task;
solutionto specify the project or solution to build,
configuration, and many more. In our case, we’ll just specify the path to our
.sqlprojfile and let
msbuildtake care of the rest.
Gavin walks us through the entire process step by step, though as he notes, some of these steps are liable to change as the VSTS team continues implementation.
I have finished my series on launching a data science project. First, I have a post on deploying models as microservices:
The other big shift is a shift away from single, large services which try to solve all of the problems. Instead, we’ve entered the era of the microservice: a small service dedicated to providing a single answer to a single problem. A microservice architecture lets us build smaller applications geared toward solving the domain problem rather than trying to solve the integration problem. Although you can definitely configure other forms of interoperation, most microservices typically are exposed via web calls and that’s the scenario I’ll discuss today. The biggest benefit to setting up a microservice this way is that I can write my service in R, you can call it from your Python service, and then some .NET service could call yours, and nobody cares about the particular languages used because they all speak over a common, known protocol.
One concern here is that you don’t want to waste your analysts time learning how to build web services, and that’s where data science workbenches and deployment tools like DeployRcome into play. These make it easier to deploy scalable predictive services, allowing practitioners to build their R scripts, push them to a service, and let that service host the models and turn function calls into API calls automatically.
But if you already have application development skills on your team, you can make use of other patterns. Let me give two examples of patterns that my team has used to solve specific problems.
At this point in the data science process, we’ve launched a product into production. Now it’s time to kick back and hibernate for two months, right? Yeah, about that…
Just because you’ve got your project in production doesn’t mean you’re done. First of all, it’s important to keep checking the efficacy of your models. Shift happens, where a model might have been good at one point in time but becomes progressively worse as circumstances change. Some models are fairly stable, where they can last for years without significant modification; others have unstable underlying trends, to the point that you might need to retrain such a model continuously. You might also find out that your training and testing data was not truly indicative of real-world data, especially that the real world is a lot messier than what you trained against.
The best way to guard against unbeknownst model shift is to take new production data and retrain the model. This works best if you can keep track of your model’s predictions versus actual outcomes; that way, you can tell the actual efficacy of the model, figuring out how frequently and by how much your model was wrong.
This was a fun series to write and will be interesting to come back to in a couple of years to see how much I disagree with the me of now.
No one can deny how large the online support community for data science is. Today, it’s possible to teach yourself Python and other programming languages in a matter of weeks. And if you’re ever in doubt, there’s a StackOverflow thread or something similar waiting to give you the perfect piece of code to help you.
But when it came to pushing it to production, we found very little documentation online. Most data scientists seem to work on Python notebooks in a silo. They process large volumes of data and analyze it — but within the confines of Jupyter Notebooks. And most of the resources we’ve found while growing as data scientists revolve around Jupyter Notebooks.
Another option might be to use JupyterHub.
There’s two things going on here (and one hidden thing):
The first two messages point out that a procedure is referencing the column
ColdRoomSensorNumberwith schemabinding. The reason it’s using schemabinding is because it’s a natively compiled stored procedure. And that tells me that the table
Warehouse.ColdRoomTemperaturesis an In-Memory table. That’s not all. I noticed another wrinkle. The procedure takes a table-valued parameter whose table type contains a column called
ColdRoomSensorLabel. We’re going to have to replace that too. Ugh. Part of me wanted to look for another example.
The last message tells me that the table is a system versioned table. So there’s a corresponding archive table where history is maintained. That has to be dealt with too. Luckily Microsoft has a great article on Changing the Schema of a System-Versioned Temporal Table.
One last thing to worry about is a index on
ColdRoomSensorNumber. That should be replaced with an index on
ColdRoomSensorLabel. SSDT didn’t warn me about that because apparently, it can deal with that pretty nicely.
I’m glad that Michael went with a more complex example—it’s easy to tell this story with a simple procedure versioning, but in seeing a larger change, you can see the rhythm in the process—it’s all the same pattern of steps over and over.
There are two main kinds of SQL queries. SELECT/INSERT/UPDATE/DELETE statements are examples of Data Manipulation Language (DML). CREATE/ALTER/DROP statements are examples of Data Definition Language (DDL).
With schema changes – DDL – we have the added complexity of the SCH-M lock. It’s a kind of lock you don’t see with DML statements. DML statements take and hold schema stability locks (SCH-S) on the tables they need. This can cause interesting blocking chains between the two types where new queries can’t start until the schema change succeeds
Click through for suggestions with regard to schema locks, as well as a few tips for modifying large tables.