Press "Enter" to skip to content

Day: March 18, 2020

CI/CD with Databricks

Sumit Mehrotra takes us through the continuous integration story around Databricks:

Development environment – Now that you have delivered a fully configured data environment to the product (or services) team in your organization, the data scientists have started working on it. They are using the data science notebook interface that they are familiar with to do exploratory analysis. The data engineers have also started working in the environment and they like working in the context of their IDEs. They would prefer a  connection between their favorite IDE and the data environment that allows them to use the familiar interface of their IDE to code and, at the same time, use the power of the data environment to run through unit tests, all in context of their IDE.

Any disciplined engineering team would take their code from the developer’s desktop to production, running through various quality gates and feedback loops. As a start, the team needs to connect their data environment to their code repository on a service like git so that the code base is properly versioned and the team can work collaboratively on the codebase.

This is more of a conceptual post than a direct how-to guide, but it does a good job of getting you on the right path.

Comments closed

Distributed Model Training with Dask and SciKit-Learn

Matthieu Lamairesse shows us how we can use Dask to perform distributed ML model training:

Dask is an open-source parallel computing framework written natively in Python (initially released 2014). It has a significant following and support largely due to its good integration with the popular Python ML ecosystem triumvirate that is NumPy, Pandas and Scikit-learn. 

Why Dask over other distributed machine learning frameworks? 

In the context of this article it’s about Dask’s tight integration with Sckit-learn’s JobLib parallel computing library that allows us to distribute Scikit-learn code with (almost) no code change, making it a very interesting framework to accelerate ML training. 

Click through for an interesting article and an example of using this on Cloudera’s ML platform.

Comments closed

Exploring the Extended Events Live Data Window

Grant Fritchey takes us through the Extended Events Live Data window:

One reason a lot of people don’t like Extended Events is because the output is in XML. Let’s face it, XML is a pain in the bottom. However, there are a bunch of ways around dealing with the XML data. The first, and easiest, is to ignore it completely and use the Live Data window built into SQL Server Management Studio.

I’ve written about the Live Data window before, and I’ve been using it throughout this series of posts on Extended Events. There’s a lot more to this tool than is immediately apparent. Today, we’re going to explore the basics around this tool

Read on to see what you can do with this. It’s a lot more powerful than it first appears to be.

Comments closed

Monitoring with WhoIsActive

Hadi Fadlallah looks at one of my favorite stored procedures:

For this reason, Adam Machanic (a Microsoft MVP since 2004) developed a more powerful stored procedure called “sp_whoisactive” to fill in the gap between the actual needs of DBAs and the currently provided procedures (sp_who and sp_who2).

In the following sections, we will talk briefly about sp_who and sp_who2 stored procedure, then we will illustrate how to download and use sp_whoisactive stored procedure.

Though there is a nicer way to insert into a table based on WhoIsActive outputs.

Comments closed

Trimming Strings with T-SQL

Andy Levy saves us all several characters at a time:

Every now and then, we encounter data that needs to be cleaned up because it’s got leading and/or trailing spaces. Or maybe you’re storing short data in a CHAR(N) field, so when you query it, you’re getting trailing spaces. For time immemorial, we’ve had to wrap these fields in rtrim(ltrim(fieldname)) to do the deed.

Effective with SQL Server 2017, that’s no longer the case. 

The eight keystrokes add up over time. In all seriousness, I am happy that TRIM() is a thing in SQL Server 2017. And Andy gives us a little bonus to make it worth your refactoring while.

Comments closed

Running SQL Server on a Windows Container

Jamie Wick takes us through the less-trodden path:

SQL Server containers are gaining popularity as a way of enhancing and standardizing development environments for Windows & Linux based SQL databases. SQL containers allow developers to have their ‘own’ dedicated copy of a database, usually without the need for extensive server infrastructures. Additionally, a single computer can host multiple containers, each with a different edition/version of SQL Server. This allows the user to quickly switch between environments, without the need to reinstall. Currently, a popular option for implementing containers on Windows-based computers uses Docker.

For those not familiar with containerization, here is a Microsoft article on Windows containers.

I’d definitely prefer to use Linux containers, even on Windows machines. But if Windows-based containers is your thing (or you need to use them for some reason), Jamie’s got you covered.

Comments closed

The Iterative Nature of Index Tuning

Erik Darling explains why index tuning is typically not a one-shot process:

For many people, index tuning means occasionally adding an index when there’s a report about a slow query. Those indexes might come from a query plan, or from the missing index DMVs, where SQL Server stores every complaint the optimizer files when it thinks an index might make a query better.

Sure, there are some people who think index tuning means rebuilding indexes or running DTA and checking all the boxes, but I ban those IP addresses.

The prudent choice, of course. Read on for Erik’s thoughts on the subject, which are quite good.

Comments closed