Press "Enter" to skip to content

Month: April 2020

Installing SSMS on Servers Running SQL Server?

Andy Mallon says yes, install SQL Server Management Studio on those servers running SQL Server instances:

“But wait, Andy. That’s not a best practice!” you say?

The pseudo best practice of “don’t install SSMS” is a misguided one–advice that I even fell into repeating in the past. However, that’s actually proposed solution to a best practice, rather than being itself a best practice.

I agree with Andy wholeheartedly on this.

Comments closed

The Importance of Unit Testing Database Code

Chris Johnson shares some thoughts on unit testing database code:

This is a topic that is quite close to me heart. I don’t come from a computing background before I started working with SQL Server, so I was quite ignorant when it came to a lot of best practices that other developers who have worked with other languages are aware of. Because of this, I had no idea about unit testing until I attended a talk at a SQL Saturday all about tSQLt. If anyone isn’t aware (as I wasn’t) tSQLt is a free open source unit testing framework for use in SQL Server databases. It is the basis of Redgate’s SQL Test software, and is the most used framework for writing unit tests in SQL Server.

Since then I’ve worked to try and get employers to adopt this as part of a standard development life cycle, with mixed success at best. My current employer is quite keen, but there are two major problems. First, we have a huge amount of legacy code that obviously has no unit tests in place; and second, the way people code is not conducive to unit testing.

Click through for additional thoughts on writing good tests and an example of modularizing code to make it more testable. I’m still in the camp of “test what you can, but you can’t test everything” with databases. There’s just too much state dependency.

Comments closed

Tidy Simulation of Stochastic Processes in R

David Robinson shows off my favorite distribution:

The Riddler puzzle describes a Poisson process, which is one of the most important stochastic processes. A Poisson process models the intuitive concept of “an event is equally likely to happen at any moment.” It’s named because the number of events occurring in a time interval of length is distributed according to , for some rate parameter (for this puzzle, the rate is described as one per day, ).

How can we simulate a Poisson process? This is an important connection between distributions. The waiting time for the next event in a Poisson process has an exponential distribution, which can be simulated with rexp().

Read on to learn about the Poisson distribution and Yule processes.

Comments closed

Changing the Graphics Device in RMarkdown Docs

Colin Gillespie shows us how to change PDF and PNG output settings within knitr:

In many workflows, function calls to graphic devices are not explicit. Instead, the call is made by another package, such as knitr.

When kniting an Rmarkdown document, the default graphics device when creating PDF documents is grDevices::pdf() and for HTML documents it’s grDevices::png(). As we demostrated, these are the worst possible choices!

Click through to see what you can do about it.

Comments closed

Clarifying Nomenclature around Azure Synapse Analytics

James Serra clears a few things up:

I see a lot of confusion among many people on what features are available today in Azure Synapse Analytics (formally called Azure SQL Data Warehouse) and what features are coming in the future. Below is a picture (click to zoom) that I describe below that hopefully clears things up:

I tend to just say “Azure Synapse Analytics SQL Pools” for the product formerly known as Azure SQL Data Warehouse and save “Azure Synapse Analytics” to include Spark + hyperscale (James’s v3).

Comments closed

Why Unit Testing in the Database Is Tough

Rob Farley talks about a couple of reasons why database unit testing can be difficult to do:

Hamish wants to develop a conversation about unit testing within database because he recognises that the lack of unit testing is a significant problem. It’s quite commonplace in the world of iterative code, of C#, Java, and those kinds of languages, but a lot less commonplace in the world of data. I’m going to look at two of the reasons why I think this is.

Read Rob’s thoughts in their entirety. I fully agree that we need to test, but get wishy-washy on the topic of automated testing. The reason for that is that tooling is quite limited, and many of those limitations are inherent limits in the database platform itself. For the types of things you most need to test (like hefty stored procedures), the number of test cases spirals out of control quickly. And unlike functional or structured programming languages, T-SQL performance gets markedly worse as you modularize, which makes it so difficult to get down to an easily testable block of code.

Comments closed

SQL Server Backup History

Dave Bland talks about a few useful tables in msdb:

How long a database takes to backup is something that over the years I have been asked to get.  These requests come for different reasons, sometimes it could be to find out how much it has increased over time, sometimes it could be to see if the backup job is interfering with other jobs and sometime it isn’t about duration at all, it is more about showing the backups were completed.  Over the years I have had a number of auditors ask of backup history.

In order to get this information we need to pull data from two tables in the MSDB database, backupset and backupmediafamily.

Read on to learn about these two tables and to get a sample query. On systems with a large number of databases and a DBA who loves frequent transaction log backups (like I do), this table can get pretty big, so don’t forget to prune that data over time.

Comments closed

Incremental Refresh with Power BI

Chris Webb talks about a special use case for Power BI incremental refresh:

Power BI incremental refresh is a very powerful feature and now it’s available in Shared capacity (not just Premium) everyone can use it. It’s designed for scenarios where you have a data warehouse running on a relational database but with a little thought you can make it do all kinds of other interesting things; Miguel Escobar’s recent blog post on how to use incremental refresh for files in a folder is a great example of this. In this post I’m going to show you how to use incremental refresh to solve another very common problem – namely how to get Power BI to keep the data that’s already in your dataset and add new data to it.

Click through for the details.

Comments closed

Distributed XGBoost in Cloudera

Harshal Patil walk us through the XGBoost algorithm and shows how we can use it in Cloudera Machine Learning:

DASK is an open-source parallel computing framework – written natively in Python – that integrates well with popular Python packages such as Numpy, Pandas, and Scikit-Learn. Dask was initially released around 2014 and has since built significant following and support. 

DASK uses Python natively, distinguishing it from Spark, which is written in Java, and has the overhead of running JVMs and context switching between Python and Java. It is also much harder to debug Spark errors vs. looking at a Python stack trace that comes from DASK.

We will run Xgboost on DASK to train in parallel on CML. The source code for this blog can be found here.

Click through for the process.

Comments closed