Press "Enter" to skip to content

Curated SQL Posts

Visualizing Linear Regression Results

Bernardo Lares gives us a few ways of interpreting visually a linear regression result in R:

The most obvious plot to study for a linear regression model, you guessed it, is the regression itself. If we plot the predicted values vs the real values we can see how close they are to our reference line of 45° (intercept = 0, slope = 1). If we’d had a very sparse plot where we can see no clear tendency over that line, then we have a bad regression. On the other hand, if we have all our points over the line, I bet you gave the model your wished results!

Then, the Adjusted R2 on the plot gives us an easy parameter for us to compare models and how well did it fits our reference line. The nearer this value gets to 1, the better. Without getting too technical, if you add more and more useless variables to a model, this value will decrease; but, if you add useful variables, the Adjusted R-Squared will improve.

We also get the RMSE and MAE (Root-Mean Squared Error and Mean Absolute Error) for our regression’s results. MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. On the other side we have RMSE, which is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation. Both metrics can range from 0 to ∞ and are indifferent to the direction of errors. They are negatively-oriented scores, which means lower values are better.

I like this approach to explaining models.

Comments closed

Test-Driven Database Development

Haroon Ashraf walks us through a simplified example of test-driven database development:

In TDDD, business requirements are encapsulated in database unit tests.

In case the requirement is adding a new category to the Category table, it is necessary to implement TDDD according to the following steps:

  1. Creating of database unit test to check the existence of AddNewCategory database object.

  2. Failing of the unit test because of the database object absence.

  3. Creating the AddNewCategory object in order for the unit test to pass.

  4. The unit test determines whether AddNewCategory stored procedure is actually adding a new category or not.

  5. That unit test also fails.

  6. AddNewCategory procedure code changes to add a new category that verifies afterrerunning the unit test, which is able to pass now.

Laying out my biases, I’m not a fan of TDD for application development and definitely not a fan of it for database development.  “Unit testing” inside a database is extremely limited, particularly when there are so many side effects and encapsulation tends to be actively harmful.

Comments closed

Improving AG Database Level Failover

Sourabh Agarwal announces improvements to Availability Groups when it comes to database level failover:

In addition to the existing checks, the new implementation has the following additional checks.

  1. The new implementation stores and uses a historical snapshot of the database state information to decide if a failover should be initiated. The health check routine caches the database state and associated error information, for the last two executions, which is then compared with the state information from the current execution of the health detection routine. If the same error condition (for the below mentioned error codes) exists in three consecutive runs of the health detection routine, a failover is initiated. This implementation is intended to provide safeguards against transient errors and issues which can be fixed by the auto page repair capabilities of the availability groups.

  2. The new implementation checks for following additional errors. Majority of these errors are indicative of a hardware issues on the server. Please note, that this is not an exhaustive list of errors which could impact the database availability. There is an outstanding item to include error 824 to this list.

Great news from the Tiger Team.

Comments closed

Managing SQL Logins Across Different Instances

Raul Gonzalez shows us how to maintain the same login across different SQL Server instances—it’s all about the SIDs:

Most servers out there would have both enabled so sooner or later us, DBA’s, need to deal with SQL logins, but there is more than providing a name and a password (a strong one, of course).

If you also have different environments, most likely you want to create different logins to avoid DEV apps or users connecting to LIVE or vice-versa.

But when you have different logins and by default database users, when you need to refresh your DEV (TEST, QA…) you’d need to apply all the permissions granted again to the right user because the login does not exist in that environment. Does it sound familiar?

In this post I will show you how you can handle this problem in a very simple way.

Click through to read the whole thing.

Comments closed

It’s 10 O’Clock; Do You Know Where Your Backups Are?

Adrian Buckman has a script which makes sure your backups are where msdb says they are:

Here is the information that the script provides:

  1. Warn of restores over the top of the database since its last FULL backup

  2. Show database snapshots currently against the database

  3. Show the last FULL, DIFF and LOG backup for the database including the backup durations and backup age.

  4. Backup file information such as backup start/finish time , file path , first LSN , Last LSN , a status column which states whether the log chain is in tact based on First and last LSN but also if the file exists on disk, and finally a file exists column which will tell you if the file still exists on disk.

This is a great script if you take transaction log backups frequently (typically a good idea).

Comments closed

SSAS Tabular Deployment Fails: Newtonsoft.Json Missing

Alex Whittles walks us through an error deploying a SQL Server Analysis Services tabular model:

Deploying an Analysis Services Tabular model to SSAS Azure using the Analysis Services Deployment Wizard. Both Visual Studio 2017 & SQL Server 2017 installed on the client.

Try and click on the ellipses to change the data source connection string or impersonation information results in a Newtonsoft.json error:

“Could not load file or assembly ‘Newtonsoft.Json, Version 6.0.0.0, Culture=neutral, ……”

As I like to joke, every single .NET project in existence includes Newtonsoft.Json.  As Alex shows, sometimes they don’t reference the right version.

Comments closed

Connecting GitHub To Azure Container Registry

Andrew Pruski automates the generation of SQL Server Docker images in Azure Container Registry, generating a new image with each GitHub repo check-in:

Fantastic, one build task created! How easy was that??

Let’s test by running: –

az acr build-task run --registry TestContainerRegistry01 --name buildsqlimage

And the progress of the build task can be monitored: –

az acr build-task logs --registry TestContainerRegistry01

Andrew gives us the step-by-step details, so check it out.

Comments closed

Parsing T-SQL Scripts With Pester

Rob Sewell shows us how to use Pester to ensure that a set of SQL scripts are valid T-SQL:

This is a quick Pester test I wrote to ensure that some SQL Scripts in a directory would parse so there was some guarantee that they were valid T-SQL. It uses the SQLParser.dll and because it was using a build server without SQL Server I have to load the required DLLs from the dbatools module (Thank you dbatools 🙂 )

It simply runs through all of the .sql files and runs the parser against them and checks the errors. In the case of failures it will output where it failed in the error message in the failed Pester result as well.

This particular example doesn’t ensure that the scripts do what you want them to do, but hey, Pester was built for that as well.

Comments closed

Data Engineering Remains As Important As Ever

Prashanth Southekal has good news for ETL developers:

While many companies have embarked on data analytics initiatives, only a few have been successful. Studies have shown that over 70% of data analytics programs fail to realize their full potential and over 80% of the digital transformation initiatives fail. While there are many reasons that affect successful deployment of data analytics, one fundamental reason is lack of good quality data. However, many business enterprises realize this and invest considerable time and effort in data cleansing and remediation; technically known as data engineering. It is estimated that about 60 to 70% of the effort in data analytics is on data engineering. Given that data quality is an essential requirement for analytics, there are 5 key reasons on why data analytics is heavy on data engineering.

1.Different systems and technology mechanisms to integrate data.

Business systems are designed and implemented for a purpose; mainly for recording business transactions. The mechanisms for data capture in Business systems such as ERP is batch/discrete data while in the SCADA/IoT Field Systems it is for continuous/time-series data. This means that these business systems store diverse data types caused by the velocity, volume, and variety dimensions in the data. Hence the technology (including the database itself) to capture data is varied and complex.  And when you are trying to integrate data from these diverse systems from different vendors, the metadata model varies resulting in data integration challenges.

That 60-70% on data engineering is probably a moderate underestimate.

Comments closed

Analyzing Clickstream Data With Spark

Tony Cruz and Denny Lee analyze advertising data in Spark and predict click counts given certain input features:

Let’s look at a concrete example with the Click-Through Rate Prediction dataset of ad impressions and clicks from the data science website Kaggle.  The goal of this workflow is to create a machine learning model that, given a new ad impression, predicts whether or not there will be a click.

To build our advanced analytics workflow, let’s focus on the three main steps:

  • ETL

  • Data Exploration, for example, using SQL

  • Advanced Analytics / Machine Learning

The Databricks blog has a couple other examples, but this was the most interesting one for me.

Comments closed