Press "Enter" to skip to content

Curated SQL Posts

Solving Naive Bayes By Hand

I have a post that requires math and is meaner toward the Buffalo Bills than I normally am:

Trust the Process
There are three steps to the process of solving the simplest of Naive Bayes algorithms. They are:
1. Find the probability of winning a game (that is, our prior probability).
2. Find the probability of winning given each input variable: whether Josh Allen starts the game, whether the team is home or away, whether the team scores 14 points, and who the top receiver was.
3. Plug in values from our new data into the formula to obtain the posterior probability.

This is an algorithm you want to solve by hand first—it’s just that easy. Then, once you understand it, let a computer do the work for larger data sets. Also, Super Bowl 2020 because I’m the kind of overly optimistic fool required of Bills fans. Just gonna leave this link here.

Comments closed

Data Transformation Tools In The Azure Space

James Serra gives us an overview of the major tools you would use for ETL and ELT in Azure:

If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake.  And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL).  Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data.  There is good news and bad news when it comes to which product to use.  The good news is there are a lot of products to choose from.  The bad news is there are a lot of products to choose from :-).  I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part

The only surprise is the non-mention of Azure Data Lake Analytics, and there is a good conversation in the comments section explaining why.

Comments closed

The Siren’s Call Of AG Readable Secondaries

Jonathan Kehayias brings up some of the reasons he tends to recommend customers keep away from using readable secondaries on Availability Groups:

A lot of the implementation questions I get for Availability Groups are generally looking at leveraging a secondary replica for High Availability or Disaster Recovery and often times also for offloading reporting or readable workloads.  Readable Secondary copies of the database are really simple at the surface conceptually, but there are a lot of additional considerations I make that could lead you to another technology entirely like Transactional Replication or Log Shipping with Standby. Let’s look at how these three stack up against each other for different data access requirements.

There are some good reasons here not to kick replication and log shipping to the curb.

Comments closed

Apache Airflow Now A Top-Level Project

Fokko Driesprong announces that Apache Airflow is now a top-level Apache project:

Today is a great day for Apache Airflow as it graduates from incubating status to a Top-Level Apache project. This is the next step of maturity for Airflow. For those unfamiliar, Airflow is an orchestration tool to schedule and orchestrate your data workflows. From ETL to training of models, or any other arbitrary tasks. Unlike other orchestrators, everything is written in Python, which makes it easy to use for both engineers and scientists. Having everything in code means that it is easy to version and maintain.

Airflow has been getting some hype lately, especially in the AWS space.

Comments closed

Databricks Library Utilities For Notebooks

Srinath Shankar and Todd Greenstein announce a new feature in Databricks Runtime 5.1:

We can see that there are no libraries installed and scoped specifically to this notebook.  Now I’m going to install a later version of SciPy, restart the python interpreter, and then run that same helper function we ran previously to list any libraries installed and scoped specifically to this notebook session. When using the list() function PyPI libraries scoped to this notebook session are displayed as  <library_name>-<version_number>-<repo>, and (empty) indicates that the corresponding part has no specification. This also works with wheel and egg install artifacts, but for the sake of this example we’ll just be installing the single package directly.

This does seem easier than dropping to a shell and installing with Pip, especially if you need different versions of libraries.

Comments closed

Capturing SQLCMD Errors

Jack Vamvas shows us how we can capture errors from SQLCMD:

I’m executing  code using SQLCMD from a batch file . The code points to a sql file and there is also an output file. 


SQLCMD -E -S MYSERVER\INST1 -i “setup_job_entry.sql” -o “setup_job_entry.log”


But I noticed that if the actual SQLCMD returns an error , for example , if I’m connecting to  an server which doesn’t exist this error message will appear in the output file – but there will  not be an ERROR number , which would allow me to trap and return an appropriate message 

There is a way and Jack shows us how.

Comments closed

Creating Cosmos DB Indexes

Hasan Savran explains indexing in Cosmos DB:

In SQL Server you need to pick which columns you like to index, In CosmosDB you need to pick which columns not to index. It’s kind of same thing at the end. You might ask “If everything is indexed and working fine, why do you want me to poke the well running system?” When we compare SQL Server indexes to CosmosDB Indexes, one thing works exactly same. That is the index file size. CosmosDB holds the indexes in a separate file like SQL Server and if we want to index everything, index file size is going to get large. Since we need to pay for the file space in CosmosDB, you might need to pay extra for indexes that you might never use. Also, your updates, inserts and deletes might cost you more Request Units since CosmosDB needs to maintain all the indexes in the background.

There’s just enough difference to make you pay the price if you assume Cosmos DB works just like SQL Server.

Comments closed

Migrating Lots Of Databases To SQL Server 2016

Andy Levy has a problem. Well, about 8000 of them. In part 1, he describes the plan:

How do you move eight thousand databases in a reasonable amount of time? I spent about an hour and a half one morning hashing ideas out w/ folks in the dbatools Slack channel, plus several conversations in the office and with our hosting provider.

Then, in part 2, he describes the execution:

We missed the estimated time for our go/no-go decision by five minutes. With the number of moving parts, databases in play, unexpected delays, and amount of testing we had to do, that’s pretty good! My colleague and I had some additional work we needed to take care of after the team declared the migration a success. Agent jobs needed to be enabled, overnight job startups monitored, things like that. We called it a day after about 14 hours in the office.

It was a nice success story, so check it out.

Comments closed

What Happens When Your Secondary DAG Fails Over

Tianyu Wen explains what happens when there is a failover incident in the secondary Availability Group of a Distributed Availability Group:

If the primary replica on the secondary AG (also known as the “forwarder”) is lost and causes automatic failover to happen or manual failover is performed in the secondary AG in a DAG, there will be no data loss if the following conditions are met:
– The primary replica on the primary AG runs with no synchronization issue when the failover happens;
– The secondary AG on the DAG has a functioning secondary replica before the failover happens;
– The primary replica on the primary AG can communicate properly with the secondary replica on the secondary AG over their database mirroring endpoints.

There’s a lot of detail here, so if you are supporting Distributed Availability Groups, check it out.

Comments closed

Getting Reporting Services Installation Details With Powershell

Josh Smith wants to find every installation of SQL Server Reporting Services on a machine:

This is one of those posts so I never have to google this again (one hopes). Here is the PS code to pull back a set of details about every SSRS instance installed on a server including the SSRS instance name, & the first URL port it is running on, the service name and the name of the report server database etc.

Click through for the Powershell script.

Comments closed