Press "Enter" to skip to content

Curated SQL Posts

Exploratory Analysis of Earthquake Data

Giorgio Garziano walks us through an earthquake data set:

Boxplots for each quantitative variables are shown. We take advantage of the quantitative variable names (quantitative_vars) determined before to apply a ggplot2 package based boxplot function. The Y axis labeling and title are determined by the variable to be plot. Further, legend is not displayed and we adopt the coordinate flip option for improved readability.

Check it out to get an idea of how to do exploratory data analysis.

Comments closed

Azure SQL Linux VM Configuration with dbatools

Rob Sewell walks us through configuring SQL Server on an Azure VM running Linux, installing Powershell, and using dbatools:

I had set the Network security rules to accept connections only from my static IP using variables in the Build Pipeline. I use MobaXterm as my SSH client. Its a free download. I click on sessions

There wasn’t much I could excerpt here, but this is a heavily screenshot-driven tutorial.

Comments closed

Defining Data Egress Charges with Azure Query Editor

Dave Bland looks into what constitutes data egress with Azure and looks at a specific example:

If you have attempted to calculate your price for your Azure environment, you know that the pricing can be complex, taking into account a number of factors.  These factors include data egress, compute and storage.  The intent of the blog is not to outline all the billing factors of Azure.  The purpose of this blog post is to answer one question.

When I use the Azure SQL Query Editor, does that count as part of the data egress charges?

I completed a blog post on the Azure Query Editor a few weeks back.  Here is a link, if you would like to check it out.

Read on to see what Dave learned.

Comments closed

Cancelling Power Query Refreshes

Imke Feldmann shares how you can stop a data refresh in Power Query without losing the work you’ve done in the designer:

If you’re working with large data or complex queries that take a long time refresh, cancelling one of those refreshes can even take longer time, especially, if the query has run for quite some time already.

Luckily, there is an easy trick to cancel refresh without loosing the work you’ve done already

Read on to see how.

Comments closed

Linked Servers and Remote Insertion

Max Vernon recommends a pull rather than push model when you need to insert cross-server:

Linked Servers offer a great way to connect two SQL Servers together, allowing remote querying and DML operations. Frequently, this is used to copy data from production to reporting. However, the temptation is to run the copy operation on the production, or source side. If you do that, even with a single INSERT INTO statement, SQL Server will process each individual row as a discrete INSERT INTO statement via a cursor operation. This makes for very slow inserts across a linked server. Running the operation from the destination server means SQL Server can simply query the remote source table for all the rows, inserting them as a set into the destination table. The difference in speed can be eye-watering.

Click through for a slightly creepy picture and a less creepy example.

Comments closed

Spark UDFs in Scala

Achilleus shows us how to create a user-defined function for Spark in Scala, as well as the performance drawbacks:

It is pretty straight forward and easy to create it in spark. Let’s say we have this customer data from Central Perk. If you look at the country data, it has a lot of discrepancies but we kinda know its the right country, it’s just that the way it is entered is not typical. Let’s say we need to normalize it to the USA that is similar with the help of a known dictionary.

The performance hit is often too much for me to accept, though that could just be that I write bad functions.

Comments closed

ETL with Spark and Hive

Emrah Mete gives us an example of using Apache Spark for ETL into Apache Hive:

Now let’s go to the construction of the sample application. In the example, we will first send the data from our Linux file system to the data storage unit of the Hadoop ecosystem (HDFS) (for example, Extraction). Then we will read the data we have written here with Spark and then we will apply a simple Transformation and write to Hive (Load). Hive is a substructure that allows us to query the data in the hadoop ecosystem, which is stored in this environment. With this infrastructure, we can easily query the data in our big data environment using SQL language.

Most of the things relational database professionals do are pretty much the same things that you do with Spark and Hive. There are differences in implementation and level of programming familiarity, but they’re pretty similar.

Comments closed

Choosing Between Management Studio and Azure Data Studio

Brent Ozar gives us the lay of the tooling land:

Every time there’s a new release of SQL Server or SQL Server Management Studio, you can grab the latest version of SSMS and keep right on keepin’ on. Your job still functions the same way using the same tool, and the tool keeps getting better.

And it’s free. You don’t have to ask the boss for upgrade money. You can just download it, install it, and take advantage of things like the cool new execution plan est-vs-actual numbers (which also cause presenters all over to curse, knowing that they have to redo a bunch of screenshots.)

I spend a lot of time jumping back & forth between SQL Server and Postgres, and lemme just tell you, the tooling options on the other side of the fence are a hot mess.

Yeah, Management Studio is the best of the bunch. I’m using Azure Data Studio more at home but still need a couple of plugins to use it often at work. And those two beat pretty much every other tool I’ve ever worked with.

Comments closed