Press "Enter" to skip to content

Curated SQL Posts

Working with Network Graphs in R

John MacKintosh shows us the visNetwork package:

I’ve long been hoping for a reason to have to devote time to learning how to produce network plots. In my world, where bar and line charts reign supreme (with heatmaps and waffle charts thrown in occasionally) it is nice to be able to develop a new visualisation.

I’ve been wanting to produce a network plot for some time. But, the data structure, with its nodes and edges, and seeming lack of any identifiable characteristics, has meant it has not been hugely far up my agenda, or at least, never far up enough to make me learn more about it.

Click through for an example of where a network diagram can work out. H/T R-Bloggers

Comments closed

Asymmetrical Matrices in Power BI

Matt Allington walks us through a complex scenario using the Matrix visual in Power BI:

Not every report you may want to build has such simple layout requirements as the one above.  When using an Excel pivot table, there is a feature called “Field, Items and Sets” that makes the process of creating asymmetric pivot tables relatively easy, but that doesn’t exist in Power BI.  To help you understand, here is an example of an asymmetric visual (Power BI matrix in this case).

Note in my example above, the first 4 years come from the year column, just like before, but now I have 2 additional columns at the end of the matrix that are not simply the addition of other columns.  To create the last 2 columns above I need to write measures, but you can’t have both measures and columns from your model showing like this in a Power BI matrix.  Instead you need to build a solution that leverages a separate table to generate the columns you need, and then to generate the results you need in each column.

Below I will show you how to build such an asymmetric matrix like this.

Read on for the solution. This is quite clever but also makes me wonder if the product shouldn’t make this scenario a bit easier for us

Comments closed

Automatic Seeding of Availability Groups

Jamie Wick takes us through some considerations when trying out automatic seeding of availability groups:

In SQL Server 2016 Microsoft introduced Automatic Seeding for Availability Groups (AG). The Automatic Seeding process streams the database files directly to the secondary server(s) using the database mirroring endpoints, removing the need to restore the databases, before joining them to an AG.

Read on for Jamie’s thoughts and notes.

Comments closed

Testing a Linked Server via T-SQL

Kenneth Fisher has evil afoot:

Unfortunately, if there is a problem with the server / server name the sp_addlinkedserver doesn’t seem to throw an error and sp_addlinkedsrvlogin throws an error that breaks me out of the try block. So after my Google-fu failed me I did what I frequently do and asked on #sqlhelp. And I was given an answer that led me to the solution.

Click through for the solution. But not for the answer as to why he’s growing batches of monsters linked servers.

Comments closed

Updating a Power BI Refresh Schedule

Martin Schoombee continues a series on automating Power BI deployments:

There’s a few things you need to pay close attention to when setting the refresh schedule via the API:

– Unless you’re setting the refresh schedule for a Premium workspace, you can only refresh a dataset up to 8 times a day. We’re only going to set it to update once a day here, but keep this in mind if you’re planning to adjust the API call to refresh multiple times a day.

– The name of the time zone you provide has to match exactly with the names (middle column) in this reference: Microsoft Time Zone Index

– The refresh time has to be in the format hh:mm, and similar to the options in the Power BI portal you can only refresh on the hour or half-hour.

Read on to see how it works and the API call to make.

Comments closed

Setting the Default Command Timeout with Microsoft.Data.SqlClient

Erik Ejlskov Jensen shows us a way to set a default command timeout in .NET’s Microsoft.Data.SqlClient:

With the latest 2.1.0 preview 2 release of the open source .NET client driver for Microsoft SQL Server and Azure SQL Database, Microsoft.Data.SqlClient, it is now possible to set the default command timeout via the connection string.

Now you can work around timeout issues simply by changing the connection string, where this previously required changes to code, and maybe changes to code you did not have the ability to change.

This is pretty nice, as my recollection was that you could set connection timeout via connection string, but not command timeout. And not everything’s going to wrap up nicely within 30 seconds.

1 Comment

Combining CSV Files in Power BI

Stephanie Bruno shows us a method for combining CSV or Excel files:

One of the unsung heroes to me in Power BI desktop (or Power Query in Excel) is how wonderfully simple it is to combine csv or Excel files from a folder. Maybe it’s not totally unsung, but I think it’s an everyday problem that many people have to deal with. It could easily be an entry point for many new users to see just how much easier Power Query can make their lives.

As much as I appreciate how easy the tool allows us to automatically combine files without having to write any code, I never like all those extra queries that get created in my file and so I prefer to simplify it by just writing one magical little line of code. But I always got frustrated because my one line of code left me without proper column headers, so my one line of code turned into a few extra cumbersome steps. In this post, I’ll show you how to slightly modify that one line of code to get the column headers and keep your queries clean and simple.

Click through for a demo.

Comments closed

Outlier Identification Using Spark 3.0

Tori Tompkins takes us through principles of anomaly detection in Apache Spark 3.0:

To calculate Median Absolute Deviation (MAD) you need to calculate the difference between the value and the median. In simpler terms, you will need to calculate the median of the entire dataset, the difference between each value and this median, then take another median of all the differences.

In Spark you can use a SQL expression ‘percentile()’ to calculate any medians or quartiles in a dataframe. ‘percentile()’ expects a column and an array of percentiles to calculate (for median we can provide ‘array(0.5)’ because we want the 50% value ie median) and will return an array of results.

Like standard deviation, to use MAD to identify the outliers it needs to be a certain number of MAD’s away. This number is also referred to as the threshold and is defaulted to 3.

Read on for three measures and their implementations in PySpark.

Comments closed

Features and Improvements in Spark 3.0

Manoj Pandey summarizes some of the improvements in Apache Spark 3.0:

With Spark 3.0 release (on June 2020) there are some major improvements over the previous releases, some of the main and exciting features for Spark SQL & Scala developers are AQE (Adaptive Query Execution), Dynamic Partition Pruning and other performance optimization and enhancements.

Below I’ve listed out these new features and enhancements all together in one page for better understanding and future reference.

Click through for the summary.

Comments closed