Press "Enter" to skip to content

Curated SQL Posts

Using SQL Server’s PIVOT Operator

Dan Blank shows how to use the PIVOT operator in SQL Server:

I was recently approached by my firms Marketing Manager with a request for some information.  She wanted to know “Which departments have our top clients never done any work for?”.  For some clarity, I work in a law firm with 11 departments. The request seems pretty straightforward at first. Then once I got thinking about how the output of this report would be presented it made me reconsider quite how simple a request this was.  She handed me a drawing with her vision for the output.  What she wanted was :

  1. Client’s names down the left,

  2. List of Departments across the top,

  3. Ticks and crosses at the intersection to show whether they had or had not done work for them.

You can make a good argument that presentation mechanics like this are meant for a different tool (a presentation layer) but it’s useful to know how to pivot and unpivot data within T-SQL for more reasons than just presentation.

Comments closed

The Secret Power Of TRIM

Dave Mason points out something quite useful about TRIM in SQL Server 2017:

Now I can tidy things up and remove both leading and trailing spaces with a single call to TRIM():

SELECT TRIM(b.foo)
FROM dbo.bar b;

On the surface, this may not seem like that big of a deal. And I would tend to agree. But, in this example, it saves a few key strokes and makes the code slightly more readable. And it is nice for T-SQL to finally have a function that has been around in other languages for far longer than I’ve been writing code for a living.

But Wait, There’s More!

Click through for that more.  This makes TRIM a lot more useful, so go check it out.

Comments closed

Joining Multiple Types Of Data With KSQL

Robin Moffatt has an example where he enriches streaming CSV data with information stored in MySQL:

This is a continuous query that executes in the background until explicitly terminated by the user. In effect, these are stream processing applications, and all we need to create them is SQL! Here all we’ve done is an enrichment (joining two sets of data), but we could easily add predicates to the data (simply include a WHERE clause), or even aggregations.

You can see which queries are running with the SHOW QUERIES; statement. All queries will pause if the KSQL server stops, and restart automagically when the KSQL server starts again.

The DESCRIBE EXTENDED command can be used to see information about the derived stream such as the one created above. As well as simply the columns involved, we can see information about the underlying topic, and run-time stats such as the number of messages processed and the timestamp of the most recent one.

It’s pretty easy to do; click through to see just how easy.

Comments closed

Joining Objects In Powershell

Shane O’Neill makes a discovery:

…if there is enough data to import this into the database & use T-SQL then you can bet that’s what I’m going to do! It’s what it was designed for, I’d find it easier, and it’s probably going to be faster after you hit a certain threshold.

However, if it’s small sets and the effort of importing the data is going to slow you down and break your flow…

Well, that doesn’t have to be the case anymore.

Read on to see joins in action.

Comments closed

Time Zone Conversion With M

Cedric Charlier shows how to perform time zone conversions with the M language in Power Query:

Everything is fine … except if I share my code with someone from another time zone. The function DateTimeZone.ToLocal is relying on regional settings and in that case my conversion should always be from UTC to “Brussels time”.

I didn’t find any other way to ensure that I’m always converting from UTC to “Brussels time” than implementing the conversion by myself. That’s the goal of the following function

Looks like there may not be a nice “convert to a different time zone” here like lubridate::with_tz() does in R.

Comments closed

Read-Only Replicas With Filled TempDB

David Fowler explains what could cause a read-only secondary replica in an Availability Group to have its tempdb fill up:

When I have an issue with tempdb filling up the first thing that I usually do is try to figure out exactly what the space has been allocated to.

You can quickly figure out what process has the most space allocated by using a quick query against dm_db_session_space_usage.

SELECT session_id, database_id, user_objects_alloc_page_count + internal_objects_dealloc_page_count AS TotalAllocatedPages
FROM sys.dm_db_session_space_usage
ORDER BY TotalAllocatedPages DESC

But what if you can see that there aren’t any pages allocated to sessions?  What could be taking up all the space?  Well let’s have a little look and see exactly where those pages are allocated.

Click through to see David’s results and explanation.

Comments closed

Sharing Power BI Content Via E-Mail

Steve Hughes looks at the security implications of being able to share Power BI reports through e-mail:

My account does not have Power BI Pro, but now I can try it for free for 60 days and get access to the data while I am on the trial. I clicked both options, because I can. The Upgrade account option would require me to pay for Pro. However, Try Pro for free works and I was able to access the report fully. I have successfully shared my corporate content with a personal user.

Steve shows us where you can go to disable this if you want, as well as places where you can see what content has been shared.

Comments closed

Finding Where Power BI Local Credentials Get Stored

Eugene Meidinger hunts down where those local Power BI credentials live:

With SSIS, you have to be careful to export the SSIS files without any sensitive information included. But what about Power BI? If you save the .PBIX files on OneDrive, can you be exposing yourself to a security risk?

Looking at things, it looks like credentials for data sources are stored globally, so one wouldn’t expect them to be in the .pbix files.

Read on as he does some more sleuthing and discovers the answer.

Comments closed

ggplot2 Geoms And Aesthetics

Tyler Rinker digs into ggplot2’s geoms and aesthetics:

I thought it my be fun to use the geoms aesthetics to see if we could cluster aesthetically similar geoms closer together. The heatmap below uses cosine similarity and heirarchical clustering to reorder the matrix that will allow for like geoms to be found closer to one another (note that today I learned from “R for Data Science” about the seriation package [https://cran.r-project.org/web/packages/seriation/index.html] that may make this matrix reordering task much easier).

It’s an interesting analysis of what’s available within ggplot2 and a detailed look at how different geoms fit together with respect to aesthetic options.

Comments closed

Multi-Class Text Classification In Python

Susan Li has a series on multi-class text classification in Python.  First up is analysis with PySpark:

Our task is to classify San Francisco Crime Description into 33 pre-defined categories. The data can be downloaded from Kaggle.

Given a new crime description comes in, we want to assign it to one of 33 categories. The classifier makes the assumption that each new crime description is assigned to one and only one category. This is multi-class text classification problem.

    • * Input: Descript
    • * Example: “STOLEN AUTOMOBILE”
    • * Output: Category
    • * Example: VEHICLE THEFT

To solve this problem, we will use a variety of feature extraction technique along with different supervised machine learning algorithms in Spark. Let’s get started!

Then, she looks at multi-class text classification with scikit-learn:

The classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation.

One common approach for extracting features from the text is to use the bag of words model: a model where for each document, a complaint narrative in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.

Specifically, for each term in our dataset, we will calculate a measure called Term Frequency, Inverse Document Frequency, abbreviated to tf-idf.

This is a nice pair of articles on the topic.  Natural Language Processing (and dealing with text in general) is one place where Python is well ahead of R in terms of functionality and ease of use.

Comments closed