Press "Enter" to skip to content

Author: Kevin Feasel

Using the HAVING Clause in Spark

Lnadon Robinson continues the Spark Starter Guide:

Having is similar to filtering (filter()where() or where (in a SQL clause)), but the use cases differ slightly. While filtering allows you to apply conditions on your non-aggregated columns to limit the result set, Having allows you to apply conditions on aggregate functions / columns instead.

Read on for examples in Spark SQL, both as a SQL query and Scala/Python function calls.

Leave a Comment

Databricks Workflows

Stacy Kerkela, et al, make an announcement:

Today we are excited to introduce Databricks Workflows, the fully-managed orchestration service that is deeply integrated with the Databricks Lakehouse Platform. Workflows enables data engineers, data scientists and analysts to build reliable data, analytics, and ML workflows on any cloud without needing to manage complex infrastructure. Finally, every user is empowered to deliver timely, accurate, and actionable insights for their business initiatives.

This looks a bit like Synapse pipelines. It’ll be interesting to see how this evovles.

Leave a Comment

Reviewing the ReportServer Database

Ed Pollack opens the lid:

SQL Server Reporting Services is a convenient application for generating reports quickly and efficiently. Its back-end components are a bit more confusing to an unsuspecting administrator.

This article delves into the ReportServer database, revealing the tables and data that are used to power SSRS. In addition, the ability to alter data in these tables is presented as a way to avoid time-consuming migration or data modification processes.

Do be sure to check out Ed’s warning and then dive into the database.

Leave a Comment

TRY-CATCH in T-SQL

Kevin Wilkie breaks out the catcher’s mitt:

In our last post, we started talking about transactions and how they can help your databases have data integrity. Let’s continue that today with our friends TRY and CATCH.

TRY is just telling SQL Server that we want to try out something. CATCH tells SQL Server what to do if it completely messes up.

Read on to see how it works. As a quick note for those of you who come at this from a software developer’s background, T-SQL’s try-catch capabilities are not as robust as what you probably are used to, especially with respect to what gets caught. Most severe error messages are not catchable, so you can’t always expect that database query to go quietly in the night.

Leave a Comment

Replacing Common Table Expressions in ADF Dataflows

Jeet Kainth needs an alternative:

At the time of writing, it is not possible to write a query using a CTE in the source of a dataflow. However, there are a few options to deal with this limitation:

– re-write the query using subqueries instead of CTEs

– use a stored procedure that contains the query and reference the stored proc in the source of the dataflow

– write the query as a view and reference the view in the source of the dataflow (this is my preferred method and the one I will demo here)

Jeet focuses on the third alternative. I’d lean toward the second or the third alternative, myself. Probably the second one (stored procedures) but both allow me to create an interface between ADF and the database. That way, underlying table changes will be less likely to require me to make code changes in ADF.

Leave a Comment

Reviewing a SQL Server Backup File

Andy Yun digs into a SQL Server backup file:

This first came up during my onboarding with Pure Storage. Anthony Nocentino (b|t) taught me that a SQL Server backup file is a byte-for-byte copy of your data, as materialized in SQL Server MDF files (assuming no backup compression or backup encryption). And that would make sense – how else would SQL Server store a copy of your data in a backup file? It does not make sense for SQL Server to alter your data when it writes it down to a backup file (again, with NO backup compression/encryption) – that’s a waste of compute and effort.

Well, I had a conversation with someone who was unclear about that assertion. I tried some Google-fu to present some supporting materials, but could not actually find any documentation, official or otherwise, to back it up. So here we are.

Click through to dive into a backup file with Andy.

Leave a Comment

Reviewing the Gartner BI Magic Quadrant

Matt Allington takes a look:

I update this blog article regularly to keep track of how Microsoft is progressing on its strategic journey in building a world class BI platform.  Gartner released the 2022 magic quadrant for Business Intelligence in March 2022.  As expected (by me at least), Microsoft is continuing its trail blazing and has maintained a clear lead over the other contenders in both ability to execute and completeness of vision.  I first wrote this blog article in 2019 to show the trend over time. Since then, I have been collecting the new Gartner magic quadrant images and I have combined them all (2015 – 2022) into an animated gif file (see below).   The visualisation was built using Power BI (what else) so it is very easy to see the changing nature of the competitive environment.  Note that Microsoft started behind back in 2015, and arguably hit the lead in 2018.

Also check out Matt’s video on how to translate spatial values to numbers so Power BI can plot them.

Leave a Comment

Importing Data into R

Sebastian Sauer shows off several ways of loading data into R:

Importing data into R can cause headaches for newbies. For some, the concept of relative and absolute paths is new. That’s why I compiled here some recommendations on how to important data into R and on how to ditch the “what’s my path” problem.

Click through for some notes. This post focused on files rather than databases, though that’s a very common way of loading data as well.

Leave a Comment

Negative Identity Reseeding

Andrea Allred hits identity integer absolute zero:

Recently we had a system go down because we ran out of integers. (Mental note to create an alert when we are getting near to running out of integers.) If we upped the column to a bigint we were going to have to drop all the indexes. This server doesn’t have the capacity needed to do an operation of that size and it was estimated that it would be down for 8 hours while we dropped indexes, upped the column type and added back the indexes. This was way too long as it was early in the work day (had it been evening, it would have been fine).

There was also a concern about how many stored procedures were expecting an int but would need to be modified to a bigint, along with any code. That was a big undertaking and we were in an emergency down.

This reseeding typically works well, though it’s important to know if the code will fail upon getting negative numbers.

Leave a Comment