Syntax – Page 35 – Curated SQL

Join Types in Spark SQL

Published 2022-12-29 by Kevin Feasel

In Apache Spark, we can use the following types of joins in SQL:

Inner join: An inner join in Apache Spark is a type of join that returns only the rows that match a given predicate in both tables. To perform an inner join in Spark using Scala, we can use the join method on a DataFrame.

The set of options is the same as you’d see in a relational database: inner, left outer, right outer, full outer, and cross. The examples here are in Scala, though would apply just as easily to PySpark and, of course, writing classic SQL statements.

Comments closed

Comparing Table Records with T-SQL

Published 2022-12-22 by Kevin Feasel

Chad Callihan compares and contrasts:

We recently looked at looked at comparing schemas using Azure Data Studio. What if we need to compare tables by using a query? For this post we’ll compare using EXCEPT, NOT IN, and NOT EXISTS to find differences between two tables.

Our two tables to compare will be Comic and Comic_Copy. Based on counts, we have 48 more records in Comic than we do in Comic_Copy. Let’s find the differences.

In Chad’s specific query, NOT EXISTS works great. Where I like EXCEPT is when you need to see if any of the non-key columns differ. For example, if you also needed to compare titles for rows with the same ID and ensure those titles matched.

Comments closed

The Value (and Cost) of DATETRUNC

Published 2022-12-21 by Kevin Feasel

Brent Ozar points out the ups and downs of DATETRUNC():

The first one, passing in a specific start & end date, gets the best plan, runs the most quickly, and does the least logical reads (4,299.) It’s a winner by every possible measure except ease of writing the query. When SQL Server is handed a specific start date, it can seek to that specific part of the index, and read only the rows that matched.

DATETRUNC and YEAR both produce much less efficient plans. They scan the entire index (19,918 pages), reading every single row in the table, and run the function against every row, burning more CPU.

SQL Server’s thought process is, and has always been, “I have no idea what’s the first date that would produce YEAR(2017). There’s just no way I could possibly guess that. I might as well read every date since the dawn of time.”

Read on for the upshot.

Comments closed

Window Functions in DAX

Published 2022-12-15 by Kevin Feasel

Jeffrey Wang is speaking my language:

The December 2022 release of Power BI Desktop includes three new DAX functions: OFFSET, INDEX, and WINDOW. They are collectively called window functions because they are closely related to SQL window functions, a powerful feature of the SQL language that allows users to perform calculations on a set of rows that are related to the current row. Because these functions are often used for data analysis, they are sometimes called analytical functions. In contrast, DAX, a language invented specifically for data analysis, had been missing similar functionalities. As a result, users found it hard to write cross-row calculations, such as calculating the difference of the values of a column between two rows or the moving average of the values of a column over a set of rows.

Read on to learn more about how these functions work and how they differ from their SQL Server counterparts.

Comments closed

Column Exclusion and Rename in Snowflake

Published 2022-12-14 by Kevin Feasel

Kevin Wilkie plays duck-duck-goose with columns:

With Snowflake, we could do many different things that we’re not used to seeing with a SELECT statement. We’re all used to seeing this – SELECT * and it shows all kinds of columns.

With Snowflake, we can tell Snowflake NOT to show certain columns by using the EXCLUDE operator.

Read on to see how it works and specific requirements around operation. In addition, Kevin shows a way to perform aliasing.

Comments closed

Semi-Colons in Snowflake

Published 2022-12-07 by Kevin Feasel

Kevin Wilkie punctuates the statement:

With our last blog post, we started discussing Snowflake and the SELECT statement. Now, if you remember, there is this great thing called a semi-colon.

The main reason you should use the semicolon is to terminate all of your queries. Snowflake does this great thing by default, letting you run one query at a time.

I remember back when Microsoft deprecated T-SQL statements which did not end with semi-colons. It was fun speculating for about 5 minutes regarding the carnage which would happen if they carried out the deprecation notice, not least of which we’d find in Microsoft-developed code.

Comments closed

Bit Twiddling in T-SQL

Published 2022-12-05 by Kevin Feasel

Louis Davidson explains how bit operations work in T-SQL:

I expect that 99% of the people reading this looks at this probably would expect there to be a status table that contained the values of status. Seeing that this is a base 2 number, you may be in that 1% that thinks this might be a bitmask. but unless you have and eidetic memory, you probably don’t know what all of the bits mean.

A bitmask is a type of denormalization of values where instead of having a set of columns that have on or off values (no Null values), you encode it like:

Bitmasks make me break out the angry nun ruler. You can almost guarantee you’re doing something wrong if you design a bitmask as a column in a table.

Comments closed

GENERATE_SERIES and Data Types

Published 2022-11-23 by Kevin Feasel

Bill Fellows runs into an issue:

Perfect, now I have a row for each second from midnight to approximately 5.5 hours later. What if my duration need to vary because I’m going to compute these ranges for a number of different scenarios? I should make that 19565 into a variable and let’s overengineer this by making it a bigint.

Things don’t work out quite the way you might have expected there. Read on and see what Bill found and how you can circumvent the problem.

Comments closed

Full-Text Search in Postgres

Published 2022-11-16 by Kevin Feasel

Adam Zegelin takes us through full-text search options in PostgreSQL:

Full-text Search is a PostgreSQL® feature that facilitates the indexing of natural language text documents, and in the identification of indexed documents that match a given query. Matching documents can be sorted based on their relevance to the query, and document excerpts can be generated with the matching terms highlighted. A set of SQL data types, operators, and functions are provided to assist with the indexing, querying, and ranking of documents.

PostgreSQL uses the term document to mean any fragment of natural language text— essentially, strings containing human-readable words separated by whitespace and punctuation. Documents are often stored as text columns but can also be generated dynamically—such as by concatenating multiple columns together (even from multiple tables).

Click through for the tutorial.

Comments closed

Deleting Data from MySQL

Published 2022-11-15 by Kevin Feasel

Robert Sheldon burns it all down:

In the last few articles in this series, you learned about three important data manipulation language (DML) statements: SELECT, INSERT, and UPDATE. The statements make it possible to retrieve, add, and modify data in a MySQL database. Another DML statement that is just as important is DELETE, which lets you remove one or more rows from a table, including temporary tables. In this article, I focus exclusively on the DELETE statement to help round out our discussion on the core DML statements in MySQL. Overall, the DELETE statement is fairly basic, but one that’s no less necessary to have in your arsenal of DML tools.

Read on to see how the DELETE statement works and the minor differences from SQL Server.

Comments closed

Category: Syntax