Press "Enter" to skip to content

Curated SQL Posts

A Primer on SparkSQL and PySpark

Anurag K covers the basics of PySpark:

In the era of big data, efficient data processing is critical for insights-driven decision-making. PySpark SQL, a part of Apache Spark, enables data engineers and analysts to work with structured data at massive scale. Combining SQL’s simplicity with Spark’s processing power, it opens a gateway to handling vast datasets seamlessly. This comprehensive guide walks you through PySpark SQL, from foundational concepts to advanced querying techniques, with detailed code examples. Let’s dive in and master PySpark SQL for data-driven analytics.

Click through for examples covering a variety of operations you can perform.

Leave a Comment

Vector Search Performance Optimizations in Elasticsearch

Venkata Gummadi works on vector search response times:

As data engineers, we are tasked with implementing these sophisticated solutions, ensuring organizations can derive actionable insights from vast datasets. This article explores the intricacies of vector search using Elasticsearch, focusing on effective techniques and best practices to optimize performance. By examining case studies on image retrieval for personalized marketing and text analysis for customer sentiment clustering, we demonstrate how optimizing vector search can lead to improved customer interactions and significant business growth.

Read on for a vector search primer and some guidance of how you can improve the performance of vector search queries. I’d expect that much of this can also apply to Azure AI Search and Amazon OpenSearch.

Leave a Comment

Converting SQL Audit FileTime to DateTime Format

Patrick Keisler helps a customer:

One of my customers recently wanted to rename each of the SQL audit files will the datetime stamp of when it was created. I explained to them the filename already contains a datetime stamp. While it does not look like a typical timestamp, it is based on the Windows Filetime data structure that is a 64-bit value representing the number of 100-nanosecond intervals since January 1, 1601 (UTC). Nonetheless, they still wanted a traditional datetime stamp in the file name.

Read on to see how. I can understand the displeasure in adding redundancy to a filename, though I also understand the reasoning from the customer’s point of view: FileTime isn’t human-readable in any meaningful way.

Leave a Comment

Move Data between Lakehouses and Workspaces in Microsoft Fabric

Gilbert Quevauvilliers performs an exfiltration:

With the new Schema’s in a Lakehouse, it now is possible to read from Lakehouse A (In Workspace A) and write to Lakehouse B (In Workspace B).

Here are more details about the Schema preview: Lakehouse schemas (Preview) – Microsoft Fabric | Microsoft Learn

This opens a whole new world of possibilities.

I also really like the fact that I can simply use the Names, and I do not need to get the actual GUIDS!

For example, I can use the following as shown below which is WorkspaceName.LakehouseName,SchemaName.TableName

Click through to see it in action.

Leave a Comment

The NOT IN Operator in R

Steven Sanderson does not want these things:

In R programming, data filtering and manipulation are needed skills for any developer. One of the most useful operations you’ll frequently encounter is checking whether elements are NOT present in a given set. While R doesn’t have a built-in “NOT IN” operator like SQL, we can easily create and use this functionality. This comprehensive guide will show you how to implement and use the “NOT IN” operator effectively in R.

Read on for examples of how to use %in% and its corollary ! (...) %in%.

Leave a Comment

Fabric List Connections API in Semantic Link Labs

Sandeep Pawar has an update for us:

In you case you missed it, List Connections Admin API is now live in Fabric. It was shipped in Semantic Link Labs v 0.7.4 a few weeks ago but at the time of the release it was still private. This API returns all the connections set up in the tenant and requires admin privileges. I still can’t find documentation on it so wait for the official details. Note that this API is different from item – list connection API which lists connections used by an item.

Read on to see what you can get from it.

Leave a Comment

Bullet Charts in Power BI Reports

Kurt Buhler is number one with a bullet (chart):

A report visual is useful when it displays information in a meaningful context. This context refers to other relevant data that helps someone interpret figures in a visual and use it to make decisions or take actions. The most common way to provide context is to compare actuals to a target.

There are many ways to compare actuals to a target, and in this article, we will explain in detail one way to do this by using a popular variant of the bar chart known as a bullet chart.

Kurt lays out several ways to implement a bullet chart in Power BI, so check that out.

Leave a Comment

Column Eviction in Power BI and Direct Lake

Paul Turley talks about fashion:

One of the core best practice guidance principals for Power BI modeling is to avoid including columns that aren’t absolutely necessary for analytic reporting. Every column uses precious memory and especially long, unique values that don’t compress very well. When consulting clients bring me large models that require expensive capacity licensing and pose report performance issues, my first inclination is to see what column data can be carved out of the model; and perhaps moved to another table for a drill-through report.

The product team came up with a very clever way to reduce the in-memory footprint of a Direct Lake semantic model: hold a popularity contest! The semantic model engine will only keep columns in memory based on their hotness. I mean this literally…

Read on to learn a bit more about the algorithm in play and how it differs from a naive Least Recently Used cache.

Leave a Comment