Press "Enter" to skip to content

Day: November 6, 2024

A Primer on SparkSQL and PySpark

Anurag K covers the basics of PySpark:

In the era of big data, efficient data processing is critical for insights-driven decision-making. PySpark SQL, a part of Apache Spark, enables data engineers and analysts to work with structured data at massive scale. Combining SQL’s simplicity with Spark’s processing power, it opens a gateway to handling vast datasets seamlessly. This comprehensive guide walks you through PySpark SQL, from foundational concepts to advanced querying techniques, with detailed code examples. Let’s dive in and master PySpark SQL for data-driven analytics.

Click through for examples covering a variety of operations you can perform.

Leave a Comment

Vector Search Performance Optimizations in Elasticsearch

Venkata Gummadi works on vector search response times:

As data engineers, we are tasked with implementing these sophisticated solutions, ensuring organizations can derive actionable insights from vast datasets. This article explores the intricacies of vector search using Elasticsearch, focusing on effective techniques and best practices to optimize performance. By examining case studies on image retrieval for personalized marketing and text analysis for customer sentiment clustering, we demonstrate how optimizing vector search can lead to improved customer interactions and significant business growth.

Read on for a vector search primer and some guidance of how you can improve the performance of vector search queries. I’d expect that much of this can also apply to Azure AI Search and Amazon OpenSearch.

Leave a Comment

Converting SQL Audit FileTime to DateTime Format

Patrick Keisler helps a customer:

One of my customers recently wanted to rename each of the SQL audit files will the datetime stamp of when it was created. I explained to them the filename already contains a datetime stamp. While it does not look like a typical timestamp, it is based on the Windows Filetime data structure that is a 64-bit value representing the number of 100-nanosecond intervals since January 1, 1601 (UTC). Nonetheless, they still wanted a traditional datetime stamp in the file name.

Read on to see how. I can understand the displeasure in adding redundancy to a filename, though I also understand the reasoning from the customer’s point of view: FileTime isn’t human-readable in any meaningful way.

Leave a Comment

Move Data between Lakehouses and Workspaces in Microsoft Fabric

Gilbert Quevauvilliers performs an exfiltration:

With the new Schema’s in a Lakehouse, it now is possible to read from Lakehouse A (In Workspace A) and write to Lakehouse B (In Workspace B).

Here are more details about the Schema preview: Lakehouse schemas (Preview) – Microsoft Fabric | Microsoft Learn

This opens a whole new world of possibilities.

I also really like the fact that I can simply use the Names, and I do not need to get the actual GUIDS!

For example, I can use the following as shown below which is WorkspaceName.LakehouseName,SchemaName.TableName

Click through to see it in action.

Leave a Comment