Press "Enter" to skip to content

Curated SQL Posts

Writing a Single JSON File in Databricks

Falek Miah performs a surprisingly difficult task:

When writing to a JSON destination using the DataFrameWriter the dataset is split into multiple files to reflect the number of RDD partitions in the dataframe when in memory – this is the most efficient way for Spark to write data out.

However, this creates a directory containing the data files, as well as Spark metadata files…but what if you just wanted a single JSON file? It’s a scenario that comes up a lot with our clients and, despite it not being the most efficient way to use Spark, we need to implement it all the same.

Click through to see how to do this, including the removal of all metadata files (committed, started, and success files).

Comments closed

Configuring FIDO2 for Azure Active Directory

Joey D’Antoni takes us through a process:

If this sounds scary, and it does to me, who is by far not an expert in all things security, but knows a little bit, you may ask, what are some alternative solutions? The answer to that question is Fido2, a different protocol for MFA and auth. Remember all of that stuff Microsoft talks about with passwordless login? That’s all based around Fido2. I configured this for DCAC’s Azure Active Directory yesterday, and I wanted to walk you through the steps.

Click through to learn how.

Comments closed

Intelligent Cache for Spark in Synapse

Avinanda Chattapadday makes an announcement:

Traditionally, when querying a file or table from your data lake, the Apache Spark engine in Synapse makes a call to your remote ADLS Gen2 storage for each read of the data. For workloads with frequent repeat queries, this process can be redundant and add latency to the overall processing time. Although Apache Spark provides a great caching feature, it must be manually set and released to minimize the latency and improve overall performance. It can also result in queries of stale data if the underlying data changes. This is where the intelligent cache in Azure Synapse can simplify the process; by automatically detecting changes to the underlying files and automatically refreshing them in the cache, you ensure you have access to the most recent data. When the cache reaches its size limit, it will automatically release the least-read data to make space for more recent data.

Click through to see how you can enable this, as well as a few more details on the process.

Comments closed

The Practical Costs of Index Fragmentation

Tibor Karaszi digs into index performance:

See numbers and diagrams at the end, or at the top. I measured a few cases: the difference between no external fragmentation and severe external fragmentation (over 99%). I have both a narrow index and a wide index, and I read one (1), 10,000 and 100,000 rows using index searches (“range scan”). There were obviously no difference reading 1 row so I exclude that from my discussion below. For the other cases the extra time with an extreme level of external fragmentation is (from lowest impact to highest) 7%, 10%, 13% and 32%. The highest number (32%) is when reading many rows from a narrow index, i.e. many rows per page. Again, this is with an extreme level of fragmentation.

What’s interesting is that for the most part, there’s a negligible difference between ~0% internal fragmentation and ~99% internal fragmentation. The follow-on question is, how much are defrag operations costing you in performance and when is the benefit worth the cost?

Comments closed

Finding Free Space in a SQL Server Filegroup

John McCormack does some digging:

I just realised that in all my scripts that I use on a regular basis, I didn’t have one for working out free space in SQL Server filegroups. It’s not something that comes up too often but it’s handy to know. For methods of working out space in individual files, you could refer to this post on mssqltips.

Click through for the query and congrats to John on 100 posts.

Comments closed

Determining if SQL Server Needs More Memory

Erik Darling breaks Betteridge’s Law of Headlines:

In this post, we’ll talk about how to figure out if your SQL Server needs more memory, and if there’s anything you can do to make better use of memory at the same time.

After all, you could be doing just fine.

(You’re probably not.)

I have a simple flow chart: do you have all of the memory created since the 1990s? If not, then you need more memory. If so, may I please borrow a cup of RAM?

Comments closed

The Performance Cost of CAST/CONVERT in a WHERE Clause

Monica Rathbun does the math:

Remove CONVERT/CAST from your WHERE clauses and JOINS when comparing to variables of different data types. Set their data types to match your table definitions before using them as a filter. Optimizing your queries this way will greatly reduce the amount of CPU time, reads, and I/O generated in your queries and allow your code to take better advantage of indexes.

This can quietly be a major performance issue.

Comments closed

Replacing Zookeeper in Kafka

Guozhang Wang explains the decision-making behind a major change in Apache Kafka:

Why replace ZooKeeper with an internal log for Apache Kafka® metadata management? This post explores the rationale behind the replacement, examines why a quorum-based consensus protocol like Raft was utilized and altered to become KRaft, and describes the new Quorum Controller built on top of KRaft protocols.

Click through for the reasoning, which includes a considerably faster shutdown in large environments..

Comments closed

Data Visualization in Python

Mehreen Saeed uses a few data visualization libraries in Python:

Data visualization is an important aspect of all AI and machine learning applications. You can gain key insights of your data through different graphical representations. In this tutorial, we’ll talk about a few options for data visualization in Python. We’ll use the MNIST dataset and the Tensorflow library for number crunching and data manipulation. To illustrate various methods for creating different types of graphs, we’ll use the Python’s graphing libraries namely matplotlib, Seaborn and Bokeh.

Bokeh results can look really nice, although it does feel like it requires a lot more developer time and effort to get it right. Click through for examples of each of the three libraries.

Comments closed

Azure ML Well-Architected Framework Review

Ben Brauer has good news:

Microsoft offers prescriptive guidance called the Well-Architected Framework that optimizes workloads implemented and deployed on Azure. This guidance has been generalized for most workloads and creates a basis for reliable and secure applications that are cost optimized.

We have begun to build on this base content set to include more precise guidance for specific workload types, such as machine learning, data services and analytics, IoT, SAP, mission critical apps, and web apps. Machine Learning was the first branch from the base content, which came into fruition in the Fall of 2021.

In case you have never used the Azure Well-Architected Review assessment tool, it’s really useful. It can take hours (or days) to go through the review but if you take it seriously and have the right people in the room giving answers, you’ll get concrete guidance on how to optimize your Azure-based solutions.

Comments closed