Press "Enter" to skip to content

Day: March 29, 2022

Visualizing Air Pressure Spikes from the Hunga Tonga Eruption in R

Neil Saunders reviews some personal weather station data:

Wow. Now, pause for a moment and try to recall the last time you read any news about Tonga since the event.
The eruption sent an atmospheric pressure wave, clearly visible in this imagery, around the world. Friends online reported that this was detected by their personal weather stations (PWS) which made me wonder: was the wave apparent in online weather station data and can it be visualized using R?

The answers are yes and yes again.

Read on to see how.

Comments closed

Writing a Single JSON File in Databricks

Falek Miah performs a surprisingly difficult task:

When writing to a JSON destination using the DataFrameWriter the dataset is split into multiple files to reflect the number of RDD partitions in the dataframe when in memory – this is the most efficient way for Spark to write data out.

However, this creates a directory containing the data files, as well as Spark metadata files…but what if you just wanted a single JSON file? It’s a scenario that comes up a lot with our clients and, despite it not being the most efficient way to use Spark, we need to implement it all the same.

Click through to see how to do this, including the removal of all metadata files (committed, started, and success files).

Comments closed

Configuring FIDO2 for Azure Active Directory

Joey D’Antoni takes us through a process:

If this sounds scary, and it does to me, who is by far not an expert in all things security, but knows a little bit, you may ask, what are some alternative solutions? The answer to that question is Fido2, a different protocol for MFA and auth. Remember all of that stuff Microsoft talks about with passwordless login? That’s all based around Fido2. I configured this for DCAC’s Azure Active Directory yesterday, and I wanted to walk you through the steps.

Click through to learn how.

Comments closed

Intelligent Cache for Spark in Synapse

Avinanda Chattapadday makes an announcement:

Traditionally, when querying a file or table from your data lake, the Apache Spark engine in Synapse makes a call to your remote ADLS Gen2 storage for each read of the data. For workloads with frequent repeat queries, this process can be redundant and add latency to the overall processing time. Although Apache Spark provides a great caching feature, it must be manually set and released to minimize the latency and improve overall performance. It can also result in queries of stale data if the underlying data changes. This is where the intelligent cache in Azure Synapse can simplify the process; by automatically detecting changes to the underlying files and automatically refreshing them in the cache, you ensure you have access to the most recent data. When the cache reaches its size limit, it will automatically release the least-read data to make space for more recent data.

Click through to see how you can enable this, as well as a few more details on the process.

Comments closed

The Practical Costs of Index Fragmentation

Tibor Karaszi digs into index performance:

See numbers and diagrams at the end, or at the top. I measured a few cases: the difference between no external fragmentation and severe external fragmentation (over 99%). I have both a narrow index and a wide index, and I read one (1), 10,000 and 100,000 rows using index searches (“range scan”). There were obviously no difference reading 1 row so I exclude that from my discussion below. For the other cases the extra time with an extreme level of external fragmentation is (from lowest impact to highest) 7%, 10%, 13% and 32%. The highest number (32%) is when reading many rows from a narrow index, i.e. many rows per page. Again, this is with an extreme level of fragmentation.

What’s interesting is that for the most part, there’s a negligible difference between ~0% internal fragmentation and ~99% internal fragmentation. The follow-on question is, how much are defrag operations costing you in performance and when is the benefit worth the cost?

Comments closed

Finding Free Space in a SQL Server Filegroup

John McCormack does some digging:

I just realised that in all my scripts that I use on a regular basis, I didn’t have one for working out free space in SQL Server filegroups. It’s not something that comes up too often but it’s handy to know. For methods of working out space in individual files, you could refer to this post on mssqltips.

Click through for the query and congrats to John on 100 posts.

Comments closed

Determining if SQL Server Needs More Memory

Erik Darling breaks Betteridge’s Law of Headlines:

In this post, we’ll talk about how to figure out if your SQL Server needs more memory, and if there’s anything you can do to make better use of memory at the same time.

After all, you could be doing just fine.

(You’re probably not.)

I have a simple flow chart: do you have all of the memory created since the 1990s? If not, then you need more memory. If so, may I please borrow a cup of RAM?

Comments closed

The Performance Cost of CAST/CONVERT in a WHERE Clause

Monica Rathbun does the math:

Remove CONVERT/CAST from your WHERE clauses and JOINS when comparing to variables of different data types. Set their data types to match your table definitions before using them as a filter. Optimizing your queries this way will greatly reduce the amount of CPU time, reads, and I/O generated in your queries and allow your code to take better advantage of indexes.

This can quietly be a major performance issue.

Comments closed