Press "Enter" to skip to content

Month: December 2024

Session, DataFrameWriter, and Table Configurations in Spark

Miles Cole makes a configuration change:

With Spark and Delta Lake, just like with Hudi and Iceberg, there are several ways to enable or disable settings that impact how tables are created. These settings may affect data layout or table format features, but it can be confusing to understand why different methods exist, when each should be used, and how property inheritance works.

While platform defaults should account for most use cases, Spark provides flexibility to optimize various workloads, whether adjusting for read or write performance, or for hot or cold path data processing. Inevitably, the need to adjust configurations from the default will arise. So, how do we do this effectively?

Read on to learn how.

Leave a Comment

Securing a Kafka Ecosystem

Riya has a breakdown of how to protect your Apache Kafka installation and resources around it:

Apache Kafka is the backbone of many real-time data pipelines, making security an essential aspect of its deployment. Protecting your Kafka ecosystem involves implementing encryption to safeguard data, authentication to verify user identities, and authorization to control access. This guide provides a comprehensive overview of these three pillars of securing Kafka, complete with code examples to help you implement best practices.

Click through for demonstrations of encryption, authentication, and authorization.

Leave a Comment

The Cost of Everything, Cloud Edition

Kevin Sookocheff noodles on a core concept:

At AWS re:Invent 2023, Amazon CTO Werner Vogels delivered a talk on the laws of frugal architecture. While I initially filed away those insights to review later, a year of cloud architecture experience crystallized a fundamental truth: in cloud computing, cost isn’t just a financial consideration — it is a first-class architectural concern through which we should design and optimize our systems.

Cloud providers charge for every conceivable resource: servers, API calls, data transfer, and computational milliseconds. But cost is more than just a line item on a monthly bill, it is a powerful forcing function that drives better architectural decisions.

Read on for more thoughts on the matter.

Leave a Comment

Writing Loops in Powershell

Adam Bertram is spinning in circles:

One of PowerShell’s key features is its ability to loop through data collections, enabling you to execute a code block repeatedly.

In this tutorial, we will explore three fundamental loop structures in PowerShell: foreachfor, and do loops.

By the end of this tutorial, you will have a solid understanding of how to use these loops to handle repetitive tasks and process collections of data more effectively.

Click through for the tutorial.

Leave a Comment

Filter and Parameter Passing from Power BI to Power Apps

Jon Voege runs into a problem:

A simple comment App embedded in Power BI was supposed to show all the comments in a Sharepoint List which matched the filters set in Power BI. But…. Not all comments were showing up, even though the Sharepoint List only held about 10 comments at the time!

Why you ask? Well, the issue stems from the fact that the table of data being sent from Power BI to Power Apps, the PowerBIIntegration.Data object, has a hard limitation of 1000 rows. And those 1000 rows, is all that we get to work with.

Read on to learn more, as well as a workaround to this issue.

Leave a Comment

Models and Endpoints in Azure AI Foundry

Tomaz Kastrun continues a series on Azure AI:

Models from the model catalog can be deployed using programming languages or using the Foundry studio.

Model deployment has two types: Deploy from the base model or deploy from the fine-tuned model. The difference is that fine-tuned model is model taken from the model catalog and later tuned to an additional dataset, as the base model is the model as it is available in Azure AI Foundry.

Click through for a bit more information on the process.

Leave a Comment

Switching between Python and PySpark Notebooks in Fabric

Sandeep Pawar wants to save some money:

File this under a test I have been wanting to do for some time. If I am exploring some data in a Fabric notebook using PySpark, can I switch between Python and PySpark engines with minimal code changes in an interactive session? The goal is to use the Python notebook for some exploration or use existing PySpark/SparkSQL or develop the logic in a low compute environment (to save CUs) and scale it in a distributed Spark environment. Understandably, there will be limitations with this approach given the difference in environments, configs etc., but can it be done?

Read on for the answer, as well as plenty of notes around it.

Leave a Comment

Partitioned Tables and Faster Performance

Henrietta Dombrovskaya provides a warning:

I am unsure whether others have this problem, but I find myself explaining the benefits9and what’s not)of partitioning tables in Postgres over and over again.

Most times, developers have unrealistic expectations that if they partition a gigantic table, “it will be faster to select from it.” I always have to explain that the goal and the purpose of partitioning is improving maintenance, not query optimization, and if the execution speed of short queries remains the same as before partitioning, they should consider it a success (yes, there are no rules without exceptions, and there are cases of performance improvement, but those are exceptions indeed).

Henrietta’s example is specifically around Postgres, but it applies to SQL Server all the same: partitioning is primarily a maintenance benefit rather than a performance benefit.

Leave a Comment

The Benefits of Read Committed Snapshot Isolation

Haripriya Naidu gets optimistic:

In this article, we are going to take a look at how write operations block read operations under default isolation level – Read Committed and how to avoid this using the optimistic isolation level – Read Committed Snapshot.

I do wish that this were the default for new databases in SQL Server, like it is in Azure SQL Database.

H/T Brent Ozar’s newsletter for pointing me in Haripriya’s direction.

2 Comments

Calling IFTTT from Powershell

Denny Cherry triggers an action:

Sometimes, you want/need to have PowerShell do an action, but it’s hard to do through PowerShell, or it can’t be done at all. If the website If This Then That (IFTTT) can do it instead, you can still do it through PowerShell, with a little extra work. The way to do this will be to call a webhook on an IFTTT Applet and pass the information you need to send to the third-party service. A great example of this would be if you need to automate posting a new post to FaceBook. Posting to Facebook is shockingly hard as you have to write all the pieces to post to Facebook manually, and there’s no current way to post to Facebook via PowerShell. However, you can call an IFTTT Action and have that action do the actual posting to Facebook for you.

Click through for a generic form of the process.

Leave a Comment