2023-06-22 – Curated SQL

Read and Write Data with PySpark

Published 2023-06-22 by Kevin Feasel

Dustin Vannnoy has two of the three R’s down:

Every Spark pipeline involves reading data from a data source or table. For data engineers we usually end the pipelines by writing the transformed data. In this tutorial we walk through some of the most common format and cloud storage locations for reading and writing with Spark. We’ll save some of the advanced Delta Lake capabilities for another tutorial.

Click through to see how to read from and write to CSV, JSON, and Parquet formats. Dustin has examples of working with Azure Blob Storage, S3, and Google Cloud Storage, and even some database examples with JDBC.

Comments closed

Creating Crosstabs in R

Published 2023-06-22 by Kevin Feasel

Steven Sanderson shows off the ability to perform crosstabs in R:

As a programmer, you’re constantly faced with the task of organizing and analyzing data. One powerful tool in your R arsenal is the xtabs() function. In this blog post, we’ll explore the versatility and simplicity of xtabs() for aggregating data. We’ll use the mtcars dataset and the healthyR.data::healthyR_data dataset to illustrate its functionality. Get ready to dive into the world of data aggregation with xtabs()!

Click through for two examples of it creating a crosstab (or, in SSRS/Power BI terms, a matrix) from your data.

Comments closed

Measuring Bandwidth with Powershell

Published 2023-06-22 by Kevin Feasel

Patrick Gruenauer checks download speed:

Short explanation: A file will be downloaded from my homepage. It’s a video about Powershell Profiles in German language. During the download of the file the process is measured with the Measure-Command cmdlet. At the end we get the final result and the time it took to download the file.

You can, of course, change the file location to whatever makes sense, but it’s nice to have something handy and not need to go to any websites for a speed test.

Comments closed

The Cost and Difficulty Level of Changes

Published 2023-06-22 by Kevin Feasel

Richard Swinbank has an image for us:

I spent some time working at a property portal, where users could look at online listings of homes for sale or rent, then go on to book a viewing appointment with an estate agent. On one occasion we were asked to build a Power BI report showing:

the number of appointments booked by portal users

the percentage of appointments where the user had viewed the online listing more than once.

Sounds easy enough, right?

Click through for the image. It makes intuitive sense, but is a good visual depiction of why some data requests are more challenging than others.

Comments closed

Running SqlBulkCopy in Parallel from Powershell

Published 2023-06-22 by Kevin Feasel

Jose Manuel Jurado Diaz has a script for us:

Today, we encountered an interesting service request of attempting to reduce the load times for 100,000 records from a table with 97 varchar(320) fields in an Azure SQL HyperScale database. Following, I would like to share my lessons learned here.

The idea is to split in different concurrent process the execution of multiples SqlBulkCopy. In this case, we are going to split this process in 5 processes running in parallel inserting 20,000 rows, let’s try to know the total size.

Read on for the script, as well as a rough idea of how long it’ll take inserting into an Azure SQL DB Hyperscale instance.

Comments closed

An Overview of Statistics in SQL Server

Published 2023-06-22 by Kevin Feasel

Matthew McGiffen has a primer for us:

Statistics are vitally important in allowing SQL Server to find the most efficient way to execute your queries. In this post we learn more about them, what they are and how they are used.

Read on for plenty of information about stats. One thing I’d emphasize in this is that, for the most part, auto-generated stats are fine. The only time I’ve ever seen any value in creating my own stats is in multi-column statistics (which auto-generated stats doesn’t do), and even that’s of fairly limited value, at least in my experience. I’m sure that there’s a point in which hand-crafting your own statistics makes a tiny but noticeable marginal difference, but the systems I’ve worked with tend not to be anywhere near that point. There are usually much bigger fish to fry.

Comments closed

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Day: June 22, 2023

Read and Write Data with PySpark

Creating Crosstabs in R

Measuring Bandwidth with Powershell

The Cost and Difficulty Level of Changes

Running SqlBulkCopy in Parallel from Powershell

An Overview of Statistics in SQL Server