2023-09-12 – Curated SQL

Plotting SVM Decision Boundaries in R

Published 2023-09-12 by Kevin Feasel

Steven Sanderson goes right up to the edge:

Support Vector Machines (SVM) are a powerful tool in the world of machine learning and classification. They excel in finding the optimal decision boundary between different classes of data. However, understanding and visualizing these decision boundaries can be a bit tricky. In this blog post, we’ll explore how to plot an SVM object using the e1071 library in R, making it easier to grasp the magic happening under the hood.

Read on to see how you can perform this analysis as well.

Comments closed

Running Apache Kafka in Windows

Published 2023-09-12 by Kevin Feasel

Jim Galasyn gives up the ghost:

Is Windows your favorite development environment? Do you want to run Apache Kafka® on Windows? Thanks to the Windows Subsystem for Linux 2 (WSL 2), now you can, and with fewer tears than in the past. Windows still isn’t the recommended platform for running Kafka with production workloads, but for trying out Kafka, it works just fine. Let’s take a look at how it’s done.

There was a time in which running Kafka on Windows meant downloading Windows-specific installers, workaround executables to deal with NTFS, and all the attendant problems of being the third operating system on the list. Using WSL2 is definitely a better approach.

Comments closed

Finding Object Counts for S3 Buckets

Published 2023-09-12 by Kevin Feasel

The Big Data in Real World team sees a problem:

There is no separate command in AWS CLI to find the number of objects in an S3 bucket but there is a workaround.

Read on for the solution to this. The way that S3 and Azure Blob Storage (without hierarchical namespaces) store files as tags and treat folders as cosmetic is neat from a technical standpoint, though it goes counter to how we’d expect a file system to behave.

Comments closed

TINYINT Casts in Spark SQL vs T-SQL

Published 2023-09-12 by Kevin Feasel

Bill Fellows runs into an interesting oddity:

Yet another thing that has bitten me working in SparkSQL in Databricks—this time it’s data types.

In SQL Server, a tinyint ranges from 0 to 255 but both of them allow for 256 total values. If you attempt to cast a value that doesn’t fit in that range, you’re going to raise an error.

SQL Server’s TINYINT data type is an unsigned one-byte number, whereas TINYINT in Spark SQL is a signed one-byte number. But that’s not the biggest difference Bill finds, so check out the post to learn more.

Comments closed

Controlling Power BI Chart Ranges with DAX

Published 2023-09-12 by Kevin Feasel

Marco Russo and Alberto Ferrrari control the horizontal, Marco Russo and Alberto Ferrari control the vertical:

DAX is a powerful tool in the hands of a Power BI developer. Using simple DAX formulas, you can not only compute interesting metrics but also customize the behavior of Power BI visuals. In this article, we use DAX to control the range of charts to obtain more coherent visualizations.

Read on to see how.

Comments closed

Documenting Power BI Workspaces with Fabric Notebooks

Published 2023-09-12 by Kevin Feasel

Prathy Kamasami shares a use case for notebooks in Microsoft Fabric:

If you are a consultant like me, you know how hard it can be to access Power BI Admin API or Service Principal. Sometimes, you need to see all the workspaces you have permission for and what’s inside them. Well, I found with MS Fabric, we can use notebooks and achieve it with a few steps:

Read on for an enumeration of those four steps, as well as detailed instructions for each.

Comments closed

Sending Azure Cost Management Data to Azure Data Explorer

Published 2023-09-12 by Kevin Feasel

Brad Watts writes out some cost data:

Understanding your Azure Spend is one of the most important things you do as an Azure customer. Azure Cost Management is built into the platform to provide you insights. But we live in a world of data and looking at the Azure Cost Management data in a silo may not meet your organization’s needs. In those situations, we can solve that need by putting your Cost Management data into an analytical platform like Azure Data Explorer or Microsoft Fabric KQL Database. Here we can bring in or join additional data that’s useful, run ad-hoc queries and build visualization tying it all together.

Using the below repository, you’ll be able to utilize Azure Cost Management exports to setup an automated process that ingests the cost data into ADX or Fabric KQL Database.

There are several steps involved, but as Brad points out, you can do this either with Microsoft Fabric or with classic Azure Data Factory + Azure Data Explorer. I’d also throw in Azure Synapse Analytics, but that’s not as in vogue anymore.

Werner Zirkel also has a great comment showing how you can cut out most of the steps with Event Grid.

Comments closed

Thoughts on NOLOCK

Published 2023-09-12 by Kevin Feasel

Erik Darling has some thoughts:

And generally, the more NOLOCK hints I see, the more money I know I’m going to make.

It shows me four things right off the bat:

The developers need a lot of training

The code needs a lot of tuning

The indexes need a lot of adjusting

There are probably some serious bugs in the software

Perhaps the only other thing that signals just how badly someone needs a lot of help is hearing “we’re an Entity Framework only shop”.

Cha-ching.

I have to admit, even being a consultant doesn’t soften the pain of walking into a place and seeing people use NOLOCK like they picked up a fresh pallet of it from Costco and need to use it up before it goes bad.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

Day: September 12, 2023