Press "Enter" to skip to content

Month: January 2024

Preserving Non-Occurring Levels in R

Sebastian Sauer saves the levels:

The summary table does not show the level TRUE, as it is not occuring in the data. This can be problematic: If the data is unknown before summarizing and you would expect that both/all levels (TRUE, FALSE) occur. Just imagine that a subsequent function will count the level TRUE and the level FALSE. If one level is missing, your system may brake down.

Click through for a solution, where, even if your dataset is missing a particular level (value of a categorical variable), you will still see it in the final output. That way, if you train a model on this data and the new level shows up in your test dataset or in the wild, it won’t cause an error.

Comments closed

The Art of the Code Review

Phil Booth shares some recommendations:

First, let’s establish what the point of code review is and also what it isn’t.

The number one, most important reason to review code is shared ownership. “Ownership” can be tricky to define in code terms, but mostly it’s a feeling. It means you understand the code, that you feel empowered to change it and the responsibility to maintain it.

Click through for Phil’s thoughts on what makes for a good code review. I’ve found that the over-the-shoulder code review isn’t nearly as effective as you’d hope, and a proper code review can take a considerable amount of time, up to hours or days for a large change.

Comments closed

Buy that Keyboard

Andy Levy shares some good advice:

The holidays have passed and it’s a new year. You probably have a gift card or two and haven’t decided how to use it yet. Allow me to help:

Buy that fancy keyboard you’ve been coveting. Yes, the $100+ model. And get the good mouse/trackball while you’re at it. Just do it.

Back in my formative days, I would often get the cheapest keyboard and mouse to add a little “budget” flair to my custom PC builds. But nowadays, I highly recommend against that approach for the same reasons Andy does. A $100 keyboard isn’t guaranteed to be better than a $50 keyboard, but they’re both typically going to be better than a $10 keyboard. And if you have a nice enough computer store around, go try some of these out and see what fits best. I love mechanical keyboards—especially when I had the chance to annoy the people around me with a buckling spring keyboard—and there are a variety of types with different required levels of pressure. Do a little digging and find the keyboard and mouse that work best for you.

Comments closed

Working with Erik Darling’s Stored Procedures in Azure SQL DB

Josephine Bush tries out some stored procedures:

Erik Darling, founder of Darling Data, has created these fantastic stored procedures to query SQL Server more efficiently to get health, log, or performance information. I will go through them here regarding using them in Azure SQL database since I don’t have any SQL Servers I manage anymore.

Read on to see which ones you can use in Azure SQL DB and which require SQL Server.

Comments closed

Linked Servers from SQL MI using Azure Entra ID

Luis Aranda has the first of a two-part series:

Lately, we have seen some customers interested on the options available to use linked servers from Managed Instance and using Entra Authentication (formerly Azure Active Directory). It is certainly possible to create Linked Servers on SQL Managed instance (SQL MI) to connect to other PaaS databases such as other SQL MIs, Azure SQL Databases or Synapse databases using Entra Authentication.

Click through to see how you can do this using a managed identity. In the next article, Luis promises to show us how to do it with pass-through authentication, so you use your credentials instead of the managed identity’s credentials to access the remote server.

Comments closed

VARCHAR() in Microsoft Fabric Lakehouses and SQL Endpoints

Gerhard Brueckl models some data:

Defining data types and knowing the schema of your data has always been a crucial factor for performant data platforms, especially when it comes to string datatypes which can potentially consume a lot of space and memory. For Lakehouses in general (not only Fabric Lakehouses), there is usually only one data type for text data which is a generic STRING of an arbitrary length. In terms of Apache Spark, this is StringType(). While this applies to Spark dataframes, this is not entirely true for Spark tables – here is what the docs say:

Read through for more information on that, as well as how to define a table in a Microsoft Fabric lakehouse using VARCHAR(). The display is a little weird, but Greg Low explains why in the comments.

Comments closed

Exploring the gRPC API in Spark Connect with .NET

Ed Elliott continues a series on Spark Connect. First, Ed builds out something DataFrame API-ish:

So there are two goals of this post, the first is to take a look at Apache Arrow and how we can do things like show the output from DataFrame.Show, the second is to start to create objects that look more familiar to us, i.e. the DataFrame API.

If that’s not enough for you, Ed then shows how you can analyze a plan:

In this post we will continue looking at the gRPC API and the AnalyzePlan method which takes a plan and analyzes it. To be honest I expected this to be longer but decided just to do the AnalyzePlan method.

This has been a really fun series so far from Ed, so check these out. The only downside is that the people demand more F#. And by “the people,” I mostly mean that I would love to see F# examples.

Comments closed