Press "Enter" to skip to content

Day: January 19, 2024

tidyAML Updates

Steven Sanderson has been busy. First up, a post on tidyAML updates:

One of the standout features in this release is the addition of extract_regression_residuals(). This function empowers users to delve deeper into regression models, providing a valuable tool for analyzing and understanding residuals. Whether you’re fine-tuning your models or gaining insights into data patterns, this enhancement adds a crucial layer to your analytical arsenal.

Then, Steven goes into detail on .drap_na:

In the newest release of tidyAML there has been an addition of a new parameter to the functions fast_classification() and fast_regression(). The parameter is .drop_na and it is a logical value that defaults to TRUE. This parameter is used to determine if the function should drop rows with missing values from the output if a model cannot be built for some reason. Let’s take a look at the function and it’s arguments.

After that, we get to see an updated function:

In response to user feedback, we’ve enhanced the internal_make_wflw_predictions() function to provide a comprehensive set of predictions. Now, when you make a call to this function, it includes:

  1. The Actual Data: This is the real-world data that your model aims to predict. Having access to this information helps you assess how well your model is performing on unseen instances.
  2. Training Predictions: Predictions made on the training dataset. This is essential for understanding how well your model generalizes to the data it was trained on.
  3. Testing Predictions: Predictions made on the testing dataset. This is crucial for evaluating the model’s performance on data it hasn’t seen during the training phase.

You can also check out the package’s GitHub repository and see more.

Comments closed

Data Reading and Writing with arrow

Colin Gillespie performs two of the three R’s:

Apache Arrow is a cross-language development platform for in-memory data. As it’s in-memory (as opposed to data stored on disk), it provides additional speed boosts. It’s designed for efficient analytic operations, and uses a standardised language-independent columnar memory format for flat and hierarchical data. The {arrow} R package provides an interface to the ‘Arrow C++’ library – an efficient package for analytic operations on modern hardware.

There are many great tutorials on using {arrow} (see the links at the bottom of the post for example). The purpose of this blog post isn’t to simply reproduce a few examples, but to understand some of what’s happening behind the scenes. In this particular post, we’re interested in understanding the reading/writing aspects of {arrow}.

Read on to see it in action in R.

Comments closed

Exporting to CSV in Azure ML Designer

Tom LaRock saves a file:

The most popular feature in any application is an easy-to-find button saying “Export to CSV.” If this button is not visibly available, a simple right-click of your mouse should present such an option. You really should not be forced to spend any additional time on this Earth looking for a way to export your data to a CSV file.

Well, in Azure ML Studio, exporting to a CSV file should be simple, but is not, unless you already know what you are doing and where to look. I was reminded of this recently, and decided to write a quick post in case a person new to ML Studio was wondering how to export data to a CSV file.

Click through for one false start and then the correct answer.

Comments closed

Globs of Tabs in SSMS

Warwick Rudd has cramped environs:

Working in SQL Server Management Studio is potentially an everyday occurrence for you! And having to work with many queries open at the same time is probably the norm.  Depending on the size of your screen that you may be working on, you are limited with the amount of screen real estate you can work in.

Personally, I get frustrated with having to continually go to the open query drop down window to see what queries I have open and be able to cycle through them to make my life easier and be more productive.

Warwick shows off one built-in way to solve this problem. When I was a database developer, I would have 40-50 tabs open at a time sometimes. I used Tabs Studio (commercial product but it’s not that expensive if you’re buying for yourself) to manage all of that.

Comments closed

Storing SQL Server Backups in Cloudflare R2

Daniel Hutmacher saves a buck:

R2 is Cloudflare’s own implementation of AWS S3 storage, with some big benefits – one of them being no egress fees, which is great if you want to publish or distribute a lot of data (like I did with this demo database). In this post, I thought I’d briefly document how to set up R2, and how to use it to back up and restore your SQL Server databases.

You’ll need a Cloudflare account to follow along. The account and a lot of their services are free, but R2 storage obviously comes with a small cost. For scale, I’m running an almost-terabyte bucket at just a couple of dollars per month.

Given the number of times I’ve pushed Daniel’s excellent Chicago parking tickets database (including right now—it’s a great database that I’ve used in several presentations and videos!), the lack of egress charges is pretty big.

Comments closed

Hash Aggregates and Hash Joins in Postgres

Muhammad Ali plays matchmaker:

PostgreSQL employs various techniques for data joining and aggregation in its queries, among which the hash-based method stands out for its efficiency in particular situations and different data sizes. We will discuss hash joins and hash aggregates in PostgreSQL, providing insights on how they work and parameters which influence this algorithm.

Read on to learn more. This looks fundamentally similar to hash matches in SQL Server, so if you’re familiar with that, the concepts should be pretty clear.

Comments closed