Press "Enter" to skip to content

Day: July 8, 2022

The Seedy Underbelly of Machine Learning Fitting

John Mount is not impressed with a fair amount of machine learning:

For this to actually happen we need the actual system to be in our concept space, a lot of training data, and an abundance of caution.

In practice what we see more and more is the training procedure in fact attacks the evaluation procedure. It doesn’t just improve the quality of the fit artifact, but through mere optimization accidentally exploits weaknesses in the measurement system itself. When this happens, fitting does the following.

In ML training, we often accidentally “teach to the test” by comparing models via test data, which over time selects for models which are better fits for the test data. As John notes, this can come two separate ways and if you don’t define your optimization strategy correctly, you can accidentally train models which optimize on non-realistic things. A classic example is the neural network which could pick out malignant tumors from non-malignant tumors not because of any property of the tumor itself but rather because the malignant tumor images all had rulers in them and the non-malignant images did not. Read the whole thing for a second pitfall you can hit when training models.

Comments closed

Recreating a Shiny App with Plumber and ReactJS

Liam Kalita starts a new series:

Being able to host static content on RStudio Connect means we can host ReactJS applications on the platform. React is a great framework for developing web applications, with a lot of power and flexibility when creating user interfaces. Separating {shiny} applications into a user interface and a data processing API has its advantages.

In this blog series, we will guide you through creating the application from the RStudio tutorial for creating a {shiny} app, except we’ll be attempting it using ReactJS and an R {plumber} API instead of {shiny}. In this blog, part 1, we will be introducing you to the technologies we will need for the tutorial.

Read on for the essentials of what plumber and ReactJS are and why you might use each of them.

Comments closed

Building Custom Widgets for Azure Data Studio

Esat Erkec builds a widget:

One of the most advantageous features of ADS is that it allows the creation of customized widgets. With the help of the widgets, we can easily visualize the result of the queries using different graph types. In this context, building the performance monitoring widgets can be a reasonable approach so that we can track the performance metrics readily. Now, let’s learn how to build a custom widget with a very straightforward example.

I haven’t tried this before in Azure Data Studio but I can see the benefit, especially if you have a common set of queries you intend to run to observe the status of a given server.

Comments closed

Using JSON_PATH_EXISTS() in SQL Server

Hasan Savran shows how the JSON_PATH_EXISTS() function works in SQL Server:

Schemas can easily change if you save your data in JSON format. It is very easy to add or remove properties from JSON documents. When the data model changes quickly, you might need to worry about if the property you are looking for exists in the documents. If the path you are looking for does not exist in some documents, you need to handle the exception in some way. JSON_PATH_EXISTS comes to your help in situations like that. It tests whether a specified path exists in the input JSON.

Read on for the syntax and examples of use.

Comments closed

Comparing Column-Level Encryption to Always Encrypted

Tom Collins compares and contrasts:

A common question from developers & data owners  is what benefits does Always Encrypted offer over column level encryption  aka cell level encryption?    First thing to understand is what are the basic differences between the two methods – Column-level encryption vs Always encrypted

For as much as I appreciate Always Encrypted, it seems I use column-level encryption about an order of magnitude more often.

Comments closed

Creating Goal Post Tables

Aaron Bertrand solves a problem of unchecked growth:

Many of us deal with logging tables that grow unchecked for years, while reporting queries against them are expected to continue running quickly regardless of the size of the table. A common issue when querying by a date range is that the clustered index is on something else (say, an IDENTITY column). This will often result in a full clustered index scan, since SQL Server doesn’t have an efficient way to find the first or last row within the specified range. This means the same query will get slower and slower as the table grows.

I like this solution but only in cases where you expect no after-the-fact updates to dates, such as late-arriving date information or “fixing” the date later. With Aaron’s log example, where we expect log entries to be immutable, this can work really well in a “pseudo-materialized view” sort of way.

Comments closed

Expanding Column Width in Powershell Results

Kenneth Fisher supersizes the screen:

Notice the ellipsis (the three dots). That’s showing us that the name was too long and ended up being truncated. Given that I’ve been doing this for a little while now I’m almost completely certain that if I send this as it is the users are going to want to know full names. And with my luck I’ll end up having to give them each truncated string individually. On the theory that if I have time to do it twice I probably have time to do it right the first time, let’s figure out how to expand the columns. Fortunately, as with most things Powershell, there’s a cmdlet for that.

Read on to see what the process looks like.

Comments closed

Buffer Pool Parallel Scans in SQL Server 2022

David Pless talks about an internal optimization in SQL Server 2022:

Operations such as database startup/shutdown, creating a new database, file drop operations, backup/restore operations, Always On failover events, DBCC CHECKDB and DBCC Check Table, log restore operations, and other internal operations (e.g., checkpoint) will all benefit from Buffer Pool Parallel Scan.

In SQL Server 2019 and previous releases, operations that require scanning the buffer pool can be slow, especially on large memory machines such as the M-series Azure SQL virtual machine and large on-premises SQL Server environments. Even log restore operations and availability group failover operations can be impacted. Currently, there’s no way to eliminate this issue prior to SQL Server 2022, and dropping buffers using DBCC DROPCLEANBUFFERS would likely result in some degree of performance degradation as any subsequent query executions will have to reread the data from the database files increasing I/O.

Read on to understand why these operations can be slow on high-memory boxes and how much of a benefit you might get on certain administrative activities.

Comments closed