Press "Enter" to skip to content

Day: April 22, 2020

Automated ML and Data Scientists

Sophia Rowland takes us through an experiment:

Ever since automated machine learning has entered the scene, people are asking, “Will automated machine learning replace data scientists?” I personally don’t think we need to be worried about losing our jobs any time soon. Automated machine learning is great at efficiently trying a lot of different options and can save a data scientist hours of work. The caveat is that automated machine learning cannot replace all tasks. Automated machine learning does not understand context as well as a human being. This means that it may not know to create specific features that are the norm for certain tasks. Also, automated machine learning may not know when it has created things that are irrelevant or unhelpful.

To strengthen my points, I created a competition between myself and two SAS Viya tools for automated machine learning. To show that we are really better together, I combined my work with automated machine learning and compared all approaches. Before we dive into the results, let’s discuss the task.

The results are in line with my expectations: a good automated ML tool will make life easier, but doesn’t replace the expert system of a human.

Comments closed

R Checkpoint Package Update Now in Beta

Hong Ooi announces that a revamp of the checkpoint package is now in beta:

Checkpoint has been around for nearly 6 years now, helping R users solve the reproducible research puzzle. In that time, it’s seen many changes, new features, and, inevitably, bug reports. Some of these bugs have been fixed, while others remain outstanding in the too-hard basket.

Many of these issues spring from the fact that it uses only base R functions, in particular install.packages, to do its work. The problem is that install.packages is meant for interactive use, and as an API, is very limited. For starters, it doesn’t return a result to the caller—instead, checkpoint has to capture and parse the printed output to determine whether the installation succeeded. This causes a host of problems, since the printout will vary based on how R is configured. Similarly, install.packages refuses to install a package if it’s in use, which means checkpoint must unload it first—an imperfect and error-prone process at best.

In addition to these, checkpoint’s age means that it has accumulated a significant amount of technical debt over the years. For example, there is still code to handle ancient versions of R that couldn’t use HTTPS, even though the MRAN site (in line with security best practice) now accepts HTTPS connections only.

Click through to see what’s in the new checkpoint package.

Comments closed

Fun with tempdb

Andy Mallon walks us through setting up tempdb:

There are three problems I’ve got to fix. I need to (1) remove those two extra files, (2) grow the tempdb log file, and (3) even out the size of the data files (and shrink them a little to make room for the larger log file. We’re going to tackle these in the reverse order than I listed them–partially out of necessity, and partially because it’s going to be easier.

Click through to see how Andy sets up tempdb. This is a good way to set up tempdb.

Comments closed

The Pain of Code Noise

Chris Johnson talks about a concept dear to me:

Basically code noise is anything that pulls your attention away from what the code is supposed to be doing, or obscures the true nature of the code in some way. It’s not something we consider enough when writing T-SQL code, but I think there is a lot to be said for writing code the next person will be able to read.

As a small example, I was debugging something recently and found that all of the insert statements had ORDER BY clauses. I couldn’t work out why these were making me so angry, after all it’s not doing anything to hurt performance, and in fact isn’t doing anything at all, until one of the other devs in the office pointed out that it’s one example of the code noise that the whole code base is filled with.

Chris provides us a couple examples of noise. My bottom line on this is, develop to the minimum required standards of what the computer needs (i.e., accurate data, fast enough, etc.) and give the humans maintaining the code a fighting chance. Spend more time making it easy for humans and make everybody’s life easier.

Comments closed

Ownership Chaining in SQL Server

K. Brian Kelley walks us through the concept of ownership chaining in SQL Server:

Ownership chaining is a security feature in SQL Server which occurs when all of the following conditions are true:

– A user (which could be an app through a login/service account) tries to access an object that makes a reference to another object. For instance, the user tries to execute a stored procedure that accesses other objects or a SELECT from a view that accesses other tables.
– The user has access to the first object, such as EXECUTE rights on the stored procedure or SELECT rights on the view.
– Both objects have the same owner.

In this case, SQL Server will see the chain between the object the user called and the object being referenced. SQL Server will also determine that the owner for both objects is the same. When those conditions are met, SQL Server will create the ownership chain.

Read on for an in-depth example of ownership chaining and how it solves certain problems around managing database rights.

Comments closed

Tempdb Issues You Might Have

Fabiano Amorim walks through a few of the tricky issues around tempdb:

One of many performance improvements that came with SQL Server 2014 is that it doesn’t flush dirty pages created in a minimally logged operation on tempdb. This gives you the benefit of having faster (compared to prior versions) inserts, but it caused another problem as those allocated pages may take a lot of time to be removed from the buffer pool data cache. Before discussing the problem, quickly look at the benefit and then understand some important concepts of flush dirty pages on tempdb.

Click through to learn more about the fix, and then a fix to the fix…which opened up a new avenue to fix. This kind of thing is why operating at scale is so difficult: the solution to one problem often becomes the avenue to a new problem.

Comments closed

Refreshing Selected Tables in Power BI Desktop

Gilbert Quevauvilliers answers one of the great mysteries in life:

When working with Power BI datasets I often want to refresh multiple tables, BUT not all the tables in my model. (This often happens when I made a change to underlying data sources and I want to make sure it is working as expected in Power BI Desktop)

I recently found that I could do it by doing the following steps below.

When you have one or two big tables and you don’t make any changes to those tables, it can be a pain waiting for them to refresh. This is a great alternative.

Comments closed

Operations Testing with Pester

Sheldon Hull takes us through using Pester to automate operations tasks:

In my example, let’s start small and say you just have PowerShell, and some servers.

What I’ve discovered is that to actual validate DevOps oriented work is completed, you typically go through the equivalent of what a Cucumber test would have. This “checklist” of validations is often manually performed, lacking consistency and the ability to scale or repeat with minimal effort.

Consider an alternative approach to helping solve this issue, and expanding your ability to automate the tedious testing and validation of changes made.

Read on for an example as well as some additional thoughts from Sheldon.

Comments closed