Press "Enter" to skip to content

Author: Kevin Feasel

Working with R and the Windows Command Line

Tomaz Kastrun takes us through calling CMD commands from R:

From time to time, when developing in R, working and wrangling data , preparing for machine learning projects, it comes the time, one would still need to access the operating system commands from/in R.

In this blog post, let’s take a look at some most useful cmd commands when using R.  Please note, that the cmd commands apply only to windows environment, for Linux/MacOS, the system commands should be slightly changed, but the wrapper R code should remains the same.

The need does come up, so it’s good to have that knowledge at hand.

Comments closed

2020 Data Professional Salary Survey Results

Brent Ozar has another year of salary data for us:

A few things to know about it:

– The data is public domain. The license tab makes it clear that you can use this data for any purpose, and you don’t have to credit or mention anyone.
– The spreadsheet includes the results for all 4 years (2017-2020.) We’ve gradually asked different questions over time, so if a question wasn’t asked in a year, the answers are populated with Not Asked.
– The postal code field was totally optional, and may be wildly unreliable. Folks asked to be able to put in small portions of their zip code, like the leading numbers.
– Frankly, anytime you let human beings enter data directly, the data can be pretty questionable – for example, there were 14 folks this year who entered annual salaries below $500. If you’re doing analysis on this, you’re going to want to discard some outliers.

It’s on my agenda (somewhere…probably a bit further back than I’d like) to dig into this year’s data and try to come up with something a little more comprehensive now that there are four years of data.

Comments closed

Not All Cursors are Bad

Erik Darling doesn’t want to mess with your cursors (that much):

Read the code. Understand the requirements.

I tune queries all day long. The number of times someone has said THIS CURSOR IS A REAL BIG PROBLEM and been right is pretty small.

Often, there was a tweak to the cursor options, or a tweak to the query the cursor was calling (or the indexes available to it) that made things run in a more immediate fashion. I want to tune queries, not wrestle with logic that no one understands. Old code is full of that.

I’ll grant the premise (and add my own case where a cursor was necessary to solve the problem), though I did work at one company where the entire product logic was driven by nested cursors 5 or 6 levels deep. Those were really big problems. I think you’ll find the problem most frequently in shops with a heavy dose of Oracle, as Oracle cursors do perform well.

Comments closed

Execution Plans: Check the First Operator

Grant Fritchey reminds us to look at the first operation when viewing an execution plan:

The first time you see a new execution plan that you’re examining to fix a performance problem, something broken, whatever, you should always start by looking at the first operator.

First Operator

The first operator is easily discerned (with an exception). It’s the very first thing you see in a graphical execution plan, at the top, on the left. It says SELECT in this case:

It’s easy to overlook, but Grant gives some good reasons not to do so.

Comments closed

Working with App Locks in SQL Server

Nisarg Upadhyay explains how to use two built-in stored procedures to take and release application locks:

For example, I want to prevent users from executing the same stored procedure at the same time. To demonstrate the scenario, I have created a stored procedure named procInsertEmployees, which inserts data into the tblEmployee table. I want to make sure that no one can access the stored procedure until the stored procedure inserts the data in the tblEmpoyee table. Moreover, I have added a waitfor delay ’00:00:15’ statement to simulate the blocking for the 15 seconds.

Application locks also allow for more complicated scenarios and you can get a bit creative when assigning resources (such as combining a process name and a parent ID).

Comments closed

Spark on Docker on YARN on Cloud

Adam Antal has included all of the layers:

Bringing your own libraries to run a Spark job on a shared YARN cluster can be a huge pain. In the past, you had to install the dependencies independently on each host or use different Python package management softwares. Nowadays Docker provides a much simpler way of packaging and managing dependencies so users can easily share a cluster without running into each other, or waiting for central IT to install packages on every node. Today, we are excited to announce the preview of Spark on Docker on YARN available on CDP DataCenter 1.0 release.

Joking about stack length aside, this looks really useful.

Comments closed

Upgrading SQL Server Windows Docker Containers

Emanuele Meazzo shows how you can upgrade SQL Server if you are using a Windows Docker container instead of Linux:

With the 1st CU for SQL 2019 released just yesterday, and Microsoft updating the docker image right away, the only natural response for me was to update the docker instance that I showed you how to deploy a few months back.

In theory, a docker container can’t be really “updated”, they’re meant to be stateless machines that you spin up and down responding to changes in demand; what we’re technically doing is creating a new container, based on a new image, that has the same configuration and uses the same persistent storage as the old one.

Read on to see how you can perform this upgrade without losing your data.

Comments closed

Optimal Kafka Partitioning

Paul Brebner is on a quest:

This blog provides an overview around the two fundamental concepts in Apache Kafka : Topics and Partitions. While developing and scaling our Anomalia Machina application we have discovered that distributed applications using Kafka and Cassandra clusters require careful tuning to achieve close to linear scalability, and critical variables included the number of Kafka topics and partitions. In this blog, we test that theory and answer questions like “What impact does increasing partitions have on throughput?” and “Is there an optimal number of partitions for a cluster to maximize write throughput?” And more!

Read on for some interesting findings.

Comments closed

Partitioning on Columnstore Table Loading

Aaron Bertrand continues a series around learning about columnstore indexes:

In part 1, I showed how both page and columnstore compression could reduce the size of a 1TB table by 80% or more. While I was impressed I could shrink a table from 1TB to 50GB, I wasn’t very happy with the amount of time it took (anywhere from 2 to 14 hours). With some tips graciously borrowed from folks like Joe ObbishLonny NiederstadtNiko Neugebauer, and others, in this post I will try to make some changes to my original attempt to get better load performance. Since the regular columnstore index didn’t compress better than page compression on this data set, and took 13 hours longer to get there, I’ll focus solely on the more advanced solution using COLUMNSTORE_ARCHIVE compression.

Click through for part 2.

Comments closed