Month: July 2018

Sharing R Notebooks

Published 2018-07-10 by Kevin Feasel

Hanyu Cui and Hossein Falaki show how to share a notebook using RMarkdown:

RMarkdown is the dynamic document format RStudio uses. It is normal Markdown plus embedded R (or any other language) code that can be executed to produce outputs, including tables and charts, within the document. Hence, after changing your R code, you can just rerun all code in the RMarkdown file rather than redo the whole run-copy-paste cycle. And an RMarkdown file can be directly exported into multiple formats, including HTML, PDF, and Word.

Click through for the demo.

Comments closed

Pipe-Friendly Functions In R

Published 2018-07-10 by Kevin Feasel

William Doane gives some tips on writing pipe-friendly functions in R:

Languages that don’t begin by supporting pipes often eventually implement some version of them. In R, the magrittr package introduced the %>% infix operator as a pipe operator and is most often pronounced as “then”. For example, “take the mtcarsdata.frame, THEN take the head of it, THEN…” and so on.

For a function to be pipe friendly, it should at least take a data object (often named .data) as its first argument and return an object of the same type—possibly even the same, unaltered object. This contract ensures that your pipe-friendly function can exist in the middle of a piped workflow, accepting the input from its left-hand side and passing along output to its right-hand side.

Click through for a couple of examples. H/T R-Bloggers

Comments closed

Using Azure Logic Apps For Database Tasks

Published 2018-07-10 by Kevin Feasel

Arun Sirpal shows off a technique he has developed to run maintenance jobs against Azure SQL Database databases:

I have been using Azure Logic apps recently to build some workflows to gather data from external sources ultimately inserting it into a database for reporting, I then got thinking, how can this be useful for “DBA” based tasks? Let’s take a step back for a minute, what are logic apps? It is a technology that helps integrate apps, data, systems, and services across enterprises. Key parts of a logic app solution are connectors, triggers and actions.

I decided that I wanted to execute a stored procedure every 6 hours to capture wait statistics for my Azure SQL Database and log the information in a table.

This is what my workflow looks like.

There are a few alternatives available, so it’s nice to see an example of one of them.

Comments closed

Grouping And Aggregating In SQL, R, And Python

Published 2018-07-10 by Kevin Feasel

Dejan Sarka has a few examples of aggregation in different languages, including SQL, R, and Python:

The query calculates the coefficient of variation (defined as the standard deviation divided the mean) for the following groups, in the order as they are listed in the GROUPING SETS clause:

Country and education – expression (g.EnglishCountryRegionName, c.EnglishEducation)

Country only – expression (g.EnglishCountryRegionName)

Education only – expression (c.EnglishEducation)

Over all dataset- expression ()

Note also the usage of the GROUPING() function in the query. This function tells you whether the NULL in a cell comes because there were NULLs in the source data and this means a group NULL, or there is a NULL in the cell because this is a hyper aggregate. For example, NULL in the Education column where the value of the GROUPING(Education) equals to 1 indicates that this is aggregated in such a way that education makes no sense in the context, for example aggregated over countries only, or over the whole dataset. I used ordering by NEWID() just to shuffle the results. I executed query multiple times before I got the desired order where all possibilities for the GROUPING() function output were included in the first few rows of the result set. Here is the result.

GROUPING SETS is an underappreciated bit of SQL syntax.

Comments closed

Index That Column Or Include It?

Published 2018-07-10 by Kevin Feasel

Jeanne Combrinck lays out her recommendations on whether to make a particular column part of an index or have it be an included column:

The original question we wanted to answer was whether we would see a performance difference when a query used the index with all columns in the key, versus the index with most of the columns included in the leaf level. In our first set of tests there was no difference, but in our third and fourth tests there was. It ultimately depends on the query. We only looked at two variations – one had an additional predicate, the other had an ORDER BY – many more exist.

What developers and DBAs need to understand is that there are some great benefits to including columns in an index, but they will not always perform the same as indexes that have all columns in the key. It may be tempting to move columns that are not part of predicates and joins out of the key, and just include them, to reduce the overall size of the index. However, in some cases this requires more resources for query execution and may degrade performance. The degradation may be insignificant; it may not be…you will not know until you test. Therefore, when designing an index, it’s important to think about the columns after the leading one – and understand whether they need to be part of the key (e.g. because keeping the data ordered will provide benefit) or if they can serve their purpose as included columns.

Read the whole thing and be willing to test different approaches.

Comments closed

The Best Of The Underground Toolbox

Published 2018-07-10 by Kevin Feasel

Adrian Buckman shares some of his favorite creations:

sp_AGreconfigure

This is a great goto proc for an alternative to the Always on availability group GUI for changing Failover mode, Synchronous mode or even Readable options.

When you manage multiple servers with multiple Availability groups this stored procedure can save you alot of time, sometimes I find the GUI can take a long time to open but equally it can take some time to execute the command.

sp_AGreconfigure can speed this process up for you, we tend to use this as our goto for switching synchronous settings when patching/rebooting replicas but also I tend to use it in @Checkonly = 1 mode for giving the Availability group settings a once over.

Click through for this and several other useful tools.

Comments closed

The Power Of Registered Servers In SSMS

Published 2018-07-10 by Kevin Feasel

Nisarg Upadhyay introduces us to Central Management Server and Registered Servers in SQL Server Management Studio:

We will:

Designate a SQL Server instance as “Central Management server”.
Create server groups named Production servers and Development servers under CMS server.
Register “QA Testing server” and “Staging server” under the development server group, and “HR DB” and “Sales DB” under the production server group.
Connect and access the CMS server from another server inside the same domain using SSMS.

Once you get past a few SQL Server instances, having a CMS in place is a fantastic idea.

Comments closed

sp_ctrl3 — Getting Information On Database Objects

Published 2018-07-10 by Kevin Feasel

Daniel Hutmacher shows off his sp_ctrl3 solution:

My absolute number one favorite homebrew tool is without a doubt sp_ctrl3. I started building it a long time ago to replace the built-in sp_help procedure in SQL Server, which is accessible using the Alt+F1 shortcut in Management Studio.

sp_help shows you object information on database objects, such as column definitions, parameters, indexes, etc, but it’s old (I remember it in SQL Server 2000, but it’s probably way older than that) and it hasn’t really aged that well since then. What’s more, sp_help won’t show you the more technical details and features from newer versions (like filters and included columns on indexes) , so you can’t really just copy and paste information from it very reliably or effortlessly.

Like the name implies, cp_ctrl3 aims to address some of those issues, and years later, I find myself adding features to it whenever there’s something I miss.

Check it out as a worthy replacement to sp_help.

Comments closed

Reading JSON Data Using httr

Published 2018-07-09 by Kevin Feasel

Akshay Mahale shows us how to use a few tidyverse-friendly packages to read JSON data from web endpoints:

When it comes to R to consume such APIS we focus majorly on the package below:

httr This package takes it very seriously when we have to work with We data by exposing some very useful functions. It provides us with HTTP client to access APIS with GET/POST methods, passing query parameters, verifying fetched response wrt to data format and if error-free.
jsonlite In order to convert received JSON response to readable R Object or a data frame, jsonlite helps to convert json to R object and vice versa.
rlist To perform some additional manipulation on data structure of received JSON response rlist exposes some important methods list.select and list.stack. This methods are useful to get parsed json data into a tibble.

Read on for an example extracting data from a web endpoint.

Comments closed

Dealing With Heteroskedasticity

Published 2018-07-09 by Kevin Feasel

Bruno Rodrigues explains the notion of heteroskedasticity and shows ways of dealing with this issue in a linear regression:

This test shows that we can reject the null that the variance of the residuals is constant, thus heteroskedacity is present. To get the correct standard errors, we can use the vcovHC() function from the {sandwich} package (hence the choice for the header picture of this post):
lmfit %>% 
    vcovHC() %>% 
    diag() %>% 
    sqrt()
##       (Intercept)   regionnortheast       regionsouth        regionwest 
##      311.31088691       25.30778221       23.56106307       24.12258706 
##         residents   young_residents per_capita_income 
##        0.09184368        0.68829667        0.02999882
By default vcovHC() estimates a heteroskedasticity consistent (HC) variance covariance matrix for the parameters. There are several ways to estimate such a HC matrix, and by default vcovHC() estimates the “HC3” one. You can refer to Zeileis (2004) for more details.

We see that the standard errors are much larger than before! The intercept and regionwest variables are not statistically significant anymore.

The biggest problem with heteroskedasticity is that it can introduce bias in error terms. That’s not the end of the world, but if the level of heteroskedasticity is serious enough, we want to find ways to account for it. H/T R-Bloggers.

Comments closed