Partitioning Data For Performance Improvement In R

Kevin Feasel

2018-07-10

R

John Mount shares a few examples of partitioning and parallelizing data operations in R:

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatabledata.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallelconveniently bundles all of these steps together when working with rquery pipelines.

There were some interesting results.  I expected data.table to be fast, but did not expect dplyr to parallelize so well.

Sharing R Notebooks

Hanyu Cui and Hossein Falaki show how to share a notebook using RMarkdown:

RMarkdown is the dynamic document format RStudio uses. It is normal Markdown plus embedded R (or any other language) code that can be executed to produce outputs, including tables and charts, within the document. Hence, after changing your R code, you can just rerun all code in the RMarkdown file rather than redo the whole run-copy-paste cycle. And an RMarkdown file can be directly exported into multiple formats, including HTML, PDF,  and Word.

Click through for the demo.

Pipe-Friendly Functions In R

Kevin Feasel

2018-07-10

R

William Doane gives some tips on writing pipe-friendly functions in R:

Languages that don’t begin by supporting pipes often eventually implement some version of them. In R, the magrittr package introduced the %>% infix operator as a pipe operator and is most often pronounced as “then”. For example, “take the mtcarsdata.frame, THEN take the head of it, THEN…” and so on.

For a function to be pipe friendly, it should at least take a data object (often named .data) as its first argument and return an object of the same type—possibly even the same, unaltered object. This contract ensures that your pipe-friendly function can exist in the middle of a piped workflow, accepting the input from its left-hand side and passing along output to its right-hand side.

Click through for a couple of examples.  H/T R-Bloggers

Using Azure Logic Apps For Database Tasks

Arun Sirpal shows off a technique he has developed to run maintenance jobs against Azure SQL Database databases:

I have been using Azure Logic apps recently to build some workflows to gather data from external sources ultimately inserting it into a database for reporting, I then got thinking, how can this be useful for “DBA” based tasks? Let’s take a step back for a minute, what are logic apps? It is a technology that helps integrate apps, data, systems, and services across enterprises. Key parts of a logic app solution are connectors, triggers and actions.

I decided that I wanted to execute a stored procedure every 6 hours to capture wait statistics for my Azure SQL Database and log the information in a table.

This is what my workflow looks like.

There are a few alternatives available, so it’s nice to see an example of one of them.

Grouping And Aggregating In SQL, R, And Python

Dejan Sarka has a few examples of aggregation in different languages, including SQL, R, and Python:

The query calculates the coefficient of variation (defined as the standard deviation divided the mean) for the following groups, in the order as they are listed in the GROUPING SETS clause:

  • Country and education – expression (g.EnglishCountryRegionName, c.EnglishEducation)
  • Country only – expression (g.EnglishCountryRegionName)
  • Education only – expression (c.EnglishEducation)
  • Over all dataset- expression ()

Note also the usage of the GROUPING() function in the query. This function tells you whether the NULL in a cell comes because there were NULLs in the source data and this means a group NULL, or there is a NULL in the cell because this is a hyper aggregate. For example, NULL in the Education column where the value of the GROUPING(Education) equals to 1 indicates that this is aggregated in such a way that education makes no sense in the context, for example aggregated over countries only, or over the whole dataset. I used ordering by NEWID() just to shuffle the results. I executed query multiple times before I got the desired order where all possibilities for the GROUPING() function output were included in the first few rows of the result set. Here is the result.

GROUPING SETS is an underappreciated bit of SQL syntax.

Index That Column Or Include It?

Jeanne Combrinck lays out her recommendations on whether to make a particular column part of an index or have it be an included column:

The original question we wanted to answer was whether we would see a performance difference when a query used the index with all columns in the key, versus the index with most of the columns included in the leaf level. In our first set of tests there was no difference, but in our third and fourth tests there was. It ultimately depends on the query. We only looked at two variations – one had an additional predicate, the other had an ORDER BY – many more exist.

What developers and DBAs need to understand is that there are some great benefits to including columns in an index, but they will not always perform the same as indexes that have all columns in the key. It may be tempting to move columns that are not part of predicates and joins out of the key, and just include them, to reduce the overall size of the index. However, in some cases this requires more resources for query execution and may degrade performance. The degradation may be insignificant; it may not be…you will not know until you test. Therefore, when designing an index, it’s important to think about the columns after the leading one – and understand whether they need to be part of the key (e.g. because keeping the data ordered will provide benefit) or if they can serve their purpose as included columns.

Read the whole thing and be willing to test different approaches.

The Best Of The Underground Toolbox

Adrian Buckman shares some of his favorite creations:

sp_AGreconfigure

This is a great goto proc for an alternative to the Always on availability group GUI for changing Failover mode, Synchronous mode or even Readable options.

When you manage multiple servers with multiple Availability groups this stored procedure can save you alot of time, sometimes I find the GUI can take a long time to open but equally it can take some time to execute the command.

sp_AGreconfigure can speed this process up for you, we tend to use this as our goto for switching synchronous settings when patching/rebooting replicas but also I tend to use it in @Checkonly = 1 mode for giving the Availability group settings a once over.

Click through for this and several other useful tools.

The Power Of Registered Servers In SSMS

Nisarg Upadhyay introduces us to Central Management Server and Registered Servers in SQL Server Management Studio:

We will:

  1. Designate a SQL Server instance as “Central Management server”.

  2. Create server groups named Production servers and Development servers under CMS server.

  3. Register “QA Testing server” and “Staging server” under the development server group, and “HR DB” and “Sales DB” under the production server group.

  4. Connect and access the CMS server from another server inside the same domain using SSMS.

Once you get past a few SQL Server instances, having a CMS in place is a fantastic idea.

sp_ctrl3 — Getting Information On Database Objects

Daniel Hutmacher shows off his sp_ctrl3 solution:

My absolute number one favorite homebrew tool is without a doubt sp_ctrl3. I started building it a long time ago to replace the built-in sp_help procedure in SQL Server, which is accessible using the Alt+F1 shortcut in Management Studio.

sp_help shows you object information on database objects, such as column definitions, parameters, indexes, etc, but it’s old (I remember it in SQL Server 2000, but it’s probably way older than that) and it hasn’t really aged that well since then. What’s more, sp_help won’t show you the more technical details and features from newer versions (like filters and included columns on indexes) , so you can’t really just copy and paste information from it very reliably or effortlessly.

Like the name implies, cp_ctrl3 aims to address some of those issues, and years later, I find myself adding features to it whenever there’s something I miss.

Check it out as a worthy replacement to sp_help.

Categories

July 2018
MTWTFSS
« Jun Aug »
 1
2345678
9101112131415
16171819202122
23242526272829
3031