In this note we will show how to speed up work in
Rby partitioning data and process-level parallelization. We will show the technique with three different
dplyr. The methods shown will also work with base-
Rand other packages.
For each of the above packages we speed up work by using
wrapr::execute_parallelwhich in turn uses
wrapr::partition_tablesto partition un-related
data.framerows and then distributes them to different processors to be executed.
rqdatatable::ex_data_table_parallelconveniently bundles all of these steps together when working with
There were some interesting results. I expected data.table to be fast, but did not expect dplyr to parallelize so well.
RMarkdown is the dynamic document format RStudio uses. It is normal Markdown plus embedded R (or any other language) code that can be executed to produce outputs, including tables and charts, within the document. Hence, after changing your R code, you can just rerun all code in the RMarkdown file rather than redo the whole run-copy-paste cycle. And an RMarkdown file can be directly exported into multiple formats, including HTML, PDF, and Word.
Click through for the demo.
Languages that don’t begin by supporting pipes often eventually implement some version of them. In R, the magrittr package introduced the
%>%infix operator as a pipe operator and is most often pronounced as “then”. For example, “take the
mtcarsdata.frame, THEN take the
headof it, THEN…” and so on.
For a function to be pipe friendly, it should at least take a data object (often named
.data) as its first argument and return an object of the same type—possibly even the same, unaltered object. This contract ensures that your pipe-friendly function can exist in the middle of a piped workflow, accepting the input from its left-hand side and passing along output to its right-hand side.
Click through for a couple of examples. H/T R-Bloggers
I have been using Azure Logic apps recently to build some workflows to gather data from external sources ultimately inserting it into a database for reporting, I then got thinking, how can this be useful for “DBA” based tasks? Let’s take a step back for a minute, what are logic apps? It is a technology that helps integrate apps, data, systems, and services across enterprises. Key parts of a logic app solution are connectors, triggers and actions.
I decided that I wanted to execute a stored procedure every 6 hours to capture wait statistics for my Azure SQL Database and log the information in a table.
This is what my workflow looks like.
There are a few alternatives available, so it’s nice to see an example of one of them.
The query calculates the coefficient of variation (defined as the standard deviation divided the mean) for the following groups, in the order as they are listed in the GROUPING SETS clause:
- Country and education – expression (g.EnglishCountryRegionName, c.EnglishEducation)
- Country only – expression (g.EnglishCountryRegionName)
- Education only – expression (c.EnglishEducation)
- Over all dataset- expression ()
Note also the usage of the GROUPING() function in the query. This function tells you whether the NULL in a cell comes because there were NULLs in the source data and this means a group NULL, or there is a NULL in the cell because this is a hyper aggregate. For example, NULL in the Education column where the value of the GROUPING(Education) equals to 1 indicates that this is aggregated in such a way that education makes no sense in the context, for example aggregated over countries only, or over the whole dataset. I used ordering by NEWID() just to shuffle the results. I executed query multiple times before I got the desired order where all possibilities for the GROUPING() function output were included in the first few rows of the result set. Here is the result.
GROUPING SETS is an underappreciated bit of SQL syntax.
The original question we wanted to answer was whether we would see a performance difference when a query used the index with all columns in the key, versus the index with most of the columns included in the leaf level. In our first set of tests there was no difference, but in our third and fourth tests there was. It ultimately depends on the query. We only looked at two variations – one had an additional predicate, the other had an ORDER BY – many more exist.
What developers and DBAs need to understand is that there are some great benefits to including columns in an index, but they will not always perform the same as indexes that have all columns in the key. It may be tempting to move columns that are not part of predicates and joins out of the key, and just include them, to reduce the overall size of the index. However, in some cases this requires more resources for query execution and may degrade performance. The degradation may be insignificant; it may not be…you will not know until you test. Therefore, when designing an index, it’s important to think about the columns after the leading one – and understand whether they need to be part of the key (e.g. because keeping the data ordered will provide benefit) or if they can serve their purpose as included columns.
Read the whole thing and be willing to test different approaches.
This is a great goto proc for an alternative to the Always on availability group GUI for changing Failover mode, Synchronous mode or even Readable options.
When you manage multiple servers with multiple Availability groups this stored procedure can save you alot of time, sometimes I find the GUI can take a long time to open but equally it can take some time to execute the command.
sp_AGreconfigure can speed this process up for you, we tend to use this as our goto for switching synchronous settings when patching/rebooting replicas but also I tend to use it in @Checkonly = 1 mode for giving the Availability group settings a once over.
Click through for this and several other useful tools.
Designate a SQL Server instance as “Central Management server”.
Create server groups named Production servers and Development servers under CMS server.
Register “QA Testing server” and “Staging server” under the development server group, and “HR DB” and “Sales DB” under the production server group.
Connect and access the CMS server from another server inside the same domain using SSMS.
Once you get past a few SQL Server instances, having a CMS in place is a fantastic idea.
My absolute number one favorite homebrew tool is without a doubt sp_ctrl3. I started building it a long time ago to replace the built-in sp_help procedure in SQL Server, which is accessible using the Alt+F1 shortcut in Management Studio.
sp_help shows you object information on database objects, such as column definitions, parameters, indexes, etc, but it’s old (I remember it in SQL Server 2000, but it’s probably way older than that) and it hasn’t really aged that well since then. What’s more, sp_help won’t show you the more technical details and features from newer versions (like filters and included columns on indexes) , so you can’t really just copy and paste information from it very reliably or effortlessly.
Like the name implies, cp_ctrl3 aims to address some of those issues, and years later, I find myself adding features to it whenever there’s something I miss.
Check it out as a worthy replacement to sp_help.