R For Apache Impala

Kevin Feasel

2017-07-24

Hadoop, R

Ian Cook describes implyr, an R interface for Apache Impala:

dplyr provides a grammar of data manipulation, consisting of set of verbs (including mutate()select()filter()summarise(), and arrange()) that can be used together to perform common data manipulation tasks. The implyr package helps dplyr translate this grammar into Impala-compatible SQL commands. This gives R users access to Impala’s scale and speed on large distributed datasets while using the same familiar dplyr syntax that they are accustomed to using on local data frames and other data sources. R users can also choose to directly write SQL commands and execute them on Impala using implyr.

implyr builds upon recent work from RStudio and other contributors, including major updates to the packages dplyr and DBI, and new packages dbplyr and odbc. implyr together with these packages enables data scientists and data engineers to more easily interact with Impala through self-service data science tools like Cloudera Data Science Workbench.

It looks like this project is off to a good start already.

Data Cleanup Using Drools

Kevin Feasel

2017-07-24

Data

Rathnadevi Manivannan gives an example of using Drools to create rule-based data cleansing processes:

The oil well drilling datasets contain raw information about wells and their formation details, drill types, and production dates. The Arkansas dataset has 6,040 records and the Oklahoma dataset has 2,559 records.

The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well and invalid well information with modified dates.

This raw data from the source is transformed to MS SQL for further filtering and normalization. To download raw data, look at the Reference section.

This is an example of applying several constraints and rules to a single data set.  Each individual rule would probably be easier to do in T-SQL, but the whole bunch becomes easier to understand with a procedural language.

Whitelisting SQL CLR Assemblies

Niels Berglund walks through the process of whitelisting a CLR assembly in SQL Server 2017:

What Microsoft introduces in SQL Server 2017 RC1, is something I refer to as whitelisting. It is somewhat similar to the TRUSTWORTHY setting, where you indicate that a database is to be trusted. But instead of doing it on the database level, you do it per assembly.

To whitelist in SQL Server 2017 RC1, you use the system stored procedure sys.sp_add_trusted_assembly. As the name implies the procedure adds an assembly to a list of “trusted” assemblies. By marking an assembly as trusted, SQL Server will allow it to be loaded when clr strict security is on (on by default), even if:

  • the assembly is not signed, and

  • the database where you want to deploy it to is not TRUSTWORTHY.

With the elimination of the CAS model finally hitting CLR, this is probably going to be one of the easier ways for DBAs to move forward with CLR in the future.

Finding The Largest Tables

Andy Galbraith has a script to find the largest tables in a database:

In the previous post in this series “Toolbox – Where Did All My Space Go?” I shared a script for finding which database files consumed the most space and which of those files had free space in them.  The next step after finding out which databases are using the space is determining which tables in those databases are occupying that space to consider purge/archive opportunities.  In many cases you will find a single large table (often called a “mother table”) taking up most of the space in your database.

There are cases when you wouldn’t want to use the Disk Usage By Table report, so here is a viable alternative.

Narrowing Down Deprecated Feature Usage

Dave Mason now skips msdb when he looks for deprecated feature usage:

In my previous post, I took a stab at monitoring deprecation events for SQL Server. It didn’t go so well. A deprecation event occurred more than 5,000 times in a very short period of time, and I got one email for every occurrence. Not good. Here’s what I kept seeing over and over:

USER_ID will be removed from a future version of SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use the feature. Use DATABASE_PRINCIPAL_ID instead.

It turns out the system stored proc msdb.dbo.sp_send_dbmail has a USER_ID() reference. I suspect an unrelated alert/email happened once, which executed sp_send_dbmail, which generated a DEPRECATION_FINAL_SUPPORT event, which ultimately led to another execution of sp_send_dbmail, which generated yet another DEPRECATION_FINAL_SUPPORT event, and round and round we go.

Click through for examples of deprecated features that various Microsoft products, including Reporting Services and Team Foundation Server, still use.

How Long Was That Event?

Taiob Ali shows that some extended event times are in milliseconds, some are in microseconds, and some are in unknownaseconds:

There was a question in dba.stackexchange.com titled “module_end extended events duration in microseconds?” that I answered and it was Microsecond in that instance.
Later on I questioned myself if the duration is always in Microseconds for extended events. I found, it is a mix of Millisecond, Microsecond and some are unknown meaning famous NULL. I wrote below tsql code to determine which is what.

Click through for the script that Taiob used to determine the answer.

Message Queues For The DBA

Drew Furgiuele explains message queueing theory and puts together a nice demo:

Please note, there’s one thing I need to make super abundantly clear for this demo: You’d never, ever configure these components like this for production. There’s so much more to consider, like setting up RabbitMQ to use SSL, writing actual applications and services instead of PowerShell to handle message listeners, and putting more effort into Service Broker. This should serve fine as a proof of concept, but if you want to actually implement something like this, make sure you do you research and follow best practices for each component.

This is the warning label on the chainsaw that says “normally you’d grab it by the other end.”  This is a great post, giving an introduction to tying Service Broker to RabbitMQ.  My biases lead me to Apache Kafka over RabbitMQ in most cases, but that’s just me.

Batch Execution Mode And Window Functions

Kevin Feasel

2017-07-24

T-SQL

Chris Adkin shows how taking advantage of batch execution mode on rowstore tables can lead to faster performance as degree of parallelism increases:

The SQL Server execution engine fundamentally acts like a cursor, control flow is exerted from the root node of the plan down to right most child node or iterators. The (logical) flow of data through the plan is in the opposite direction.

I say ‘Logical’ because in practise the run time uses buffers in order to minimise the data movement whilst executing the plan. However, up until SQL Server 2012, which first introduced batch mode, execution plans were executed by iterators processing data in a row by agonising row manner. Batch mode changes all of this, if we can ferry rows around in batches, this reduces the number of CPU cycles it takes to process an iterator in the plan (providing it supports batch mode). Also by sizing batches such that they fit inside the level 2 cache of the CPU, we gain even more performance by minimizing CPU last level cache misses or worse still main memory.

Adkin credits Niko Neugebauer for the insight and shows how you can use this on normal rowstore tables.

Categories

July 2017
MTWTFSS
« Jun Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31