Inline Operators In R With wrapr

John Mount shows how to use inline operators in R with the wrapr package:

The above code is assuming you have the wrapr package attached via already having run library('wrapr').

Notice we picked R-related operator names. We stayed away from overloading the + operator, as the arithmetic operators are somewhat special in how they dispatch in R. The goal wasn’t to make R more like Python, but to adapt a good idea from Python to improve R.

Also, it’s a little late to pick up the discount (though Manning has discounts pretty much every day so be patient and you’ll find 40+ percent off) but check out the second edition of Practical Data Science with R by Mount and Nina Zumel. I’ve held off on reading it so far because I want to wait until it’s closer to completion, but it is on my to-read list.

Feature And Text Classification Using Naive Bayes In R

I wrap up my series on the Naive Bayes class of algorithms, finally writing some code along the way:

Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, which is under 3MB zipped in 2000 reviews in total.

Unike last time, I’m going to break this out into sections with commentary in between. If you want the full script with notebook, check out the GitHub repo I put together for this talk.

Assuming I ever get a chance to do this talk again, I’m probably going to change the data sets in the example given how overplayed iris is.

An Explanation Of Convolutional Neural Networks

Shirin Glander explains some of the mechanics behind Convolutional Neural Networks:

Convolutional Neural Nets are usually abbreviated either CNNs or ConvNets. They are a specific type of neural network that has very particular differences compared to MLPs. Basically, you can think of CNNs as working similarly to the receptive fields of photoreceptors in the human eye. Receptive fields in our eyes are small connected areas on the retina where groups of many photo-receptors stimulate much fewer ganglion cells. Thus, each ganglion cell can be stimulated by a large number of receptors, so that a complex input is condensed into a compressed output before it is further processed in the brain.

If you’re interested in understanding why a CNN will classify the way it does, chapter 5 of Deep Learning with R is a great reference.

Getting CSV Row Counts

Dave Mason shares a few techniques for getting row counts of CSV files:

I was reminded of this recently as I was working with R, trying to read a nearly 2 GB data file. I wanted to read in 5% of the data and output it to a smaller file that would make the test code run faster. The particular function I was working with needed a row count as one of its parameters. For me, that meant I had to determine the number of rows in the source file and multiply by 0.05. I tied the code for all of those tasks into one script block.
Now, none to my surprise, it was slow. In my short experience, I’ve found R isn’t particularly snappy–even when the data can fit comfortably in memory. I was pretty sure I could beat R’s record count performance handily with C#. And I did. I found some related questions on StackOverflow. A small handful of answers discussed the efficiency of various approaches. I only tried two C# variations: my original attempt, and a second version that was supposed to be faster (the improvement was nominal).

To fact-check Dave (because this blog is about nothing other than making sure Dave is right), I checked the source code to the wc command. That command also streams through the entire file, so Dave’s premise looks good.

CosmosDB Time To Live Support

Hasan Savran explains the Time To Live (TTL) counter in CosmosDB:

Another great feature of Cosmos DB is, TTL (Time To Live) support. This is a great option to have if you need a database system with Caching option, or you need to purge your data and you don’t want to develop a function to remove data from your dataset. CosmosDB’s TTL feature is pretty simple, all you need to do is, turn this feature on and declare when data should be removed from your dataset. Best part about TTL in CosmosDB is, CosmosDB does not charge you when it removes the data from your containers so you can use your Request Units for other transactions!

There are two ways to set Time to Live value. You can set the TTL value on a container or you can set it on a specific item by using CosmosDB SDK. TTL value must be in seconds.
 TTL timer resets if data gets modified for any reason.

Click through for an example of it in action.

Miminal Rights For Bulk Inserts

Timothy Smith takes us through least privilege while allowing bulk insert operations:

While this file path serves as a useful location for us to load flat files, we should consider that the user account that is executing the underlying insert statement must be able to read (and possibly write to) that file location. The writing part of the equation comes in when it involves logging, even if the permissions of the written logging data are tied down strictly in the output, in that the user doesn’t control what gets written, but that errors are written. In the least, we want to ensure that a separate folder with strict permissions exists for any flat file import to restrict the account access – notice that we’re not reading off the root drive, as we’ve seen that we can insert an entire file of data – think about using SQL bulk insert to view files through SQL Server by inserting the file’s data and reviewing it.

It’s more than just “check the box for the server-level role.”

Bubbling Up HTTP Status Errors In Power Query

Tony McGovern takes us through a method involving Power Query + M of giving end users useful information when a web request fails:

So how does this relate to error tables? Like most well-documented APIs, the U.S. Census Bureau API has a page devoted to listing and describing all the possible response codes that can be returned by their service. I take this information and build an internal table within the query that defines and describes these response codes in my own words. I’m now able to throw custom messages that make the difference between a 400 response code and a 404 response code more obvious.

For example, in the code below, I use the Error.Record function to create individual records that allow me to catch these unsuccessful requests and throw my own custom error messages to the user. I then create an extra field in each record called ‘Status’, which maps each HTTP response code returned by the API to a corresponding error message of my choosing:

There’s a bit of work, but the end result is a fairly simple explanation for end users.

Auditing SQL Agent Jobs

Jason Brimhall has some clever techniques for auditing SQL Agent Jobs with Extended Events:

Once upon a time, I was in the position of trying to figure out why a job failed. After a bunch of digging and troubleshooting, it was discovered that the job had changed but nobody knew when or why. Because of that, I was asked to provide a low cost audit solution to try and at least provide answers to the when and who of the change.

Tracking who made a change to an agent job should be a task added to each database professionals checklist / toolbox. Being caught off guard from a change to a system under your purview isn’t necessarily a fun conversation – nor is it pleasant to be the one to find that somebody changed your jobs without notice – two weeks after the fact! Usually, that means that there is little to no information about the change and you find yourself getting frustrated.

Click through to see how Jason does it.

When SLEEP_TASK Waits Are Important

Ginger Keys gives an example where the SLEEP_TASK wait indicates a performance problem:

Ordinarily SLEEP_TASK is a nonspecific wait type in SQL Server which occurs when a task sleeps while waiting for a generic event to occur, according to Microsoft documentation.  This wait type can usually be safely ignored, however on some occasions it can happen when a script does not execute completely or hangs up for long periods of time.

The SLEEP_TASK wait means that a thread is waiting on a resource or waiting for some event to occur, and could indicate background task scheduling, a query plan exchange operator that isn’t tracked by CXPACKET, or it could be a hashing operation that spills to tempdb.

Read the whole thing. For a bit more information, check out the SQLskills description of this wait type.

Testing Kafka Streams Applications

Yeva Byzek continues her series on testing Kafka-based streaming applications:

When you create a stream processing application with Kafka’s Streams API, you create a Topologyeither using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). For testing though, connecting to a running Kafka cluster and making sure to clean up state between tests adds a lot of complexity and time.

Instead, developers can unit test their Kafka Streams applications with utilities provided by kafka-streams-test-utils. Introduced in KIP-247, this artifact was specifically created to help developers test their code, and it can be added into your continuous integration and continuous delivery (CI/CD) pipeline.

Streaming applications need tested just like any other.


January 2019
« Dec