Text Featurizing With Microsoft R Server

David Smith has a post summarizing sentiment analysis with Microsoft R Server:

Tsuyoshi Matsuzaki demonstrates the process in a post at the MSDN Blog. The post explores the Multi-Domain Sentiment Dataset, a collection of product reviews from Amazon.com. The dataset includes reviews from 975,194 products on Amazon.com from a variety of domains, and for each product there is a text review and a star rating of 1, 2, 4, or 5. (There are no 3-star rated reviews in the data set.) Here’s one example, selected at random:

What a useful reference! I bought this book hoping to brush up on my French after a few years of absence, and found it to be indispensable. It’s great for quickly looking up grammatical rules and structures as well as vocabulary-building using the helpful vocabulary lists throughout the book. My personal favorite feature of this text is Part V, Idiomatic Usage. This section contains extensive lists of idioms, grouped by their root nouns or verbs. Memorizing one or two of these a day will do wonders for your confidence in French. This book is highly recommended either as a standalone text, or, preferably, as a supplement to a more traditional textbook. In either case, it will serve you well in your continuing education in the French language.

The review contains many positive terms (“useful”, “indespensable”, “highly recommended”), and in fact is associated with a 5-star rating for this book. The goal of the blog post was to find the terms most associated with positive (or negative) reviews. One way to do this is to use the featurizeText function in thje Microsoft ML package included with Microsoft R Client and Microsoft R Server. Among other things, this function can be used to extract ngrams (sequences of one, two, or more words) from arbitrary text. In this example, we extract all of the one and two-word sequences represented at least 500 times in the reviews. Then, to assess which have the most impact on ratings, we use their presence or absence as predictors in a linear model:

If you’re thinking about sentiment analysis, read the whole thing.

Kafka Cruise Control

Jiangjie Qin announces Cruise Control, an automated workload management system for Kafka:

Intelligent automation is critical under these circumstances, which is why we developed Cruise Control: a general-purpose system that continually monitors our clusters and automatically adjusts the resources allocated to them to meet pre-defined performance goals. In essence, users specify goals, Cruise Control monitors for violations of these goals, analyzes the existing workload on the cluster, and automatically executes administrative operations to satisfy those goals. You can see a video here about Cruise Control at the Stream Processing Meet Up last fall.

Today we are pleased to announce that we have open sourced Cruise Control and it is now available on Github. In this post, we’ll describe Cruise Control’s uses both generally and at LinkedIn, its architecture, and some unique challenges we faced when creating it. For further details about Kafka terminology used throughout this post, this reference can be a helpful guide.

This isn’t a monitoring tool per se, but rather a resource balancing tool.  And it’s now freely available to all.

Getting Distinct Rows In R

Rob J. Hyndman shows four different techniques (one “classic” and three tidyverse) for getting a distinct subset of a data set in R:

So that looks much better — clean, short, and easy to understand. But is it fast? Rather than grabbing the first lines of each group, it has to go searching for duplicates. But avoiding grouping and ungrouping must save some time.

So I ran some microbenchmark timings:

Click through for techniques and timings.  I’m not surprised that the “classic” method won out in terms of time, but for explanatory value, I’d definitely prefer trying to explain the tidyverse distinct version.  H/T R-Bloggers

Name Your Powershell Parameters

Shane O’Neill warns us to name our Powershell parameters:


Cannot process argument because the value of argument “path” is null.

The “path” argument is NULL? And I’m back to the very root of “I:\”?

Thankfully, at this stage, I know to go back and read the help file again before I let slip any curses.

As usual, it’s a fun read and a good lesson.

Azure SQL Database Multi-Factor Authentication

Arun Sirpal notes that the latest version of SQL Server Management Studio supports Multi-Factor Authentication with Azure Active Directory:

Quite a mouth full for a title but never the less very exciting. With the new version of SQL Server Management Studio (SSMS) 17.2 You now have the option to use Azure AD authentication for Universal Authentication with Multi-factor authentication (MFA) enabled, by that I mean use a login via SSMS that is enabled for MFA where below I will show you the two step verification using a push notification to my iPhone. (Yes iPhone I love it)

Download SSMS 17.2 from this link. https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms

Once installed you will see new Authentication options, the option that I want is the one highlighted below – “Active Directory – Universal with MFA support”

Click through for a demo of this.  I wonder if (when?) something like this comes to on-prem, maybe in conjunction with a third-party multi-factor authentication service.

The Downside Of Trusted Assemblies

Solomon Rutzky does not like the Trusted Assembly solution to SQL Server 2017 CLR:

Hopefully, Microsoft removes all traces of “Trusted Assemblies” (as I have suggested here). In either case, please just use Certificates (and possibly Asymmetric Keys, depending on your preference and situation) as I have demonstrated in these past three posts (i.e. Parts 2, 3, and 4). Even better, especially for those using SSDT, would be if Microsoft implemented my suggestion to allow Asymmetric Keys to be created from a binary hex bytes string. But, even without that convenience, there is still no reason to ever, ever, use the “Trusted Assemblies” feature.

He’s given three alternatives so far, so if you’re interested in CLR security, there’s plenty of food for thought.

SQL Code Smells

Phil Factor has updated his code smells compendium:

Some time ago, Phil Factor wrote his booklet ‘SQL Code Smells’, collecting together a whole range of SQL Coding practices that could be considered to indicate the need for a review of the code. It was published as 119 code smells, even though there were 120 of them at the time. Phil Factor has continued to collect them and the current state of the art is reflected in this article. There are now around 150 of these smells and SQL Code Guard is committed to cover as many as possible of them.

I loved this booklet when it came out almost as much as I loved his Confessions of an IT Manager.  If you’re looking for some light reading over a long weekend, you can do a lot worse than this.

Searching In Windbg

Ewald Cress shows us how to search for a four-byte pattern in the Windows debugger:

Cracking open Windbg on 2016 SP1 with the s command to look for byte patterns yielded nothing of value. Maybe something has changed with conventions or indirection? Nope, no joy in 2014 either.

In the end, it took the extremely brave step of RTFM, in this case the Windbg online help, to realise where I was going wrong. I was searching for a four-byte pattern by searching for doublewords. Sounds reasonable on the face of it, but what I had missed was that this specifically required the doublewords to be doubleword-aligned, i.e. starting on an address divisible by four. My method only had a 25% chance of working, so it’s sheer luck I ever got good results with it.

Changing to a byte search for four consecutive bytes gave me the non-aligned semantics my taste buds craved, and the results came pouring in.

This is in the context of gathering information on an uncommon wait type related to columnstore indexes.

Azure Archive Blob Storage

James Serra talks about a new tier of blob storage:

Last year Microsoft introduced Azure Cool Blob storage, which cost customers a penny per GB per month in some Azure regions.  Now, users have another, lower-cost option in Azure Archive Blob Storage, along with new Blob-Level Tiering data lifecycle management capabilities.  So there are now three Azure blog storage tiers: Hot, Cool, and Archive.

Azure Archive Blob Storage costs 0.18 cents per GB per month when the service is delivered through its cloud data center in the East US 2 (for comparison, in the same region hot is 1.8 cents and cool is 1.0 cents per GB per month) .  Customers can expect a 99 percent availability SLA (service level agreement) when the service makes its way out of the preview stage.

This is Azure’s response to AWS Glacier.  The immediate sticker price is a bit higher, but if there aren’t any incremental costs associated with deletion, uploading, or retrieving files, then it could end up matching Glacier in TCO.


September 2017
« Aug