Author: Kevin Feasel

Once I understood the data well, I realized that the raw data had to be flattened out to expand collections (like KB) at the row level into their own row so that everything has a single value in each row. Then, the grouping is easy.
It made more sense to allow grouping not just by KB but by other columns like Product or CVE. The Group-Object works fine for most cases but since there will be duplicates after the data is grouped, it makes it easier to just do it with HashTables.

Jana provides the entire solution on his site. When reading it, I felt the urge to switch to a language which offers easier pivoting and aggregation, but the code was clear and understandable.

Comments closed

Gigantic Row Custom U-SQL Extractor

Published 2017-09-04 by Kevin Feasel

Phillip Seamark has created a custom U-SQL extractor which handles rows larger than 4 MB:

It seemed some of the rows in my CSV files exceeded an upper limit on how much the Extractor.Csv function can handle and adding the silent:true parameter didn’t solve the issue.
I dug a bit deeper and found rows in some of the files that are long – really long. One in particular was 47MB long just for the row and this was valid data. I could have manually edited these outs by hand but thought I’d see if I could solve another way.
After some internet research and a couple of helpful tweets to and from Michael Rys, I decided to have a go at making my own custom U-SQL extractor.

Phillip has included the custom extractor code, so if you find yourself needing to parse very large rows of data in U-SQL, you’ll definitely be interested in this.

Comments closed

Counting Rows In Spark With Dplyr

Published 2017-09-04 by Kevin Feasel

John Mount discusses the difficulty of using dplyr to count rows in Spark:

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.
There are a number of common legitimate uses of nrow() in user code and package code including:

Checking if a table is empty.
Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).
Confirming data size is the same as reported in other sources (Spark, database, and so on).
Reporting amount of work performed or rows-per-second processed.

Read the whole thing; this seems unnecessarily complicated.

Comments closed

The Power Of Out-GridView

Published 2017-09-04 by Kevin Feasel

Mike F. Robbins shows off Out-GridView in the context of working with Azure:

Although the Out-GridView cmdlet existed in PowerShell version 2.0, you’ll need PowerShell version 3.0 to use it as shown in this blog article. The OutputMode parameter of Out-GridView which is used in this blog article was added in PowerShell version 3.0. Also, Out-GridView can only be used on operating systems with a GUI (it cannot be used on server core).
As far as I know, there’s no way to set a default region in Azure like there is with the AWS Initialize-AWSDefaultConfiguration cmdlet. I guess if you really wanted to set a default, you could always use $PSDefaultParameterValues to set a default value for the Location parameter for the cmdlets that require it.

Out-GridView is great for what it is: an easy UI within Powershell, with sorting and filtering built in.

Comments closed

Creating Minidumps In SQL Server

Published 2017-09-04 by Kevin Feasel

Joe Obbish shows how to read the call stack by creating a minidump:

We can use minidumps to generate small files that contain SQL Server call stacks. Note that if you aren’t careful you can end up writing the full contents of memory to a dump file. I believe that writing the file is a single-threaded process, so this can take hours and hundreds of GBs on a real server. Also SQL Server is frozen while the dump is happening, so don’t do it in production.
Dumps are most useful when you want to examine a long running, very simple query or a query that seems to be stuck at a certain point. They aren’t useful to see a full call stack of a non-simple query or to see something that happens just once during execution, like a specific task during query compilation. All that you can get is a snapshot and the snapshot may not contain the information that you’re looking for.
There are multiple ways to generate and read dump files. For this blog post I’ll be using sqldumper.exe and WinDbg.

Click through to read the whole thing.

Comments closed

Working With AT TIME ZONE In SQL Server

Published 2017-09-04 by Kevin Feasel

Louis Davidson has a post up showing how to use AT TIME ZONE:

I have a series of blogs about temporal tables in SQL Server 2016, which you can see here. However, one thing I never had mentioned was that the Temporal Row Start and End Times are always stored in UTC times (which is generally inconvenient to work with unless you are from the Great Britain area). However, when you are looking at your data and want to see what the data looked like at ‘Yesterday at 12:00 at the computer you are working from, this can be a bit more work.
In 2016 SQL Server, a new feature was added that totally passed me by. AT TIME ZONE. It is used to manipulate the time zone offset of a point in time value,either to append the time zone to a value, or to shift the time zone value of a DATETIMEOFFSET value. The most common practical use that I will show is converting a value in your local time to UTC (and vice versa).

This is something I tend to forget about, but can be useful when building quick reports from UTC. I’d store all data in UTC, just to make sure dates and times are consistent across servers in different locations, but you don’t have to do those calculations in your head all the time.

Comments closed

Join Elimination

Published 2017-09-04 by Kevin Feasel

Lukas Eder has a nice post explaining different forms of automatic join elimination:

We intended to fetch all customers and their addresses. But observe: We project only columns from the CUSTOMER table and we don’t have any predicates at all, specifically not predicates using the ADDRESS table. So, we’re completely ignoring any contributions from the ADDRESS table. We never really needed the JOIN in the first place!
And in fact, the optimiser can prove this too, because of the FOREIGN KEY constraint on C.ADDRESS_ID, which guarantees that every CUSTOMER record has exactly one corresponding ADDRESS record. The JOIN does not duplicate, nor remove any CUSTOMER rows, so it is unneeded and thus eliminated (by some, not all databases, will list each database at the end of the article).
So, the database can rewrite the SQL statement to the following, equivalent SQL statement in the presence of said FOREIGN KEY

Read on for a comparison across different products as well.

Comments closed

Text Featurizing With Microsoft R Server

Published 2017-09-01 by Kevin Feasel

David Smith has a post summarizing sentiment analysis with Microsoft R Server:

Tsuyoshi Matsuzaki demonstrates the process in a post at the MSDN Blog. The post explores the Multi-Domain Sentiment Dataset, a collection of product reviews from Amazon.com. The dataset includes reviews from 975,194 products on Amazon.com from a variety of domains, and for each product there is a text review and a star rating of 1, 2, 4, or 5. (There are no 3-star rated reviews in the data set.) Here’s one example, selected at random:
What a useful reference! I bought this book hoping to brush up on my French after a few years of absence, and found it to be indispensable. It’s great for quickly looking up grammatical rules and structures as well as vocabulary-building using the helpful vocabulary lists throughout the book. My personal favorite feature of this text is Part V, Idiomatic Usage. This section contains extensive lists of idioms, grouped by their root nouns or verbs. Memorizing one or two of these a day will do wonders for your confidence in French. This book is highly recommended either as a standalone text, or, preferably, as a supplement to a more traditional textbook. In either case, it will serve you well in your continuing education in the French language.
The review contains many positive terms (“useful”, “indespensable”, “highly recommended”), and in fact is associated with a 5-star rating for this book. The goal of the blog post was to find the terms most associated with positive (or negative) reviews. One way to do this is to use the featurizeText function in thje Microsoft ML package included with Microsoft R Client and Microsoft R Server. Among other things, this function can be used to extract ngrams (sequences of one, two, or more words) from arbitrary text. In this example, we extract all of the one and two-word sequences represented at least 500 times in the reviews. Then, to assess which have the most impact on ratings, we use their presence or absence as predictors in a linear model:

If you’re thinking about sentiment analysis, read the whole thing.

Comments closed

Kafka Cruise Control

Published 2017-09-01 by Kevin Feasel

Jiangjie Qin announces Cruise Control, an automated workload management system for Kafka:

Intelligent automation is critical under these circumstances, which is why we developed Cruise Control: a general-purpose system that continually monitors our clusters and automatically adjusts the resources allocated to them to meet pre-defined performance goals. In essence, users specify goals, Cruise Control monitors for violations of these goals, analyzes the existing workload on the cluster, and automatically executes administrative operations to satisfy those goals. You can see a video here about Cruise Control at the Stream Processing Meet Up last fall.
Today we are pleased to announce that we have open sourced Cruise Control and it is now available on Github. In this post, we’ll describe Cruise Control’s uses both generally and at LinkedIn, its architecture, and some unique challenges we faced when creating it. For further details about Kafka terminology used throughout this post, this reference can be a helpful guide.

This isn’t a monitoring tool per se, but rather a resource balancing tool. And it’s now freely available to all.

Comments closed

Getting Distinct Rows In R

Published 2017-09-01 by Kevin Feasel

Rob J. Hyndman shows four different techniques (one “classic” and three tidyverse) for getting a distinct subset of a data set in R:

So that looks much better — clean, short, and easy to understand. But is it fast? Rather than grabbing the first lines of each group, it has to go searching for duplicates. But avoiding grouping and ungrouping must save some time.
So I ran some microbenchmark timings:

Click through for techniques and timings. I’m not surprised that the “classic” method won out in terms of time, but for explanatory value, I’d definitely prefer trying to explain the tidyverse distinct version. H/T R-Bloggers

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30