Press "Enter" to skip to content

Author: Kevin Feasel

Using Hive For Data Virtualization

Gunther Hagleitner, et al, walk us through some reasons why we might want to use Apache Hive for data virtualization:

Assume you want to execute a Hive query that accesses data from an external RDBMS behind a JDBC connection. A possible naïve way of doing this would treat the JDBC source as a “dumb” storage system, reading all the raw data over JDBC and processing it in Hive. In this case you would ignore the query capabilities of the RDBMS and pull too much data over the JDBC link, thus ending up with poor performance and an overloaded system.
For that reason, Hive implements smart push-down to other systems by relying on its storage handler interfaces and cost-based optimizer (CBO) powered by Apache Calcite. In particular, Calcite provides rules that match a subset of operators in the logical representation of the query and generates a new equivalent representation with more operations executed in the external system. Hive includes those rules that push computation to the external systems in its query planner, and then relies on Calcite to generate a valid query in the language that those systems support. The storage handler implementations are responsible to send the generated query to the external system, retrieve its results, and transform the incoming data into Hive internal representation so it can be processed further if needed.

A lot of platforms are moving toward data virtualization (e.g., SQL Server with its Big Data Clusters). That appears to be the next product battleground.

Comments closed

$null In Powershell

Kevin Marquette goes into great detail on Powershell’s $nullconcept:

When a $null value is used in a numeric equation then your results will be invalid if they don’t give an error. Sometimes the $null will evaluate to 0 and other times it will make the whole result $null. Here is an example with multiplication that gives 0 or $nulldepending on the order of the values.

Nulls are tricky to handle in any language, making their nuances important to understand.

Comments closed

Becoming An Expert

Adrian Colyer wraps up The Morning Paper for the year by reviewing a big picture paper on developer expertise:

You’ll know an expert programmer by the quality of the code that they write. Experts have good communication skills, both sharing their own knowledge and soliciting input from others. They are self-aware, understanding the kinds of mistakes they can make, and reflective. They are also fast (but not at the expense of quality).
Experience should be measured not just on its quantity (i.e., number of years in the role), but on its quality. For example, working on a variety of different code bases, shipping significant amounts of code to production, and working on shared code bases. The knowledge of an expert is T-shaped with depth in the programming language and domain at hand, and a broad knowledge of algorithms, data structures, and programming paradigms.

Click through for the full review.

Comments closed

The Ultimate Powershell Telemetry Prompt

Jeffery Hicks might have taken things a bit too far:

Well, I knew I wouldn’t be satisfied. The other day I shared a PowerShell prompt function that could display telemetry like information for a few remote servers. One of the drawbacks was the limited amount of information I could display. I’ve revised that function and have a new version that displays additional information via a few performance counters. I’ve also reorganized the function to make it a bit more efficient. Want to see it?

My jokey lede aside, this is really cool. Click through for details and to get a link to the code.

Comments closed

Displaying Human-Readable Month Sets With DAX

Alberto Ferrari wants to show sets of contiguous months using DAX:

Today I woke up with an interesting question, about how to show a selection of months in a nice way, detecting contiguous selection. You can easily understand the desired solution from the following figure:

I enjoyed writing a quick solution, which is worth sharing. The code is somewhat verbose, but this is mainly for educational purposes (meaning I did not want to spend time optimizing it). I will likely write a full article on it, for now, just enjoy some DAX code:

I removed the image, but to get the gist (and get you to click through to see it in its beauty), it reads “January, March-April, August-December”

Click through for Alberto’s quick-and-dirty solution and then Chris Webb’s improvement.

Comments closed

The Bitmap Operator

Hugo Kornelis describes a new operator:

The Bitmap operator is used to build a bitmap that, based on a hash, represents which values may be present in a data flow. Due to the chance of hash collisions in the hash function used, the Bitmap process can produce false positives but not false negatives – so a match based on a bitmap is not guaranteed to be a match to the actual data, but a non-match based on a bitmap is guaranteed to not be a match in the actual data.
The generated bitmap is typically used in other operators to remove rows for which there is no match in the bitmap, and hence guaranteed no match in the original set of data processed by the Bitmap operator. The use of Bitmap operators is most common in execution plans for star join queries in large data warehouses. An example can be seen here.

Click through for details on how it works and plenty of good information on it.

Comments closed

Tips When Writing Extended Events To Files

Jason Brimhall has some tips to help you use the file target in Extended Events:

This first little tip comes from a painful experience. It is common sense to only try and create files in a directory that exists, but sometimes that directory has to be different on different systems. Then comes a little copy and paste of the last code used that worked. You think you are golden but forgot that one little tweak for the directory to be used. Oops.

Read on to see how SQL Server exposes that error, and then Jason shows us a different how-not-to with file targets.

Comments closed

Parallel Processing With The Pool Object In Python

Sanjay Kumar takes us through parallel processing in Python:

The parallel processing holds two varieties of execution: Synchronous and Asynchronous.
In synchronous execution, once a process starts execution, it puts a lock over the main program until its get accomplished.
While the asynchronous execution doesn’t require locking, it performs a task quickly but the outcome can be in the rearranged order.

Click through for a few examples using Pool.

Comments closed

Training A Text Classifier Against Books

Julia Silge builds a text classifier to differentiate Pride and Prejudice from War of the Worlds:

Now it’s time to train our classification model! Let’s use the glmnet package to fit a logistic regression model with LASSO regularization. It’s a great fit for text classification because the variable selection that LASSO regularization performs can tell you which words are important for your prediction problem. The glmnet package also supports parallel processing with very little hassle, so we can train on multiple cores with cross-validation on the training set using cv.glmnet().

Hot take: Jane Austen was the best English-language novelist of the 19th century. I’d say “all-time” but the world isn’t ready for a take that hot.

Comments closed

Load Multiple Input Data Sets For ML Services

Niels Berglund shows us a way to get more than one input data set passed into SQL Server Machine Learning Services:

This post came about due to a question on the Microsoft Machine Learning Server forum. The question was if there are any plans by Microsoft to support more the one input dataset (@input_data_1) in sp_execute_external_script. My immediate reaction was that if you want more than one dataset, you can always connect from the script back into the database, and retrieve data.
However, the poster was well aware of that, but due to certain reasons he did not want to do it that way – he wanted to push in the data, fair enough. When I read this, I seemed to remember something from a while ago, where, instead of retrieving data from inside the script, they pushed in the data, serialized it as an output parameter and then used the binary representation as in input parameter (yeah – this sounds confusing, but bear with me). I did some research (read Googling), and found this StackOverflow question, and answer. So for future questions, and for me to remember, I decided to write a blog post about it.

This has been a point of frustration for me. We can name the one input data set, so I’d really like to see true support for input multiple data sets without the need for hacks.

Comments closed