Author: Kevin Feasel

Now it’s time to train our classification model! Let’s use the glmnet package to fit a logistic regression model with LASSO regularization. It’s a great fit for text classification because the variable selection that LASSO regularization performs can tell you which words are important for your prediction problem. The glmnet package also supports parallel processing with very little hassle, so we can train on multiple cores with cross-validation on the training set using cv.glmnet().

Hot take: Jane Austen was the best English-language novelist of the 19th century. I’d say “all-time” but the world isn’t ready for a take that hot.

Comments closed

Load Multiple Input Data Sets For ML Services

Published 2018-12-27 by Kevin Feasel

Niels Berglund shows us a way to get more than one input data set passed into SQL Server Machine Learning Services:

This post came about due to a question on the Microsoft Machine Learning Server forum. The question was if there are any plans by Microsoft to support more the one input dataset (@input_data_1) in sp_execute_external_script. My immediate reaction was that if you want more than one dataset, you can always connect from the script back into the database, and retrieve data.
However, the poster was well aware of that, but due to certain reasons he did not want to do it that way – he wanted to push in the data, fair enough. When I read this, I seemed to remember something from a while ago, where, instead of retrieving data from inside the script, they pushed in the data, serialized it as an output parameter and then used the binary representation as in input parameter (yeah – this sounds confusing, but bear with me). I did some research (read Googling), and found this StackOverflow question, and answer. So for future questions, and for me to remember, I decided to write a blog post about it.

This has been a point of frustration for me. We can name the one input data set, so I’d really like to see true support for input multiple data sets without the need for hacks.

Comments closed

Returning NULL on NULL Input In UDFs

Published 2018-12-27 by Kevin Feasel

Jonathan Kehayias shows us a performance improvement you can get if your user-defined function is expected to return NULL if you pass in NULLs for inputs:

I was really curious about the RETURNS NULL ON NULL INPUT function option so I decided to do some testing. I was very surprised to find out that it’s actually a form of scalar UDF optimization that has been in the product since at least SQL Server 2008 R2.
It turns out that if you know that a scalar UDF will always return a NULL result when a NULL input is provided then the UDF should ALWAYS be created with the RETURNS NULL ON NULL INPUT option, because then SQL Server doesn’t even run the function definition at all for any rows where the input is NULL – short-circuiting it in effect and avoiding the wasted execution of the function body.

The more often you pass in NULL to that function, the better your performance will be relative to the default case.

Comments closed

Exporting To Text With SQL Server: Comparing Methods

Published 2018-12-27 by Kevin Feasel

Phil Factor shows us several ways of exporting data from SQL Server to files and gives us size and time comparisons:

I enjoy pulling the data out of AdventureWorks. It is a great test harness. What is the quickest way of doing it? Well, everyone knows it is native BCP, but how much faster is that format than tab-delimited or comma-delimited BCP? Can we quickly output data in XML? Is there a way of outputting array-in-array JSON reasonably quickly? Of course, the answer is going to vary from system to system, and across versions, but any data about this is usually welcome.
In addition to these questions, I wanted to know more about how much space these files take up, either raw or zipped. We’re about to find out. We’ll test all that, using good ol’ BCP and SQLCMD.
My motivation for doing this was to explore ways of quickly transferring data to MongoDB. to test out a way of producing array-in-array JSON at a respectable turn of speed. It turned out to be tricky. The easy and obvious ways were slow.

As is usual for Phil, this article is done quite well.

Comments closed

Generating Tally Tables With Snowflake DB

Published 2018-12-27 by Kevin Feasel

Koen Verbeeck shows us how to use the GENERATOR function to build a tally table in Snowflake:

I’m starting a little series on some of the nice features/capabilities in Snowflake (the cloud data warehouse). In each part, I’ll highlight something that I think it’s interesting enough to share. It might be some SQL function that I’d really like to be in SQL Server, it might be something else.
In the first part: the GENERATOR function. In short, it lets you generate a virtual table with a specified number of rows, or in database lingo: a tally table (or numbers table). A tally table can help you solve a lot of problems in SQL, but the TL;DR version is that it replaces loops/cursors most of the time and allows you to tackle the issue in a true set-based manner.

Naturally, as I read the article, I got the Bad Religion song stuck in my head. That’s an occupational hazard, I suppose.

Comments closed

Power BI Tab Order

Published 2018-12-27 by Kevin Feasel

Meagan Longoria shows us how to set the tab order for our Power BI elements:

Tab order is the order in which users interact with the items on a page using the keyboard. Generally, we want tab order to be predictable and to closely match the visual order on the page (unless there is a good reason to deviate). If you press the tab key in a Power BI report, you might be surprised at the seemingly random order in which you move from visual to visual on the page. The default order is the order in which the visuals were placed on the page in Power BI Desktop, or the last modified order in PowerBI.com if you have edited your report there.

I had not actually tried to tab through a dashboard in Power BI, so this is news to me.

Comments closed

Reasons Why We Get Identity Column Gaps

Published 2018-12-27 by Kevin Feasel

Steve Jones walks through several reasons why you might see gaps in identity columns:

Deleting Rows
This is noted in the tweet as a cause, but let’s test this.
One of the common ways that we get gaps in identity values is when rows are deleted. Let’s remove the row with Steve in it.

Steve explains a few others, but even that’s not complete: identity columns can jump after the service restarts as well.

In short, please do not use identity values in cases where you need to guarantee sequentiality (like check numbers or invoice numbers). Don’t use sequences either, as they’ll behave similarly.

Comments closed

Vectorization With Apache Hive And Parquet Tables

Published 2018-12-26 by Kevin Feasel

Vihang Karajgaonkar, et al, take us through using a performance improvement in Apache Hive using Parquet tables:

The performance benchmarks on CDH 6.0 show that enabling Parquet vectorization significantly improves performance for a typical ETL workload. In the test workload (TPC-DS), enabling parquet vectorization gave 26.5% performance improvement on average (geomean value of runtime for all the queries). Vectorization achieves these performance improvements by reducing the number of virtual function calls and leveraging the SIMD instructions on modern processors. A query is vectorized in Hive when certain conditions like supported column data-types and expressions are satisfied. However, if the query cannot be vectorized its execution falls back to a non-vectorized execution. Overall, for workloads which use the Parquet file format on most modern processors, enabling Parquet vectorization can lead to better query performance in CDH 6.0 and beyond.

This is worth looking into, especially if you are on the Cloudera stack.

Comments closed

More On Confluent’s Licensing Change

Published 2018-12-26 by Kevin Feasel

Alex Woodie has an article covering Confluent’s recent licensing change:

Confluent this month became the latest commercial open source software company to restrict the use of its software in the cloud. The move prevents cloud companies from using parts of the Confluent Platform, such as the KSQL component that uses SQL to process streaming data, as standalone software as a service (SaaS) offerings.
Jay Kreps, the co-creator of Apache Kafka and the CEO of Confluent, explained the significance of switching the Confluent Platform from the Apache 2.0 license to the new Confluent Community License.

Over at Aiven, CTO Heikki Nousiainen shares his thoughts:

The new Confluent Community License is a proprietary software license, specifically excluding “making available any software-as-a-service, platform-as-a-service, infrastructure-as-a-service or other similar online service that competes with Confluent products or services that provide the Software.”
While the license change does apply to all future versions of the specific software, it doesn’t alter the licensing status of the components in the versions that have been released and utilized by Aiven.

I believe it would be best to read the latter article looking for the significant silences.

Comments closed

Forecasting Field Goal Percentages With Prophet

Published 2018-12-26 by Kevin Feasel

Marlon Ribunal uses the Prophet library in R to forecast critical information:

I’ve been looking for an easy way to get to learning predictive analysis and forecasting. Prophet provides that path. Prophet is released by Facebook’s Core Data Science Team.
“Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.”
Just to dip my toes into the waters, I tried Prophet’s Quick Start Guide in R.
Let’s forecast the Field Goal Percentage (FG%) of Kyle Kuzma of the Los Angeles Lakers for the next 6 Months.

It’d be critical and important if it were hockey data. Or football data or baseball data or maybe even cricket data (but I don’t understand cricket data and why is that guy still running didn’t he get thrown out or something I don’t get it?).

As far as Prophet goes, it’s a useful library and works well if you’re looking at seasonal time series data.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31