Press "Enter" to skip to content

Category: R

Using FreeTDS To Connect To SQL Server

Steph Locke embraces the pain of FreeTDS:

If you use SQL Server (or Azure SQL DB) as your data store and you need to connect to the databasse from, you’re presently stuck with FreeTDS. If you have any control over infrastructure I cannot recommend highly enough the actual ODBC Driver on Linux for ease. Alas, does not let you control the infrastructure. We have to make do with with FreeTDS and it can be pretty painful to get right.

Due to how obtuse the error messages you end up getting back from FreeTDS in your shiny app and the time to deploy an app, you might just want to cry a little. I know I did. Determined to succeed, here is my solution to getting a working database connection that you can also use to test you’re doing it right. If you’re on a particularly old version of SQL Server though, I can’t guarantee this will work for you.

Read on for more.  I also have an older post on working with FreeTDS, though I ended up using TDS_Version = 8.0 instead of 7.4.

Comments closed

Structural Topic Models In R

Julia Silge has a great post on building Structural Topic Models in R using stm and tidytext:

The stm package has a summary() method for trained topic models like these that will print out some details to your screen, but I want to get back to a tidy data frame so I can use dplyr and ggplot2 for data manipulation and data visualization. I can use tidy() on the output of an stm model, and then I will get the probabilities that each word is generated from each topic.

I haven’t watched the video yet, but that’s on my to-do list for today.

Comments closed

The Grammar of Graphics

I’ve started a new series:

Instead, we will start with Wickham’s paper on ggplot2.  This gives us the basic motivation behind the grammar of graphics by covering what a grammar does for us:  “A grammar provides a strong foundation for understanding a diverse range of graphics. A grammar may also help guide us on what a well-formed or correct graphic looks like, but there will still be many grammatically correct but nonsensical graphics. This is easy to see by analogy to the English language: good grammar is just the first step in creating a good sentence” (3).

With a language, we have different language components like nouns (which can be subjects, direct objects, or indirect objects), verbs, adjectives, adverbs, etc.  We put together combinations of those individual components to form complete sentences and transmit ideas.  Our particular word choice and language component usage will affect the likelihood of success in idea transmission, but to an extent, we can work iteratively on a sentence, switching words or adding phrases to get the point across the way we desire.

With graphics, we can do the same thing.  Instead of thinking of “a graph” as something which exists in and of itself, we should think of different objects that we combine into its final product:  a graph.

I call this first post the poor man’s literature review.  The rest of the series is code- and visual-heavy.

Comments closed

Convenience Functions In wrapr

John Mount walks us through some of the language conveniences available in the wrapr library:

wrapr supplies additional q*() methods.

  • qae() “quote assignment expression” where both sides of assignments is taken as un-evaluated. I.e.: qae(x = 5+1) yields c(‘x’ = ‘5 + 1’) regardless if x is bound or unbound in the environment. This is a bit of a complement to := which looks-up bindings/references (i.e.: x = "z"; x := 5+1 returns c(‘z’ = ‘6’)).

  • qe() “quote expressions” for quoting complex expressions. Similar to quote(), except it returns a list of strings (not a language object). The qe()method is not as interesting to end-users as the other methods mentioned, it is designed to help in implementation of methods that take a non-assignment expression or list of expressions such as rquery::select_rows_nse().

Read the whole thing.  := probably gives the most obvious immediate benefit but the whole set seems useful.

Comments closed

Fun With ML Services And VARBINARY

I wrap up my ML Services mini-series by building out a process to predict sales for multiple products using different models:

I have my model as an input and want to spit it out at the end as well. But when I try that, I get an error:

Msg 39017, Level 16, State 3, Line 239
Input data query returns column #1 of type ‘varbinary(max)’ which is not supported by the runtime for ‘R’ script. Unsupported types are binary, varbinary, timestamp, datetime2, datetimeoffset, time, text, ntext, image, hierarchyid, xml, sql_variant and user-defined type.

So there goes that plan—I can output a VARBINARY(MAX) model, but I cannot input one.

Click through to see my workaround.

Comments closed

R Or Python

Tomaz Kastrun shares his thoughts on the topic of R versus Python:

Imag[in]e I ask you, would you prefer Apple iPhone over Samsung Galaxy, respectively? Or if I would ask you, would you prefer BMW over Audi, respectively? In all the cases, both phones or both cars will get the job done. So will Python or R, R or Python. So instead of asking which one I prefer, ask your self, which one suits my environment better? If your background is more statistics and less programming, take R, if you are more into programming and less into statistics, take Python; in both cases you will have faster time to accomplish results with your preferred language. If you ask me, can I do gradient boosting or ANOVA or MDS in Python or in R, the answer will be yes, you can do both in any of the languages.

This graf hits the crux of my opinion on the topic, but as I’ve gone deeper into the topic over the past year, I think the correct answer is probably “both” for a mature organization and “pick the one which suits you better” for beginners.

Comments closed

Building ML Services Models

I have part two of my three-part series on SQL Server Machine Learning Services modeling:

We used sp_execute_external_script to build a model and generate some number of days worth of predictions.  Now I’d like to break that up into two operations:  training a model and generating predictions.  The biggest reason I might want to do this is if model generation is very expensive relative to prediction generation.  It’s obviously not very expensive to build a random number generator following a Poisson distribution, but imagine we had a complicated neural net that took hours or days to train—we’d want to use that model over and over, as long as the model came close enough to representing the underlying reality.

So now we’re going to create a model.

Click through to see a more complete example, something closer to production-ready.

Comments closed

Including A Progress Bar In R

Peter Solymos has an update to his pbapply library:

The pbapply R package that adds progress bar to vectorized functions has been know to accumulate overhead when calling parallel::mclapply with forking (see this post for more background on the issue). Strangely enough, a GitHub issue held the key to the solution that I am going to outline below. Long story short: forking is no longer expensive with pbapply, and as it turns out, it never was.

H/T R-Bloggers

Comments closed

Building A Model Using SQL Server ML Services

I have a post which shows how to build a simple R model to predict demand for an item:

I am a huge fan of the Poisson distribution.  It is special in that its one parameter (lambda) represents both the mean and the variance of the distribution.  At the limit, a Poisson distribution becomes normal.  But it’s most useful in helping us pattern infrequently-occurring events.  For example, selling 3-4 watches per day.

Estimating a Poisson is also easy in R:  lambda is simply the mean of your sample counts, and there is a function called rpois() which takes two parameters:  the number of events you want to generate and the value of lambda.

So what I want to do is take my data from SQL Server, feed it into R, and return back a prediction for the next seven days.

This was a simple post, but the next two in the series will expand upon it and build out a full implementation.

Comments closed

Beware Multi-Assignment dplyr::mutate() Statements

John Mount hits on an issue when using dplyr backed by a database in R:

Notice the above gives an incorrect result: all of the x_i columns are identical, and all of the y_i columns are identical. I am not saying the above code is in any way desirable (though something like it does arise naturally in certain test designs). If this is truly “incorrect dplyr code” we should have seen an error or exception. Unless you can be certain you have no code like that in a database backed dplyr project: you can not be certain you have not run into the problem producing silent data and result corruption.

The issue is: dplyr on databases does not seem to have strong enough order of assignment statement execution guarantees. The running counter “delta” is taking only one value for the entire lifetime of the dplyr::mutate() statement (which is clearly not what the user would want).

Read on for a couple of suggested solutions.

Comments closed