Author: Kevin Feasel

This works, and the paste() pattern is so useful we suggest researching and memorizing it.

However the “call” portion of the model is reported as “formula = f” (the name of the variable carrying the formula) instead of something more detailed. Frankly this printing issue never bothered us. None of our tools or workflows currently use the model call item, and for a very large number of variables formatting the call contents in the model report becomes unweildy. We also already have the formula in a variable, so if we need it we can save it or pass it along.

There is a much better place on many models to get model structure information from than the model call item: the model terms item. This item carries a lot of information and formats up quite nicely:
format(terms(model))
# [1] "mpg ~ cyl + disp + hp + carb"

Be sure to check out the comments too, as there are several solutions to this problem.

Comments closed

Lists In R

Published 2018-09-05 by Kevin Feasel

Dave Mason continues his process of learning about R data structures with a survey of lists:

In previous lessons, we’ve noted vectors and matrices consist of data elements of the same class. R will coerce data elements to a single class if we attempt to create a vector or matrix with data elements of differing classes. Lists, on the other hand, can hold data elements of different classes, such as the integer, character, or logical class. In fact, a list can hold most anything in R, including vectors, matrices, and many more! None to my surprise, lists can be created with the list() function:

And if you want to work with lists, purrr is a great package to learn.

Comments closed

Using JSON In Azure Data Lake Analytics

Published 2018-09-05 by Kevin Feasel

Jeffrey Verheul shows how to register .NET assemblies in Azure Data Lake Analytics:

The power of Azure Data Lake is that you can use a variety of different file types to process data (from Azure Data Lake Analytics). But in order to use JSON, you need to register some assemblies first.

Downloading assemblies
The assemblies are available on Github for download. Unfortunately you need to download the solution, and compile it on your machine. So I’ve also made the 2 DLL’s you need available via direct download:

Click through for links to the assemblies and instructions on how to register them. And to continue my long-running joke that every .NET project has as a core requirement Newtonsoft.Json.

Comments closed

Joining Your SQL Server On Linux VM To A Domain

Published 2018-09-05 by Kevin Feasel

Dylan Gray and Tejas Shah provides some tips on joining a SQL Server on Linux instance to an existing Active Directory domain:

AD authentication is a popular mechanism for login and user authentication. It works very well in many scenarios, especially for enterprise applications. AD authentication is a supported scenario on SQL Server on Linux. Configuring the Linux VM to join with Active Directory (AD) can be a little tricky at sometimes though, especially in a complex enterprise environment.

One error message you may see from “realm join” is “realm: Couldn’t join realm: This computer’s host name is not set correctly.” This is due to a generic hostname (e.g. “localhost”), an incorrect domain in your hostname (e.g. “host1.abcd.com” instead of “host1.contoso.com”), or a duplicate hostname on the domain. To fix this, edit /etc/hostname to have a unique hostname and reboot the machine. On Ubuntu, it can also be helpful to put the fully qualified domain name in /etc/hostname (e.g. “host1.contoso.com” instead of “host1”).

Another possibility is that if the DNS is configured incorrectly, the host may be unable to resolve the domain. This will result in the message “realm: No such realm found” being returned from “realm join” command. To fix this, you need to use a DNS server on the realm you wish to join (can be on the same machine as the domain controller). The steps to fix this are described here: https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-active-directory-authentication?view=sql-server-2017#join

They provide in this post some of the low-hanging fruit answers, where the problem is in basic server configuration.

Comments closed

Column Order Matters For Indexes

Published 2018-09-05 by Kevin Feasel

Bert Wagner violates Betteridge’s Law of Headlines:

When beginning to learn SQL, at some point you learn that indexes can be created to help improve the performance of queries.

Creating your first few indexes can be intimidating though, particularly when trying to understand what order to put your key columns in.

Today we’ll look at how row store indexes work to understand whether index column order matters.

Despite the flagrant violation, you should check out Bert’s post, as it’s a good one.

Comments closed

Changing Language In SQL Server

Published 2018-09-05 by Kevin Feasel

Jeff Mlakar shows how to switch languages in SQL Server:

The SET LANGUAGE command allows us to choose a language for a session. By session here I mean by SPID. Each query tab you open in SSMS is another thread to the database and receives a SPID. This can be called by almost anyone who has permissions to access the database because it only requires membership in the public role to execute.

Now let us change the session language to Russian.

You can change the default language for all sessions, as well as switching language for a specific session.

Comments closed

The Two-Digit Cutoff For Years

Published 2018-09-05 by Kevin Feasel

Claudio Silva explains why a two-digit year may be interpreted differently in SQL Server versus Excel:

What do you read when you see some date in a format like “01-Jan-00 00:00:00.000”? Keep in mind that I’m talking about the output directly from the table and without any formatting.
1st of January seems to leave no doubt (just because there is no default date format starting with two digits for the year), but…what about the year part ’00’?
It stands for 1900 and the 3rd column is wrong?
Or it stands for 2000 and the DATEPART function is returning the wrong value?

This is why you want to stick with four-digit years. But if you’re stuck with two-digit years for some reason, Claudio explains how you can get Excel and SQL Server to return the same results.

Comments closed

Network Analysis With Python In Power BI

Published 2018-09-05 by Kevin Feasel

Tori Tompkins shows us how to use the NetworkX package in Power BI:

The data I used was created to demonstrate this task in Power BI but there are many real-world network datasets to experiment with provided by Stanford Network Analysis Project. This small dummy dataset represents a co-purchasing network of books.

The data I loaded into Power BI consisted of two separate CSVs. One, Books.csv, consisted of metadata pertaining to the top 40 bestselling books according to Wikipedia and their assigned IDs. The other, Relationship.csv, was an edgelist of the book IDs which is a popular method for storing/ delivering network data. The graph I wanted to create was an undirected, unweighted graph which I wanted to be able to cross-filter accurately. Because of this, I duplicated this edgelist and reversed the columns so the ToNodeId and FromNodeId were swapped. Adding this new edge list onto the end of the original edgelist has created a dataset with can be filtered on both columns later down the line. For directed graphs, this step is unnecessary and can be ignored.

Once loaded into Power BI, I duplicated the Books table to create the following relationship diagram as it isn’t possible to replicate the relationship between FromNodeId to Book ID and ToNodeId to Book ID with only one Books table.

Read on for an example using this data set.

Comments closed

Kalman Filters With Spark And Kafka

Published 2018-09-04 by Kevin Feasel

Konur Unyelioglu goes deep into Kalman filters:

In simple terms, a Kalman filter is a theoretical model to predict the state of a dynamic system under measurement noise. Originally developed in the 1960s, the Kalman filter has found applications in many different fields of technology including vehicle guidance and control, signal processing, transportation, analysis of economic data, and human health state monitoring, to name a few (see the Kalman filter Wikipedia page for a detailed discussion). A particular application area for the Kalman filter is signal estimation as part of time series analysis. Apache Spark provides a great framework to facilitate time series stream processing. As such, it would be useful to discuss how the Kalman filter can be combined with Apache Spark.

In this article, we will implement a Kalman filter for a simple dynamic model using the Apache Spark Structured Streaming engine and an Apache Kafka data source. We will use Apache Spark version 2.3.1 (latest, as of writing this article), Java version 1.8, and Kafka version 2.0.0. The article is organized as follows: the next section gives an overview of the dynamic model and the corresponding Kalman filter; the following section will discuss the application architecture and the corresponding deployment model, and in that section we will also review the Java code comprising different modules of the application; then, we will show graphically how the Kalman filter performs by comparing the predicted variables to measured variables under random measurement noise; we’ll wrap up the article by giving concluding remarks.

This is going on my “reread carefully” list; it’s very interesting and goes deep into the topic.

Comments closed

Values Belong In Columns

Published 2018-09-04 by Kevin Feasel

John Mount argues that to reduce ambiguity, ensure that your values are columns on appropriate data frames:

Here is an (artificial) example.
chamber_sizes <- mtcars$disp/mtcars$cyl
form <- hp ~ chamber_sizes
model <- lm(form, data = mtcars)
print(model)
# Call:
# lm(formula = form, data = mtcars)
#
# Coefficients:
#   (Intercept)  chamber_sizes  
#         2.937          4.104  
Notice: one of the variables came from a vector in the environment, not from the primary data.frame. chamber_sizes was first looked for in the data.frame, and then in the environment the formula was defined (which happens to be the global environment), and (if that hadn’t worked) in the executing environment (which is again the global environment).

Our advice is: do not do that. Place all of your values in columns. Make it unambiguous all variables are names of columns in your data.frame of interest. This allows you to write simple code that works over explicit data. The style we recommend looks like the following.

Read the whole thing.

Comments closed

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31