The above XML produces the following output (don’t worry too much about the call of
xplain(), we will discuss later on in more detail how to work with the
library(car) library(xplain) xplain(call="lm(education ~ young + income + urban, data=Anscombe)", xml="http://www.zuckarelli.de/xplain/example_lm_foreach.xml")
## lm(formula = education ~ young + income + urban, data = Anscombe)
## (Intercept) young income urban
## -286.83876 0.81734 0.08065 -0.10581
## Interpreting the coefficients
## Your coefficient ‘(Intercept)’ is smaller than zero.
## Your coefficient ‘young’ is larger than zero. This means that the
## value of your dependent variable ‘education’ changes by 0.82 for
## any increase of 1 in your independent variable ‘young’.
## Your coefficient ‘income’ is larger than zero. This means that the
## value of your dependent variable ‘education’ changes by 0.081 for
## any increase of 1 in your independent variable ‘income’.
## Your coefficient ‘urban’ is smaller than zero. This means that the
## value of your dependent variable ‘education’ changes by -0.11 for
## any increase of 1 in your independent variable ‘urban’.
I’ll be interested in looking at this in more detail, though my first glance indication is that it’ll be useful mostly in large shops with different teams creating and using models.
Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods:
- “AFINN” for Finn Årup Nielsen – which classifies words from -5 to +5 in terms of negative or positive valence
- “bing” for Bing Liu and colleagues – which classifies words as either positive or negative
- “loughran” for Loughran-McDonald – mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining
- “nrc” for the NRC lexicon – which classifies words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as positive or negative sentiment
Sentiment analysis works on unigrams – single words – but you can aggregate across multiple words to look at sentiment across a text.
To demonstrate sentiment analysis, I’ll use one of my favorite songs: “Hotel California” by the Eagles.
Read the whole thing, though you can’t check out afterward.
In the code above we use the
RxInSqlServer()function to indicate we want to execute in a SQL context. The
connectionStringproperty defines where we execute, and the
numTasksproperty sets the number of tasks (processes) to run for each computation, in Code Snippet 4 it is set to 1 which from a processing perspective should match what we do in Code Snippet 3. Before we execute the code in Code Snippet 4 we do what we did before we ran the code in Code Snippet 3:
- Run Process Explorer as admin.
- Navigate to the
devenv.exeprocess in Process Explorer.
- In addition, also look at the
Launchpad.exeprocess in Process Explorer.
When we execute we see that the
BxlServer.exeprocesses under the
Microsoft.R.Host.exeprocesses are idling, but when we look at the
Launchpad.exeprocess we see this:
This is a bit deep but interesting reading.
This posts shows how easy it can be to build an visually pleasing plot. We will use hrbrmster’s
ggcounty, which is an R package at this Github repo. Graphics engine is as mostly in my plots, Hadley Wickhams
ggplot. All build on R. Standing on shoulders…
Disclaimer: This example heavily draws on hrbrmster example on this page. All credit is due to Rudy, and those on whose work he built up on.
In just a few lines of code, you can have a pretty nice map.
The most common solution I see offered is along the lines of a SWITCH statement that lists 12 conditions (one for each month). This works, but can also be done using existing functions.
While DAX lacks a dedicated function to convert a number to a text version, such as DATENAME in T-SQL, we can get there in two functions using DATEVALUE wrapped in a FORMAT.
To demonstrate, I will create a simple table with 13 values (1 through 13) using the following calculated table.
1 Table = GENERATESERIES(1,13)
This creates a single column table with 13 rows.
Read on for the rest of the story.
Disconnected Table method:
This method is more towards PowerBI modelers. Basically, the idea is to have a Field in a independent table (no relationship to other tables) as Slicer with your measure choice and then create a measure using SELECTEDVALUE function to have the measure dynamically switch referring measures based on the choice made on the slicer.
Click through for both methods.
And then I had to write about it in my book Introducing Microsoft SQL Server 2016 (which is free to download) when JSON support was added to SQL Server 2016. But I still didn’t have clients using JSON. It was interesting to me that I could use SQL Server to work with JSON data, but it was still theoretical to me rather than practical.
Therefore, I never thought much about how I would handle it in SQL Server Integration Services (SSIS). I just didn’t have a reason.
Until now. This seems to be the year that I am bumping into JSON left and right. It’s everywhere!
Read on for those methods as well as Stacia’s recommendation.
This week I had a user come to me asking about how fields were defined on a few tables he was using in writing some reports. Long story short, he’s been tasked with writing some new reports and updating existing ones, but he doesn’t have visibility to the database itself so he’s left with the “ok, let’s try this” approach and then reading error messages to debug when things go sideways. Very tedious.
I asked him for a list of the tables he was most interested in, and while he worked on that I set to work on (of course) a quick & dirty PowerShell script to collect the data he needed – field names, types, and whether they’re nullable.
Ideally these analysts would have data model documentation, but it’s not an ideal world.
You have many choices when it comes to storing and processing data on Hadoop, which can be both a blessing and a curse. The data may arrive in your Hadoop cluster in a human readable format like JSON or XML, or as a CSV file, but that doesn’t mean that’s the best way to actually store data.
In fact, storing data in Hadoop using those raw formats is terribly inefficient. Plus, those file formats cannot be stored in a parallel manner. Since you’re using Hadoop in the first place, it’s likely that storage efficiency and parallelism are high on the list of priorities, which means you need something else.
Luckily for you, the big data community has basically settled on three optimized file formats for use in Hadoop clusters: Optimized Row Columnar (ORC), Avro, and Parquet. While these file formats share some similarities, each of them are unique and bring their own relative advantages and disadvantages.
Read the whole thing. I’m partial to ORC and Avro but won’t blink if someone recommends Parquet.
The best place to start when exploring the
purrrpackage is the
mapfunction. The reader will notice that these functions are utilised in a very similar way to the
applyfamily of functions. The subtle difference is that the
purrrfunctions are consistent and the user can be assured of the output – as opposed to some cases when using for example
sapplyas I demonstrate later on.
My considered belief is, Always Be Purrring. H/T R-bloggers