FileTable has been around now for quite some time and and it is useful for storing files, documents, pictures and and binary files in a designated SQL Server table – FileTable. The best part of FileTable is the fact one can access it from windows or other application as if it were stored on file system (because they are) and not making any other changes on the client.
And this feature is absolutely handy for using and storing outputs from Microsoft R Server. In this blog post I will focus mainly on persistently storing charts from statistical analysis.
I can see this being quite useful for things like automatically sampling data for quality control.
Note that the parameters of xgboost used here fall in three categories:
- nthread (number of threads used, here 8 = the number of cores in my laptop)
- max.depth (of tree)
Learning task parameters
- objective: type of learning task (softmax for multiclass classification)
- num_class: needed for the “softmax” algorithm: how many classes to predict?
Command Line Parameters
nround: number of rounds for boosting
Read the whole thing.
For this post I decided to go with a simple example of how many steps I walked with my per day for the month of August. My goal is 10,000 steps per day – that has been my average over the year but is this true of the data I gathered in August? I have a simple table with two columns – day and steps. Each record has how many steps I took in August per day, for 30 days. So – SELECT AVG(steps) FROM [dbo].[mala-steps] gives me 8262 as my average number of steps per day in August. I want to know if am consistently under performing my goal, or if this is a result of my being less active in August alone. Let me state my problem first – or state what is called ‘null hypothesis’:
I walk 10,000 steps on an average per year.
Read on for T test operations in T-SQL (although not all operations are available) and R.
According to Stack Overflow documentation, these are the categories of questions that may be closed by the community users:
- off topic
- too broad
- primarily opinion-basedNot everyone in the Stack Overflow community is able to close a question. In fact users need to have certain reputation expressed in points (more details here).
To calculate the overall website closure rate is easy. Just use the original “questions_2016” dataset and count how many questions have the field “Closed Date” populated. Over 10% of questions made in 2016 have been closed so far.
If you’re interested in learning more about data analysis, walk through the exercise as well and play around with the data set too. Hat tip, R-Bloggers.
Now that we can separate data for each group(s), we can fit a model to each tibble in
map()from the purrr package (also
tidyverse). We’re going to add the results to our existing tibble using
mutate()from the dplyr package (again,
tidyverse). Here’s a generic version of our pipe with adjustable parts in caps:
Read the whole thing. Hat tip, R-Bloggers.
Date time rounding (with
ceiling_date()) now supports unit multipliers, like “3 days” or “2 months”:
ceiling_date(ymd_hms("2016-09-12 17:10:00"), unit = "5 minutes")#>  "2016-09-12 17:10:00 UTC"
If you handle date and time data in R, Lubridate is a tremendous asset.
To illustrate the scenario, we will focus on companies who operate machines which encounter mechanical failures. These failures lead to downtime which has cost implications on any business, hence most companies are interested in predicting the failures ahead of time so that they can proactively prevent them. This scenario is aligned with an existing R Notebook published in the Cortana Intelligence Gallery but works with a larger dataset where we will focus on predicting component failures of a machine using raw telemetry, maintenance logs, previous errors/failures and additional information about the make/model of the machine. This scenario is widely applicable for almost any industry which uses machines that need maintenance. A quick overview of typical feature engineering techniques as well as how to build a model will be discussed below.
Understanding when machines are likely to break down is a very interesting statistical problem. Check out the template.
For any dataset to lend itself to the Chi Square test it has to fit the following conditions –
1 Both variables are categorical (in this case – exposure to smoking – yes/no, and health condition – sick/not sick are both categorical).
2 Researchers used a random sample to collect data.
3 Researchers had an adequate sample size.Generally the sample size should be at least 100.
4 The number of respondents in each cell should be at least 5.
This is an easy case for using R over T-SQL—the Chi Square test is built in, whereas you have to roll your own T-SQL code. Mala does show you how to do this from within SQL Server R Services as well.
If your Shiny app contains computations that take a long time to complete, a progress bar can improve the user experience by communicating how far along the computation is, and how much is left. Progress bars were added in Shiny 0.10.2. In Shiny 0.14, we’ve changed them to use the notifications system, which gives them a different look.
Important note: If you were already using progress bars and had customized them with your own CSS, you can add the
style = "old"argument to your
Progress$new()). This will result in the same appearance as before. You can also call
shinyOptions(progress.style = "old")in your app’s server function to make all progress indicators use the old styling.
It looks like they’ve made some good progress with Shiny.
The statistical definition of Pearson’s R Coefficient, as it is called, can be found in detail here for those interested. A value of 1 indicates that there is a strong positive correlation(the two variables in question increase together), 0 indicates no correlation between them, and -1 indicates a strong negative correlation (the two variables decrease together). But you rarely get a perfect -1, 0 or 1. Most values are fractional and interpreted as follows:
High correlation: .5 to 1.0 or -0.5 to 1.0.
Medium correlation: .3 to .5 or -0.3 to .5.
Low correlation: .1 to .3 or -0.1 to -0.3.
Mala includes R and T-SQL code so you can follow along.