Unlike most other statistical software packages, R doesn’t have a native data file format. You can certainly import and export data in any number of formats, but there’s no native “R data file format”. The closest equivalent is the
loadRDSfunction pair, which allows you to serialize an R object to a file and then load it back into a later R session. But these files don’t hew to a standardized format (it’s essentially a dump of R in-memory representation of the object), and so you can’t read the data with any software other than R.
The goal of the feather project, a collaboration of Wes McKinney and Hadley Wickham, is to create a standard data file format that can be used for data exchange by and between R, Python, and any other software that implements its open-source format. Data are stored in a computer-native binary format, which makes the files small (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file), and fast to read and write (no need to convert numbers to text and back again). Another reason why feather is fast is that it’s a column-oriented file format, which matches R’s internal representation of data. (In fact, feather is based on the Apache Arrow framework for working with columnar data stores.) When reading or writing traditional data files with R, it must spend signfican time translating the data from column format to row format and back again; with feather the entire second step in the process below is eliminated.
Given the big speedup in read time, I can see this file format being rather useful. I just can’t see it catching on as a common external data format, though, unless most tools get retrofitted to support the file. So instead, it’d end up closer to something like Avro or Parquet: formats we use in our internal tools because they’re so much faster, but not formats we send across to other companies because they’re probably using a different set of tools.
It’s not fast. The above piece of T-SQL took ~4 seconds to execute. This is on an Azure A3 VM. Not a great machine admittedly, but the R code, which just returns the first 6 rows of a built-in data set, ran in under a second on my desktop. This is likely not something you’ll be doing as part of an OLTP process.
I hope this external_script method is temporary. It’s ugly, hard to troubleshoot, and it means I have to write my R somewhere else, probably R Studio, maybe Visual Studio, and move it over once tested and working. I’d much rather see something like
I agree with the sp_execute_external_script mess. It’s the worst of dynamic SQL combined with multiple languages (T-SQL for the stored procedure & R for the contents, but taking care to deal with T-SQL single-quoting). Still, even with these issues, I think this will be a very useful tool for data analysts, particularly when dealing with rather large data sets on warehouse servers with plenty of RAM.
The code is executed as an external script, specifying that the language used should be R. @script contains the R code, which is a simple command to take the mean of the data coming from the InputDataSet. @Input_Data_1 contains the location of the data to be processed. In this case the data set is a table containing Amazon review data, where the overall field is the rating field. Of course the R code could of course be more complicated, but I was hoping that this example was generic enough that many people would be able to duplicate it and run their first R code.
This is quite a bit easier to install in RTM(ish) than it was back in CTP 3, so good job Microsoft.
From the above observation, a natural next step for us was to explore the possibility of automatically generating this handwritten code at runtime, which we are calling “whole-stage code generation.” This idea is inspired by Thomas Neumann’s seminal VLDB 2011 paper onEfficiently Compiling Efficient Query Plans for Modern Hardware. For more details on the paper, Adrian Colyer has coordinated with us to publish a review on The Morning Paper blog today.
The goal is to leverage whole-stage code generation so the engine can achieve the performance of hand-written code, yet provide the functionality of a general purpose engine. Rather than relying on operators for processing data at runtime, these operators together generate code at runtime and collapse each fragment of the query, where possible, into a single function and execute that generated code instead.
The possibility of getting an order of magnitude better performance is certainly enticing.
Grafana provides a powerful and customizable dashboard builder for visualizing time series data. Ambari installs Grafana v2.6 as a Master Component of AMS and adds a datasource for AMS to Grafana. The dashboard builder is supported through a Metadata API in AMS that allows easy discovery of metrics, applications and hosts which are the key components that formalize an API call to AMS. There has been significant work put into creating templated dashboards for Hadoop ecosystem services tailored towards analyzing issues and performance bottlenecks on the Hadoop cluster. The following is an image of the dashboard builder highlighting the metric name drop down with type ahead and auto complete along with options to apply aggregate functions as needed based on whether the metric is a GAUGE or a COUNTER.
This is the beginning of a good visualization system for Hadoop metrics.
This is logically equivalent to the first version of the code, but I find it makes for more readable code. It just looks cleaner.
For those of us who are
lazylooking to maximize efficiency, this could save a whole lot of key strokes.
This is true, but if you’re on SQL Server 2012 or later, check out CONCAT for concatenation, as it handles NULL values more elegantly.
Now since Power BI Custom Visualizations are not provided by Microsoft, they feel compelled to give you a warning message letting users know this. Here is the message box you get in Power BI Desktop when using a custom visualization. Notice that I clicked on the check box next to the text Don’t show this dialog again. As Words mean things, checking this box means the warning message never appears again. When you import the visualization into Power BI, no warning messages. Now I can use and propose custom visualizations to clients because they really are neat, and now they contain no warnings. Thanks so much to the Power BI Product team for fixing this major issue.
This is good news.
In every statistical analysis, the first thing one should do is try and visualise the data before any modeling. In microarray studies, a common visualisation is a heatmap of gene expression data.
In this post I simulate some gene expression data and visualise it using the
Rby Tal Galili. This package extends the plotly engine to heatmaps, allowing you to inspect certain values of the data matrix by hovering the mouse over a cell. You can also zoom into a region of the heatmap by drawing a rectangle over an area of your choice
This went way past my rudimentary heatmap skills, so it’s nice to see what an advanced user can do.
If you notice, in the above cmdlet the where-clause I’m selecting to use the Column1 property instead of a reasonable label. In my scenario the data in the CSV file contain variable columns fopr its different data types such as: Info, Error, and System. So, it was easy to identify the total number of columns to be 15 columns.
Now, using the cmdlet “Import-Csv” using the parameter “-Header”, you can define a list columns when you build the $Logdata object. We create the $header variable with the column-names separated by comma.
Keep an eye out for part 3. In the meantime, check out part 1 if you haven’t already.
I’ve made a few changes to the XQueryPlanPath project. The project parses query plans into xml and then using xpath to find the value of one or more nodes. This could then be used in testing to verify that any changes made to a query would retain a query plan that is considered optimal, and then if any changes break the test you can verify if the change causes sub-optimal effect on your query.
There was however one issue – query plans are like opinions; every SQL Server Instance has one, and none of them think that theirs stinks. So running a test on a dev box will potentially produce a different query plan from that on the build server, to that of production etc. This broadly because of 3 reasons:
Check it out, especially if your XML parsing skills aren’t top-notch.