Feather

Kevin Feasel

2016-05-25

R

David Smith discusses Feather:

Unlike most other statistical software packages, R doesn’t have a native data file format. You can certainly import and export data in any number of formats, but there’s no native “R data file format”. The closest equivalent is the saveRDS/loadRDS function pair, which allows you to serialize an R object to a file and then load it back into a later R session. But these files don’t hew to a standardized format (it’s essentially a dump of R in-memory representation of the object), and so you can’t read the data with any software other than R.

The goal of the feather project, a collaboration of Wes McKinney and Hadley Wickham, is to create a standard data file format that can be used for data exchange by and between R, Python, and any other software that implements its open-source format. Data are stored in a computer-native binary format, which makes the files small (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file), and fast to read and write (no need to convert numbers to text and back again). Another reason why feather is fast is that it’s a column-oriented file format, which matches R’s internal representation of data. (In fact, feather is based on the Apache Arrow framework for working with columnar data stores.) When reading or writing traditional data files with R, it must spend signfican time translating the data from column format to row format and back again; with feather the entire second step in the process below is eliminated.

Given the big speedup in read time, I can see this file format being rather useful.  I just can’t see it catching on as a common external data format, though, unless most tools get retrofitted to support the file.  So instead, it’d end up closer to something like Avro or Parquet:  formats we use in our internal tools because they’re so much faster, but not formats we send across to other companies because they’re probably using a different set of tools.

Related Posts

Conjoint Analysis In R

Abhijit Telang introduces the concept of conjoint analysis and shows how you can implement this in R: We will need to typically transform the problem of utility modeling from its intangible, abstract form to something that is measurable. That is, we wish to assign a numeric value to the perceived utility by the consumer, and […]

Read More

Improving Plots With ggformula

Sebastian Sauer shows how you can use the ggformula package combined with ggplot2 to enhance your R visuals: Since some time, there’s a wrapper for ggplot2 available, bundled in the package ggformula. One nice thing is that in that it plays nicely with the popular R package mosaic. mosaic provides some useful functions for modeling along with a tamed and consistent […]

Read More

Categories

May 2016
MTWTFSS
« Apr Jun »
 1
2345678
9101112131415
16171819202122
23242526272829
3031