Feather

Kevin Feasel

2016-05-25

R

David Smith discusses Feather:

Unlike most other statistical software packages, R doesn’t have a native data file format. You can certainly import and export data in any number of formats, but there’s no native “R data file format”. The closest equivalent is the saveRDS/loadRDS function pair, which allows you to serialize an R object to a file and then load it back into a later R session. But these files don’t hew to a standardized format (it’s essentially a dump of R in-memory representation of the object), and so you can’t read the data with any software other than R.

The goal of the feather project, a collaboration of Wes McKinney and Hadley Wickham, is to create a standard data file format that can be used for data exchange by and between R, Python, and any other software that implements its open-source format. Data are stored in a computer-native binary format, which makes the files small (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file), and fast to read and write (no need to convert numbers to text and back again). Another reason why feather is fast is that it’s a column-oriented file format, which matches R’s internal representation of data. (In fact, feather is based on the Apache Arrow framework for working with columnar data stores.) When reading or writing traditional data files with R, it must spend signfican time translating the data from column format to row format and back again; with feather the entire second step in the process below is eliminated.

Given the big speedup in read time, I can see this file format being rather useful.  I just can’t see it catching on as a common external data format, though, unless most tools get retrofitted to support the file.  So instead, it’d end up closer to something like Avro or Parquet:  formats we use in our internal tools because they’re so much faster, but not formats we send across to other companies because they’re probably using a different set of tools.

Related Posts

Vectors for Programmers

John Mount has a couple of videos available: We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material. Click through […]

Read More

Defining Tidy Data

John Mount shares thoughts about the concept of tidy data: A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner […]

Read More

Categories

May 2016
MTWTFSS
« Apr Jun »
 1
2345678
9101112131415
16171819202122
23242526272829
3031