I was reminded of this recently as I was working with R, trying to read a nearly 2 GB data file. I wanted to read in 5% of the data and output it to a smaller file that would make the test code run faster. The particular function I was working with needed a row count as one of its parameters. For me, that meant I had to determine the number of rows in the source file and multiply by 0.05. I tied the code for all of those tasks into one script block.
Now, none to my surprise, it was slow. In my short experience, I’ve found R isn’t particularly snappy–even when the data can fit comfortably in memory. I was pretty sure I could beat R’s record count performance handily with C#. And I did. I found some related questions on StackOverflow. A small handful of answers discussed the efficiency of various approaches. I only tried two C# variations: my original attempt, and a second version that was supposed to be faster (the improvement was nominal).
To fact-check Dave (because this blog is about nothing other than making sure Dave is right), I checked the source code to the wc command. That command also streams through the entire file, so Dave’s premise looks good.