Combining Files In C#

Chris Koester shows how to combine a set of CSVs without duplicating their header rows:

The timings in this post came from combining 8 csv files with 13 columns and a combined total of 9.2 million rows.

I first tried combining the files with the PowerShell technique described here. It was painfully slow and took an hour and a half! This is likely because it is deserializing and then serializing every bit of data in the files, which adds a lot of unnecessary overhead.

Next I tried the C# script below using LINQPad. When reading from and writing to a network share, it took 3 minutes and 56 seconds. Much better! Next I tried it on a local SSD drive and it took just 44 seconds.

Read on for the script itself.  The ReadAllLines method works fine as long as the file isn’t larger than your working memory.

Related Posts

Problems Distributed Systems Experience

RJ Zaworski gives us examples of the types of problems you can run into with distributed systems: Time limits: ending the neverendingHere’s one to ponder: how long can a long-running action go on before the customer (even a very patient, very digital customer) loses all interest in the outcome?Pull up a chair. With no upper […]

Read More

Spark+AI Summit 2019 Announcements

Victoria Holt creates a roundup of Spark+AI Summit 2019 announcements: Rohan Kumar  of Microsoft announced .NET for Apache Spark, making Apache Spark accessible to .NET developers – Git Hub I’m very happy that the Spark for .NET team added F# support. Spark is so much nicer when using functional programming.

Read More

Categories

January 2017
MTWTFSS
« Dec Feb »
 1
2345678
9101112131415
16171819202122
23242526272829
3031