image by Becky Wetherington
I like to think of myself as a code artist, writing code that is beautiful, modern, and full of value. There are times when that reality is shattered by the task at hand; file processing anyone?
Most systems, if not all, have to deal with external data needing to be extracted, transformed, and loaded in. This process is known as ETL. My best guess is it was this way 20 years ago, and will be 20 years from now. If you find yourself having to write an ETL, what can you do to stop yourself from going insane? Follow these tips and tricks and you should be well on your way to success.
Know Your Format, But Don't Trust Your Format
When starting an ETL, we most likely will be dealing with a file format that is different than the current systems format of storing data. This is where the E(xtract) part of ETL is important. Files might come in several data formats: CSV, Excel, Fixed Length, Xml, Yaml, etc. Take the time to understand the files you are receiving. Once the format is understood, we need to see where it can fall apart. Integer columns in CSV documents are notorious for having non-integer values, accommodate for this weirdness. Sometimes these format errors are minor and recoverable, and other times it is cause for a hard stop.
Separate Your Reads From Writes
I've learned that reading the contents of a file is much easier than transforming the contents of a file. If you are dealing with flat files (files on disk), eat them up as fast as possible and put them into another data storage system that is easily readable. I would suggest SQL Server or a document database like RavenDB. Dumping files into a data storage engine will allow you to do complex grouping and querying later. Additionally, separating your reads will allow you to keep track of what was read from what has already been processed and loaded into your system. It might feel like an extra step, but you'll thank me later.
Buffer Dreaded Groups of Files
Sometimes files come in as groups. These files need to be processed and linked together. Also, the process writing the group files will most likely have a lock on them (one or all). The best approach is to buffer based on created time of all each individual file. If you are processing files in real time, give yourself a one or two minute window before snapping up files (adjust for your situation). Give the other process time to complete writing the group. Being too eager to process incoming files will just create big headaches sooner than later.
Always Have an Error Queue
Files will inevitably have issues: Encoding, incomplete data, bad data, the list goes on. When these issues occur, and they will, move the bad files into an error queue to view and process later. This makes it easier to absolve yourself of blame :).
Remember how I mentioned you should split your reads and writes? Following that tip will help you think transactionally. If you are having trouble saving a file into your data storage engine, then just leave the files as is and come back later to process them. This is a system error and not a file error, so be more resilient here and try a couple of times before giving up.
Finally, build metrics into your ETL process. Knowing how many files are currently in need of processing, how many have been processed, and how many errors are outstanding is very important and will help you sniff out issues before the files pile up. Leveraging a data storage engine for your file processing will make it much easier to watch your process.
There you have it, tips and tricks that I have learned from working for one of the most well known newspapers in country, processing *Associated Press stories and images that come in by the thousands. My favorite trick is to separate the reads from writes. Separating my read and write concerns saves a ton of headache and simplifies the code greatly. So go forth my friend, processing files might not be glamorous but it most likely will be core functionality that will be used daily, if not hourly. If all goes well, and you follow these tips, most people will never notice how great a job you did :)