A Lesson in Simplicity - codeandtheworld

A couple of years ago, I wrote a small tool to convert laserscan-generated terrain elevation data from a text-based file format to binary raster map files. The input files contain a series of point notations, consisting of geographical coordinates and associated elevation values, one point per line. This format is widely used to store and exchange data acquired by laser scanning because it is very easy to write and read, with no need for special applications or libraries. However, the files are huge, due to the inefficient text-based encoding and the explicit storage of point coordinates (which is required for many applications of laser scanning, but redundant for "2.5D" terrain elevation maps). Also, the format can't be used with tools that work with raster maps. That's why I wanted to convert the data to a binary raster file format that is both more memory-efficient and more compatible with common GIS software.

The Original Approach and why it Was Really Bad

Originally, I wrote the tool to process data for one specific project. The input data for the entire project area had a size of about 2 TB in total and were split up into several hundreds of files with unknown and possibly irregular shapes and extents. Back then, my intuitive approach to do the conversion consisted of the following steps:

Read through all input files, determine the geographic extent of each file and write the extents to an "index file" for later fast lookup.
Determine the extent of the entire project area (that's the min/max of the extents of all files).
Subdivide the extent of the entire project area into a regular grid of same-sized rectangular cells (Figure 1). Each cell will be stored as one output raster file.
For each cell, look at the index file to find all input files that have an extent that overlaps with the cell's extent. Read through each of these files again and write all elevation values that lie within the current cell's extent to the output raster file.

Figure 1: Exemplary input file extents (grey) and output file raster (green, dashed) as it was set up by the first version of the tool.

If you think about this for a moment, you'll very probably notice that this is highly inefficient. In a typical scenario, each input file needs to be read completely several times: One time to find out and write its extent to the index file. Then again - at least one time, but probably two, four or even more times - to write the output raster files for each cell. Since I/O is very slow and the whole process is heavily I/O-bound (there's almost nothing else happening at all), that solution was really ~~shitty~~ suboptimal.

I was aware of this from the beginning on, but back then, I had no substantially better idea. The best thing that came to my mind was to split up my input files even more to make them smaller, or transform them to some clever data structure to avoid the need to read through the entire file to find the points that lie within a given sub-extent. This would have helped a bit, but I knew that it was still not optimal. Most importantly, it would have meant to write the whole 2 terabytes of input data to the disk again before the actual raster file creation process would even start. This was not only questionable in terms of time, but also in terms of required disk space. Since the tool was originally needed only for this one specific project, and it was unclear if and when it would be needed again, I left it as it was.

Rethinking the Problem

But it was needed again over the course of the following years, and more often than I had originally expected. At one point, I decided to go back to the drawing board and see if I could finally find a good way to improve it. So I started thinking about it again, and for some time, my thoughts still ran along the lines of introducing more sophistic ways to handle the data. I thought that I needed to increase the complexity of the process to make it faster. Until I suddenly had an epiphany. Suddenly, I realized that I could drastically speed up (and also simplify) the process if I "turned around" the whole approach: Instead of iterating over the output cells once and find the overlapping input files for each cell, I had to iterate over the input files once and find the overlapping output cells instead!

What might at first sound symmetrical has huge consequences.

It starts with the fact that I can completely omit the creation of the index file. Instead of reading through the all input files only to find out their extents, the same one-time full read is already enough to produce all output files. I simply read the point coordinates and determine their destination output cell. And reading each input file only once instead of two, four or more times means a huge reduction of processing time! But how do I know the extents of the output cells, if I don't know the extent of the entire area? Here lies the very heart of the whole idea:

Nobody ever said that the extents of the output cells need to be specified through subdivision of the extent of the entire input area. This was just my idea in the beginning, but they can be defined arbitrarily. I can just as well define the borders of the output cells to be, for example, the lines of every 1/10th-degree circle of latitude and longitude (the used size can be configured in the tool). The surrounding output cell for each input point can now be determined by simple modulo division of its coordinates. With this approach, the area covered by the output cells will in most cases not match exactly with the area covered by the input files (Figure 2), but this is no problem at all.

Figure 2: Exemplary input file extents (grey) and output file raster (green, dashed) as it is set up by the improved version of the tool.

Now, there's one thing left to be discussed: Analogous to the old approach, where, in a typical case, each input file had to be read multiple times to write the ouput files, I now need to read and write each output file several times. An output cell can - and in most cases, will - still intersect with the extents of multiple source files, after all.

Why read and write? When an input point is located in a grid cell for which no output raster file has been written yet, the output file for that cell is newly created. It is not immediately written to the disk, but instead kept in RAM, so that additional elevation data can be added quickly. This buffer/caching system is designed to hold multiple output files in RAM simultaneously, so the procedure can be repeated a couple of times when input data for new output cells is processed. However, what happens when the available RAM is used up? Then, the program removes the least recently modified output file from the buffer to make place for a new one. Before the file can be removed from the buffer, it needs to be written to disk, though. It might also happen that more data needs to be added to an output file that was already written to disk. In this case, the file needs to be loaded into the buffer again. This case is handled in the same ways as if the file was newly created.

Over the whole process of writing the output raster files, incomplete versions of the files are usually written and re-loaded from disk several times. However, this is much less time-consuming than the repeated reading of input files that the old version of the tool did. This is because a single output file is much smaller than a typical input file: First, the output files usually cover smaller geographic extents. Second, they store the data much more efficiently (binary raster vs. textual representation of points with explicitly stored coordinates for each point). And thanks to the buffer system, the writing does not need to happen every time when the program switches to process another cell. In some way, this is a revival of the old idea of reducing the size of the input files - with the significant difference that I don't need to first re-write the whole input data.

Results and Conclusion

The exact speed-up of the new method depends on the sizes and shapes of the input and output files. In my tests, the new version was roughly about five to seven times faster than the old one - which is a lot, especially since processing of a typical data set can easily take dozens of hours. As a nice side-effect, the source code of the new version is also shorter and easier to understand, since the code that was required to work with the "index file" and to calculate the extents of the output grid cells from the overall extent of the input data set could be removed. A win for simplicity on all levels.

While I had many similar experiences over the course of my programming career, I find this one especially remarkable. The problem did not require knowledge of any complicated math, algorithms or sophisticated libraries. By the standards of an advanced software developer, it was quite simple from the beginning on, yet I first failed to find an adequately simple (and efficient) solution. This is not surprising, though. Even as a good programmer, you often write bad code at first when it comes to solve a problem of a new type that you haven't encountered before. Your general knowledge and experience helps a long way, but it's not enough. To me, this phenomenon is one of the most fascinating things about programming. No matter how experienced you are, the expanse of uncharted wisdom that lies ahead of you remains infinite, and you learn new things every day - sometimes very complex, sometimes (seemingly?) simple. How cool is that?

<code>andtheworld

A Lesson in Simplicity
How I made a program run five times faster by rethinking the approach and removing a lot of code

The Original Approach and why it Was Really Bad

Rethinking the Problem

Results and Conclusion

Who writes this?