Over the last several months, IGIS has been working to collect, digitize and spatially reference historical project data from the RECs. Recently, we were given some very detailed data for projects at the Hopland REC dating back to 1952. Unfortunately, the data came in the form of two binders filled with typed annual project lists, well over one hundred pages in all. Somehow I needed to find a way to get all of these historical projects into a database that could then be attached to our existing spatial data sets.
First off, I tried the manual entry route, simply typing each project entry from the historical lists into the database. I quickly realized that with my mediocre typing skills, this approach would take far too long and be incredibly frustrating, so I decided to search for a technological solution to speed up the process and Optical Character Recognition (OCR) software seemed like an obvious path to take.
There are numerous web-based OCR options out there that are both free and reasonably accurate, but they all seem to limit you to upload and convert a single page of text at a time. If you just have a page or two of text that you need to work with, that would be fine, but it's a less than optimal solution for processing very large amounts of data.
Fortunately, I discovered that Drive, Google's free cloud storage site, will automatically detect text and perform OCR on any pdf that you upload. The process is incredibly simple, fast and also quite accurate. The formatting was not always perfect, and it seemed to have issues with certain numbers and symbols, but in a very short time, I had all my data loaded into Google Drive and converted to real text that I could simply copy and paste into my database. It drastically cut the time it took to complete the project and I would recommend anyone who needs to do OCR on large datasets to check out Google Drive.