OCR Processed Guantanamo Protocols

I was curious about the protocols that the US Department of Defense published some days ago. The originals are in PDF format, but in fact they are just embedded images and not normal PDF text, which prevents searching and indexing and can't be printed well. Last not least it's badly readable. I consequently tried an optical character recognition software (Abbyy Finereader) on the roughly 4700 pages, which took the better of the last 24 hours.The resulting PDFs are in this directory. Some random samples that I looked at indicated that the OCR quality is not perfect but usable; the error rate is small enough, and the layout was reasonably preserved. The visual readability of the resulting PDFs is greatly improved as well.

I don't know how well proper names are preserved though (probably one of the most urgent search criteria) since the dictionary correction cannot be applied. There also are some dozen handwritten pages that need to be manually processed. The automated output of these pages is just garbage (cf. set 5). This still needs to be done.

The resulting PDF files are named like the originals with an "_ocr" suffix for easy sorting and comparison.

I believe that copying and processing these files is ok, since the defenselink website specifically claims that "Information presented on DefenseLINK is considered public information and may be distributed or copied unless otherwise specified." If somebody feels their privacy compromised though, please drop me a short note and I'll remove the pages in question.

I hope the files are of some use.

Last change March 8th, 2006