Tuesday 22 January 2019

search PDFs with non-standard character encodings

Some PDF files produce garbage ("mojibake") when you copy text (even though they render OK). This makes it impossible to search them (whatever you search for will not match the garbage).


Does anyone have an easy workaround?


Examples:



  1. TEAC TV manual EU2816STF (yields above problems in Adobe Reader on both Windows and a Mac, but works fine in Preview on a Mac)

  2. Leadtek Winfast PVR2 manual (FTP link; also has problems in Preview on a Mac)

  3. Swann TV tuner card manual (FTP link; also has problems in Preview on a Mac)

  4. Phonedisc license agreement (from the now-defunct DTMS)

  5. Macquarie IFP quarterly fund review

  6. BAN-TACS Small Business Booklet (archived version)

  7. Easterfest 2004 flyer (also from the archive)


I am using Adobe Reader (latest version) for Windows - perhaps an alternative viewer might help? I'm looking for a free solution for Windows. Open-source would be even better.


Edit: The docs for the Multivalent Extract Text tool have a good summary of why things can go wrong, including: (quoted document last modified Jan 2006)




  • Text may not have a Unicode mapping. PDF Type 3 fonts often do not, and TeX DVI has characters that do not have Unicode equivalents.

  • The Unicode encoding may be buggy. Open Office maps some characters into the same Unicode, resulting in apparant letter dropping and doubling.



I guess the ultimate solution in these cases would be to OCR each glyph in a font to figure out what character it really is. Note that this would be easier than OCRing a noisy scanned document because the exact shape of the glyph is available (at infinite resolution since it's a "vector" image).

No comments:

Post a Comment

Where does Skype save my contact's avatars in Linux?

I'm using Skype on Linux. Where can I find images cached by skype of my contact's avatars? Answer I wanted to get those Skype avat...