I have a few PDFs that contain ligatures in the text (e.g., ff is combined into a single character, ff ). Is there an easy way to remove them when copying the text from the PDF? (i.e., when I paste, I'd like the ff to be pasted as ff ). I copy a lot of text from these PDFs into answers on Stack Overflow and I find the ligatures at best obnoxious (ok, I admit, I'm really picky :-P); the ligatures also do not show up correctly when copied into other places (e.g., if I copy them into Notepad, they show up as blocks). I cannot modify the PDFs. I use both Adobe Acrobat Reader and Foxit Reader, but I'd be open to trying a new PDF reader.
65.4k 7 7 gold badges 113 113 silver badges 168 168 bronze badges asked Jul 18, 2010 at 19:54 James McNellis James McNellis 321 2 2 silver badges 6 6 bronze badgesIn python this would be:
import unicodedata # \uFB00 is the ff ligature. unicodedata.normalize('NFKD',u'\uFB00').encode('ascii','ignore') You could combine this with pyPdf to read the pdf files.
answered Jul 18, 2010 at 22:10 276 2 2 silver badges 5 5 bronze badgesThe reader evince seems to decode ligatures when I tested this.
Btw. for pdflatex documents you can use this in the preamble to display ligatures in the PDF document but copy individual characters:
\input \pdfgentounicode=1 %answered Aug 19, 2011 at 19:35 41 2 2 bronze badges
One possibility would be to use your favorite text-editor and simply replace them.
Another way would be to write a script which utilizes sed ...but that would be *NIX-Systems only, I fear.
You can replace the "broken" words in the copied text if you have a mapping from broken words to original words. I wrote a script to generate this mapping by removing ligatures from words and checking whether the resulting word is unique. For my dictionary of English words, 99.5% of all possible broken words are replaceable, and 92.3% of words that contain a ligature sequence ( ff , fi , fl , ffi , or ffl ) can be recovered. The difference between these two percentages is due to the surprisingly large number of legitimate words that are created by removing ligaments from other legitimate words (like butterfly --> buttery , fluffs --> us , and misfits --> mists ).
Here's a CSV of guaranteed-replaceable "broken" words (and the words they used to be): http://www.filedropper.com/brokenligaturewordfixes
answered Aug 28, 2015 at 7:52 Jan Van Bruggen Jan Van Bruggen 191 1 1 silver badge 3 3 bronze badgesIt's great that you're offering the file. Realistically, though, nobody with common sense would download an unknown file (especially from a brand new user). Don't take it personally if the file doesn't get much traffic. It doesn't mean your efforts aren't appreciated.
Commented Aug 28, 2015 at 8:00Yeah, I understand. I wish there was a simple way to verify links like that, or even just to guarantee the file type. Thanks!