-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Description
Text extracted from PDF files may contain incomplete values of ligatures (and probably other glyphs) if they have composite names. More precisely, only the first Unicode symbol is present. In this file skakvattenbad_bs_milmedtek.pdf there are problems with fi in "Specifications & Ordering Information" ("Specifcations"), fl in "the bath fluid" ("fuid") and tt in "Multiple LED displays for setting various values"/"various ways of glassware settings" ("seting"/"setings").
Apparently, the issue is with decoding non-standard glyph names (not from Adobe Glyph List). Examples above are /f_i, /f_l and /t_t glyphs correspondingly. Accordingly the procedure described here https://github.com/adobe-type-tools/agl-specification#2-the-mapping, they should be mapped as sequences /f/i, /f/l and /t/t
This behavior seems not to depend on presence or absence of explicit /ToUnicode mappings.
How to reproduce
Read text from PDF using any method get_text, get_texttrace etc.
Expected behavior
Text extracted from PDF contains all Unicode symbols.
Configuration
- Linux MINT
- Python 3.8
- PyMuPDF 1.23.5, installed via pip