Skip to content

Incomplete text values of ligatures #2785

@pulsar314

Description

@pulsar314

Description

Text extracted from PDF files may contain incomplete values of ligatures (and probably other glyphs) if they have composite names. More precisely, only the first Unicode symbol is present. In this file skakvattenbad_bs_milmedtek.pdf there are problems with fi in "Specifications & Ordering Information" ("Specifcations"), fl in "the bath fluid" ("fuid") and tt in "Multiple LED displays for setting various values"/"various ways of glassware settings" ("seting"/"setings").

Apparently, the issue is with decoding non-standard glyph names (not from Adobe Glyph List). Examples above are /f_i, /f_l and /t_t glyphs correspondingly. Accordingly the procedure described here https://github.com/adobe-type-tools/agl-specification#2-the-mapping, they should be mapped as sequences /f/i, /f/l and /t/t

This behavior seems not to depend on presence or absence of explicit /ToUnicode mappings.

How to reproduce

Read text from PDF using any method get_text, get_texttrace etc.

Expected behavior

Text extracted from PDF contains all Unicode symbols.

Configuration

  • Linux MINT
  • Python 3.8
  • PyMuPDF 1.23.5, installed via pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    not a bugnot a bug / user error / unable to reproduce

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions