Skip to content

convert chinese pdf to image garbled/wrong code #1523

@guoxiaolu

Description

@guoxiaolu
 The pdf preview in chrome or mac/ubuntu is wrong, correct in adobe.
 pdf convert to image failed, it seems fitz cannot get the correct font, I have installed pdfminer's cmap, and chars are correct, but the image is not.
 When page.getFontList(), it output like this:

0:(7, 'n/a', 'Type0', 'ËÎÌå', 'F0', 'Identity-H')
1:(8, 'n/a', 'Type0', 'ËÎÌå,Bold', 'F1', 'Identity-H')
Is it possible to set a new font, then output a correct image? or other ways?

from fitz import fitz
pdf_path = '/Users/guoxiaolu/work/code/haiguan/pdf/20211011_0108016_ONO44030120218202018202211011000118_18000.00_21090105469_0.pdf'
pdf = fitz.Document(pdf_path)
img_pages = pdf.pageCount
for pg in range(0, img_pages):
page = pdf[pg]
blocks = page.get_text("rawdict", flags=0)["blocks"]
trans = fitz.Matrix(3,3).preRotate(0)
chars = [char['c'] for b in blocks for l in b["lines"] for s in l["spans"] for char in s["chars"]]
cstr = ''.join(chars)
print(cstr)
pm = page.getPixmap(matrix=trans, alpha=False)
name = './%d.png'%(pg)
pm.writeImage(name)

cstr: "预算单位电子支付印章:财政授权支付凭证凭证号:018202211011000118付款人名称账号开户银行收款人名称账号开户银行南方科技大学8110301014300402010中信银行深圳分行营业部深圳市智新仪器有限公司773171711672中国银行股份有限公司深圳曦湾支行壹万捌仟元整凭证金额人民币(小写) ¥18,000.00资金性质111一般公共预算资金业务类型1普通业务基层预算单位0108016001南方科技大学代理银行电子支付印章:银行会计分录用途21090105469 重付发票00272433凭证日期:凭证金额人民币(大写)功能分类科目项目名称2050205高等教育基础科研经费2021年10月11日支付日期: 2021年10月11日部门经济分类科目30227委托业务费壹万捌仟元整实际支付金额人民币(小写) ¥18,000.00实际支付金额人民币(大写)"
20211011_0108016_ONO44030120218202018202211011000118_18000.00_21090105469_0.pdf

Metadata

Metadata

Assignees

Labels

not a bugnot a bug / user error / unable to reproduce

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions