-
Notifications
You must be signed in to change notification settings - Fork 678
Description
The pdf preview in chrome or mac/ubuntu is wrong, correct in adobe.
pdf convert to image failed, it seems fitz cannot get the correct font, I have installed pdfminer's cmap, and chars are correct, but the image is not.
When page.getFontList(), it output like this:
0:(7, 'n/a', 'Type0', 'ËÎÌå', 'F0', 'Identity-H')
1:(8, 'n/a', 'Type0', 'ËÎÌå,Bold', 'F1', 'Identity-H')
Is it possible to set a new font, then output a correct image? or other ways?
from fitz import fitz
pdf_path = '/Users/guoxiaolu/work/code/haiguan/pdf/20211011_0108016_ONO44030120218202018202211011000118_18000.00_21090105469_0.pdf'
pdf = fitz.Document(pdf_path)
img_pages = pdf.pageCount
for pg in range(0, img_pages):
page = pdf[pg]
blocks = page.get_text("rawdict", flags=0)["blocks"]
trans = fitz.Matrix(3,3).preRotate(0)
chars = [char['c'] for b in blocks for l in b["lines"] for s in l["spans"] for char in s["chars"]]
cstr = ''.join(chars)
print(cstr)
pm = page.getPixmap(matrix=trans, alpha=False)
name = './%d.png'%(pg)
pm.writeImage(name)
cstr: "预算单位电子支付印章:财政授权支付凭证凭证号:018202211011000118付款人名称账号开户银行收款人名称账号开户银行南方科技大学8110301014300402010中信银行深圳分行营业部深圳市智新仪器有限公司773171711672中国银行股份有限公司深圳曦湾支行壹万捌仟元整凭证金额人民币(小写) ¥18,000.00资金性质111一般公共预算资金业务类型1普通业务基层预算单位0108016001南方科技大学代理银行电子支付印章:银行会计分录用途21090105469 重付发票00272433凭证日期:凭证金额人民币(大写)功能分类科目项目名称2050205高等教育基础科研经费2021年10月11日支付日期: 2021年10月11日部门经济分类科目30227委托业务费壹万捌仟元整实际支付金额人民币(小写) ¥18,000.00实际支付金额人民币(大写)"
20211011_0108016_ONO44030120218202018202211011000118_18000.00_21090105469_0.pdf