Skip to content

Pymupdf grouping same text of different pages in different text_blocks #2898

@vignesh0710

Description

@vignesh0710

Description of the bug

Goal: Pymupdf highlight difference between 2 pdf pages, version (1.22.1)

Trying to compare 2 pdf pages - p1 and p2 and highlight the difference in p1

Algorithm:

1. Get text_blocks with bounding_box from each_page
2. Compare text_blocks of p1 with p2
3. for every text_block which is different use the respective bounding_box to highlight the diffeerence

Code:

def get_text_blocks(page):

    blocks = []
    blocks_bbox = []
    blocks = page.get_text_blocks()
    for block in blocks:
        #appending the bounding box of the block
        blocks_bbox.append(block[0:4])
        #appending the text from the block
        blocks.append(block[4])
   return blocks, blocks_bbox

difference psuedo_code:

diff = [list of text_blocks IN p1 and NOT IN p2]
for each_diff in diff:  
     #get the bounding_box of the difference block
     rect = fitz.rect(bounding_box)
     annot = p1.add_highlight_annot(rect)
     annot.update()

This works. But in certain cases though the contents are identical they get grouped into different text blocks so while comparing it is highlighting wrong.

Example:

p1:

block_1: line1, line2
block_2: line3

p2:

block_1: line1, line2, line3

Though the identical 3 lines (back-to-back) - line1, line2, line3 are present in both the pages p1 and p2 since the blocks are different it is getting flagged

Also, tried the get_text and compare line by line approach, it is not working.

Any suggestions on how to fix this will be helpful?

How to reproduce the bug

explained above

PyMuPDF version

1.23.5 or earlier

Operating system

Windows

Python version

3.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions