Pymupdf grouping same text of different pages in different text_blocks

### Description of the bug

Goal: Pymupdf highlight difference between 2 pdf pages, version (1.22.1)

Trying to compare 2 ```pdf pages - p1 and p2``` and highlight the difference in ```p1``` 

Algorithm:
  
    1. Get text_blocks with bounding_box from each_page
    2. Compare text_blocks of p1 with p2
    3. for every text_block which is different use the respective bounding_box to highlight the diffeerence

Code:

    def get_text_blocks(page):

        blocks = []
        blocks_bbox = []
        blocks = page.get_text_blocks()
        for block in blocks:
            #appending the bounding box of the block
            blocks_bbox.append(block[0:4])
            #appending the text from the block
            blocks.append(block[4])
       return blocks, blocks_bbox

difference psuedo_code:

    diff = [list of text_blocks IN p1 and NOT IN p2]
    for each_diff in diff:  
         #get the bounding_box of the difference block
         rect = fitz.rect(bounding_box)
         annot = p1.add_highlight_annot(rect)
         annot.update()

This works. But in certain cases though the ```contents``` are ```identical``` they get grouped into ```different text blocks``` so while comparing it is highlighting wrong.

Example:

p1:

    block_1: line1, line2
    block_2: line3

p2:

    block_1: line1, line2, line3

Though the identical 3 lines (back-to-back) - ```line1, line2, line3``` are present in both the pages ```p1``` and ```p2``` since the ```blocks``` are different it is getting flagged

Also, tried the ```get_text``` and compare ```line by line``` approach, it is not working.

Any suggestions on how to fix this will be helpful?

### How to reproduce the bug

explained above

### PyMuPDF version

1.23.5 or earlier

### Operating system

Windows

### Python version

3.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pymupdf grouping same text of different pages in different text_blocks #2898

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pymupdf grouping same text of different pages in different text_blocks #2898

Description

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions