Not able to parse information in correct sequence

### Description of the bug

Hi Team,

I am using PyMuPDF to parse data from pdf which contains text, table and image.

When I am trying to use below code just for parsing text, I am able to parse text in right sequence:

def extract_text_from_pdf(pdf_path):
    import fitz
    doc = fitz.open(pdf_path)
    text = ''

    for page_number in range(doc.page_count):
        page = doc[page_number]
        text += page.get_text()
    doc.close()
    return text


However, when I trying to alter the code as below, I am getting tables content listed twice(one by get_text function and other by  .find_tables() function). Also, I am not getting text and tables in correct sequence. Is there any way I can parse the table data just once?

import fitz  # PyMuPDF
import matplotlib.pyplot as plt
import pandas as pd

def parse_pdf(pdf_path):
    doc = fitz.open(pdf_path)

    # Initialize variables to store extracted data
    parsed_data=[]

    for page_num in range(doc.page_count):
        page = doc[page_num]

        # Extract text
        text = page.get_text()
        if text:
            parsed_data.append({'type': 'text', 'content': text})
        
        #Find Tables
        tabs = page.find_tables()
        #print(tabs)
        if tabs:
            
            for tab in tabs:
                table=[]
                for line in tab.extract():
                    table.append(line)
                parsed_data.append({'type': 'table', 'content': table})

    doc.close()
    return parsed_data

# Calling the function:
pdf_path = "EOS-User-Manual.pdf"
parse_data=parse_pdf(pdf_path)
#calling sub-set
parsed_data=parse_data[0:100000]

# Access the parsed data & display it
for entry in parsed_data:
    if entry['type'] == 'text':
        print(entry['content'])
    elif entry['type'] == 'table':
        data=entry['content']
        df=pd.DataFrame(data)
        print(df)
        
    print()

Can you please advise how I can parse text,table,images in correct sequence using PyMuPDF? 

Thank you
Reema Jain

### How to reproduce the bug

Complete Code:
!pip install fitz
!pip install PyMuPDF
!pip install PyMuPDF Pillow

import fitz  # PyMuPDF
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
import io
from io import BytesIO

def parse_pdf(pdf_path):
    doc = fitz.open(pdf_path)

    # Initialize variables to store extracted data
    parsed_data=[]

    for page_num in range(doc.page_count):
        page = doc[page_num]

        # Extract text
        text = page.get_text()
        if text:
            parsed_data.append({'type': 'text', 'content': text})
        
        #Find Tables
        tabs = page.find_tables()
        #print(tabs)
        if tabs:
            
            for tab in tabs:
                table=[]
                for line in tab.extract():
                    table.append(line)
                parsed_data.append({'type': 'table', 'content': table})

    doc.close()
    return parsed_data

# Calling the function:
pdf_path = "EOS-User-Manual.pdf"
parse_data=parse_pdf(pdf_path)
#calling sub-set
parsed_data=parse_data[0:100000]

# Access the parsed data & display it
for entry in parsed_data:
    if entry['type'] == 'text':
        print(entry['content'])
    elif entry['type'] == 'table':
        data=entry['content']
        df=pd.DataFrame(data)
        print(df)
        
    print()

### PyMuPDF version

1.23.7

### Operating system

Windows

### Python version

3.10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Not able to parse information in correct sequence #2930

Description of the bug

Calling the function:

Access the parsed data & display it

How to reproduce the bug

Calling the function:

Access the parsed data & display it

PyMuPDF version

Operating system

Python version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Not able to parse information in correct sequence #2930

Description

Description of the bug

Calling the function:

Access the parsed data & display it

How to reproduce the bug

Calling the function:

Access the parsed data & display it

PyMuPDF version

Operating system

Python version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions