Skip to content

Not able to parse information in correct sequence #2930

@reema93jain

Description

@reema93jain

Description of the bug

Hi Team,

I am using PyMuPDF to parse data from pdf which contains text, table and image.

When I am trying to use below code just for parsing text, I am able to parse text in right sequence:

def extract_text_from_pdf(pdf_path):
import fitz
doc = fitz.open(pdf_path)
text = ''

for page_number in range(doc.page_count):
    page = doc[page_number]
    text += page.get_text()
doc.close()
return text

However, when I trying to alter the code as below, I am getting tables content listed twice(one by get_text function and other by .find_tables() function). Also, I am not getting text and tables in correct sequence. Is there any way I can parse the table data just once?

import fitz # PyMuPDF
import matplotlib.pyplot as plt
import pandas as pd

def parse_pdf(pdf_path):
doc = fitz.open(pdf_path)

# Initialize variables to store extracted data
parsed_data=[]

for page_num in range(doc.page_count):
    page = doc[page_num]

    # Extract text
    text = page.get_text()
    if text:
        parsed_data.append({'type': 'text', 'content': text})
    
    #Find Tables
    tabs = page.find_tables()
    #print(tabs)
    if tabs:
        
        for tab in tabs:
            table=[]
            for line in tab.extract():
                table.append(line)
            parsed_data.append({'type': 'table', 'content': table})

doc.close()
return parsed_data

Calling the function:

pdf_path = "EOS-User-Manual.pdf"
parse_data=parse_pdf(pdf_path)
#calling sub-set
parsed_data=parse_data[0:100000]

Access the parsed data & display it

for entry in parsed_data:
if entry['type'] == 'text':
print(entry['content'])
elif entry['type'] == 'table':
data=entry['content']
df=pd.DataFrame(data)
print(df)

print()

Can you please advise how I can parse text,table,images in correct sequence using PyMuPDF?

Thank you
Reema Jain

How to reproduce the bug

Complete Code:
!pip install fitz
!pip install PyMuPDF
!pip install PyMuPDF Pillow

import fitz # PyMuPDF
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
import io
from io import BytesIO

def parse_pdf(pdf_path):
doc = fitz.open(pdf_path)

# Initialize variables to store extracted data
parsed_data=[]

for page_num in range(doc.page_count):
    page = doc[page_num]

    # Extract text
    text = page.get_text()
    if text:
        parsed_data.append({'type': 'text', 'content': text})
    
    #Find Tables
    tabs = page.find_tables()
    #print(tabs)
    if tabs:
        
        for tab in tabs:
            table=[]
            for line in tab.extract():
                table.append(line)
            parsed_data.append({'type': 'table', 'content': table})

doc.close()
return parsed_data

Calling the function:

pdf_path = "EOS-User-Manual.pdf"
parse_data=parse_pdf(pdf_path)
#calling sub-set
parsed_data=parse_data[0:100000]

Access the parsed data & display it

for entry in parsed_data:
if entry['type'] == 'text':
print(entry['content'])
elif entry['type'] == 'table':
data=entry['content']
df=pd.DataFrame(data)
print(df)

print()

PyMuPDF version

1.23.7

Operating system

Windows

Python version

3.10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions