-
Notifications
You must be signed in to change notification settings - Fork 64
Open
Description
For large dataframes containing strings with 500k rows and 60+ columns, pyexcelerate generates invalid .xlsx files that cannot be opened in Excel. Excel gives the error message: "We found a problem with some content in yourfile.xlsx. Do you want us to try to recover as much as we can?". Answering yes removes all invalid content, which yields an empty worksheet.
from string import ascii_lowercase
from pandas import DataFrame
from pyexcelerate import Workbook
from uuid import uuid4
num_rows = 501000
num_cols = 64
# generate dataframe with column names
column_names = [ f"Col{idx}" for idx in range(0, num_cols)]
df = DataFrame(random.choice(list(ascii_lowercase), size=(num_rows, num_cols)), columns=column_names)
# write to file
output_filename = f"/tmp/{str(uuid4())}.xlsx"
wb = Workbook()
ws = wb.new_sheet("sheet name", data=df.values.tolist())
wb.save(output_filename)
The xlsx file is successfully written to disk, is 138Mb in size, but generates the above-mentionned error when being opened in Excel. Further inspection when opening the file in Python reveals that the Zip archive is invalid.
Interestingly, the problem only occurs for large dataframes filled with strings. The problem does not occur:
- For smaller dataframes
- For dataframes of the same size filled only with integers
Versions:
- OS: macOS Catalina 10.15.7; Darwin Kernel Version 19.6.0
- PyExcelerate==0.10.0
- pandas==1.3.5
Metadata
Metadata
Assignees
Labels
No labels