Skip to content

Would it be possible to use a video codec as chunk compressor? #1086

@FirefoxMetzger

Description

@FirefoxMetzger

I have a video dataset of around 500k (half a million) videos that I want to ML on so I am looking for an efficient data format that I can use to quickly read the data. Each video is 10 sec long and subsampled to 256x256@10Hz, i.e., when decoded a video can be viewed as a (frame, height, with, channel) ndarray of shape (100, 256, 256, 3) and dtype uint8 and the entire dataset as a ndarray of shape (500k, 100, 256, 256, 3) and dtype uint8.

The naive approach to storing this data would be to store it as individual videos. This is the format I have now, but it isn't ideal because each file is just about 1 MB on disk. This makes loading a pain since I can't keep that many open file handles. Instead, I am constantly opening and closing small files and I am wondering if this is really the best way to go.

My other idea would be to see if I can store the dataset (or shards of it) as zarr arrays where each chunk is compressed using a video codec. This way I can keep the amazing compression rate of video codecs while also getting some of python's nice ndarray semantics. I realize that this might be a crazy idea, but part of me thinks that it sounds like the kind of crazy that deserves a try.

Would something like this be achievable with Zarr?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions