-
-
Notifications
You must be signed in to change notification settings - Fork 371
Description
I have a video dataset of around 500k (half a million) videos that I want to ML on so I am looking for an efficient data format that I can use to quickly read the data. Each video is 10 sec long and subsampled to 256x256@10Hz, i.e., when decoded a video can be viewed as a (frame, height, with, channel) ndarray of shape (100, 256, 256, 3) and dtype uint8 and the entire dataset as a ndarray of shape (500k, 100, 256, 256, 3) and dtype uint8.
The naive approach to storing this data would be to store it as individual videos. This is the format I have now, but it isn't ideal because each file is just about 1 MB on disk. This makes loading a pain since I can't keep that many open file handles. Instead, I am constantly opening and closing small files and I am wondering if this is really the best way to go.
My other idea would be to see if I can store the dataset (or shards of it) as zarr arrays where each chunk is compressed using a video codec. This way I can keep the amazing compression rate of video codecs while also getting some of python's nice ndarray semantics. I realize that this might be a crazy idea, but part of me thinks that it sounds like the kind of crazy that deserves a try.
Would something like this be achievable with Zarr?