-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Description
Trying to download DGS Corpus on a very powerful machine with high RAM and such, I get a ValueError. Apparently this can be fixed with proto splitter?
https://discuss.ai.google.dev/t/fix-the-notorious-graphdef-2gb-limitation/29392
Traceback:
Traceback (most recent call last):
File "sldata_download.py", line 20, in <module>
dataset = tfds.load(name=str(args.dataset_name), data_dir=args.data_dir)
File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/logging/__init__.py", line 166, in __call__
return function(*args, **kwargs)
File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/load.py", line 639, in load
_download_and_prepare_builder(dbuilder, download, download_and_prepare_kwargs)
File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/load.py", line 498, in _download_and_prepare_builder
dbuilder.download_and_prepare(**download_and_prepare_kwargs)
File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/logging/__init__.py", line 166, in __call__
return function(*args, **kwargs)
File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 691, in download_and_prepare
self._download_and_prepare(
File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1583, in _download_and_prepare
future = split_builder.submit_split_generation(
File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/split_builder.py", line 341, in submit_split_generation
return self._build_from_generator(**build_kwargs)
File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/split_builder.py", line 418, in _build_from_generator
writer.write(key, example)
File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/writer.py", line 238, in write
serialized_example = self._serializer.serialize_example(example=example)
File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/example_serializer.py", line 98, in serialize_example
return self.get_tf_example(example).SerializeToString()
ValueError: Message tensorflow_copy.Example exceeds maximum protobuf size of 2GB: 15513563426
Download script and command
Installed env as noted in #89, with python 3.8, webvtt-py and lxml
# https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/autsl/autsl.py
import tensorflow_datasets as tfds
import sign_language_datasets.datasets
import itertools
from sign_language_datasets.datasets.config import SignDatasetConfig
from pathlib import Path
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="attempt to download a dataset from sign-language-datasets, e.g. 'dgs_corpus/holistic'")
parser.add_argument("dataset_name", help="something like 'dgs_corpus'")
parser.add_argument("--data_dir", type=Path, default=Path("~/tfds_sign_language_datasets"))
args= parser.parse_args()
# config = SignDatasetConfig(name="only-annotations", version="1.0.0", include_video=False)
# config = SignDatasetConfig(name="holistic")
# autsl = tfds.load(name='autsl', data_dir=data_dir, builder_kwargs={"config": config})
# autsl = tfds.load(name='autsl/holistic', data_dir=data_dir)
dataset = tfds.load(name=str(args.dataset_name), data_dir=args.data_dir)
for datum in itertools.islice(dataset["train"], 0, 10):
print(f"datum")
print(datum)Command
python sldata_download.py dgs_corpus --data_dir /data/petabyte/cleong/data/tfds_sign_language_datasets/
Metadata
Metadata
Assignees
Labels
No labels