Skip to content

Conversation

@Gautzilla
Copy link
Contributor

🐳 What's new?

This PR allows to pick specific files that are to be included in the public dataset build, with an option to either move or to copy the files to the dataset folder.

🐳 How does it work?

Let's say we have the following file structure:

cool
├── stuff
│   ├── 2007-11-05_00-00-00.wav
│   ├── 2007-11-05_00-01-00.wav
│   └── 2007-11-05_00-02-00.wav
└── things
    ├── 2007-11-05_00-03-00.wav
    ├── 2007-11-05_00-04-00.wav
    └── 2007-11-05_00-05-00.wav

And we want to build a dataset from specific files (let's say those recorded at an odd hour for some reason 🥸):

from pathlib import Path
from osekit.public_api.dataset import Dataset

from osekit import setup_logging
setup_logging()

# Pick the files you want to include to the dataset
files = (
    r"cool\stuff\2007-11-05_00-01-00.wav",
    r"cool\things\2007-11-05_00-03-00.wav",
    r"cool\things\2007-11-05_00-05-00.wav"
)

# Set the DESTINATION folder of the dataset
dataset = Dataset(
    folder = Path(r"cool\odd_hours"),
    strptime_format="%Y-%m-%d_%H-%M-%S",
)

dataset.build_from_files(
    files=files,
    move_files=False,
)

This'll lead to the following structure (note that if move_files=True, the files used for the build would have been moved from their original location rather than copied):

cool
├── stuff
│   ├── 2007-11-5_00-00-00.wav
│   ├── 2007-11-5_00-01-00.wav
│   └── 2007-11-5_00-02-00.wav
├── things
│   ├── 2007-11-5_00-03-00.wav
│   ├── 2007-11-5_00-04-00.wav
│   └── 2007-11-5_00-05-00.wav
└── odd_hours
    ├── data
    │   └── audio
    │       └── original
    │           ├── 2007-11-5_00-01-00.wav
    │           ├── 2007-11-5_00-02-00.wav
    │           └── 2007-11-5_00-03-00.wav
    ├── log
    │   └── logs.log
    └── dataset.json

🐬 Related issue

Merging this PR will close #302

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an option to not move or only copy the original files when using dataset.build()

1 participant