-
Notifications
You must be signed in to change notification settings - Fork 1
Treat Metadata Program
The Treat Metadata program allows combining and treating the metadata associated to the fastq files based on the treatment information provided by a Treatment Template file. This program corresponds to the Optional Programs group, which means that this step could be skipped if there is no need for further treatment of your metadata.
There are two execution modes available:
-
ENA Mode. This mode treats the ENA Metadata Table based on the treatment information provided by a Treatment Template file. In this case, the program will use the fastq file names indicated in the Treatment Template and compare them to the provided ENA Download column (--ena_download_column parameter) by checking the presence of the fastq file names.
-
Generic Mode. This mode treats the Generic Metadata Table based on the treatment information provided by a Treatment Template file. In this case, the program will compare the provided common columns from the Metadata Table (--generic_common_column_mt parameter) and the Treatment Template (--generic_common_column_tt parameter). If the provided generic_common_column_tt is the "fastq_file_name" column, the program will use the fastq file names of the Treatmente Template and compare them to the provided generic_common_column_mt in a similar fashion to the ENA mode. However, if the provided generic_common_column_tt is the "sample_name" column, a direct match will be carried out between the two common columns.
The provided metadata can be treated in three different modes:
-
Merge Mode. This treatment will merge the metadata with the same sample name (as indicated in the "sample_name" column of the Treatment Template) generating a unique entry in the resulting Metadata Table. As in the case of the Treat Fastqs program, this will be done if the combination of fastq types is a permitted configuration. For further details about the permitted configurations, see Treat Fastqs Program Documentation.
-
Rename Mode. This treatment will merge the metadata with the same sample name (as indicated in the "sample_name" column of the Treatment Template) generating a unique entry in the resulting Metadata Table. As in the case of the Treat Fastqs program, this will be done if the combination of fastq types is a permitted configuration. For further details about permitted configurations, see Treat Fastqs Program Documentation.
-
Copy Mode. This treatment will merge the metadata ignoring the "sample_name" column of the Treatment Template, using instead the original sample name of the fastq files as reference for merging metadata.
The resulting PROJECT_treated_metadata.tsv file will have the following structure:
-
Treatment Template derived columns. The following columns are:
-
Final Sample Name. The "final_files_sample_name" column indicates the sample names of the final treated fastq files. -
Original Sample Name. The "original_files_sample_names" column indicates the sample names of the original fastq files. These names are determined based in a sample name separator (-sep parameter), a sample name separator appearance (-n_sep parameter) and the associated Fastq Pattern (-p parameter). For example, for a file named "accession_1.fastq.gz" using -sep="_" and -n_sep=1 we would get "accession" as the original file sample name. In the case of SINGLE fastq files without a separator, the program will try to get the original sample name by removing the Fastq Patten. For example, for a file named "accession.fastq.gz" and -p=".fastq.gz" we would get "accession" as the original file sample name. -
Treatment Sample Name. The "treatment_sample_name" column indicates the sample name provided in the "sample_name" column of the Treatment Template. -
Treatment Fastq Type. The "treatment_fastq_type" column indicates the different fastq types (pair1, pair2 or single) associated to the final sample names as represented in the Treatment Template.
-
-
Metadata Table derived columns. This will include the merged metadata for the different columns of the Metadata Table. Except in the case of ENA mode, in which some unnecessary metadata columns are not included in the final
PROJECT_treated_metadata.tsvfile (nominal_length, read_count, base_count, fastq_bytes, fastq_md5, fastq_ftp, fastq_aspera, fastq_galaxy, submitted_bytes, submitted_md5, submitted_ftp, submitted_aspera, submitted_galaxy, submitted_format, sra_bytes, sra_md5, sra_ftp, sra_aspera, sra_galaxy, cram_index_ftp, cram_index_aspera, cram_index_galaxy, nominal_sdev, Read depth).
When combining information for a final sample name, if multiple different values are found, these will be separated by a semicolon (;) and the program will generate a warning report that should be used to check possible metadata inconsistencies. In some cases, when working in ENA mode the program will ignore this circumstance as they are interpreted as a non-problematic expected situation (sample_accession, secondary_sample_accession, experiment_accession, run_accession, library_name, experiment_title, experiment_alias, run_alias, sample_alias, broker_name, sample_title, first_public, last_updated, ENA-FIRST-PUBLIC, ENA-LAST-UPDATE, first_created). More non-problematic columns can be provided using the Extra No Warning Columns option (-e parameter).
For instance, if we had the following PROJECT_treatment_template.tsv:
| sample_name | fastq_file_name | fastq_type | treatment |
|---|---|---|---|
| Sample0 | ERR12233.fastq.gz | single | copy |
| Sample1 | ERR12234.fastq.gz | single | rename |
| Sample2 | ERR12235.fastq.gz | single | merge |
| Sample2 | ERR12236.fastq.gz | single | merge |
The program would perform the following treatment on the metadata table:
| final_files_sample_name | original_files_sample_names | treatment_sample_name | treatment_fastq_type | metadata_columns[...] |
|---|---|---|---|---|
| ERR12233 | ERR12233 | Sample0 | single | ERR12233 metadata values |
| Sample1 | ERR12234 | Sample1 | single | Sample1 metadata combined values |
| Sample2 | ERR12235;ERR12236 | Sample2 | single | Sample2 metadata combined values |
For further details, check the treated_filtered_merged_PRJEB10949_ENA_metadata.tsv file generated by the Test omdctk program.
Input Elements:
| Input | Type | Description |
|---|---|---|
PROJECT_metadata.tsv |
File |
Metadata Table. One of the Metadata Tables generated in the different steps of the workflow by Download Metadata ENA program (PROJECT_ENA_metadata.tsv), Merge Metadata program (PROJECT_merged_metadata.tsv) or Filter Metadata program (PROJECT_filtered_metadata.tsv). Also a Generic Metadata Table (GENERIC_metadata_file.tsv) |
PROJECT_treatment_template.tsv |
File |
Final Curated Treatment Template |
Output Elements:
| Output | Type | Description |
|---|---|---|
PROJECT_treated_metadata.tsv |
File |
Treated Metadata Table |
warnings_report.tsv |
File |
Warnings Report. Only produced if possible metadata inconsistencies were detected. |
The resulting PROJECT_treated_metadata.tsv file would be the final treated metadata table. To get a general idea of the optional treatment steps of the workflow, check the workflow's diagram.
Usage:
treat_metadata [-h] -m METADATA_TABLE -t TREATMENT_TEMPLATE [-s {ENA,Generic}]
[-c {fastq_ftp,fastq_aspera,fastq_galaxy,submitted_ftp,submitted_aspera,submitted_galaxy}]
[-g_mt GENERIC_COMMON_COLUMN_MT] [-g_tt {sample_name,fastq_file_name}]
[-e EXTRA_NO_WARNING_COLUMNS [EXTRA_NO_WARNING_COLUMNS ...]]
[-sep SAMPLE_NAME_SEP] [-n_sep SAMPLE_NAME_SEP_APPEREANCE]
[-p FASTQ_PATTERN] [-r1 R1_PATTERN] [-r2 R2_PATTERN]
[-o OUTPUT_DIRECTORY] [-x] [-v]
Options:
| Parameter | Description |
|---|---|
-h, --help |
Show help message and exit. |
-m, --metadata_table |
Metadata Table [Expected sep=TABS]. Indicate the path to the Metadata Table file. |
-t, --treatment_template |
Treatment Template [Expected sep=TABS]. Indicate the path to the Treatment Template file. |
-s, --mode |
Execution Mode (Optional) [Default:ENA]. Options: 1) ENA Metadata Table File [Expected sep=TABS] or 2) Generic Manifest Table File [Expected sep=TABS]. Permitted options are {ENA, Generic}. |
-c, --ena_download_column |
ENA Download Column (Optional) [Default:fastq_ftp]. Indicate the ENA Metadata Table column that was used to download Fastq files. Permitted options are {fastq_ftp, fastq_aspera, fastq_galaxy, submitted_ftp, submitted_aspera, submitted_galaxy}. This parameter will be skipped if Generic mode is used. |
-g_mt, --generic_common_column_mt |
Generic Common Metadata Column (Optional) [Default:sample_id]. Indicate the name of the Common Column in Metadata Table to compare Metadata Table and Treatment Template Files. This parameter will be skipped if ENA mode is used. |
-g_tt, --generic_common_column_tt |
Generic Common Treatment Template Column (Optional) [Default:sample_name]. Indicate the name of the Common Column in Treatment Template to compare Metadata Table and Treatment Template Files. Permitted options are {sample_name,fastq_file_name}. This parameter will be skipped if ENA mode is used. |
-e, --extra_no_warning_columns |
Extra No Warning Columns (Optional). Indicate the column names of the Metadata Table that can be safely merged without warning. Provide column names separated by spaces (If a column name has spaces, quote it). |
-sep, --sample_name_sep |
Sample Name separator (Optional). Indicate sample name separator for "fastq_file_name" column in Treatment Template [Default="_"]. |
-n_sep, --sample_name_sep_appearance |
Sample Name separator appearance (Optional). Indicate by which appearance of the separator the file name can be divided in sample_name + rest [Default=1 appearance]. |
-p, --fastq_pattern |
Fastq File Pattern (Optional) [Default:".fastq.gz"]. Indicate the pattern to identify Fastq files. |
-r1, --r1_pattern |
R1 File Pattern (Optional) [Default:"_1.fastq.gz"]. Indicate the pattern to identify R1 PAIRED Fastq files. |
-r2, --r2_pattern |
R2 File Pattern (Optional) [Default:"_2.fastq.gz"]. Indicate the pattern to identify R2 PAIRED Fastq files. |
-o, --output_directory |
Output Directory. Indicate the path to the Output Directory to save the resulting files. |
-x, --plain_text |
Plain Text Mode (Optional). If indicated, it will enable Plain Text mode, and text will appear without colors. |
-v, --version |
Show program's version number and exit. |
Commands:
- Treat Metadata with colored text stdout:
treat_metadata -t treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv -m filtered_merged_PRJEB10949_ENA_metadata.tsv
- Treat Metadata with plain text stdout:
treat_metadata -t treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv -m filtered_merged_PRJEB10949_ENA_metadata.tsv --plain_text
- Treat Metadata indicating extra no warning columns:
treat_metadata -t treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv -m filtered_merged_PRJEB10949_ENA_metadata.tsv --extra_no_warning_columns Run Sample run_accessions run_label
- Treat Metadata using "fq.gz" instead of the default "fastq.gz" Fastq Pattern:
treat_metadata -t treatment_template_PROJECT_metadata_files_other_fastq_extension.tsv -m PROJECT_metadata_files_other_fastq_extension.tsv -p ".fq.gz" -r1 "_1.fq.gz" -r2 "_2.fq.gz"
- Treat Metadata using "submitted_ftp" instead of the default "fastq_ftp" as ENA Download Column:
treat_metadata -t treatment_template_PROJECT_metadata_submitted_ftp.tsv -m PROJECT_metadata_submitted_ftp.tsv -c submitted_ftp
- Treat Metadata using sep="." and n_sep=2 instead of the defaults:
treat_metadata -t treatment_template_PROJECT_metadata_sep_point_nsep_2.tsv -m PROJECT_metadata_sep_point_nsep_2.tsv -sep "." -n_sep 2
- Treat Metadata and save results in the specified directory (Example):
treat_metadata -t treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv -m filtered_merged_PRJEB10949_ENA_metadata.tsv -o /home/user/Desktop/Example
- Treat Metadata in Generic mode (default: g_mt=sample_id and g_tt=sample_name):
treat_metadata -t treatment_template_filtered_manifest_CRA001372_example.tsv -m filtered_merged_merged_merged_CRA001372_run_clean.tsv -p ".fq.gz" -r1 "_1.fq.gz" -r2 "_2.fq.gz"
- Treat Metadata in Generic mode (g_tt=fastq_file_name):
treat_metadata -t treatment_template_filtered_manifest_CRA001372_example.tsv -m filtered_merged_merged_merged_CRA001372_run_clean.tsv -p ".fq.gz" -r1 "_1.fq.gz" -r2 "_2.fq.gz" -g_tt fastq_file_name -g_mt 'Read filename 1 mod'
To see a full and detailed example of dataset curation, see the Tutorial Full Example page. Particularly recommended in this case.