Treat Metadata Program

Description

The Treat Metadata program allows combining and treating the metadata associated to the fastq files based on the treatment information provided by a Treatment Template file. This program corresponds to the Optional Programs group, which means that this step could be skipped if there is no need for further treatment of your metadata.

There are two execution modes available:

ENA Mode. This mode treats the ENA Metadata Table based on the treatment information provided by a Treatment Template file. In this case, the program will use the fastq file names indicated in the Treatment Template and compare them to the provided ENA Download column (--ena_download_column parameter) by checking the presence of the fastq file names.
Generic Mode. This mode treats the Generic Metadata Table based on the treatment information provided by a Treatment Template file. In this case, the program will compare the provided common columns from the Metadata Table (--generic_common_column_mt parameter) and the Treatment Template (--generic_common_column_tt parameter). If the provided generic_common_column_tt is the "fastq_file_name" column, the program will use the fastq file names of the Treatmente Template and compare them to the provided generic_common_column_mt in a similar fashion to the ENA mode. However, if the provided generic_common_column_tt is the "sample_name" column, a direct match will be carried out between the two common columns.

The provided metadata can be treated in three different modes:

Merge Mode. This treatment will merge the metadata with the same sample name (as indicated in the "sample_name" column of the Treatment Template) generating a unique entry in the resulting Metadata Table. As in the case of the Treat Fastqs program, this will be done if the combination of fastq types is a permitted configuration. For further details about the permitted configurations, see Treat Fastqs Program Documentation.
Rename Mode. This treatment will merge the metadata with the same sample name (as indicated in the "sample_name" column of the Treatment Template) generating a unique entry in the resulting Metadata Table. As in the case of the Treat Fastqs program, this will be done if the combination of fastq types is a permitted configuration. For further details about permitted configurations, see Treat Fastqs Program Documentation.
Copy Mode. This treatment will merge the metadata ignoring the "sample_name" column of the Treatment Template, using instead the original sample name of the fastq files as reference for merging metadata.

The resulting PROJECT_treated_metadata.tsv file will have the following structure:

Treatment Template derived columns. The following columns are:
- Final Sample Name. The "final_files_sample_name" column indicates the sample names of the final treated fastq files.
- Original Sample Name. The "original_files_sample_names" column indicates the sample names of the original fastq files. These names are determined based in a sample name separator (-sep parameter), a sample name separator appearance (-n_sep parameter) and the associated Fastq Pattern (-p parameter). For example, for a file named "accession_1.fastq.gz" using -sep="_" and -n_sep=1 we would get "accession" as the original file sample name. In the case of SINGLE fastq files without a separator, the program will try to get the original sample name by removing the Fastq Patten. For example, for a file named "accession.fastq.gz" and -p=".fastq.gz" we would get "accession" as the original file sample name.
- Treatment Sample Name. The "treatment_sample_name" column indicates the sample name provided in the "sample_name" column of the Treatment Template.
- Treatment Fastq Type. The "treatment_fastq_type" column indicates the different fastq types (pair1, pair2 or single) associated to the final sample names as represented in the Treatment Template.
Metadata Table derived columns. This will include the merged metadata for the different columns of the Metadata Table. Except in the case of ENA mode, in which some unnecessary metadata columns are not included in the final PROJECT_treated_metadata.tsv file (nominal_length, read_count, base_count, fastq_bytes, fastq_md5, fastq_ftp, fastq_aspera, fastq_galaxy, submitted_bytes, submitted_md5, submitted_ftp, submitted_aspera, submitted_galaxy, submitted_format, sra_bytes, sra_md5, sra_ftp, sra_aspera, sra_galaxy, cram_index_ftp, cram_index_aspera, cram_index_galaxy, nominal_sdev, Read depth).

When combining information for a final sample name, if multiple different values are found, these will be separated by a semicolon (;) and the program will generate a warning report that should be used to check possible metadata inconsistencies. In some cases, when working in ENA mode the program will ignore this circumstance as they are interpreted as a non-problematic expected situation (sample_accession, secondary_sample_accession, experiment_accession, run_accession, library_name, experiment_title, experiment_alias, run_alias, sample_alias, broker_name, sample_title, first_public, last_updated, ENA-FIRST-PUBLIC, ENA-LAST-UPDATE, first_created). More non-problematic columns can be provided using the Extra No Warning Columns option (-e parameter).

For instance, if we had the following PROJECT_treatment_template.tsv:

sample_name	fastq_file_name	fastq_type	treatment
Sample0	ERR12233.fastq.gz	single	copy
Sample1	ERR12234.fastq.gz	single	rename
Sample2	ERR12235.fastq.gz	single	merge
Sample2	ERR12236.fastq.gz	single	merge

The program would perform the following treatment on the metadata table:

final_files_sample_name	original_files_sample_names	treatment_sample_name	treatment_fastq_type	metadata_columns[...]
ERR12233	ERR12233	Sample0	single	ERR12233 metadata values
Sample1	ERR12234	Sample1	single	Sample1 metadata combined values
Sample2	ERR12235;ERR12236	Sample2	single	Sample2 metadata combined values

For further details, check the treated_filtered_merged_PRJEB10949_ENA_metadata.tsv file generated by the Test omdctk program.

Input Elements:

Input	Type	Description
`PROJECT_metadata.tsv`	`File`	Metadata Table. One of the Metadata Tables generated in the different steps of the workflow by Download Metadata ENA program (`PROJECT_ENA_metadata.tsv`), Merge Metadata program (`PROJECT_merged_metadata.tsv`) or Filter Metadata program (`PROJECT_filtered_metadata.tsv`). Also a Generic Metadata Table (`GENERIC_metadata_file.tsv`)
`PROJECT_treatment_template.tsv`	`File`	Final Curated Treatment Template

Output Elements:

Output	Type	Description
`PROJECT_treated_metadata.tsv`	`File`	Treated Metadata Table
`warnings_report.tsv`	`File`	Warnings Report. Only produced if possible metadata inconsistencies were detected.

The resulting PROJECT_treated_metadata.tsv file would be the final treated metadata table. To get a general idea of the optional treatment steps of the workflow, check the workflow's diagram.

Arguments

Usage:

treat_metadata [-h] -m METADATA_TABLE -t TREATMENT_TEMPLATE [-s {ENA,Generic}] 
                   [-c {fastq_ftp,fastq_aspera,fastq_galaxy,submitted_ftp,submitted_aspera,submitted_galaxy}]
                   [-g_mt GENERIC_COMMON_COLUMN_MT] [-g_tt {sample_name,fastq_file_name}]
                   [-e EXTRA_NO_WARNING_COLUMNS [EXTRA_NO_WARNING_COLUMNS ...]] 
                   [-sep SAMPLE_NAME_SEP] [-n_sep SAMPLE_NAME_SEP_APPEREANCE] 
                   [-p FASTQ_PATTERN] [-r1 R1_PATTERN] [-r2 R2_PATTERN]
                   [-o OUTPUT_DIRECTORY] [-x] [-v]

Options:

Parameter	Description
`-h, --help`	Show help message and exit.
`-m, --metadata_table`	Metadata Table [Expected sep=TABS]. Indicate the path to the Metadata Table file.
`-t, --treatment_template`	Treatment Template [Expected sep=TABS]. Indicate the path to the Treatment Template file.
`-s, --mode`	Execution Mode (Optional) [Default:ENA]. Options: 1) ENA Metadata Table File [Expected sep=TABS] or 2) Generic Manifest Table File [Expected sep=TABS]. Permitted options are {ENA, Generic}.
`-c, --ena_download_column`	ENA Download Column (Optional) [Default:fastq_ftp]. Indicate the ENA Metadata Table column that was used to download Fastq files. Permitted options are {fastq_ftp, fastq_aspera, fastq_galaxy, submitted_ftp, submitted_aspera, submitted_galaxy}. This parameter will be skipped if Generic mode is used.
`-g_mt, --generic_common_column_mt`	Generic Common Metadata Column (Optional) [Default:sample_id]. Indicate the name of the Common Column in Metadata Table to compare Metadata Table and Treatment Template Files. This parameter will be skipped if ENA mode is used.
`-g_tt, --generic_common_column_tt`	Generic Common Treatment Template Column (Optional) [Default:sample_name]. Indicate the name of the Common Column in Treatment Template to compare Metadata Table and Treatment Template Files. Permitted options are {sample_name,fastq_file_name}. This parameter will be skipped if ENA mode is used.
`-e, --extra_no_warning_columns`	Extra No Warning Columns (Optional). Indicate the column names of the Metadata Table that can be safely merged without warning. Provide column names separated by spaces (If a column name has spaces, quote it).
`-sep, --sample_name_sep`	Sample Name separator (Optional). Indicate sample name separator for "fastq_file_name" column in Treatment Template [Default="_"].
`-n_sep, --sample_name_sep_appearance`	Sample Name separator appearance (Optional). Indicate by which appearance of the separator the file name can be divided in sample_name + rest [Default=1 appearance].
`-p, --fastq_pattern`	Fastq File Pattern (Optional) [Default:".fastq.gz"]. Indicate the pattern to identify Fastq files.
`-r1, --r1_pattern`	R1 File Pattern (Optional) [Default:"_1.fastq.gz"]. Indicate the pattern to identify R1 PAIRED Fastq files.
`-r2, --r2_pattern`	R2 File Pattern (Optional) [Default:"_2.fastq.gz"]. Indicate the pattern to identify R2 PAIRED Fastq files.
`-o, --output_directory`	Output Directory. Indicate the path to the Output Directory to save the resulting files.
`-x, --plain_text`	Plain Text Mode (Optional). If indicated, it will enable Plain Text mode, and text will appear without colors.
`-v, --version`	Show program's version number and exit.

Examples

Commands:

Treat Metadata with colored text stdout:

treat_metadata -t treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv -m filtered_merged_PRJEB10949_ENA_metadata.tsv

Treat Metadata with plain text stdout:

treat_metadata -t treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv -m filtered_merged_PRJEB10949_ENA_metadata.tsv --plain_text

Treat Metadata indicating extra no warning columns:

treat_metadata -t treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv -m filtered_merged_PRJEB10949_ENA_metadata.tsv --extra_no_warning_columns Run Sample run_accessions run_label

Treat Metadata using "fq.gz" instead of the default "fastq.gz" Fastq Pattern:

treat_metadata -t treatment_template_PROJECT_metadata_files_other_fastq_extension.tsv -m PROJECT_metadata_files_other_fastq_extension.tsv -p ".fq.gz" -r1 "_1.fq.gz" -r2 "_2.fq.gz"

Treat Metadata using "submitted_ftp" instead of the default "fastq_ftp" as ENA Download Column:

treat_metadata -t treatment_template_PROJECT_metadata_submitted_ftp.tsv -m PROJECT_metadata_submitted_ftp.tsv -c submitted_ftp

Treat Metadata using sep="." and n_sep=2 instead of the defaults:

treat_metadata -t treatment_template_PROJECT_metadata_sep_point_nsep_2.tsv -m PROJECT_metadata_sep_point_nsep_2.tsv -sep "." -n_sep 2

Treat Metadata and save results in the specified directory (Example):

treat_metadata -t treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv -m filtered_merged_PRJEB10949_ENA_metadata.tsv -o /home/user/Desktop/Example

Treat Metadata in Generic mode (default: g_mt=sample_id and g_tt=sample_name):

treat_metadata -t treatment_template_filtered_manifest_CRA001372_example.tsv -m filtered_merged_merged_merged_CRA001372_run_clean.tsv -p ".fq.gz" -r1 "_1.fq.gz" -r2 "_2.fq.gz"

Treat Metadata in Generic mode (g_tt=fastq_file_name):

treat_metadata -t treatment_template_filtered_manifest_CRA001372_example.tsv -m filtered_merged_merged_merged_CRA001372_run_clean.tsv -p ".fq.gz" -r1 "_1.fq.gz" -r2 "_2.fq.gz" -g_tt fastq_file_name -g_mt 'Read filename 1 mod'

To see a full and detailed example of dataset curation, see the Tutorial Full Example page. Particularly recommended in this case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Treat Metadata Program

Description

Arguments

Examples

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally