-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/DIMS refactor GenerateViolinPlots #82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…tor_GenerateViolinPlots
rernst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all lots of work done, good job! I feel like there is room for improvement, some general thoughts:
-
Many parameters have names like metab_interest_sorted. In the context of a function, it’s not relevant whether the input is “of interest” or “sorted.” Use neutral, descriptive names that reflect the data type or role.
-
Several functions are named after their use case rather than their functionality. name functions based on what they do, not where they are used.
-
When breaking function calls across lines, maintain a consistent style. Preferred format:
Rfunction1(
function_2(param_a),
param_b,
param_c,
)-
There is no error catching for missing files or invalid paths. Currently, the code will crash, making debugging difficult.
-
There seems to be a lot of ad-hoc data transformations. It feels like the DIMS application is missing a standardized data format for saving and reusing data between steps.
| #' @param intensity_cols: names of the columns that contain the intensities (string) | ||
| #' | ||
| #' @returns fraction_side_intensity: a vector of intensities (vector of integers) | ||
| get_intentities_for_ratios <- function(ratios_metabs_df, row_index, intensities_zscore_df, fraction_side, intensity_cols) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functionality of this function would get more clear with some more descriptive comments, for example before each if/else block. Secondly the name get_intentities_for_ratios implies that we get multiple intensities for multiple ratios, however the return object fraction_side_intensity implies only one value.
| #' @param intensity_cols: names of the columns that contain the intensities (string) | ||
| #' | ||
| #' @returns fraction_side_intensity: a vector of intensities (vector of integers) | ||
| get_intentities_for_ratios <- function(ratios_metabs_df, row_index, intensities_zscore_df, fraction_side, intensity_cols) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function name contains a typo intentities -> intensities
| get_zscore_columns <- function(colnames_zscore, intensity_cols) { | ||
| sample_intersect <- intersect(paste0(intensity_cols, "_Zscore"), grep("_Zscore", colnames_zscore, value = TRUE)) | ||
| return(sample_intersect) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function name get_zscore_columns implies that we get columns (data or index?) with z-scores. The descriptions describes we get sample_ids.
A better name would be something like get_sample_ids_with_zscore.
| get_list_metabolites <- function(metab_group_dir) { | ||
| # get a list of all metabolite files | ||
| metabolite_files <- list.files(metab_group_dir, pattern = "*.txt", full.names = FALSE, recursive = FALSE) | ||
| # put all metabolites into one list | ||
| metab_list_all <- lapply(paste(metab_group_dir, metabolite_files, sep = "/"), | ||
| read.table, sep = "\t", header = TRUE, quote = "") | ||
| names(metab_list_all) <- gsub(".txt", "", metabolite_files) | ||
|
|
||
| return(metab_list_all) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Use the same 'word' for metabolite -> not metab.
- You named the function to its use, not to its function. I think that it just creates a bunch of dataframes from a directory containing .txt files. So a better name would be something like (making it reusable) ->
get_dataframes_from_dir.
| # get a list of all metabolite files | ||
| metabolite_files <- list.files(metab_group_dir, pattern = "*.txt", full.names = FALSE, recursive = FALSE) | ||
| # put all metabolites into one list | ||
| metab_list_all <- lapply(paste(metab_group_dir, metabolite_files, sep = "/"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set full_names to True to get ride of the 'paste' on line 48.
| # Remove columns, move HMDB_code & HMDB_name column to the front, change intensity columns to numeric | ||
| intensities_zscore_df <- intensities_zscore_df %>% | ||
| select(-c(plots, HMDB_name_all, HMDB_ID_all, sec_HMDB_ID, HMDB_key, sec_HMBD_ID_rlvnc, name, | ||
| relevance, descr, origin, fluids, tissue, disease, pathway, nr_ctrls)) %>% | ||
| relocate(c(HMDB_code, HMDB_name)) %>% | ||
| rename(mean_controls = avg_ctrls, sd_controls = sd_ctrls) %>% | ||
| mutate(across(!c(HMDB_name, HMDB_code), as.numeric)) | ||
|
|
||
| # Get the controls and patient IDs, select the intensity columns | ||
| controls <- colnames(intensities_zscore_df)[grepl("^C", colnames(intensities_zscore_df)) & | ||
| !grepl("_Zscore$", colnames(intensities_zscore_df))] | ||
| control_intensities_cols_index <- which(colnames(intensities_zscore_df) %in% controls) | ||
| nr_of_controls <- length(controls) | ||
|
|
||
| patients <- colnames(intensities_zscore_df)[grepl("^P", colnames(intensities_zscore_df)) & | ||
| !grepl("_Zscore$", colnames(intensities_zscore_df))] | ||
| patient_intensities_cols_index <- which(colnames(intensities_zscore_df) %in% patients) | ||
| nr_of_patients <- length(patients) | ||
|
|
||
| intensity_cols_index <- c(control_intensities_cols_index, patient_intensities_cols_index) | ||
| intensity_cols <- colnames(intensities_zscore_df)[intensity_cols_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be one (or more) 'prepare_data' functions.
| intensity_cols_index <- c(control_intensities_cols_index, patient_intensities_cols_index) | ||
| intensity_cols <- colnames(intensities_zscore_df)[intensity_cols_index] | ||
|
|
||
| #### Calculate ratios of intensities for metabolites #### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parts of this block can be 'calculate' functions.
| zscore_patients_df <- intensities_zscore_ratios_df %>% select(HMDB_code, HMDB_name, any_of(paste0(patients, "_Zscore"))) | ||
| zscore_controls_df <- intensities_zscore_ratios_df %>% select(HMDB_code, HMDB_name, any_of(paste0(controls, "_Zscore"))) | ||
|
|
||
| #### Make violin plots ##### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this a make create violoin plot pdf function
| save_prob_scores_to_excel(diem_probability_score, output_dir, run_name) | ||
|
|
||
|
|
||
| #### Generate dIEM plots ######### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also be a function.
| # metabs_iems <- lapply(top_iems, function(iem) { | ||
| # iem_probablity <- patient_top_iems_probs %>% filter(Disease == iem) %>% pull(!!sym(patient_id)) | ||
| # metabs_iems_names <- c(metabs_iems_names, paste0(iem, ", probability score ", iem_probablity)) | ||
| # metab_iem <- expected_biomarkers_df %>% filter(Disease == iem) %>% select(HMDB_code, HMDB_name) | ||
| # return(metab_iem) | ||
| # }) | ||
| # names(metabs_iems) <- metabs_iems_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove old? code.
The refactor of GenerateViolinPlots: