Skip to content

Conversation

@jtarraga
Copy link
Member

@jtarraga jtarraga commented Jul 1, 2024

Update data sources for CellBase 6.2

@jtarraga jtarraga requested a review from imedina July 1, 2024 06:37
@j-coll j-coll changed the title TASK-5564 Update data sources for CellBase 6.2 TASK-5564 - Update data sources for CellBase 6.2 Jul 1, 2024
jtarraga and others added 28 commits July 24, 2024 16:10
…fixing sonnar issues, #TASK-5576, #TASK-5564
…on file for species (e.g., mmusculus), and update the variant downloader according to these changes, #TASK-6426, #TASK-5564
…by the different data, e.g., repeats, #TASK-6142, #TASK-5564
  - Re-using the fucntion loadJsonFile
  - Adding a mongodb index for the collection genome info
  - Adding log messages
… files, and rename some constants, #TASK-5776, #TASK-5564
j-coll and others added 25 commits November 7, 2025 15:38
TASK-7809 - Upgrade avro version from 1.9.1 to 1.11.4
And fix checkstyle after merging
@j-coll j-coll requested a review from Copilot December 19, 2025 10:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request updates data sources for CellBase 6.2, involving a significant refactoring of the builder infrastructure and modernization of clinical variant processing.

Key Changes:

  • Refactored builder class hierarchy by replacing CellBaseBuilder with AbstractBuilder as the base class
  • Updated clinical variant indexers to handle new data formats and sources (ClinVar, COSMIC, CIViC, GWAS)
  • Added new builders for polygenic scores (PGS Catalog)
  • Enhanced gene annotation with additional data sources (imprinted genes, gene fusions from ChimerDB, gnomAD constraints)
  • Updated data source versions and file formats (e.g., UniProt to version 202502, new ClinVar XML structure)

Reviewed changes

Copilot reviewed 107 out of 226 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
CosmicBuilder.java Changed parent class from CellBaseBuilder to AbstractBuilder
ClinicalVariantBuilder.java Major refactoring: added file validation, updated to process new ClinVar/COSMIC/CIViC/GWAS formats with version tracking
ClinicalIndexer.java Added version/assembly fields, updated normalization config, removed inner SequenceLocation class, added constant for original property ID
ClinVarParser.java Changed parent class from CellBaseBuilder to AbstractBuilder
ClinVarIndexer.java Updated to handle new ClinVar format with version parameter, added import for SequenceLocation, improved error handling
CivicIndexerCallback.java New file implementing CIViC data indexing callback with evidence entry creation
CivicIndexer.java New file for CIViC data source integration
VariationBuilder.java Complete rewrite to handle VCF files for non-human species
SpliceBuilder.java Updated parent class and constant references
RocksDbManager.java Added methods for gene imprinting and gene fusion data retrieval, added missing imports
RevelScoreBuilder.java Updated to use new data model ProteinSubstitutionPrediction, enhanced error handling
RepeatsBuilder.java Added configuration-based file validation and support for multiple repeat data sources
RegulatoryRegionBuilder.java File deleted (deprecated)
RegulatoryFeatureBuilder.java Complete rewrite with PFM matrix download and new file format handling
RefSeqGeneBuilderIndexer.java Simplified to delegate to common gene builder methods
RefSeqGeneBuilder.java Major refactoring with configuration-based file validation and improved indexing
PubMedBuilder.java Enhanced with configuration-based validation and improved logging
ProteinBuilder.java Updated to UniProt 202502 format with InterPro integration and chunk processing
PolygenicScoreBuilder.java New file for PGS Catalog data processing
OntologyBuilder.java Refactored with configuration-based file validation
MiRTarBaseIndexer.java New file extracting miRTarBase indexing logic
InteractionBuilder.java Changed parent class to AbstractBuilder
GenomeSequenceFastaBuilder.java Updated parent class and improved logging
GeneExpressionAtlasBuilder.java Changed parent class to AbstractBuilder
GeneBuilderUtils.java File deleted (deprecated)
GeneBuilderIndexer.java Extensive additions for constraints, imprinted genes, gene fusions, and ChimerDB integration
GeneBuilder.java Complete rewrite delegating to Ensembl and RefSeq gene builders
DbSnpBuilder.java Updated constant name and parent class
CellBaseBuilder.java File deleted (replaced by AbstractBuilder)
CaddAllAnnotationBuilder.java Changed parent class to AbstractBuilder
pom.xml Version bump to 6.7.0-SNAPSHOT, added dependencies for commons-compress and commons-csv

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +79 to +81
if (checked) {
return;
}
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean field checked is referenced but not declared in this class. It should be declared as a class field (e.g., private boolean checked = false;) or inherited from AbstractBuilder.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +83
if (1 == 1) {
CellBaseJsonFileSerializer refSeqGeneSerializer = new CellBaseJsonFileSerializer(buildPath, REFSEQ_GENE_BASENAME);
this.refSeqGeneBuilder = new RefSeqGeneBuilder(downloadPath.resolve(REFSEQ_DATA), speciesConfiguration, configuration,
refSeqGeneSerializer);
}
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition if (1 == 1) is always true and suggests incomplete or placeholder code. This should either be replaced with a proper condition or the block should be unconditional.

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +93
if (1 == 1) {
refSeqGeneBuilder.check();
}
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition if (1 == 1) is always true and suggests incomplete or placeholder code. This should either be replaced with a proper condition or the block should be unconditional.

Copilot uses AI. Check for mistakes.
Comment on lines +105 to 111
if (1 == 1) {
if (!Files.exists(downloadPath.resolve(REFSEQ_DATA).resolve(REFSEQ_GENE_OUTPUT_FILENAME))) {
refSeqGeneBuilder.parse();
} else {
tabixReader = new TabixReader(tfbsFile.toAbsolutePath().toString(), tabixFile.toAbsolutePath().toString());
}

// Preparing the fasta file for fast accessing
// System.out.println("genomeSequenceFilePath.toString() = " + genomeSequenceFilePath.toString());
FastaIndex fastaIndex = new FastaIndex(genomeSequenceFilePath);

// Empty transcript and exon dictionaries
transcriptDict.clear();
exonDict.clear();
logger.info("Parsing gtf...");
GtfReader gtfReader = new GtfReader(gtfFile);

// Gene->Transcript->Feature->GTF line
Map<String, Map<String, Map<String, Object>>> gtfMap = null;
if (flexibleGTFParsing) {
gtfMap = loadGTFMap(gtfReader);
initializePointers(gtfMap);
}

Gtf gtf;
while ((gtf = getGTFEntry(gtfReader, gtfMap)) != null) {

if (gtf.getFeature().equals("gene") || gtf.getFeature().equals("transcript")
|| gtf.getFeature().equals("UTR") || gtf.getFeature().equals("Selenocysteine")) {
continue;
}

String geneId = gtf.getAttributes().get("gene_id");
String transcriptId = gtf.getAttributes().get("transcript_id");
String geneName = gtf.getAttributes().get("gene_name");
if (newGene(gene, geneId)) {
// If new geneId is different from the current then we must serialize before data new gene
if (gene != null) {
serializer.serialize(gene);
}

GeneAnnotation geneAnnotation = new GeneAnnotation(indexer.getExpression(geneId), indexer.getDiseases(geneName),
indexer.getDrugs(geneName), indexer.getConstraints(geneId), indexer.getMirnaTargets(geneName),
indexer.getCancerGeneCensus(geneName), indexer.getCancerHotspot(geneName));

gene = new Gene(geneId, geneName, gtf.getSequenceName().replaceFirst("chr", ""),
gtf.getStart(), gtf.getEnd(), gtf.getStrand(), gtf.getAttributes().get("gene_version"),
gtf.getAttributes().get("gene_biotype"), "KNOWN", SOURCE, indexer.getDescription(geneId),
new ArrayList<>(), indexer.getMirnaGene(transcriptId), geneAnnotation);
}

// Check if Transcript exist in the Gene Set of transcripts
if (!transcriptDict.containsKey(transcriptId)) {
transcript = getTranscript(gene, indexer, tabixReader, gtf, transcriptId);
} else {
transcript = gene.getTranscripts().get(transcriptDict.get(transcriptId));
}

// At this point gene and transcript objects are set up
// Update gene and transcript genomic coordinates, start must be the
// lower, and end the higher
updateTranscriptAndGeneCoords(transcript, gene, gtf);

String transcriptIdWithoutVersion = transcript.getId().split("\\.")[0];
if (gtf.getFeature().equalsIgnoreCase("exon")) {
// Obtaining the exon sequence
String exonId = gtf.getAttributes().get("exon_id") + "." + gtf.getAttributes().get("exon_version");
String exonSequence = fastaIndex.query(gtf.getSequenceName(), gtf.getStart(), gtf.getEnd());

exon = new Exon(exonId, gtf.getSequenceName().replaceFirst("chr", ""),
gtf.getStart(), gtf.getEnd(), gtf.getStrand(), 0, 0, 0, 0, 0, 0, -1, Integer.parseInt(gtf
.getAttributes().get("exon_number")), exonSequence);
transcript.getExons().add(exon);

exonDict.put(transcriptIdWithoutVersion + "_" + exon.getExonNumber(), exon);
if (gtf.getAttributes().get("exon_number").equals("1")) {
cdna = 1;
cds = 1;
} else {
// with every exon we update cDNA length with the previous exon length
cdna += exonDict.get(transcriptIdWithoutVersion + "_" + (exon.getExonNumber() - 1)).getEnd()
- exonDict.get(transcriptIdWithoutVersion + "_" + (exon.getExonNumber() - 1)).getStart() + 1;
}
} else {
exon = exonDict.get(transcriptIdWithoutVersion + "_" + exon.getExonNumber());
if (gtf.getFeature().equalsIgnoreCase("CDS")) {
// Protein ID is only present in CDS lines
String proteinId = gtf.getAttributes().get("protein_id") != null
? gtf.getAttributes().get("protein_id") + "." + gtf.getAttributes().get("protein_version")
: "";
transcript.setProteinId(proteinId);
transcript.setProteinSequence(indexer.getProteinFasta(proteinId));

if (gtf.getStrand().equals("+") || gtf.getStrand().equals("1")) {
// CDS states the beginning of coding start
exon.setGenomicCodingStart(gtf.getStart());
exon.setGenomicCodingEnd(gtf.getEnd());

// cDNA coordinates
exon.setCdnaCodingStart(gtf.getStart() - exon.getStart() + cdna);
exon.setCdnaCodingEnd(gtf.getEnd() - exon.getStart() + cdna);
// Set cdnaCodingEnd to prevent those cases without stop_codon

transcript.setCdnaCodingEnd(gtf.getEnd() - exon.getStart() + cdna);
exon.setCdsStart(cds);
exon.setCdsEnd(gtf.getEnd() - gtf.getStart() + cds);

// increment in the coding length
cds += gtf.getEnd() - gtf.getStart() + 1;
transcript.setCdsLength(cds - 1); // Set cdnaCodingEnd to prevent those cases without stop_codon

exon.setPhase(Integer.parseInt(gtf.getFrame()));

if (transcript.getGenomicCodingStart() == 0 || transcript.getGenomicCodingStart() > gtf.getStart()) {
transcript.setGenomicCodingStart(gtf.getStart());
}
if (transcript.getGenomicCodingEnd() == 0 || transcript.getGenomicCodingEnd() < gtf.getEnd()) {
transcript.setGenomicCodingEnd(gtf.getEnd());
}
// only first time
if (transcript.getCdnaCodingStart() == 0) {
transcript.setCdnaCodingStart(gtf.getStart() - exon.getStart() + cdna);
}
// strand -
} else {
// CDS states the beginning of coding start
exon.setGenomicCodingStart(gtf.getStart());
exon.setGenomicCodingEnd(gtf.getEnd());
// cDNA coordinates
// cdnaCodingStart points to the same base position than genomicCodingEnd
exon.setCdnaCodingStart(exon.getEnd() - gtf.getEnd() + cdna);
// cdnaCodingEnd points to the same base position than genomicCodingStart
exon.setCdnaCodingEnd(exon.getEnd() - gtf.getStart() + cdna);
// Set cdnaCodingEnd to prevent those cases without stop_codon
transcript.setCdnaCodingEnd(exon.getEnd() - gtf.getStart() + cdna);
exon.setCdsStart(cds);
exon.setCdsEnd(gtf.getEnd() - gtf.getStart() + cds);

// increment in the coding length
cds += gtf.getEnd() - gtf.getStart() + 1;
transcript.setCdsLength(cds - 1); // Set cdnaCodingEnd to prevent those cases without stop_codon
exon.setPhase(Integer.parseInt(gtf.getFrame()));

if (transcript.getGenomicCodingStart() == 0 || transcript.getGenomicCodingStart() > gtf.getStart()) {
transcript.setGenomicCodingStart(gtf.getStart());
}
if (transcript.getGenomicCodingEnd() == 0 || transcript.getGenomicCodingEnd() < gtf.getEnd()) {
transcript.setGenomicCodingEnd(gtf.getEnd());
}
// only first time
if (transcript.getCdnaCodingStart() == 0) {
// cdnaCodingStart points to the same base position than genomicCodingEnd
transcript.setCdnaCodingStart(exon.getEnd() - gtf.getEnd() + cdna);
}
}

}
// if (gtf.getFeature().equalsIgnoreCase("start_codon")) {
// // nothing to do
// System.out.println("Empty block, this should be redesigned");
// }
if (gtf.getFeature().equalsIgnoreCase("stop_codon")) {
// setCdnaCodingEnd = false; // stop_codon found, cdnaCodingEnd will be set here,
// no need to set it at the beginning of next feature
if (exon.getStrand().equals("+")) {
updateStopCodingDataPositiveExon(exon, cdna, cds, gtf);

cds += gtf.getEnd() - gtf.getStart();
// If stop_codon appears, overwrite values
transcript.setGenomicCodingEnd(gtf.getEnd());
transcript.setCdnaCodingEnd(gtf.getEnd() - exon.getStart() + cdna);
transcript.setCdsLength(cds - 1);

} else {
updateNegativeExonCodingData(exon, cdna, cds, gtf);

cds += gtf.getEnd() - gtf.getStart();
// If stop_codon appears, overwrite values
transcript.setGenomicCodingStart(gtf.getStart());
// cdnaCodingEnd points to the same base position than genomicCodingStart
transcript.setCdnaCodingEnd(exon.getEnd() - gtf.getStart() + cdna);
transcript.setCdsLength(cds - 1);
}
}
}
}

// last gene must be serialized
serializer.serialize(gene);

// cleaning
gtfReader.close();
serializer.close();
fastaIndex.close();
indexer.close();
} catch (Exception e) {
indexer.close();
throw e;
}
}

private Transcript getTranscript(Gene gene, EnsemblGeneBuilderIndexer indexer, TabixReader tabixReader, Gtf gtf, String transcriptId)
throws IOException, RocksDBException {
Map<String, String> gtfAttributes = gtf.getAttributes();

// To match Ensembl, we set the ID as transcript+version. This also matches the Ensembl website.
String transcriptIdWithVersion = transcriptId + "." + gtfAttributes.get("transcript_version");
String biotype = gtfAttributes.get("transcript_biotype") != null ? gtfAttributes.get("transcript_biotype") : "";
String transcriptChromosome = gtf.getSequenceName().replaceFirst("chr", "");
List<TranscriptTfbs> transcriptTfbses = getTranscriptTfbses(gtf, transcriptChromosome, tabixReader);

List<FeatureOntologyTermAnnotation> ontologyAnnotations = getOntologyAnnotations(indexer.getXrefs(transcriptId), indexer);
TranscriptAnnotation transcriptAnnotation = new TranscriptAnnotation(ontologyAnnotations, indexer.getConstraints(transcriptId));

Transcript transcript = new Transcript(transcriptIdWithVersion, gtfAttributes.get("transcript_name"), transcriptChromosome,
gtf.getStart(), gtf.getEnd(), gtf.getStrand(), biotype, "KNOWN",
0, 0, 0, 0, 0,
indexer.getCdnaFasta(transcriptIdWithVersion), "", "", "",
gtfAttributes.get("transcript_version"), SOURCE, new ArrayList<>(), indexer.getXrefs(transcriptId), transcriptTfbses,
new HashSet<>(), transcriptAnnotation);

// Adding Ids appearing in the GTF to the xrefs is required, since for some unknown reason the ENSEMBL
// Perl API often doesn't return all genes resulting in an incomplete xrefs.txt file. We must ensure
// that the xrefs array contains all ids present in the GTF file
addGtfXrefs(transcript, gene, gtfAttributes);

// Add HGNC ID mappings, with this we can know which Ensembl and Refseq transcripts match to HGNC ID
String hgncId = indexer.getHgncId(gene.getName());
if (StringUtils.isNotEmpty(hgncId)) {
transcript.getXrefs().add(new Xref(hgncId, "hgnc_id", "HGNC ID"));
}

// Add MANE Select mappings, with this we can know which Ensembl and Refseq transcripts match according to MANE
for (String suffix: Arrays.asList("refseq", "refseq_protein")) {
String maneRefSeq = indexer.getMane(transcriptIdWithVersion, suffix);
if (StringUtils.isNotEmpty(maneRefSeq)) {
transcript.getXrefs().add(new Xref(maneRefSeq, "mane_select_" + suffix,
"MANE Select RefSeq" + (suffix.contains("_") ? " Protein" : "")));
logger.info(DATA_ALREADY_BUILT, getDataName(REFSEQ_DATA) + " gene");
}
}
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition if (1 == 1) is always true and suggests incomplete or placeholder code. This should either be replaced with a proper condition or the block should be unconditional.

Copilot uses AI. Check for mistakes.
Comment on lines +127 to 130
if (1 == 1) {
if (isHSapiens || isDataSupported(configuration.getDownload().getManeSelect(), prefixId)) {
dataList.add(MANE_SELECT_DATA);
}
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition if (1 == 1) is always true and suggests incomplete or placeholder code. This should either be replaced with a proper condition or the block should be unconditional.

Copilot uses AI. Check for mistakes.
@gpveronica gpveronica requested a review from dapregi December 19, 2025 12:05
dapregi
dapregi previously approved these changes Dec 19, 2025
@jtarraga jtarraga dismissed dapregi’s stale review December 19, 2025 12:35

The merge-base changed after approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

8 participants