Skip to content

Conversation

@kaysiz
Copy link
Member

@kaysiz kaysiz commented May 30, 2025

Summary

This PR introduces a comprehensive Python script for processing EuropePMC CSV files and retrieving DOI information. The script maps PMCIDs to DOIs, standardizes repository names, and formats the data for ingest into our Corpus database.

Key Features

  • Maps PMCIDs to DOIs using three sources in order of preference:
    1. Local DOI mapping file
    2. In-memory cache to avoid duplicate API calls
    3. EuropePMC API with rate limiting
  • Fetches repository information from DataCite API for DOIs (with rate limiting)
  • Implements robust error handling and timeout protection
  • Creates standardized CSV output with three columns:
    • repository: Standardized repository name
    • dataset: Dataset ID
    • publication: Full DOI URL

Technical Details

  • Added request timeout handling to prevent script hanging
  • Implemented rate limiting for both EuropePMC (10 req/sec) and DataCite (50 req/sec) APIs
  • Created persistent caching to improve performance on subsequent runs

@kaysiz kaysiz self-assigned this May 30, 2025
@kaysiz kaysiz marked this pull request as ready for review June 8, 2025 17:10
@kaysiz kaysiz requested a review from Copilot June 8, 2025 17:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds tools to download, process, and reformat EuropePMC CSV data into a standardized CSV for ingestion into the Corpus database.

  • Introduces a JSON mapping for repository name standardization
  • Provides a Python script to map PMCIDs to DOIs, fetch publisher info, and format output
  • Adds a shell script to download raw EuropePMC CSVs and DOI metadata

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
corpus-v4/repository_mapping.json Added standardized repository name mappings
corpus-v4/eupmc_reformat_csv.py New script for processing EuropePMC CSVs, DOI mapping, and formatting
corpus-v4/eupmc_file_downloader.sh Shell script to download raw EuropePMC CSV files and metadata
Comments suppressed due to low confidence (1)

corpus-v4/eupmc_reformat_csv.py:1

  • This new script contains substantial logic (DOI mapping, API cache, CSV formatting) but lacks unit tests; consider adding tests for key functions.
#!/usr/bin/env python3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants