Skip to content
Mike Trizna edited this page Jun 5, 2015 · 25 revisions

The goal of this project is prototype an automated workflow to enrich or link all existing iDigBio specimen records to external taxonomies, ontologies or vocabularies.

The result allows to answer requests like:

Give me all iDigBio records that corresponds to a fungal taxon with MycoBank 123.

Give me all names that did not match (misspelling, typo) against either MycoBank or any of the global names data sources.

Give me all iDigBio records that contains outdates names.

Give me all records that might describe a species interaction.

Sequence Diagram

sequence diagam

Components

Component Name Status
Archive Processor Jorrit
Name Normalizer Dima
GUID Generator Dima
Global Names Resolver Dima
MycoBank Resolver Scott
iDigBio LD ?
iDigBio LD Web App John

Technologies

Python, Node.js, Apache Spark or similar.

Pseudo code for processors

Something like:

def process(Dict: specimenRecord):  
  var enrichedRecord = specimenRecord.copy()
  
  # do something like
  var someName = specimenRcord.get("dwc:scientificName")
  enrichedRecord.update({"external:id": lookup(someName)})
  return enrichedDictionary;

UUID version 5

We create UUID v5 out of scientific name strings and use them as identifiers of these strings.

UUID v5 blog post

UUID v5 examples

UUID and URLS

source UUID URL
GlobalNames 16f235a0-e4a3-529c-9b83-bd15fe722110 http://gni.globalnames.org/name_strings/16f235a0-e4a3-529c-9b83-bd15fe722110
GlobalNames 813583ad-c364-5c15-b01a-43eaa1446fee http://gni.globalnames.org/name_strings/813583ad-c364-5c15-b01a-43eaa1446fee
GBIF 215 http://www.gbif.org/species/215
GBIF Image GBIF:215 http://api.globalbioticinteractions.org/images/GBIF:215
iDigBio 00f8efa0-75ee-45c1-a88d-8a853705c6dd http://beta-search.idigbio.org/v2/view/records/00f8efa0-75ee-45c1-a88d-8a853705c6dd
GenBank 9a7d8ad8-60ec-48a0-9b36-a9cc0cf0b223 http://www.ncbi.nlm.nih.gov/nuccore/?term=AY803322+OR+AV50248+OR+HM583371

GenBank Accession extraction

Here is the Regular Expression for extracting GB Accession numbers:

[a-zA-Z]{1,2}\-?_?\d{5,6}

iDigBio UUID dwc:associatedSequences field Extracted Accessions NCBI Search Link
4c5f122d-4686-4514-bf94-c38ecb4e98ab GenBank FJ266907 (cytb) GenBank FJ267193 (ND4) FJ266907|FJ267193 http://www.ncbi.nlm.nih.gov/nuccore/?term=FJ266907+OR+FJ267193
4d4b08ca-a552-481b-b3c4-7819548880eb http://www.ncbi.nlm.nih.gov/nuccore/AF285919 ; http://www.ncbi.nlm.nih.gov/nuccore/AF285941 AF285919|AF285941 http://www.ncbi.nlm.nih.gov/nuccore/?term=AF285919+OR+AF285941
0b089c97-e451-4f0d-a8ea-940582096f38 , , , , , , , null null

Files

iDigBio archive

Crossmap of iDigBio with GBIF

Clone this wiki locally