Skip to content

serptools/ecosystem

Repository files navigation

A. Source map (what to scrape for each silo)

Schema Primary sources (stable) What you can pull cleanly Notes / gaps
filetype (ext, name, category, mime, magic, open_with, related) IANA media types, mime-db (npm), freedesktop/shared-mime-info XML, PRONOM/DROID (signature files), libmagic database, Wikidata ext↔MIME, canonical names, categories; magic signatures (PRONOM/libmagic); aliases; common apps (Wikidata) FileInfo is useful but semi-structured; use it last for titles/“how to open” blurbs
mime mime-db JSON (GitHub/npm), IANA registries type/subtype, extensions, notes mime-db already merges IANA + community; treat as ground truth for ext lists
magic_signature PRONOM (DROID ZIP/XML), libmagic text db, community file-signature DB repos hex patterns, offsets, description, associated extensions PRONOM is very complete but bureaucratic IDs; libmagic is pragmatic for detection
container FFmpeg docs/ffprobe -formats, Matroska & MP4 official docs, Wikipedia container pages extensions, MIME, supported stream types FFmpeg tells read/write/pipe support; pair with official specs for MIME
codec ffmpeg -codecs / docs, Wikipedia codec pages, AOM/SVT/x264/x265 repos names, kind (video/audio), common containers, profiles/levels, hw support HW support best from vendor docs (NVIDIA/Intel/Apple); keep limited to “common”
software (openers/handlers) Wikidata (SPARQL), app vendor pages, chocolatey/homebrew formulae app name, platforms, homepage, supported extensions Wikidata has many app→filetype relations
support (browser/OS) Can I Use (for AVIF/WebP etc.), Apple/Android docs, MDN “yes/partial/no” by OS/browser Don’t over-promise; “partial” when decoder exists but UI missing
subtitle/archives/raw clusters Wikipedia format pages, Matroska specs, vendor docs format descriptions, typical containers Good for cross-linking (“VTT in HLS”)
manifests HLS/DASH specs, MDN, player docs (Shaka/HLS.js) tags/attributes and examples More textual than tabular; scrape for examples, not truth tables

B. First-pass ETL plan (fast + reproducible)

  1. Bootstrap with machine-readable sources

    • mime-db → seed MIME ↔ extensions (single JSON).
    • shared-mime-info → parse XML for categories + magic patterns.
    • PRONOM/DROID → unzip signature files; extract hex patterns + offsets + PUIDs.
    • libmagic → parse /usr/share/file/magic text; secondary support.
    • ffprobe → programmatically list containers/codecs from your own FFmpeg build (ffprobe -formats -codecs -protocols -of json).
  2. Enrich with semi-structured sources

    • Wikidata → SPARQL queries for “software X supports extension Y”, “format family”, “developer”, etc.
    • Wikipedia → per-page infobox scrape for missing descriptions/aliases (cache and manual review).
  3. Human-curate deltas

    • Where sources disagree, keep priority order: IANA/mime-db > shared-mime > PRONOM/libmagic > Wikidata > Wikipedia > FileInfo.
    • Open a “review” sheet for oddities (e.g., .bin overlapping meanings).
  4. Normalization rules

    • ext lowercase, no dot.
    • mime unique, lowercase; prefer mime-db entries.
    • category map from shared-mime-info “generic-icons” (image, video, audio, text, app → collapse to your enum).
    • magic store as normalized hex with wildcards; keep source in notes.
    • open_with: cap to 3–5 popular, per OS; source = Wikidata/vendor.
  5. Versioning

    • Store raw snapshots in /sources/… with dates.
    • Build a deterministic pipeline (same inputs → same JSONs). Emit a build_id with timestamps + git SHA.

C. Concrete pulls you can implement immediately

  • mime-db (JS/JSON): gives you mime.full and extensions[]. Map into mime.schema.json and use reverse index to tee up filetype.mime.

  • shared-mime-info (XML db):

    • Fields: <mime-type type="image/heic">, <glob pattern="*.heic"/>, <magic> with <match value="..." offset="...">.
    • Use this to fill filetype.magic[], category, and alternate extensions.
  • PRONOM/DROID (signature files):

    • XML with byte sequences (Pos, ByteSequenceValue) and PUIDs.
    • Perfect for magic_signature.schema.json; map PUID → id, include extensions.
  • ffprobe (your build):

    • ffprobe -hide_banner -formats -of json → mux/demux flags for containers.
    • ffprobe -hide_banner -codecs -of json → codec names + decoders/encoders.
    • ffprobe -hide_banner -protocols -of json → protocols list. Populate container, codec, and your taxonomy buckets 10–11–14.
  • Wikidata SPARQL (JSON results, no scraping):

    • Query: apps that open a given extension; or file formats with filename extension “.heic”.
    • Populates software.handles_extensions[] and filetype.open_with[].

D. Minimal extractor specs (so your scrapers are small)

  • I/O: always save raw → /sources/{provider}/{date}/… (don’t parse in-place).
  • Parser: pure functions from raw to normalized records matching your schemas.
  • Joiners: merge by normalized keys (ext, mime.full) with priority rules.
  • Emitted: one JSON per entity type (/build/filetypes.json, mimes.json, …) and (optionally) one JSON-per-item for static page generation.

E. Example: tiny pipelines (pseudo-Python)

mime-db → mime + filetype seeds

import json, requests
db = requests.get("https://raw.githubusercontent.com/jshttp/mime-db/master/db.json").json()
mimes, ext_to_mimes = [], {}
for full, meta in db.items():
    exts = meta.get("extensions", [])
    mimes.append({"type": full.split("/")[0], "subtype": full.split("/")[1], "full": full, "extensions": exts})
    for e in exts:
        ext_to_mimes.setdefault(e.lower(), set()).add(full)
# seed filetypes from ext_to_mimes
filetypes = [{"id": ext, "ext": ext, "name": f".{ext.upper()} file", "category":"other", "mime": sorted(list(m))} for ext,m in ext_to_mimes.items()]

ffprobe → containers/codecs

ffprobe -hide_banner -formats  -of json > formats.json
ffprobe -hide_banner -codecs   -of json > codecs.json
ffprobe -hide_banner -protocols -of json > protocols.json

shared-mime-info XML → magic

from lxml import etree
root = etree.parse("freedesktop.org.xml")
sigs = []
for mt in root.findall(".//mime-type"):
    t = mt.get("type")
    for m in mt.findall(".//magic//match"):
        sigs.append({
          "id": f"{t}-{m.get('value')[:8]}",
          "hex": m.get("value").upper().replace("\\x"," ").strip(),
          "offset": int(m.get("offset","0")),
          "meaning": t,
          "extensions": [g.get("pattern").lstrip("*.") for g in mt.findall(".//glob")]
        })

Wikidata SPARQL (apps that open HEIC)

SELECT ?app ?appLabel WHERE {
  ?fmt wdt:P1195 "heic".          # filename extension
  ?app wdt:P1072 ?fmt.            # software supports file format
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

F. Filling your schemas (coverage matrix)

  • filetype.ext/name/mime/category/magic/open_with/relatedmime-db + shared-mime-info + PRONOM + Wikidata
  • codec.kind/profiles/common_containersffprobe + Wikipedia
  • container.extensions/mime/streams_supportedffprobe formats + specs
  • magic_signature.hex/offset/meaningPRONOM/libmagic/shared-mime-info
  • software.platforms/handles_extensionsWikidata + vendor docs
  • support.browsers/osesCan I Use/MDN (only for a handful like AVIF/WebP/HEVC)

G. Practical cautions

  • Licensing: PRONOM is free to use but credit; mime-db is MIT; shared-mime-info is LGPL-2.1 data; Wikipedia/Wikidata are CC-BY-SA/CC0. Attribute where required.
  • Rate limits: cache requests, backoff; for Wikipedia/Wikidata use official APIs, not HTML.
  • Consistency: extensions are many-to-many to MIME; your UI must handle multiple MIME per ext.
  • Ambiguity: generic extensions (.bin, .dat) → keep but mark category: other, and do not auto-suggest risky openers.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published