A. Source map (what to scrape for each silo)

the chat convo: https://chatgpt.com/share/68d0a6be-f704-800d-ac58-540a42b6bc35

A. Source map (what to scrape for each silo)

Schema	Primary sources (stable)	What you can pull cleanly	Notes / gaps
filetype (`ext`, name, category, mime, magic, open_with, related)	IANA media types, mime-db (npm), freedesktop/shared-mime-info XML, PRONOM/DROID (signature files), libmagic database, Wikidata	ext↔MIME, canonical names, categories; magic signatures (PRONOM/libmagic); aliases; common apps (Wikidata)	FileInfo is useful but semi-structured; use it last for titles/“how to open” blurbs
mime	mime-db JSON (GitHub/npm), IANA registries	`type/subtype`, extensions, notes	mime-db already merges IANA + community; treat as ground truth for ext lists
magic_signature	PRONOM (DROID ZIP/XML), libmagic text db, community file-signature DB repos	hex patterns, offsets, description, associated extensions	PRONOM is very complete but bureaucratic IDs; libmagic is pragmatic for detection
container	FFmpeg docs/`ffprobe -formats`, Matroska & MP4 official docs, Wikipedia container pages	extensions, MIME, supported stream types	FFmpeg tells read/write/pipe support; pair with official specs for MIME
codec	`ffmpeg -codecs` / docs, Wikipedia codec pages, AOM/SVT/x264/x265 repos	names, kind (video/audio), common containers, profiles/levels, hw support	HW support best from vendor docs (NVIDIA/Intel/Apple); keep limited to “common”
software (openers/handlers)	Wikidata (SPARQL), app vendor pages, chocolatey/homebrew formulae	app name, platforms, homepage, supported extensions	Wikidata has many app→filetype relations
support (browser/OS)	Can I Use (for AVIF/WebP etc.), Apple/Android docs, MDN	“yes/partial/no” by OS/browser	Don’t over-promise; “partial” when decoder exists but UI missing
subtitle/archives/raw clusters	Wikipedia format pages, Matroska specs, vendor docs	format descriptions, typical containers	Good for cross-linking (“VTT in HLS”)
manifests	HLS/DASH specs, MDN, player docs (Shaka/HLS.js)	tags/attributes and examples	More textual than tabular; scrape for examples, not truth tables

B. First-pass ETL plan (fast + reproducible)

Bootstrap with machine-readable sources
- mime-db → seed MIME ↔ extensions (single JSON).
- shared-mime-info → parse XML for categories + magic patterns.
- PRONOM/DROID → unzip signature files; extract hex patterns + offsets + PUIDs.
- libmagic → parse /usr/share/file/magic text; secondary support.
- ffprobe → programmatically list containers/codecs from your own FFmpeg build (ffprobe -formats -codecs -protocols -of json).
Enrich with semi-structured sources
- Wikidata → SPARQL queries for “software X supports extension Y”, “format family”, “developer”, etc.
- Wikipedia → per-page infobox scrape for missing descriptions/aliases (cache and manual review).
Human-curate deltas
- Where sources disagree, keep priority order: IANA/mime-db > shared-mime > PRONOM/libmagic > Wikidata > Wikipedia > FileInfo.
- Open a “review” sheet for oddities (e.g., .bin overlapping meanings).
Normalization rules
- ext lowercase, no dot.
- mime unique, lowercase; prefer mime-db entries.
- category map from shared-mime-info “generic-icons” (image, video, audio, text, app → collapse to your enum).
- magic store as normalized hex with wildcards; keep source in notes.
- open_with: cap to 3–5 popular, per OS; source = Wikidata/vendor.
Versioning
- Store raw snapshots in /sources/… with dates.
- Build a deterministic pipeline (same inputs → same JSONs). Emit a build_id with timestamps + git SHA.

C. Concrete pulls you can implement immediately

mime-db (JS/JSON): gives you mime.full and extensions[]. Map into mime.schema.json and use reverse index to tee up filetype.mime.
shared-mime-info (XML db):
- Fields: <mime-type type="image/heic">, <glob pattern="*.heic"/>, <magic> with <match value="..." offset="...">.
- Use this to fill filetype.magic[], category, and alternate extensions.
PRONOM/DROID (signature files):
- XML with byte sequences (Pos, ByteSequenceValue) and PUIDs.
- Perfect for magic_signature.schema.json; map PUID → id, include extensions.
ffprobe (your build):
- ffprobe -hide_banner -formats -of json → mux/demux flags for containers.
- ffprobe -hide_banner -codecs -of json → codec names + decoders/encoders.
- ffprobe -hide_banner -protocols -of json → protocols list. Populate container, codec, and your taxonomy buckets 10–11–14.
Wikidata SPARQL (JSON results, no scraping):
- Query: apps that open a given extension; or file formats with filename extension “.heic”.
- Populates software.handles_extensions[] and filetype.open_with[].

D. Minimal extractor specs (so your scrapers are small)

I/O: always save raw → /sources/{provider}/{date}/… (don’t parse in-place).
Parser: pure functions from raw to normalized records matching your schemas.
Joiners: merge by normalized keys (ext, mime.full) with priority rules.
Emitted: one JSON per entity type (/build/filetypes.json, mimes.json, …) and (optionally) one JSON-per-item for static page generation.

E. Example: tiny pipelines (pseudo-Python)

mime-db → mime + filetype seeds

import json, requests
db = requests.get("https://raw.githubusercontent.com/jshttp/mime-db/master/db.json").json()
mimes, ext_to_mimes = [], {}
for full, meta in db.items():
    exts = meta.get("extensions", [])
    mimes.append({"type": full.split("/")[0], "subtype": full.split("/")[1], "full": full, "extensions": exts})
    for e in exts:
        ext_to_mimes.setdefault(e.lower(), set()).add(full)
# seed filetypes from ext_to_mimes
filetypes = [{"id": ext, "ext": ext, "name": f".{ext.upper()} file", "category":"other", "mime": sorted(list(m))} for ext,m in ext_to_mimes.items()]

ffprobe → containers/codecs

ffprobe -hide_banner -formats  -of json > formats.json
ffprobe -hide_banner -codecs   -of json > codecs.json
ffprobe -hide_banner -protocols -of json > protocols.json

shared-mime-info XML → magic

from lxml import etree
root = etree.parse("freedesktop.org.xml")
sigs = []
for mt in root.findall(".//mime-type"):
    t = mt.get("type")
    for m in mt.findall(".//magic//match"):
        sigs.append({
          "id": f"{t}-{m.get('value')[:8]}",
          "hex": m.get("value").upper().replace("\\x"," ").strip(),
          "offset": int(m.get("offset","0")),
          "meaning": t,
          "extensions": [g.get("pattern").lstrip("*.") for g in mt.findall(".//glob")]
        })

Wikidata SPARQL (apps that open HEIC)

SELECT ?app ?appLabel WHERE {
  ?fmt wdt:P1195 "heic".          # filename extension
  ?app wdt:P1072 ?fmt.            # software supports file format
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

F. Filling your schemas (coverage matrix)

filetype.ext/name/mime/category/magic/open_with/related → mime-db + shared-mime-info + PRONOM + Wikidata
codec.kind/profiles/common_containers → ffprobe + Wikipedia
container.extensions/mime/streams_supported → ffprobe formats + specs
magic_signature.hex/offset/meaning → PRONOM/libmagic/shared-mime-info
software.platforms/handles_extensions → Wikidata + vendor docs
support.browsers/oses → Can I Use/MDN (only for a handful like AVIF/WebP/HEVC)

G. Practical cautions

Licensing: PRONOM is free to use but credit; mime-db is MIT; shared-mime-info is LGPL-2.1 data; Wikipedia/Wikidata are CC-BY-SA/CC0. Attribute where required.
Rate limits: cache requests, backoff; for Wikipedia/Wikidata use official APIs, not HTML.
Consistency: extensions are many-to-many to MIME; your UI must handle multiple MIME per ext.
Ambiguity: generic extensions (.bin, .dat) → keep but mark category: other, and do not auto-suggest risky openers.

https://github.com/serpdownloaders/codecs

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
buckets.md		buckets.md
codecs.md		codecs.md
downloader-logic.md		downloader-logic.md
ecosystem.md		ecosystem.md
media-stack.md		media-stack.md
seo-content-silos.md		seo-content-silos.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A. Source map (what to scrape for each silo)

B. First-pass ETL plan (fast + reproducible)

C. Concrete pulls you can implement immediately

D. Minimal extractor specs (so your scrapers are small)

E. Example: tiny pipelines (pseudo-Python)

F. Filling your schemas (coverage matrix)

G. Practical cautions

About

Uh oh!

Releases

Packages

serptools/ecosystem

Folders and files

Latest commit

History

Repository files navigation

A. Source map (what to scrape for each silo)

B. First-pass ETL plan (fast + reproducible)

C. Concrete pulls you can implement immediately

D. Minimal extractor specs (so your scrapers are small)

E. Example: tiny pipelines (pseudo-Python)

F. Filling your schemas (coverage matrix)

G. Practical cautions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages