Skip to content

Commit e52181c

Browse files
committed
Improve README doc
Signed-off-by: Philippe Ombredanne <pombredanne@aboutcode.org>
1 parent ae09fe0 commit e52181c

File tree

2 files changed

+91
-24
lines changed

2 files changed

+91
-24
lines changed

aboutcode/federated/README.rst

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,59 @@ aboutcode.federated
44
This is a library of utilities to compute ids and file paths for AboutCode
55
federated data based on Package URL
66

7+
8+
Federated data utilities goal is to handle content-defined and hash-addressable
9+
Package data keyed by PURL stored in many Git repositories. This approach to
10+
federate decentralized data is called FederatedCode.
11+
12+
13+
Overview
14+
========
15+
16+
The main design elements for these utilities are:
17+
18+
1. **Data Federation**: A Data Federation is a database, representing a consistent,
19+
non-overlapping set of data kind clusters (like scans, vulnerabilities or SBOMs)
20+
across many package ecosystems, aka. PURL types.
21+
A Federation is similar to a traditional database.
22+
23+
2. **Data Cluster**: A Data Federation contains Data Clusters, where a Data Cluster
24+
purpose is to store the data of a single kind (like scans) across multiple PURL
25+
types. The cluster name is the data kind name and is used as the prefix for
26+
repository names. A Data Cluster is akin to a table in a traditional database.
27+
28+
3. **Data Repository**: A DataCluster contains of one or more Git Data Repository,
29+
each storing datafiles of the cluster data kind and a one PURL type, spreading
30+
the datafiles in multiple Data Directories. The name is data-kind +PURL-
31+
type+hashid. A Repository is similar to a shard or tablespace in a traditionale
32+
database.
33+
34+
4. **Data Directory**: In a Repository, a Data Directory contains the datafiles for
35+
PURLs. The directory name PURL-type+hashid
36+
37+
5. **Data File**: This is a Data File of the DataCluster's Data Kind that is
38+
stored in subdirectories structured after the PURL components::
39+
40+
namespace/name/version/qualifiers/subpath:
41+
42+
- Either at the level of a PURL name: namespace/name,
43+
- Or at the PURL version level namespace/name/version,
44+
- Or at the PURL qualifiers+PURL subpath level.
45+
46+
A Data File can be for instance a JSON scan results file, or a list of PURLs in
47+
YAML.
48+
49+
For example, a list of PURLs as a Data Kind would stored at the name
50+
subdirectory level::
51+
52+
gem-0107/gem/random_password_generator/purls.yml
53+
54+
Or a ScanCode scan as a Data Kind at the version subdirectory level::
55+
56+
gem-0107/npm/file/3.24.3/scancode.yml
57+
58+
59+
760
License
861
-------
962

aboutcode/federated/__init__.py

Lines changed: 38 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,9 @@
3232

3333
"""
3434
Federated data utilities to handle content-defined and hash-addressable Package
35-
data keyed by PURL stored in many Git repositories. This approach to federate
36-
decentralized data is called FederatedCode.
35+
Federated data utilities goal is to handle content-defined and hash-addressable
36+
Package data keyed by PURL stored in many Git repositories. This approach to
37+
federate decentralized data is called FederatedCode.
3738
3839
3940
Overview
@@ -61,7 +62,8 @@
6162
PURLs. The directory name PURL-type+hashid
6263
6364
5. Data File: This is a Data File of the DataCluster's Data Kind that is
64-
stored in subdirectories structured after the PURL components:
65+
stored in subdirectories structured after the PURL components::
66+
6567
namespace/name/version/qualifiers/subpath:
6668
6769
- Either at the level of a PURL name: namespace/name,
@@ -71,7 +73,7 @@
7173
A Data File can be for instance a JSON scan results file, or a list of PURLs in
7274
YAML.
7375
74-
For example, a list of PURLs as a Data Kind would sored at the name
76+
For example, a list of PURLs as a Data Kind would stored at the name
7577
subdirectory level::
7678
7779
gem-0107/gem/random_password_generator/purls.yml
@@ -131,14 +133,19 @@
131133
Object hierarchy
132134
----------------
133135
134-
**federation**: defined by its name and a Git repo with a config file with
135-
clusters configuration for data kind and PURL type parameters, enabling pointing
136-
to multiple repositories.
137-
**cluster**: identified by the data kind name, prefixing its data repos
138-
**repo**: data repo (Git) identified by datakind+PURL-type+hashid
139-
**directory**: dir in a repo, identified by PURL-type+PURL-hashid
140-
**PURL path**: ns/name/version/extra_path derived from the PURL
141-
**datafile**: file storing the data as text JSON/YAML/XML
136+
- **federation**: defined by its name and a Git repo with a config file with
137+
clusters configuration for data kind and PURL type parameters, enabling pointing
138+
to multiple repositories
139+
140+
- **cluster**: identified by the data kind name, prefixing its data repos
141+
142+
- **repo**: data repo (Git) identified by datakind+PURL-type+hashid
143+
144+
- **directory**: dir in a repo, identified by PURL-type+PURL-hashid
145+
146+
- **PURL path**: ns/name/version/extra_path derived from the PURL
147+
148+
- **datafile**: file storing the data as text JSON/YAML/XML
142149
143150
Example
144151
-------
@@ -147,32 +154,34 @@
147154
versions, we would have:
148155
149156
- data federation definition git repo, with its config file.
150-
aboutcode-data/aboutcode-data
151-
aboutcode-federation-config.yml
157+
- aboutcode-data/aboutcode-data
158+
- aboutcode-federation-config.yml
152159
153160
- data cluster repos name prefix is the data kind
154-
aboutcode-data/purls
161+
- aboutcode-data/purls
155162
156163
- data repository git repo, with a purl sub dir tree and datafile.
157164
The first repo name has a hash of 0000 which is the first PURL hashid of the
158165
range of PURL hashid stored in this repo's dirs.
159-
aboutcode-data/purls-gem-0000/
166+
167+
- aboutcode-data/purls-gem-0000/
160168
161169
- data directory, with a purl sub dir tree and datafile. The dir name
162170
composed of type+hashid.
163-
aboutcode-data/purls-gem-0000/gem-0107/
171+
172+
- aboutcode-data/purls-gem-0000/gem-0107/
164173
165174
- PURL subdirectory, and datafile, here list of PURLs for the gem named rails:
166-
aboutcode-data/purls-gem-0000/gem-0107/rails/purls.yml
175+
- aboutcode-data/purls-gem-0000/gem-0107/rails/purls.yml
167176
168177
In this example, if the base URL for this cluster is at the aboutcode-data
169178
GitHub organization, so the URL to the purls.yml datafile is inferred this way
170-
based on the cluster config:
179+
based on the cluster config::
171180
172-
https://github.com/
173-
aboutcode-data/purls-gem-0000/
174-
raw/refs/heads/main/
175-
gem-0107/rails/purls.yml
181+
https://github.com/
182+
aboutcode-data/purls-gem-0000/
183+
raw/refs/heads/main/
184+
gem-0107/rails/purls.yml
176185
177186
178187
More Design details
@@ -290,19 +299,23 @@
290299
using these starting values:
291300
292301
1. For super large ecosystems (with ~5M packages):
302+
293303
- one dir per repo, yielding 1,024 repos
294304
- github, npm
295305
296306
2. For large ecosystems (with ~500K packages)
307+
297308
- eight dirs per repo, yielding 128 repos
298309
- golang, maven, nuget, perl, php, pypi, ruby, huggingface
299310
300311
3. For medium ecosystems (with ~50K packages)
312+
301313
- 32 dirs per repo, yielding 32 Git repositories
302314
- alpm, bitbucket, cocoapods, composer, deb, docker, gem, generic,
303315
mlflow, pub, rpm, cargo
304316
305317
4. For small ecosystem (with ~2K packages)
318+
306319
- 1,024 directories in one git repository
307320
- all others
308321
@@ -321,7 +334,7 @@
321334
322335
323336
Rebalancing and splitting a DataCluster repos
324-
------------------------------------------
337+
------------------------------------------------
325338
326339
We can rebalance a cluster, like when we first store the data in a cluster with
327340
a single Git repository for a given PURL type, and later split this repo to more
@@ -365,6 +378,7 @@
365378
from 1024 to 2049, 4096 or 8192. This would imply moving all the files around
366379
are the directory structure would change from the new hashids. This is likely
367380
to be an exceptional operation.
381+
368382
"""
369383

370384
PACKAGE_REPOS_NAME_PREFIX = "aboutcode-packages"

0 commit comments

Comments
 (0)