|
32 | 32 |
|
33 | 33 | """ |
34 | 34 | Federated data utilities to handle content-defined and hash-addressable Package |
35 | | -data keyed by PURL stored in many Git repositories. This approach to federate |
36 | | -decentralized data is called FederatedCode. |
| 35 | +Federated data utilities goal is to handle content-defined and hash-addressable |
| 36 | +Package data keyed by PURL stored in many Git repositories. This approach to |
| 37 | +federate decentralized data is called FederatedCode. |
37 | 38 |
|
38 | 39 |
|
39 | 40 | Overview |
|
61 | 62 | PURLs. The directory name PURL-type+hashid |
62 | 63 |
|
63 | 64 | 5. Data File: This is a Data File of the DataCluster's Data Kind that is |
64 | | -stored in subdirectories structured after the PURL components: |
| 65 | +stored in subdirectories structured after the PURL components:: |
| 66 | +
|
65 | 67 | namespace/name/version/qualifiers/subpath: |
66 | 68 |
|
67 | 69 | - Either at the level of a PURL name: namespace/name, |
|
71 | 73 | A Data File can be for instance a JSON scan results file, or a list of PURLs in |
72 | 74 | YAML. |
73 | 75 |
|
74 | | -For example, a list of PURLs as a Data Kind would sored at the name |
| 76 | +For example, a list of PURLs as a Data Kind would stored at the name |
75 | 77 | subdirectory level:: |
76 | 78 |
|
77 | 79 | gem-0107/gem/random_password_generator/purls.yml |
|
131 | 133 | Object hierarchy |
132 | 134 | ---------------- |
133 | 135 |
|
134 | | -**federation**: defined by its name and a Git repo with a config file with |
135 | | -clusters configuration for data kind and PURL type parameters, enabling pointing |
136 | | -to multiple repositories. |
137 | | - **cluster**: identified by the data kind name, prefixing its data repos |
138 | | - **repo**: data repo (Git) identified by datakind+PURL-type+hashid |
139 | | - **directory**: dir in a repo, identified by PURL-type+PURL-hashid |
140 | | - **PURL path**: ns/name/version/extra_path derived from the PURL |
141 | | - **datafile**: file storing the data as text JSON/YAML/XML |
| 136 | +- **federation**: defined by its name and a Git repo with a config file with |
| 137 | + clusters configuration for data kind and PURL type parameters, enabling pointing |
| 138 | + to multiple repositories |
| 139 | +
|
| 140 | + - **cluster**: identified by the data kind name, prefixing its data repos |
| 141 | +
|
| 142 | + - **repo**: data repo (Git) identified by datakind+PURL-type+hashid |
| 143 | +
|
| 144 | + - **directory**: dir in a repo, identified by PURL-type+PURL-hashid |
| 145 | +
|
| 146 | + - **PURL path**: ns/name/version/extra_path derived from the PURL |
| 147 | +
|
| 148 | + - **datafile**: file storing the data as text JSON/YAML/XML |
142 | 149 |
|
143 | 150 | Example |
144 | 151 | ------- |
|
147 | 154 | versions, we would have: |
148 | 155 |
|
149 | 156 | - data federation definition git repo, with its config file. |
150 | | - aboutcode-data/aboutcode-data |
151 | | - aboutcode-federation-config.yml |
| 157 | + - aboutcode-data/aboutcode-data |
| 158 | + - aboutcode-federation-config.yml |
152 | 159 |
|
153 | 160 | - data cluster repos name prefix is the data kind |
154 | | - aboutcode-data/purls |
| 161 | + - aboutcode-data/purls |
155 | 162 |
|
156 | 163 | - data repository git repo, with a purl sub dir tree and datafile. |
157 | 164 | The first repo name has a hash of 0000 which is the first PURL hashid of the |
158 | 165 | range of PURL hashid stored in this repo's dirs. |
159 | | - aboutcode-data/purls-gem-0000/ |
| 166 | +
|
| 167 | + - aboutcode-data/purls-gem-0000/ |
160 | 168 |
|
161 | 169 | - data directory, with a purl sub dir tree and datafile. The dir name |
162 | 170 | composed of type+hashid. |
163 | | - aboutcode-data/purls-gem-0000/gem-0107/ |
| 171 | +
|
| 172 | + - aboutcode-data/purls-gem-0000/gem-0107/ |
164 | 173 |
|
165 | 174 | - PURL subdirectory, and datafile, here list of PURLs for the gem named rails: |
166 | | - aboutcode-data/purls-gem-0000/gem-0107/rails/purls.yml |
| 175 | + - aboutcode-data/purls-gem-0000/gem-0107/rails/purls.yml |
167 | 176 |
|
168 | 177 | In this example, if the base URL for this cluster is at the aboutcode-data |
169 | 178 | GitHub organization, so the URL to the purls.yml datafile is inferred this way |
170 | | -based on the cluster config: |
| 179 | +based on the cluster config:: |
171 | 180 |
|
172 | | -https://github.com/ |
173 | | - aboutcode-data/purls-gem-0000/ |
174 | | - raw/refs/heads/main/ |
175 | | - gem-0107/rails/purls.yml |
| 181 | + https://github.com/ |
| 182 | + aboutcode-data/purls-gem-0000/ |
| 183 | + raw/refs/heads/main/ |
| 184 | + gem-0107/rails/purls.yml |
176 | 185 |
|
177 | 186 |
|
178 | 187 | More Design details |
|
290 | 299 | using these starting values: |
291 | 300 |
|
292 | 301 | 1. For super large ecosystems (with ~5M packages): |
| 302 | +
|
293 | 303 | - one dir per repo, yielding 1,024 repos |
294 | 304 | - github, npm |
295 | 305 |
|
296 | 306 | 2. For large ecosystems (with ~500K packages) |
| 307 | +
|
297 | 308 | - eight dirs per repo, yielding 128 repos |
298 | 309 | - golang, maven, nuget, perl, php, pypi, ruby, huggingface |
299 | 310 |
|
300 | 311 | 3. For medium ecosystems (with ~50K packages) |
| 312 | +
|
301 | 313 | - 32 dirs per repo, yielding 32 Git repositories |
302 | 314 | - alpm, bitbucket, cocoapods, composer, deb, docker, gem, generic, |
303 | 315 | mlflow, pub, rpm, cargo |
304 | 316 |
|
305 | 317 | 4. For small ecosystem (with ~2K packages) |
| 318 | +
|
306 | 319 | - 1,024 directories in one git repository |
307 | 320 | - all others |
308 | 321 |
|
|
321 | 334 |
|
322 | 335 |
|
323 | 336 | Rebalancing and splitting a DataCluster repos |
324 | | ------------------------------------------- |
| 337 | +------------------------------------------------ |
325 | 338 |
|
326 | 339 | We can rebalance a cluster, like when we first store the data in a cluster with |
327 | 340 | a single Git repository for a given PURL type, and later split this repo to more |
|
365 | 378 | from 1024 to 2049, 4096 or 8192. This would imply moving all the files around |
366 | 379 | are the directory structure would change from the new hashids. This is likely |
367 | 380 | to be an exceptional operation. |
| 381 | +
|
368 | 382 | """ |
369 | 383 |
|
370 | 384 | PACKAGE_REPOS_NAME_PREFIX = "aboutcode-packages" |
|
0 commit comments