feat: implement coarse versioning for future matching against #4590

michaelkedar · 2026-01-13T05:56:11Z

Implemented a string-comparable encoding of ecosystem versions to allow us to do filtering on the database queries.
This should help reduce the number of entities needed to fetch and compare in many ecosystems.

I've also slightly changed the behaviour of _sort_key, which previously was returning a 'maximal' value for invalid versions. Now it should raise a ValueError, and the sort_key wrapper (which was handling 0 values) converts those into maximal version objects.

I've added hypothesis to do some fuzzing tests to make sure the logic retains the ordering rules of the ecosystems. (Please let me know if after this is merged you run into more failure cases so I can fix them). Fuzzing also helped identify some potential uncaught errors in our version parsing (particularly, using isdigit instead of isdecimal), which I've fixed where applicable.

I need to write a script to populate the existing AffectedVersions entities with the newly generated values, then work on making the API use the coarse versions for querying.

We have some PRs open on adding new ecosystems - they don't necessarily need the coarse version method, but I'm happy to merge those first then add the code for this later.

michaelkedar · 2026-01-13T05:56:22Z

/gemini review

michaelkedar · 2026-01-13T05:58:48Z

osv/ecosystems/coarse_version_monotonicity_test.py

+
+# Matches RPM versions: optional epoch, alternating alphanumeric segments.
+rpm_version_strategy = st.from_regex(
+    re.compile(r'^([0-9]+:)?(([0-9]+|[A-Za-z]+)((?![0-9A-Za-z])[ -~])*)+$',


hypothesis uses the regex to generate patterns. It doesn't match with them afaik.

michaelkedar · 2026-01-13T05:58:52Z

osv/ecosystems/coarse_version_monotonicity_test.py

+
+# Matches RPM versions: optional epoch, alternating alphanumeric segments.
+rpm_version_strategy = st.from_regex(
+    re.compile(r'^([0-9]+:)?(([0-9]+|[A-Za-z]+)((?![0-9A-Za-z])[ -~])*)+$',


hypothesis uses the regex to generate patterns. It doesn't match with them afaik.

gemini-code-assist

Code Review

This is a substantial and well-executed pull request that introduces a coarse versioning scheme to optimize database queries. The implementation is thorough, covering a wide range of ecosystems, and the addition of property-based testing with hypothesis is a fantastic way to ensure the correctness of the version ordering logic. The refactoring of _sort_key methods to raise ValueError on invalid versions is a solid design improvement. I've identified one potential issue in the Debian coarse versioning logic and a minor opportunity for optimization in osv/models.py. Overall, this is an excellent contribution.

gemini-code-assist · 2026-01-13T05:59:32Z

osv/ecosystems/debian.py

+    parts = re.findall(r'^$|\d+|\D+', v)
+    int_parts = []
+    for couple in batched(parts, 2):
+      if not couple[0].isdecimal():
+        # This is probably handled by is_valid
+        break
+      int_parts.append(int(couple[0]))
+      if len(couple) == 1:
+        break
+      sep = couple[1]
+      if sep == '.':
+        continue
+      if sep[0] == '.':
+        # Bigger than the max int, so we overflow
+        int_parts.append(9999999999)
+      break


The logic for parsing Debian versions in coarse_version appears to be incorrect for versions containing alphabetic characters. For an input like '10.20.a30', re.findall(r'^$|\d+|\D+', v) will produce ['10', '.', '20', '.', 'a', '30']. The loop processing these parts will break when it encounters the non-decimal part 'a'. This results in int_parts being [10, 20], which then produces a coarse version ending in .00000000.

However, the corresponding test expects a result ending in .99999999, implying that the alphabetic character should cause an 'overflow' to a maximal value for that component to preserve ordering. The current implementation doesn't seem to handle this case; it only considers special separators for overflow, not alphabetic components in the version string itself.

For an input like '10.20.a30', re.findall(r'^$|\d+|\D+', v) will produce ['10', '.', '20', '.', 'a', '30'].

Um, no:

>>> re.findall(r'^$|\d+|\D+', '10.20.a30') ['10', '.', '20', '.a', '30']

which would find its way to the int_parts.append(9999999999) line

osv/models.py

another-rex · 2026-01-14T00:38:16Z

osv/ecosystems/alpine.py

+      raise ValueError(f'Invalid version: {version}')
+    # is_valid uses a $ regex anchor (which can match a newline),
+    # so we need to remove the newline if one exists.
+    if version[-1] == '\n':


does $ also match \r\n? Might be worth it to call strip(), since I don't believe any versioning scheme have white spaces as part of the spec.

does $ also match \r\n?

No, at least not on Linux.
I would rather not call strip, just in case

another-rex · 2026-01-14T00:43:49Z

osv/ecosystems/ecosystems_base.py

+  while len(components) < 3:
+    components.append(pad_value)
+
+  return f'00:{components[0]:08d}.{components[1]:08d}.{components[2]:08d}'


nit, let's make this construction a function that just takes in 3 int args. then we can replace 374, 337, and 300 with this func.

I've made coarse_version_from_ints be the generic construction function for this, and consolidated some of the logic into it.

another-rex · 2026-01-14T00:45:08Z

osv/ecosystems/ecosystems_base.py

+  Args:
+    version: The version string to convert.
+    separators_regex: Regex for separators (default: r'[.]').
+    trim_regex: Regex for characters to trim after (default: r'[-+]'). 


rename to trim_suffix_regex?

'suffix' is a bit misleading here, because it implies it'd match the whole end of the string rather than a point at which to truncate from.
I've renamed it to truncate_regex, which might be clearer?

another-rex · 2026-01-14T00:46:00Z

osv/ecosystems/ecosystems_base.py

+    if not p.isdecimal():
+      break
+    val = int(p)
+    if val > 99999999:


Create a const for MAX

another-rex · 2026-01-14T00:47:37Z

osv/ecosystems/alpine.py

+        separators_regex=r'[.]',
+        # in APK, 1.02.1 < 1.1.1, so we must treat everything after .0x as 0
+        # also split off the _rc, _p, or -r suffixes
+        trim_regex=r'(?:\.0|[_-])',


does this also trim 1.0.1?

Might need \.0\d

It does trim 1.0.1, but that's intentional because 1.0.2 < 1.01.1 < 1.1.0
Will mention that in the comment.

another-rex · 2026-01-14T00:56:18Z

osv/ecosystems/pypi.py

+    """
+    # legacy versions are less than non-legacy versions, thus mapped to 0
+    ver = packaging_legacy.version.parse(version)
+    if isinstance(ver, packaging_legacy.version.LegacyVersion):


Hmm how common are these, might it be worth it to bump all legacy version up by 1 so we can still have legacy versions? (probably not worth it unless it's really common)

I don't think they're that common.
Plus, legacy versions can be arbitrary strings and I don't want to work out the comparison rules for them.

another-rex · 2026-01-14T00:57:21Z

osv/third_party/univers/alpine.py

+  # if not search:
+  #   return False
+
+  # s = search.group(1)
+  # left, _, _ = s.partition(".")
+  # # handle the suffix case
+  # left, _, _ = left.partition("-")
+  # if not left.isdecimal():
+  #   return True
+  # i = int(left)
+  # return str(i) == left


Should this still be here?

I kinda wanted to signpost the original third-party code that I've changed

another-rex · 2026-01-14T00:58:33Z

osv/models.py


 _MAX_GIT_VERSIONS_TO_INDEX = 5000

+MIN_COARSE_VERSION = '00:00000000.00000000.00000000'


this can use the helper function suggested in ecosystems.

another-rex · 2026-01-14T00:59:02Z

osv/models.py


-def affected_from_bug(entity: Bug) -> list[AffectedVersions]:
-  """Compute the AffectedVersions from a Bug entity."""
+def _get_coarse_min_max(events, e_helper, db_id):


nit: add types

another-rex · 2026-01-14T01:01:56Z

osv/models.py

+
+  # Add the enumerated versions
+  # We need at least a package name to perform matching.
+  if pkg_name and affected.versions:


Should we be logging something if there is non pkg_name but got ranges or affected.versions?

I'm pretty sure this would be expected for GIT ranges, so probably not.

michaelkedar added 4 commits January 8, 2026 11:33

coarse version matching WIP

d8ec74d

2nd coarse

cef9b23

📄

99a202a

[CURRENT YEAR]

289a15f

github-advanced-security bot found potential problems Jan 13, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

michaelkedar added 2 commits January 14, 2026 09:55

🧼

cbc3c5d

my line is too long

82d2b78

michaelkedar marked this pull request as ready for review January 14, 2026 00:25

michaelkedar requested a review from another-rex January 14, 2026 00:25

another-rex reviewed Jan 14, 2026

View reviewed changes

review comments

31cf004


		_MAX_GIT_VERSIONS_TO_INDEX = 5000

		MIN_COARSE_VERSION = '00:00000000.00000000.00000000'

feat: implement coarse versioning for future matching against #4590

Are you sure you want to change the base?

feat: implement coarse versioning for future matching against #4590

Conversation

michaelkedar commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelkedar commented Jan 13, 2026

Uh oh!

Check failure

Choose a reason for hiding this comment

Uh oh!

Check failure

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelkedar Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelkedar commented Jan 13, 2026 •

edited

Loading

michaelkedar Jan 14, 2026 •

edited

Loading