Metadata samples #3345

hyanwong · 2025-12-02T14:46:32Z

While fixing #3344 I thought I might as well address this long standing issue (which is still open). This allows

samples = ts.samples(population={"name": "pop_0"})
# or shorten to
samples = ts.samples({"name": "pop_0"})

It doesn't promote "name" (or any other metadata key) to any special status, but simply matches the passed in dictionary against the metadata for each population, which seems neater to me than having to define special metadata fields. So you can equally do:

samples = ts.samples(population={"description": "Population zero"})

(Fixes #1697)

codecov · 2025-12-02T14:52:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.76%. Comparing base (25a26a7) to head (325c64e).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3345   +/-   ##
=======================================
  Coverage   89.76%   89.76%           
=======================================
  Files          29       29           
  Lines       31292    31304   +12     
  Branches     5738     5744    +6     
=======================================
+ Hits        28089    28101   +12     
  Misses       1794     1794           
  Partials     1409     1409

Flag	Coverage Δ
c-tests	`86.77% <ø> (ø)`
lwt-tests	`80.38% <ø> (ø)`
python-c-tests	`87.12% <ø> (ø)`
python-tests	`98.85% <100.00%> (+<0.01%)`	⬆️
python-tests-no-jit	`33.46% <0.00%> (-0.05%)`	⬇️
python-tests-numpy1	`50.38% <100.00%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
python/tskit/trees.py	`98.90% <100.00%> (+<0.01%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

petrelharp · 2025-12-02T16:41:57Z

I like it! And looking at #1697 it sounds like others do/did also. I can think of basically no downsides (will someone accidentally pass in a dict when they meant an int?), and I can't think of another, better way to do this. (We could have this be a new argument instead of overloading population, but I think this is good.)

python/CHANGELOG.rst

python/tskit/trees.py

petrelharp · 2025-12-02T16:51:43Z

Looks to me like you need some more tests - test for matching more than one metadata key (so eg it's the combination of metadata keys that uniquely specifies the population, not a single one).

Also, I guess if you do populations={} and there's only a single population, this will select that population. I think that's correct, but this could be a test.

hyanwong · 2025-12-02T17:59:22Z

Great, thanks Peter - all helpful comments which I'll address now that this seems like the way to go.

python/tests/test_highlevel.py

python/tskit/trees.py

petrelharp · 2025-12-03T05:36:58Z

You'll want to double-check my last commit there; I committed the suggestions so we could see if the tests still passed.

python/tskit/trees.py

hyanwong · 2025-12-05T13:44:23Z

Committed and squashed. Do you think you've looked at it enough to mark as approved @petrelharp ? Then I can merge (it would be nice to get this in!)

Fixes tskit-dev#1697

petrelharp

LGTM!

petrelharp · 2025-12-06T02:29:14Z

Yep; I just had another look. I think we're good. I've just rebased.

jeromekelleher · 2025-12-07T14:42:40Z

LGTM also. You need to formally "approve" the PR during a review to unblock merging @petrelharp. When you click the "merge when ready" button then github will merge once CI passes.

benjeffery · 2025-12-08T10:43:42Z

I'm not sure what to do about this - maybe just put a note in the docstring? It
won't be obvious what is wrong to a lot of users.

  import tskit

  tc = tskit.TableCollection(sequence_length=1)
  tc.populations.metadata_schema = tskit.MetadataSchema.permissive_json()
  p0 = tc.populations.add_row(metadata={"labels": ["a", "b"]})
  p1 = tc.populations.add_row(metadata={"labels": ["c"]})

  tc.nodes.add_row(flags=tskit.NODE_IS_SAMPLE, time=0, population=p0)
  tc.nodes.add_row(flags=tskit.NODE_IS_SAMPLE, time=0, population=p1)
  root = tc.nodes.add_row(time=1)
  tc.edges.add_row(left=0, right=1, parent=root, child=0)
  tc.edges.add_row(left=0, right=1, parent=root, child=1)
  tc.sort()
  ts = tc.tree_sequence()

  # Raises: TypeError: unhashable type: 'list'
  ts.samples(population={"labels": ["a", "b"]})

jeromekelleher · 2025-12-08T11:55:31Z

Good spot @benjeffery, I didn't think of that.

I think we're trying to solve a more general problem than we need to here by looking at all keys (which seems to be a common pattern). Why not just say that the metadata has to have one key that we match on? Can we think of a situation in actual code that's out there where people would want to match to more than one key?

hyanwong · 2025-12-08T12:47:36Z

Hmm, good spot @benjeffery. Re @jeromekelleher's suggestion, I'm happy to either match a single key or multiple keys, as long as it's consistent. I chose multiple keys as it seemed to be the more intuitive interpretation for the end user. Either way, I'm not sure that solves Ben's problem above, which only involves a single labels key? So don't we need some means of either comparing values or bombing out if a comparison is not possible?

Re real-world examples, the more useful thing would be to match e.g. population={'superpopulation': 'AFR'}, which we can't do using the PR here anyway, as we require matching with only one population.

hyanwong · 2025-12-08T12:56:15Z

So I think the solutions are either:

We check that the input is a dict with a single item, and compare that, or
We loop through the input dictionary and check equality of each value

Both of them bomb out if the value cannot be compared. In the first we need to put a check in about single items in the input. In the second we need to put a loop in. That feels about the same effort to me, so either would be fine: the second gives the user more flexibility, but is therefore less tightly constrained.

petrelharp · 2025-12-08T21:24:44Z

I agree that the 'checking multiple keys' is more than is needed, but the implementation was so clean, I figured "why not".

I don't like (1) since many real-world examples will have more than one key in the dictionaries.

How about wrap the comparison in a try-except-pass clause, so it'll skip any values that would cause this error, and then just say in the documentation that for this to work the values have to be hashable (for instance: not lists)?

petrelharp · 2025-12-08T21:25:04Z

So, this points out that we need more weird tests.

jeromekelleher

I've made a suggestion of how to do this in a testable way.

This really isn't that straightforward, so if you're not prepared to spend ~1/2 a day writing test cases for this and thinking about it, then please just close the PR and leave it alone.

jeromekelleher · 2025-12-09T09:34:59Z

python/tskit/trees.py

+                for pop in self.populations():
+                    if not isinstance(pop.metadata, dict):
+                        raise ValueError("Population metadata is not a dictionary")
+                    if set(population.items()).issubset(pop.metadata.items()):


We're trying to be too concise and "clever" here. Clever code that tries to do too much at once is bad. Here's a way to do it that's clearer, more obvious and less clever.

# If the query keys are a subset of the population metadata keys, we try to compare. # Note that *all* keys must match. pops = [] if set(population.keys()).issubset(pop.metadata.keys()): for key, query_value in population.items(): # This requires the values are comparable, which should work for nested dictionaries # and so on. if query_value != pop[key]: break else: pops.append(pop.id)

Note that we've ended up with some quite tricky logic which needs to be tested now. So, I think we should separate this out into it's own function that can be unit tested. Something like

def match_metadata(table, query): """ Return the row IDs of the specified table that match the specified query dictionary. All rows matching *all* key-value pairs will be returned. """ # implementation above class TestMatchMetadata: # test all the quirky combos of stuff here.

petrelharp · 2025-12-10T01:00:18Z

Thanks, @jeromekelleher. My proposal is at this point to put this aside (and close it), as it's distracting from more urgent stuff. I think good idea @hyanwong - it seemed like low-hanging fruit - but there were some hidden cans of worms. We could get this through to the end, but probably that's not (currently) worth it because we really want a differently named function that lets us solve this use case also:

Re real-world examples, the more useful thing would be to match e.g. population={'superpopulation': 'AFR'}, which we can't do using the PR here anyway, as we require matching with only one population.

benjeffery · 2025-12-10T11:52:54Z

I think why this seems simple but it is isn't is that what you are trying to do is a subset of "a query language over data/metadata". Probably means bringing in a dep (humph) like jmespath to do things like:

for ind in ts.individuals_query("superpopulation.continent == `AFR` && contains(tags, 'quality-checked')"):
  ind.something()

Trying to do a minimal version of this is pain from rolling your query code.

hyanwong · 2025-12-10T12:34:26Z

I do want to do this, and am willing to put effort in (for teaching reasons), but agree that we have other priorities ATM, and don't want to suck up time from others for reviewing it etc. There is useful discussion here, which I don't want to lose, so I suggest closing this PR but keeping the issue itself live, and I'll address it in 2026.

petrelharp reviewed Dec 2, 2025

View reviewed changes

python/CHANGELOG.rst Show resolved Hide resolved

petrelharp reviewed Dec 2, 2025

View reviewed changes

python/tskit/trees.py Show resolved Hide resolved

petrelharp reviewed Dec 2, 2025

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

hyanwong force-pushed the metadata-samples branch from f8b3d3e to 4d287bc Compare December 2, 2025 19:14

hyanwong mentioned this pull request Dec 2, 2025

Raise error if pop passed to samples is not an integer #3344

Merged

3 tasks

hyanwong force-pushed the metadata-samples branch from 4d287bc to db925cc Compare December 2, 2025 23:08

petrelharp reviewed Dec 3, 2025

View reviewed changes

python/tests/test_highlevel.py Outdated Show resolved Hide resolved

petrelharp reviewed Dec 3, 2025

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Dec 3, 2025

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

hyanwong force-pushed the metadata-samples branch from e981b49 to 36755b3 Compare December 5, 2025 13:41

Allow metadata keys to be specified in samples()

325c64e

Fixes tskit-dev#1697

petrelharp reviewed Dec 6, 2025

View reviewed changes

petrelharp force-pushed the metadata-samples branch from 36755b3 to 325c64e Compare December 6, 2025 02:24

hyanwong enabled auto-merge December 7, 2025 17:41

benjeffery disabled auto-merge December 8, 2025 10:39

jeromekelleher reviewed Dec 9, 2025

View reviewed changes

hyanwong closed this Dec 10, 2025

hyanwong mentioned this pull request Dec 10, 2025

specify samples by population name #1697

Open

Metadata samples #3345

Metadata samples #3345

Uh oh!

Conversation

hyanwong commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

petrelharp commented Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petrelharp commented Dec 2, 2025

Uh oh!

hyanwong commented Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

petrelharp commented Dec 3, 2025

Uh oh!

Uh oh!

hyanwong commented Dec 5, 2025

Uh oh!

petrelharp left a comment

Choose a reason for hiding this comment

Uh oh!

petrelharp commented Dec 6, 2025

Uh oh!

jeromekelleher commented Dec 7, 2025

Uh oh!

benjeffery commented Dec 8, 2025

Uh oh!

jeromekelleher commented Dec 8, 2025

Uh oh!

hyanwong commented Dec 8, 2025

Uh oh!

hyanwong commented Dec 8, 2025

Uh oh!

petrelharp commented Dec 8, 2025

Uh oh!

petrelharp commented Dec 8, 2025

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

petrelharp commented Dec 10, 2025

Uh oh!

benjeffery commented Dec 10, 2025

Uh oh!

hyanwong commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hyanwong commented Dec 2, 2025 •

edited

Loading

codecov bot commented Dec 2, 2025 •

edited

Loading