Skip to content

Conversation

@hyanwong
Copy link
Member

@hyanwong hyanwong commented Dec 2, 2025

While fixing #3344 I thought I might as well address this long standing issue (which is still open). This allows

samples = ts.samples(population={"name": "pop_0"})
# or shorten to
samples = ts.samples({"name": "pop_0"})

It doesn't promote "name" (or any other metadata key) to any special status, but simply matches the passed in dictionary against the metadata for each population, which seems neater to me than having to define special metadata fields. So you can equally do:

samples = ts.samples(population={"description": "Population zero"})

(Fixes #1697)

@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.76%. Comparing base (25a26a7) to head (325c64e).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3345   +/-   ##
=======================================
  Coverage   89.76%   89.76%           
=======================================
  Files          29       29           
  Lines       31292    31304   +12     
  Branches     5738     5744    +6     
=======================================
+ Hits        28089    28101   +12     
  Misses       1794     1794           
  Partials     1409     1409           
Flag Coverage Δ
c-tests 86.77% <ø> (ø)
lwt-tests 80.38% <ø> (ø)
python-c-tests 87.12% <ø> (ø)
python-tests 98.85% <100.00%> (+<0.01%) ⬆️
python-tests-no-jit 33.46% <0.00%> (-0.05%) ⬇️
python-tests-numpy1 50.38% <100.00%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/tskit/trees.py 98.90% <100.00%> (+<0.01%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@petrelharp
Copy link
Contributor

I like it! And looking at #1697 it sounds like others do/did also. I can think of basically no downsides (will someone accidentally pass in a dict when they meant an int?), and I can't think of another, better way to do this. (We could have this be a new argument instead of overloading population, but I think this is good.)

@petrelharp
Copy link
Contributor

Looks to me like you need some more tests - test for matching more than one metadata key (so eg it's the combination of metadata keys that uniquely specifies the population, not a single one).

Also, I guess if you do populations={} and there's only a single population, this will select that population. I think that's correct, but this could be a test.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 2, 2025

Great, thanks Peter - all helpful comments which I'll address now that this seems like the way to go.

@petrelharp
Copy link
Contributor

You'll want to double-check my last commit there; I committed the suggestions so we could see if the tests still passed.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 5, 2025

Committed and squashed. Do you think you've looked at it enough to mark as approved @petrelharp ? Then I can merge (it would be nice to get this in!)

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@petrelharp
Copy link
Contributor

Yep; I just had another look. I think we're good. I've just rebased.

@jeromekelleher
Copy link
Member

LGTM also. You need to formally "approve" the PR during a review to unblock merging @petrelharp. When you click the "merge when ready" button then github will merge once CI passes.

@hyanwong hyanwong enabled auto-merge December 7, 2025 17:41
@benjeffery benjeffery disabled auto-merge December 8, 2025 10:39
@benjeffery
Copy link
Member

I'm not sure what to do about this - maybe just put a note in the docstring? It
won't be obvious what is wrong to a lot of users.

  import tskit

  tc = tskit.TableCollection(sequence_length=1)
  tc.populations.metadata_schema = tskit.MetadataSchema.permissive_json()
  p0 = tc.populations.add_row(metadata={"labels": ["a", "b"]})
  p1 = tc.populations.add_row(metadata={"labels": ["c"]})

  tc.nodes.add_row(flags=tskit.NODE_IS_SAMPLE, time=0, population=p0)
  tc.nodes.add_row(flags=tskit.NODE_IS_SAMPLE, time=0, population=p1)
  root = tc.nodes.add_row(time=1)
  tc.edges.add_row(left=0, right=1, parent=root, child=0)
  tc.edges.add_row(left=0, right=1, parent=root, child=1)
  tc.sort()
  ts = tc.tree_sequence()

  # Raises: TypeError: unhashable type: 'list'
  ts.samples(population={"labels": ["a", "b"]})

@jeromekelleher
Copy link
Member

Good spot @benjeffery, I didn't think of that.

I think we're trying to solve a more general problem than we need to here by looking at all keys (which seems to be a common pattern). Why not just say that the metadata has to have one key that we match on? Can we think of a situation in actual code that's out there where people would want to match to more than one key?

@hyanwong
Copy link
Member Author

hyanwong commented Dec 8, 2025

Hmm, good spot @benjeffery. Re @jeromekelleher's suggestion, I'm happy to either match a single key or multiple keys, as long as it's consistent. I chose multiple keys as it seemed to be the more intuitive interpretation for the end user. Either way, I'm not sure that solves Ben's problem above, which only involves a single labels key? So don't we need some means of either comparing values or bombing out if a comparison is not possible?

Re real-world examples, the more useful thing would be to match e.g. population={'superpopulation': 'AFR'}, which we can't do using the PR here anyway, as we require matching with only one population.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 8, 2025

So I think the solutions are either:

  1. We check that the input is a dict with a single item, and compare that, or
  2. We loop through the input dictionary and check equality of each value

Both of them bomb out if the value cannot be compared. In the first we need to put a check in about single items in the input. In the second we need to put a loop in. That feels about the same effort to me, so either would be fine: the second gives the user more flexibility, but is therefore less tightly constrained.

@petrelharp
Copy link
Contributor

I agree that the 'checking multiple keys' is more than is needed, but the implementation was so clean, I figured "why not".

I don't like (1) since many real-world examples will have more than one key in the dictionaries.

How about wrap the comparison in a try-except-pass clause, so it'll skip any values that would cause this error, and then just say in the documentation that for this to work the values have to be hashable (for instance: not lists)?

@petrelharp
Copy link
Contributor

So, this points out that we need more weird tests.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a suggestion of how to do this in a testable way.

This really isn't that straightforward, so if you're not prepared to spend ~1/2 a day writing test cases for this and thinking about it, then please just close the PR and leave it alone.

for pop in self.populations():
if not isinstance(pop.metadata, dict):
raise ValueError("Population metadata is not a dictionary")
if set(population.items()).issubset(pop.metadata.items()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to be too concise and "clever" here. Clever code that tries to do too much at once is bad. Here's a way to do it that's clearer, more obvious and less clever.

# If the query keys are a subset of the population metadata keys, we try to compare. 
# Note that *all* keys must match.
pops = []
if set(population.keys()).issubset(pop.metadata.keys()):
       for key, query_value in population.items():
              # This requires the values are comparable, which should work for nested dictionaries
              # and so on. 
              if query_value != pop[key]:
                      break
        else:
                pops.append(pop.id)

Note that we've ended up with some quite tricky logic which needs to be tested now. So, I think we should separate this out into it's own function that can be unit tested. Something like

def match_metadata(table, query):
     """
     Return the row IDs of the specified table that match the specified query dictionary. All   
     rows matching *all* key-value pairs will be returned.
      """
      # implementation above

class TestMatchMetadata:
       # test all the quirky combos of stuff here.

@petrelharp
Copy link
Contributor

Thanks, @jeromekelleher. My proposal is at this point to put this aside (and close it), as it's distracting from more urgent stuff. I think good idea @hyanwong - it seemed like low-hanging fruit - but there were some hidden cans of worms. We could get this through to the end, but probably that's not (currently) worth it because we really want a differently named function that lets us solve this use case also:

Re real-world examples, the more useful thing would be to match e.g. population={'superpopulation': 'AFR'}, which we can't do using the PR here anyway, as we require matching with only one population.

@benjeffery
Copy link
Member

I think why this seems simple but it is isn't is that what you are trying to do is a subset of "a query language over data/metadata". Probably means bringing in a dep (humph) like jmespath to do things like:

for ind in ts.individuals_query("superpopulation.continent == `AFR` && contains(tags, 'quality-checked')"):
  ind.something()

Trying to do a minimal version of this is pain from rolling your query code.

@hyanwong
Copy link
Member Author

I do want to do this, and am willing to put effort in (for teaching reasons), but agree that we have other priorities ATM, and don't want to suck up time from others for reviewing it etc. There is useful discussion here, which I don't want to lose, so I suggest closing this PR but keeping the issue itself live, and I'll address it in 2026.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

specify samples by population name

4 participants