-
Notifications
You must be signed in to change notification settings - Fork 79
Metadata samples #3345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata samples #3345
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3345 +/- ##
=======================================
Coverage 89.76% 89.76%
=======================================
Files 29 29
Lines 31292 31304 +12
Branches 5738 5744 +6
=======================================
+ Hits 28089 28101 +12
Misses 1794 1794
Partials 1409 1409
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
I like it! And looking at #1697 it sounds like others do/did also. I can think of basically no downsides (will someone accidentally pass in a dict when they meant an int?), and I can't think of another, better way to do this. (We could have this be a new argument instead of overloading |
|
Looks to me like you need some more tests - test for matching more than one metadata key (so eg it's the combination of metadata keys that uniquely specifies the population, not a single one). Also, I guess if you do |
|
Great, thanks Peter - all helpful comments which I'll address now that this seems like the way to go. |
f8b3d3e to
4d287bc
Compare
4d287bc to
db925cc
Compare
|
You'll want to double-check my last commit there; I committed the suggestions so we could see if the tests still passed. |
e981b49 to
36755b3
Compare
|
Committed and squashed. Do you think you've looked at it enough to mark as approved @petrelharp ? Then I can merge (it would be nice to get this in!) |
petrelharp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
36755b3 to
325c64e
Compare
|
Yep; I just had another look. I think we're good. I've just rebased. |
|
LGTM also. You need to formally "approve" the PR during a review to unblock merging @petrelharp. When you click the "merge when ready" button then github will merge once CI passes. |
|
I'm not sure what to do about this - maybe just put a note in the docstring? It import tskit
tc = tskit.TableCollection(sequence_length=1)
tc.populations.metadata_schema = tskit.MetadataSchema.permissive_json()
p0 = tc.populations.add_row(metadata={"labels": ["a", "b"]})
p1 = tc.populations.add_row(metadata={"labels": ["c"]})
tc.nodes.add_row(flags=tskit.NODE_IS_SAMPLE, time=0, population=p0)
tc.nodes.add_row(flags=tskit.NODE_IS_SAMPLE, time=0, population=p1)
root = tc.nodes.add_row(time=1)
tc.edges.add_row(left=0, right=1, parent=root, child=0)
tc.edges.add_row(left=0, right=1, parent=root, child=1)
tc.sort()
ts = tc.tree_sequence()
# Raises: TypeError: unhashable type: 'list'
ts.samples(population={"labels": ["a", "b"]}) |
|
Good spot @benjeffery, I didn't think of that. I think we're trying to solve a more general problem than we need to here by looking at all keys (which seems to be a common pattern). Why not just say that the metadata has to have one key that we match on? Can we think of a situation in actual code that's out there where people would want to match to more than one key? |
|
Hmm, good spot @benjeffery. Re @jeromekelleher's suggestion, I'm happy to either match a single key or multiple keys, as long as it's consistent. I chose multiple keys as it seemed to be the more intuitive interpretation for the end user. Either way, I'm not sure that solves Ben's problem above, which only involves a single Re real-world examples, the more useful thing would be to match e.g. |
|
So I think the solutions are either:
Both of them bomb out if the value cannot be compared. In the first we need to put a check in about single items in the input. In the second we need to put a loop in. That feels about the same effort to me, so either would be fine: the second gives the user more flexibility, but is therefore less tightly constrained. |
|
I agree that the 'checking multiple keys' is more than is needed, but the implementation was so clean, I figured "why not". I don't like (1) since many real-world examples will have more than one key in the dictionaries. How about wrap the comparison in a try-except-pass clause, so it'll skip any values that would cause this error, and then just say in the documentation that for this to work the values have to be hashable (for instance: not lists)? |
|
So, this points out that we need more weird tests. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a suggestion of how to do this in a testable way.
This really isn't that straightforward, so if you're not prepared to spend ~1/2 a day writing test cases for this and thinking about it, then please just close the PR and leave it alone.
| for pop in self.populations(): | ||
| if not isinstance(pop.metadata, dict): | ||
| raise ValueError("Population metadata is not a dictionary") | ||
| if set(population.items()).issubset(pop.metadata.items()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trying to be too concise and "clever" here. Clever code that tries to do too much at once is bad. Here's a way to do it that's clearer, more obvious and less clever.
# If the query keys are a subset of the population metadata keys, we try to compare.
# Note that *all* keys must match.
pops = []
if set(population.keys()).issubset(pop.metadata.keys()):
for key, query_value in population.items():
# This requires the values are comparable, which should work for nested dictionaries
# and so on.
if query_value != pop[key]:
break
else:
pops.append(pop.id)Note that we've ended up with some quite tricky logic which needs to be tested now. So, I think we should separate this out into it's own function that can be unit tested. Something like
def match_metadata(table, query):
"""
Return the row IDs of the specified table that match the specified query dictionary. All
rows matching *all* key-value pairs will be returned.
"""
# implementation above
class TestMatchMetadata:
# test all the quirky combos of stuff here.
|
Thanks, @jeromekelleher. My proposal is at this point to put this aside (and close it), as it's distracting from more urgent stuff. I think good idea @hyanwong - it seemed like low-hanging fruit - but there were some hidden cans of worms. We could get this through to the end, but probably that's not (currently) worth it because we really want a differently named function that lets us solve this use case also:
|
|
I think why this seems simple but it is isn't is that what you are trying to do is a subset of "a query language over data/metadata". Probably means bringing in a dep (humph) like for ind in ts.individuals_query("superpopulation.continent == `AFR` && contains(tags, 'quality-checked')"):
ind.something()Trying to do a minimal version of this is pain from rolling your query code. |
|
I do want to do this, and am willing to put effort in (for teaching reasons), but agree that we have other priorities ATM, and don't want to suck up time from others for reviewing it etc. There is useful discussion here, which I don't want to lose, so I suggest closing this PR but keeping the issue itself live, and I'll address it in 2026. |
While fixing #3344 I thought I might as well address this long standing issue (which is still open). This allows
It doesn't promote "name" (or any other metadata key) to any special status, but simply matches the passed in dictionary against the metadata for each population, which seems neater to me than having to define special metadata fields. So you can equally do:
(Fixes #1697)