Skip to content

Conversation

@BryanFauble
Copy link
Member

Problem:

  • There is a new validation_file_handle_id field for the RecordSet that points to a validation CSV file denoting the specific issues with the validation rules.
  • Static type checking had issues with the TypeVar strings that we were using before, and not actually rendering out to Pandas types

Solution:

  • Adding the new field, and allowing the detailed results to come back via a Pandas DataFrame for analysis

Testing:

  • Integration testing to verify logic, I also verified that suggestions in the how-to guide for curator as well

@BryanFauble BryanFauble requested a review from a team as a code owner November 24, 2025 21:49
Comment on lines +182 to +186
## Step 4: Work with metadata and validate (Record-based workflow)

After creating a record-based metadata task, collaborators can enter metadata through the Grid interface. Once metadata entry is complete, you'll want to validate the data against your schema and identify any issues.

### The metadata curation workflow
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created https://sagebionetworks.jira.com/browse/SYNPY-1712 so we can revisit this and make doing this work a bit easier for folks

def create(
self,
attach_to_previous_session=True,
attach_to_previous_session=False,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change because if you do have a previous session open, and you are not aware that this defaults to True it could drop you into a session you did not intend to and overwrite, or delete data in the RecordSet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so the default behavior is always going to be creation of a new session? Which means data within the working session previously won't be included?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case that is correct unless they change that boolean to True.

Users may also use the list method as well to get a list of their active grid sessions and select the correct one from there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Thanks for the great work here. I tagged @aditigopalan for a review as well, but I'll defer a final review to another person on the team.

@BryanFauble BryanFauble requested a review from a team November 25, 2025 18:28
…ple modules to ensure pandas is installed where needed
@thomasyu888 thomasyu888 requested a review from SageGJ November 26, 2025 22:26
Copy link
Contributor

@linglp linglp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @BryanFauble and @thomasyu888 ! The code looks good to me, but I am confused when I tested out the functionality. I started by creating a record set, a task, and a grid by using a test data schema:

record_set, task, grid = create_record_based_metadata_task(
    synapse_client=syn,
    project_id="syn71816385",
    folder_id="syn71816388",
    record_set_name="BiospecimenMetadata_RecordSet",
    record_set_description="RecordSet for biospecimen metadata curation",
    curation_task_name="BiospecimenMetadataTemplate",
    upsert_keys=["specimenID"],
    instructions="Please curate this metadata according to the schema requirements",
    schema_uri="sage.schemas.v2571-nf.BiospecimenTemplate.schema-9.14.0"
)

I entered some data in the grid on the UI, but when I tried to download the validation result:

async def main():
    syn = Synapse()
    syn.login()
    # Assuming you have a RecordSet with a bound schema
    record_set = await RecordSet(id="syn71816405").get_async()
    # Create and export Grid session to generate validation results
    grid = await Grid(record_set_id=record_set.id).create_async()
    await grid.export_to_record_set_async()
    await grid.delete_async()
    # Re-fetch the RecordSet to get updated validation_file_handle_id
    record_set = await record_set.get_async()
    # Get the detailed validation results
    results_df = await record_set.get_detailed_validation_results_async(download_location="/Users/lpeng/code/synapsePythonClient")
    print('result df', results_df)
    # Analyze the results
    print(f"Total rows: {len(results_df)}")
    print(f"Columns: {results_df.columns.tolist()}")
    # Filter for valid and invalid rows
    # Note: is_valid is boolean (True/False) for validated rows
    valid_rows = results_df[results_df['is_valid'] == True]  # noqa: E712
    invalid_rows = results_df[results_df['is_valid'] == False]  # noqa: E712
    print(f"Valid rows: {len(valid_rows)}")
    print(f"Invalid rows: {len(invalid_rows)}")
    # View invalid rows with their error messages
    if len(invalid_rows) > 0:
        print(invalid_rows[['row_index', 'validation_error_message']])
asyncio.run(main())

Without modifying anything in the grid, I got different validation results:
The first time running the code:

"row_index","is_valid","validation_error_message","all_validation_messages"
"0","true",,

The second time running the code:

"row_index","is_valid","validation_error_message","all_validation_messages"
"0",,,

Any ideas why this is happening?

row["all_validation_messages"]
), f"Row {idx} is marked invalid but has no all_validation_messages"

def test_get_validation_results_validation_error(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the test name is misleading. The name suggests that an error is raised, but in this case, it only raises a warning?

syn.login()

# Get your RecordSet (must have a schema bound)
record_set = RecordSet(id="syn987654321").get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be helpful for this id to link to an actual record set? right now it leads to a Page Unavailable message

Comment on lines +25 to +30
else:
# At runtime, use object as a placeholder
DataFrame = object
Series = object
np = object # type: ignore[misc, assignment]
nx = object # type: ignore[misc, assignment]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to display a message in the cases of ImportErrors or not type checking to notify users that the specific types weren't available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants