Skip to content

Conversation

@omkar-334
Copy link

Multimodal math benchmark, consists of 2 types of questions - free-form and MCQ.
I've separated each type into a different subset.

The benchmark can be evaluated in 2 ways - either provide the problem solution or the problem code.
For now I'm implementing the solution method.

I need to figure out the proper metric for this - I've tried Metrics.expr_gold_metric and Metrics.exact_match but these are not working. Working on this right now.

@NathanHB
Copy link
Member

NathanHB commented Nov 24, 2025

hey @omkar-334 !

I need to figure out the proper metric for this - I've tried Metrics.expr_gold_metric and Metrics.exact_match but these are not working. Working on this right now.

Don't worry about this, what's important for new evals like this is the inspect-ai implementation :)

There are examples here and documentation on how to use here, the inspect-ai documentation is here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants