[EVAL] Long Horizon Execution #1074

akshathmangudi · 2025-11-21T10:57:49Z

I screwed up my previous git clone, so I had to redo the changes 😅

Description:
Approach described within #1056.

Tasks:

Initial scaffolding of /tasks/tasks/long_horizon_execution.py
Implement a custom scorer to parse <answer> tags.
Complete implementation of /tasks/tasks/long_horizon_execution.py
Evaluation and Testing

STATUS: ready for review.

Current behavior:

When we run lighteval tasks inspect long_horizon_execution, the output has been shown below:

... more lines
           "'basic', 'alive', 'cream', 'dress', 'black', 'brown', 'drama', "
           "'black', 'audio', 'brown', 'album', 'cover', 'avoid', 'aware', "
           "'event', 'dream', 'clean', 'clock', 'apple', 'above', 'close', "
           "'begin', 'allow', 'album', 'draft', 'brain', 'civil', 'faith', "
           "'death', 'coach', 'below', 'doubt', 'aware', 'cover', 'final', "
           "'allow', 'avoid', 'ahead', 'cross', 'child', 'cream', 'error', "
           "'break', 'brief', 'clock', 'final', 'dance', 'award', 'every', "
           "'chief', 'could', 'dream', 'begin', 'burst', 'audio', 'album', "
           "'cross', 'doubt', 'blood', 'child', 'brand', 'brand', 'extra', "
           "'broad', 'cloud', 'check', 'after', 'chart', 'basic', 'child', "
           "'coach', 'chair', 'faith', 'earth', 'audio', 'basic', 'field', "
           "'cloud', 'draft', 'apply', 'court', 'black', 'ahead', 'burst', "
           "'crowd', 'depth', 'enemy', 'drink', 'first', 'could', 'false', "
           "'could', 'blame', 'first', 'album', 'crowd', 'first', 'broad', "
           "'extra', 'clock', 'chart', 'fiber', 'board', 'earth', 'being', "
           "'alive', 'chart', 'avoid', 'dress', 'cloud', 'clean', 'avoid', "
           "'crash', 'clean', 'arise', 'death', 'brand', 'error']\n"
           '\n'
           'Your task: Calculate the cumulative sum after each key. The first '
           'sum is just the value of the first key. The second sum is the '
           'first value plus the second value, and so on.\n'
           '\n'
           'IMPORTANT:\n'
           '- Output your answer as a single line with comma-separated values '
           'inside <answer></answer> tags\n'
           '- Do not include any other text outside the answer tags\n'
           '- Format: <answer>value1,value2,value3,...</answer>\n'
           '- Example: If the cumulative sums are [5, 8, 12], output: '
           '<answer>5,8,12</answer>\n'
           '\n'
           'Your answer:',
  'sampling_methods': [],
  'specific': None,
  'stop_sequences': (),
  'task_name': 'long_horizon_execution',
  'unconditioned_query': None,
  'use_logits': False}

akshathmangudi · 2025-11-21T10:59:50Z

cc: @NathanHB

NathanHB · 2025-11-21T12:42:28Z

looking good ! Will run locally and review today or start of next week :)
Can you share a HUggingFace Space with the samples as described here to make it easier to verify ? 🤗

akshathmangudi · 2025-11-22T12:17:27Z

i ran the benchmark on HF Inference's gpt-4o but a lot of the results I am seeing are quite poor. is this expected or something wrong with the prompting that I haven't looked at yet?

https://huggingface.co/spaces/akshathmangudi/lhe-gpt4o-single

…hteval into akshath/issue-1056-v2

NathanHB

Hey ! Thanks for the hard work on this, i'm testing it locally right now. I have some small nits but it's looking almost ready !

src/lighteval/tasks/tasks/long_horizon_execution/__init__.py

src/lighteval/tasks/tasks/long_horizon_execution/single_turn.py

src/lighteval/tasks/tasks/long_horizon_execution/constants.py

src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py

NathanHB

Tested on single turn, working great with the few nits I added above. However i cannot seems to make the multiturn work, can you ping when it's ready?

akshathmangudi · 2025-11-25T12:50:59Z

@NathanHB it should be working now, ive created a link below that tests both single and multi-turn.

https://huggingface.co/spaces/akshathmangudi/lhe-gpt

NathanHB · 2025-12-04T15:17:48Z

hey @akshathmangudi that's amazing !!
The link seems broken or maybe the dataset is private ? :)

HuggingFaceDocBuilderDev · 2025-12-04T15:20:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

akshathmangudi · 2025-12-04T15:21:07Z

sorry! it was private. made it public now :)

NathanHB · 2025-12-04T15:25:07Z

great ! Maybe i'm mistaken but i only see single turn eval ?

NathanHB · 2025-12-09T13:36:09Z

hey @akshathmangudi we are planning a release thisz week and would love the tasks you started implementing to be in it. I was just wondering if you were planning on finishing those or if i could take over ? Thanks ! 🤗

akshathmangudi · 2025-12-09T13:38:07Z

hey @NathanHB!

sorry, been traveling all week. i'll have some space today and tomorrow, since a lot of the comments are nits and just things i accidentally overlooked (sorry for that), ill get them ready ASAP!

akshathmangudi · 2025-12-09T15:44:33Z

https://huggingface.co/spaces/akshathmangudi/lhe-gpt

ive updated the space to have multi-turn evaluation. please let me know if any changes have to be made 🤗

Copilot

Pull request overview

This PR implements the Long Horizon Execution benchmark for evaluating language models' ability to maintain state and perform cumulative operations over long sequences. The implementation follows a research paper approach with both single-turn (process all keys at once) and multi-turn (incremental key processing) evaluation modes.

Key Changes

Added complete task implementation with support for 7 context sizes (1024-65536) and 3 turn complexities (K=1, 2, 10)
Implemented custom answer tag parsing scorers for extracting <answer> formatted responses
Used binary search optimization to fit maximum items within prompt length constraints

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
`src/lighteval/tasks/tasks/long_horizon_execution/constants.py`	Defines prompt templates and configuration constants for context sizes and turn complexities
`src/lighteval/tasks/tasks/long_horizon_execution/utils.py`	Implements binary search logic and prompt building functions for both single and multi-turn modes
`src/lighteval/tasks/tasks/long_horizon_execution/main.py`	Provides single-turn task implementation with scorer and creates task configurations
`src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py`	Implements multi-turn evaluation with conversation state tracking and fractional accuracy scoring

Comments suppressed due to low confidence (2)

src/lighteval/tasks/tasks/long_horizon_execution/utils.py:130

Surplus named argument for string format. An argument named 'num_keys' is provided, but it is not required by [format "You are an AI assistant. I will provide you with a dictionary and then give you keys in groups of {k}.
Your task is to keep a running total (starting from 0) by adding the values associated with the keys I provide.
In each turn, I'll provide {k} keys (comma-separated).
Respond with the current running sum, enclosed in tags.

Dictionary to maintain:
{dict_str}

Ready to start!
User: {keys_str}
Assistant:"](1).

        return PROMPT_TEMPLATE_MULTI_START.format(
            dict_str=dict_str, keys_str=keys_str, k=k, num_keys=len(first_turn_keys)
        )

src/lighteval/tasks/tasks/long_horizon_execution/utils.py:194

Surplus named argument for string format. An argument named 'num_keys' is provided, but it is not required by [format "You are an AI assistant. I will provide you with a dictionary and then give you keys in groups of {k}.
Your task is to keep a running total (starting from 0) by adding the values associated with the keys I provide.
In each turn, I'll provide {k} keys (comma-separated).
Respond with the current running sum, enclosed in tags.

Dictionary to maintain:
{dict_str}

Ready to start!
User: {keys_str}
Assistant:"](1).

    initial_prompt = PROMPT_TEMPLATE_MULTI_START.format(
        dict_str=dict_str, keys_str=first_turn_keys_str, k=k, num_keys=len(turn_chunks[0])
    )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-09T15:51:07Z

src/lighteval/tasks/tasks/long_horizon_execution/main.py

+from lighteval.tasks.tasks.long_horizon_execution.constants import CONTEXT_SIZES
+from lighteval.tasks.tasks.long_horizon_execution.multi_turn import create_multi_turn_tasks
+from lighteval.tasks.tasks.long_horizon_execution.utils import _build_prompt_and_target
+
+
+# Single-turn prompt template
+PROMPT_TEMPLATE_SINGLE = """You are an AI assistant. I will provide you with a dictionary and then give you a list of keys.
+Your task is to calculate the final cumulative sum after processing all keys in order.
+
+For each key in the list, you need to:
+1. Look up the value in the dictionary
+2. Add it to the running sum
+3. After processing all keys, output the final cumulative sum
+
+Dictionary to use:
+{dict_str}
+
+Keys to process in order:
+{keys_str}
+
+Your task: Process all keys in order and calculate the final cumulative sum after processing all {num_keys} keys.
+
+IMPORTANT:
+- Output your answer as a single integer value inside <answer></answer> tags
+- Do not include any other text outside the answer tags
+- Format: <answer>final_sum</answer>
+- Example: If the final cumulative sum is 42, output: <answer>42</answer>
+
+Your answer:"""
+
+


PROMPT_TEMPLATE_SINGLE is duplicated in both main.py and constants.py. Since constants.py already defines this template, remove the duplicate definition here and use the imported constant from constants.py instead. This reduces code duplication and ensures consistency.

Suggested change

from lighteval.tasks.tasks.long_horizon_execution.constants import CONTEXT_SIZES

from lighteval.tasks.tasks.long_horizon_execution.multi_turn import create_multi_turn_tasks

from lighteval.tasks.tasks.long_horizon_execution.utils import _build_prompt_and_target

# Single-turn prompt template

PROMPT_TEMPLATE_SINGLE = """You are an AI assistant. I will provide you with a dictionary and then give you a list of keys.

Your task is to calculate the final cumulative sum after processing all keys in order.

For each key in the list, you need to:

1. Look up the value in the dictionary

2. Add it to the running sum

3. After processing all keys, output the final cumulative sum

Dictionary to use:

{dict_str}

Keys to process in order:

{keys_str}

Your task: Process all keys in order and calculate the final cumulative sum after processing all {num_keys} keys.

IMPORTANT:

- Output your answer as a single integer value inside <answer></answer> tags

- Do not include any other text outside the answer tags

- Format: <answer>final_sum</answer>

- Example: If the final cumulative sum is 42, output: <answer>42</answer>

Your answer:"""

from lighteval.tasks.tasks.long_horizon_execution.constants import CONTEXT_SIZES, PROMPT_TEMPLATE_SINGLE

from lighteval.tasks.tasks.long_horizon_execution.multi_turn import create_multi_turn_tasks

from lighteval.tasks.tasks.long_horizon_execution.utils import _build_prompt_and_target

# Single-turn prompt template

Copilot · 2025-12-09T15:51:08Z

src/lighteval/tasks/tasks/long_horizon_execution/constants.py

+
+PROMPT_TEMPLATE_MULTI_START = """You are an AI assistant. I will provide you with a dictionary and then give you keys in groups of {k}.
+Your task is to keep a running total (starting from 0) by adding the values associated with the keys I provide.
+In each turn, I'll provide {k} keys (comma-separated).


Grammar issue when k=1: "I'll provide 1 keys (comma-separated)" is incorrect. Consider using conditional text: "I'll provide {k} key{'s' if k > 1 else ''}" or rephrasing to "I'll provide {k} key(s)" to handle both singular and plural cases correctly.

Suggested change

In each turn, I'll provide {k} keys (comma-separated).

In each turn, I'll provide {k} key(s) (comma-separated).

Copilot · 2025-12-09T15:51:08Z

src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py

+            value={
+                "turn_accuracy": fractional_accuracy,
+                "fractional_accuracy": fractional_accuracy,
+                "correct_turns": correct_turns,
+                "total_turns": len(expected_per_turn),
+            },


The Score value returns a dictionary with multiple metrics (lines 180-185), which is inconsistent with other scorers in the codebase that return simple values like "C"/"I" or floats. The @scorer decorator on line 132 uses a dictionary for metrics, but the typical pattern shows scorers returning a single value. Consider either: (1) returning just the fractional_accuracy as the value, or (2) verifying this dictionary pattern is supported by inspect_ai's scorer API.

Suggested change

value={

"turn_accuracy": fractional_accuracy,

"fractional_accuracy": fractional_accuracy,

"correct_turns": correct_turns,

"total_turns": len(expected_per_turn),

},

value=fractional_accuracy,

Copilot · 2025-12-09T15:51:09Z

src/lighteval/tasks/tasks/long_horizon_execution/constants.py

+**User**: {keys_str}
+**Assistant**:"""
+
+PROMPT_TEMPLATE_MULTI_FOLLOWUP = """Here are the next keys to process:
+**User**: {keys_str}
+**Assistant**:"""


[nitpick] The template uses **User**: and **Assistant**: formatting which appears to be markdown bold syntax. If this is intentional for the model to understand turn structure, it's fine. However, if these are meant to be actual speaker labels, consider using plain text like "User:" or using the chat message structure properly instead of embedding these labels in the template text.

Suggested change

**User**: {keys_str}

**Assistant**:"""

PROMPT_TEMPLATE_MULTI_FOLLOWUP = """Here are the next keys to process:

**User**: {keys_str}

**Assistant**:"""

User: {keys_str}

Assistant:"""

PROMPT_TEMPLATE_MULTI_FOLLOWUP = """Here are the next keys to process:

User: {keys_str}

Assistant:"""

Copilot · 2025-12-09T15:51:09Z

src/lighteval/tasks/tasks/long_horizon_execution/constants.py

+**User**: {keys_str}
+**Assistant**:"""
+
+PROMPT_TEMPLATE_MULTI_FOLLOWUP = """Here are the next keys to process:
+**User**: {keys_str}
+**Assistant**:"""


[nitpick] The template uses **User**: and **Assistant**: formatting which appears to be markdown bold syntax. If this is intentional for the model to understand turn structure, it's fine. However, if these are meant to be actual speaker labels, consider using plain text like "User:" or using the chat message structure properly instead of embedding these labels in the template text.

Suggested change

**User**: {keys_str}

**Assistant**:"""

PROMPT_TEMPLATE_MULTI_FOLLOWUP = """Here are the next keys to process:

**User**: {keys_str}

**Assistant**:"""

User: {keys_str}

Assistant:"""

PROMPT_TEMPLATE_MULTI_FOLLOWUP = """Here are the next keys to process:

User: {keys_str}

Assistant:"""

Copilot · 2025-12-09T15:51:09Z

src/lighteval/tasks/tasks/long_horizon_execution/utils.py

+        keys_str = ", ".join(first_turn_keys)
+
+        return PROMPT_TEMPLATE_MULTI_START.format(
+            dict_str=dict_str, keys_str=keys_str, k=k, num_keys=len(first_turn_keys)


The num_keys parameter is passed to PROMPT_TEMPLATE_MULTI_START.format() but the template in constants.py (lines 29-39) doesn't include a {num_keys} placeholder. This will cause a TypeError: format() got an unexpected keyword argument 'num_keys'. Either remove this parameter from the format call or add {num_keys} to the template if it's needed.

Suggested change

dict_str=dict_str, keys_str=keys_str, k=k, num_keys=len(first_turn_keys)

dict_str=dict_str, keys_str=keys_str, k=k

Copilot · 2025-12-09T15:51:10Z

src/lighteval/tasks/tasks/long_horizon_execution/utils.py

+
+    first_turn_keys_str = ", ".join(turn_chunks[0])
+    initial_prompt = PROMPT_TEMPLATE_MULTI_START.format(
+        dict_str=dict_str, keys_str=first_turn_keys_str, k=k, num_keys=len(turn_chunks[0])


The num_keys parameter is passed to PROMPT_TEMPLATE_MULTI_START.format() but the template in constants.py (lines 29-39) doesn't include a {num_keys} placeholder. This will cause a TypeError: format() got an unexpected keyword argument 'num_keys'. Either remove this parameter from the format call or add {num_keys} to the template if it's needed.

Suggested change

dict_str=dict_str, keys_str=first_turn_keys_str, k=k, num_keys=len(turn_chunks[0])

dict_str=dict_str, keys_str=first_turn_keys_str, k=k

Copilot · 2025-12-09T15:51:10Z

src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py

+async def _process_single_turn(state, turn_chunk, generate):
+    """Process a single turn: add user message, get model response, add assistant message."""
+    keys_str = ", ".join(turn_chunk)
+    followup_prompt = PROMPT_TEMPLATE_MULTI_FOLLOWUP.format(keys_str=keys_str)
+    state.messages.append(ChatMessageUser(content=followup_prompt))
+
+    # generate() takes the state and returns updated state with assistant message added
+    updated_state = await generate(state)


[nitpick] The parameter name generate in _process_single_turn shadows the generate function imported from inspect_ai.solver (line 14). This is likely intentional since the function receives generate as a parameter from the solver, but it could be clearer to name it generate_fn or generator to avoid confusion with the imported function.

Suggested change

async def _process_single_turn(state, turn_chunk, generate):

"""Process a single turn: add user message, get model response, add assistant message."""

keys_str = ", ".join(turn_chunk)

followup_prompt = PROMPT_TEMPLATE_MULTI_FOLLOWUP.format(keys_str=keys_str)

state.messages.append(ChatMessageUser(content=followup_prompt))

# generate() takes the state and returns updated state with assistant message added

updated_state = await generate(state)

async def _process_single_turn(state, turn_chunk, generate_fn):

"""Process a single turn: add user message, get model response, add assistant message."""

keys_str = ", ".join(turn_chunk)

followup_prompt = PROMPT_TEMPLATE_MULTI_FOLLOWUP.format(keys_str=keys_str)

state.messages.append(ChatMessageUser(content=followup_prompt))

# generate_fn() takes the state and returns updated state with assistant message added

updated_state = await generate_fn(state)

Copilot · 2025-12-09T15:51:10Z

src/lighteval/tasks/tasks/long_horizon_execution/utils.py

+    # Use the maximum n that fits
+    input_keys = input_keys[:max_n]
+    input_values = input_values[:max_n]
+    expected_output = expected_output[:max_n]


Variable expected_output is not used.

Suggested change

expected_output = expected_output[:max_n]

akshathmangudi · 2025-12-09T15:57:10Z

it's seems there are few valid nits that copilot has addressed, will be fixing them in a few hours

ready for review

cef0b0f

akshathmangudi mentioned this pull request Nov 21, 2025

[EVAL] Long Horizon Execution #1072

Closed

4 tasks

akshathmangudi marked this pull request as ready for review November 21, 2025 10:59

Merge branch 'main' into akshath/issue-1056-v2

fdc9288

akshathmangudi added 2 commits November 22, 2025 17:47

some fixes

2c0ceae

Merge branch 'akshath/issue-1056-v2' of github.com:akshathmangudi/lig…

3d8ac1b

…hteval into akshath/issue-1056-v2