Skip to content
This repository was archived by the owner on Nov 19, 2025. It is now read-only.
This repository was archived by the owner on Nov 19, 2025. It is now read-only.

serve_reward_model goes down #351

@AtsunoriFujita

Description

@AtsunoriFujita

Describe the bug

When we start serve_reward_model.py and run annotation, the server goes down during processing. It will crash on specific samples. These samples have a long context.

error.log

What we did

  • We built the source, but the issue has not been solved.
  • We also tried nvidia/Llama2-13B-SteerLM-RM, but ran into the same issue.
  • It runs without an issue on nvcr.io/nvidia/nemo:24.05.01 (critic speedup #219 is the main difference.).
  • The estimated processing time has also increased from 2 hours (nvcr.io/nvidia/nemo:24.05.01) to 7 hours (nvcr.io/nvidia/nemo:24.07).

Steps/Code to reproduce bug

export HYDRA_FULL_ERROR=1
export MODEL="/workspace/models/Llama3-70B-SteerLM-RM"

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst

python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \
    rm_model_file=${MODEL} \
    trainer.num_nodes=1 \
    trainer.devices=8 \
    ++model.tensor_model_parallel_size=8 \
    ++model.pipeline_model_parallel_size=1 \
    inference.inference_micro_batch_size=2 \
    inference.port=1424

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \
      --input-file=data/oasst/train.jsonl \
      --output-file=data/oasst/train_labeled.jsonl \
      --port=1424

Before run attribute_annotate.py, you should apply #350

Expected behavior

The process is completed without the server going down.

Environment overview (please complete the following information)

  • DGX-C A100 * 8
  • nvcr.io/nvidia/nemo:24.07

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions