serve_reward_model goes down

**Describe the bug**

When we start `serve_reward_model.py` and run annotation, the server goes down during processing. It will crash on specific samples. These samples have a long context.

[error.log](https://github.com/user-attachments/files/17437822/error.log)

***What we did***
- We built the source, but the issue has not been solved.
- We also tried `nvidia/Llama2-13B-SteerLM-RM`, but ran into the same issue.
- It runs without an issue on `nvcr.io/nvidia/nemo:24.05.01` (#219 is the main difference.).
- The estimated processing time has also increased from 2 hours (`nvcr.io/nvidia/nemo:24.05.01`) to 7 hours (`nvcr.io/nvidia/nemo:24.07`).

**Steps/Code to reproduce bug**

```
export HYDRA_FULL_ERROR=1
export MODEL="/workspace/models/Llama3-70B-SteerLM-RM"

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst

python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \
    rm_model_file=${MODEL} \
    trainer.num_nodes=1 \
    trainer.devices=8 \
    ++model.tensor_model_parallel_size=8 \
    ++model.pipeline_model_parallel_size=1 \
    inference.inference_micro_batch_size=2 \
    inference.port=1424

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \
      --input-file=data/oasst/train.jsonl \
      --output-file=data/oasst/train_labeled.jsonl \
      --port=1424
```

Before run `attribute_annotate.py`, you should apply #350

**Expected behavior**

The process is completed without the server going down.

**Environment overview (please complete the following information)**

 - DGX-C A100 * 8
 - `nvcr.io/nvidia/nemo:24.07`

**Environment details**

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version

**Additional context**

Add any other context about the problem here.
Example: GPU model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

serve_reward_model goes down #351

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

serve_reward_model goes down #351

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions