This repository was archived by the owner on Nov 19, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 105
This repository was archived by the owner on Nov 19, 2025. It is now read-only.
serve_reward_model goes down #351
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When we start serve_reward_model.py and run annotation, the server goes down during processing. It will crash on specific samples. These samples have a long context.
What we did
- We built the source, but the issue has not been solved.
- We also tried
nvidia/Llama2-13B-SteerLM-RM, but ran into the same issue. - It runs without an issue on
nvcr.io/nvidia/nemo:24.05.01(critic speedup #219 is the main difference.). - The estimated processing time has also increased from 2 hours (
nvcr.io/nvidia/nemo:24.05.01) to 7 hours (nvcr.io/nvidia/nemo:24.07).
Steps/Code to reproduce bug
export HYDRA_FULL_ERROR=1
export MODEL="/workspace/models/Llama3-70B-SteerLM-RM"
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst
python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \
rm_model_file=${MODEL} \
trainer.num_nodes=1 \
trainer.devices=8 \
++model.tensor_model_parallel_size=8 \
++model.pipeline_model_parallel_size=1 \
inference.inference_micro_batch_size=2 \
inference.port=1424
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \
--input-file=data/oasst/train.jsonl \
--output-file=data/oasst/train_labeled.jsonl \
--port=1424
Before run attribute_annotate.py, you should apply #350
Expected behavior
The process is completed without the server going down.
Environment overview (please complete the following information)
- DGX-C A100 * 8
nvcr.io/nvidia/nemo:24.07
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version
Additional context
Add any other context about the problem here.
Example: GPU model
StefanHeng
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working