This repository was archived by the owner on Nov 19, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 105
This repository was archived by the owner on Nov 19, 2025. It is now read-only.
llama-70b SFT OSError: [Errno 5] Input/output error #502
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Any potential directions that we have errors.
OSError: [Errno 5] Input/output error
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 173, in main
sft_trainer.fit()
File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 249, in fit
self.save(metrics, is_train_end=is_train_end)
File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 269, in save
self.ckpt_callback.custom_save(monitor_candidates=monitor_candidates, is_train_end=is_train_end)
File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 73, in custom_save_ckpt_func
super(NeMoModelCheckpoint, self)._save_last_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 696, in _save_last_checkpoint
self._save_checkpoint(trainer, filepath)
File "/opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py", line 544, in _save_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only, storage_options=storage_options)
File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1365, in save_checkpoint
self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 416, in save_checkpoint
self.checkpoint_io.save_checkpoint(checkpoint, ckpt_to_dir(filepath), storage_options=storage_options)
File "/usr/lib/python3.12/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py", line 271, in save_checkpoint
return dist_checkpointing.save(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 395, in save
sharded_strategy.save(sharded_state_dict, checkpoint_dir)
File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/base.py", line 227, in save
async_calls.maybe_finalize_async_calls(blocking=True)
File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/async_utils.py", line 209, in maybe_finalize_async_calls
finalize_fn()
File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/torch.py", line 706, in finalize_fn
save_state_dict_async_finalize(*save_state_dict_ret)
File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/state_dict_saver.py", line 144, in save_state_dict_async_finalize
write_results = storage_writer.retrieve_write_results()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/filesystem_async.py", line 328, in retrieve_write_results
raise RuntimeError(f'Worker failure: {write_results_or_exc}') from write_results_or_exc
RuntimeError: Worker failure: [Errno 5] Input/output error
Steps/Code to reproduce bug
torchrun \
--nnodes ${WORLD_SIZE} \
--nproc_per_node=${GPU_NUM} \
--node_rank ${RANK} \
--master_addr ${MASTER_ADDR} \
--master_port ${MASTER_PORT} \
${GPFS}/examples/nlp/gpt/train_gpt_sft.py \
trainer.precision=bf16 \
trainer.num_nodes=${WORLD_SIZE} \
trainer.devices=${GPU_NUM} \
trainer.sft.max_steps=-1 \
trainer.sft.max_epochs=2 \
trainer.sft.limit_val_batches=40 \
trainer.sft.val_check_interval=$SAVE_STEPS \
model.megatron_amp_O2=True \
model.restore_from_path=${PRETRAINED_ACTOR_NEMO_FILE} \
model.tensor_model_parallel_size=${TP_SIZE} \
model.pipeline_model_parallel_size=${PP_SIZE} \
model.sequence_parallel=False \
model.encoder_seq_length=${SEQUENCE_LENGTH} \
model.optim.lr=${LEARNING_RATE} \
model.answer_only_loss=True \
model.data.num_workers=0 \
model.use_flash_attention=True \
++model.data.train_ds.packed_sequence=True \
++model.data.train_ds.file_path=${TRAIN_DATA_PATH} \
model.data.train_ds.max_seq_length=${SEQUENCE_LENGTH} \
model.data.train_ds.micro_batch_size=1 \
model.data.train_ds.global_batch_size=${GLOBAL_BATCH_SIZE} \
++model.data.validation_ds.packed_sequence=True \
++model.data.validation_ds.file_path=${VALID_DATA_PATH} \
model.data.validation_ds.max_seq_length=${SEQUENCE_LENGTH} \
model.data.validation_ds.micro_batch_size=1 \
model.data.validation_ds.global_batch_size=${GLOBAL_BATCH_SIZE} \
exp_manager.create_wandb_logger=False \
exp_manager.explicit_log_dir=${RESULTS_DIR} \
exp_manager.wandb_logger_kwargs.project=sft_run_instruct_data \
exp_manager.wandb_logger_kwargs.name=${RUN_NAME} \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
++exp_manager.checkpoint_callback_params.always_save_nemo=True \
exp_manager.resume_if_exists=True \
exp_manager.resume_ignore_no_checkpoint=True \
exp_manager.create_checkpoint_callback=True \
exp_manager.checkpoint_callback_params.monitor=validation_loss
Environment details
docker :registry.zoomdev.us/languagetech/nemo:24.12.01
Otherwise, please provide:
- OS version :linux
- PyTorch version :rch
Version: 2.5.0a0+e000cf0ad9.nv24.10 - Python version: 3.10.12
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working