Crash on empty whisper segment

Overview
========

Subaligner as of https://github.com/baxtree/subaligner/commit/fc18afa823dc5637eeee0103c8a1569badf53c74 (current head of master) crashes when using Whisper to transcribe audio if Whisper outputs an empty segment. The crash output looks like:

```
subaligner.transcriber - INFO - MainThread - Finished transcribing the audio
ERROR: list index out of range
  File "subaligner/.venv/bin/subaligner", line 10, in <module>
    sys.exit(main())
  File "subaligner/.venv/lib/python3.11/site-packages/subaligner/__main__.py", line 438, in main
    "ERROR: {}\n{}".format(str(e), "".join(traceback.format_stack()) if FLAGS.debug else "")
  File "subaligner/.venv/lib/python3.11/site-packages/subaligner/__main__.py", line 377, in main
    subtitle, frame_rate = transcriber.transcribe(video_file_path=local_video_path,
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "subaligner/.venv/lib/python3.11/site-packages/subaligner/transcriber.py", line 106, in transcribe
    f"{Utils.format_timestamp(segment['words'][0]['start'])} --> {Utils.format_timestamp(segment['words'][-1]['end'])}\n" \
                              ~~~~~~~~~~~~~~~~^^^
subaligner.transcriber - INFO - MainThread - process shutting down
subaligner.transcriber - DEBUG - MainThread - running all "atexit" finalizers with priority >= 0
subaligner.transcriber - DEBUG - MainThread - running the remaining "atexit" finalizers
```

Reproduction
===========

I think it's probably input-file dependent and my file is large and private, but here's the command-line invocation that crashed:

`CUDA_VISIBLE_DEVICES= subaligner -m transcribe -v '/tmp/myfile.webm' -ml deu -mr whisper -mf large-v3 -o '/tmp/myfile.de.srt' --debug`

Analysis
=======

I did some debugging to print out the `result` from https://github.com/baxtree/subaligner/blob/fc18afa823dc5637eeee0103c8a1569badf53c74/subaligner/transcriber.py#L84 and found a segment that looks like this (in and amongst other more conventional segments that have transcribed content in the `text` field and a list of `words`):

```
{
    'id': 70,
    'seek': 14668,
    'start': 163.36,
    'end': 163.36,
    'text': '',
    'tokens': [],
    'temperature': 0.0,
    'avg_logprob': -0.785851426011934,
    'compression_ratio': 1.8442211055276383,
    'no_speech_prob': 0.034043002873659134,
    'words': []
}
```

The crash is happening in https://github.com/baxtree/subaligner/blob/fc18afa823dc5637eeee0103c8a1569badf53c74/subaligner/transcriber.py#L106 because `words` is an empty array.

Possible Fix
=========

In my local install , I patched up the conditions in the for-loop that processes segments to skip over empty segments like so:

```
                for segment in result["segments"]:
                    if max_char_length is not None and len(segment["text"]) > max_char_length:
                        srt_str, srt_idx = self._chunk_segment(segment, srt_str, srt_idx, max_char_length)
                    elif with_word_time_codes:
                        for word in segment["words"]:
                            srt_str += f"{srt_idx}\n" \
                                        f"{Utils.format_timestamp(word['start'])} --> {Utils.format_timestamp(word['end'])}\n" \
                                        f"{word['word'].strip().replace('-->', '->')}\n" \
                                        "\n"
                            srt_idx += 1
                    elif segment["words"]:
                        srt_str += f"{srt_idx}\n" \
                                    f"{Utils.format_timestamp(segment['words'][0]['start'])} --> {Utils.format_timestamp(segment['words'][-1]['end'])}\n" \
                                    f"{segment['text'].strip().replace('-->', '->')}\n" \
                                    "\n"
                        srt_idx += 1
```

I don't know much about whisper's output format and I can imagine lots of ways this could still bail ungracefully... but handling empty segments in some fashion seems necessary for at least some input files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Crash on empty whisper segment #101

Overview

Reproduction

Analysis

Possible Fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Crash on empty whisper segment #101

Description

Overview

Reproduction

Analysis

Possible Fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions