-
Notifications
You must be signed in to change notification settings - Fork 72
Description
An internal CUDA function can return CUDA_ERROR_ILLEGAL_ADDRESS during Nvidia transcoding which means that the process is in an inconsistent state s.t. it needs to be restarted. The original context in which we encountered this issue is documented in livepeer/go-livepeer#1921. In #267 we implemented a panic whenever an unrecoverable error is encountered, in livepeer/go-livepeer#2057 we bumped the LPMS version to include this update, and then in livepeer/go-livepeer#2094 and livepeer/go-livepeer#2352 we moved the unrecoverable error check into go-livepeer.
The problem is that LPMS will mark any unknown error (indicated by AVERROR_UNKNOWN) as unrecoverable. As a result, some CUDA errors that do not warrant a process restart would be marked as unrecoverable and go-livepeer would panic for those errors.
For example, a CUDA OOM error is also treated as an unknown error by the libav code:
[AVHWDeviceContext @ 0x7f43ec093800] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_OUT_OF_MEMORY: out of memory
ERROR: decoder.c:251] Unable to open hardware context for decoding : Unknown error occurred
ERROR: decoder.c:285] Unable to open video decoder : Error number -1448234581 occurred
E0111 23:30:50.790498 1 ffmpeg.go:503] Transcoder Return : Unrecoverable state, restart process
panic: Unrecoverable state, restart process
goroutine 6108 [running]:
github.com/livepeer/lpms/ffmpeg.(*Transcoder).Transcode(0xc013aa0ec0, 0xc001f6de50, 0xc00056c3c0, 0x1, 0x1, 0x0, 0x0, 0x0)
/go/pkg/mod/github.com/livepeer/lpms@v0.0.0-20211022165630-1d91ede415fa/ffmpeg/ffmpeg.go:505 +0x25a7
github.com/livepeer/go-livepeer/core.(*NvidiaTranscoder).Transcode(0xc013aa0ee0, 0xc0001db4a0, 0x2, 0x1, 0xc0001db401)
/build/core/transcoder.go:88 +0x1ce
github.com/livepeer/go-livepeer/core.(*transcoderSession).loop(0xc012e54d00)
/build/core/lb.go:186 +0x184
github.com/livepeer/go-livepeer/core.(*LoadBalancingTranscoder).createSession.func2(0xc012e54d00, 0xc012e54d80)
/build/core/lb.go:127 +0x2b
created by github.com/livepeer/go-livepeer/core.(*LoadBalancingTranscoder).createSession
/build/core/lb.go:126 +0x61c
We should only mark CUDA_ERROR_ILLEGAL_ADDRESS errors as unrecoverable so that go-livepeer only panics for those errors.