Skip to content

Conversation

@Raveline
Copy link

@Raveline Raveline commented Mar 15, 2024

This PR mostly builds upon the opened draft PR on the original tracing repo (mtth#9) to ensure that the library survives gracefully a failure to connect to the backend.

The main change is to use a bounded queue (reasonably sized) to store samples to be sent to the backend, and to have a logic to re-enqueue messages that could not be sent.

As a result:

  • Exceptions to connect to the backend are properly silenced;
  • When the connection can be reestablished, the missing messages are resent, so that no information should be lost.

I tested with the following dummy code:

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE FlexibleContexts #-}
module Main where

import Control.Monad (forever)
import Control.Monad.Base (MonadBase, liftBase)
import Control.Concurrent.Lifted (threadDelay)
import Monitor.Tracing
import Control.Monad.Trace.Class (rootSpanWith)
import qualified Monitor.Tracing.Zipkin as ZPK 

main :: IO ()
main = ZPK.with settings $ ZPK.run runner

settings :: ZPK.Settings
settings = 
  ZPK.defaultSettings
    { ZPK.settingsEndpoint =
      Just $
        ZPK.defaultEndpoint
          { ZPK.endpointService = Just "test"
          }
          , ZPK.settingsPublishPeriod = Just 1 -- second
    }   

runner :: (MonadBase IO m, MonadTrace m) => m ()
runner = rootSpanWith (ZPK.addInheritedTag "id" "1234") alwaysSampled "example" $
  forever $ do
    childSpan "wakeup" wakeup
    threadDelay 5000000

wakeup :: (MonadBase IO m, MonadTrace m) => m ()
wakeup = do
  liftBase . putStrLn $ "Woke up"
  threadDelay 3000000
  childSpan "subcall" doubleWakeup

doubleWakeup :: (MonadBase IO m, MonadTrace m) => m ()
doubleWakeup = liftBase . putStrLn $ "Double wake up"

I did some profiling on this to ensure this didn't introduce any leak.

With no connection loss, we seem to have quasi-constant space:
no-loss-of-backend

With a connection loss, oddly enough, there is a clear variation:
loss-of-backend

I understand the spike (the queue gets filled), I don't get why it gets way lower afterwards; at least it doesn't leak, I guess, but I'm still a bit puzzled by the result.


Edit after changes: now that I've switched to the SBQueue, the results are much closer to what I expected (though postscript did some odd things in rendering, apologies for the lack of readability).

With no loss, we have - again - constant space:

tracer-test-no-loss

When losing the backend then replugging it, we observe a spike associated with the queue filling (though a very very long name in the key shrunk the graph, the spike followed by a return to normal is clearly observable):

tracer-test-loss

else flush
pure $ Zipkin mgr req tracer mbEpt
zpk = Zipkin mgr req tracer mbEpt
for_ (microSeconds <$> mfilter (> 0) mbPrd) $ \delay -> fork $ forever $ do
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is the one I'm the least satisfied with; I took it "as is" from the original PR, but I think it is flawed.
If the mbPrd parameter is Nothing, the behaviour is messed up: the flush is only performed at the termination of the process (i.e., in most case, when quitting the program) - I suspect (without being 100% sure) because of the intricacies of mfilter. Though we typically set this parameter, it would be better not to leave such an aberrant behaviour possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that looks really bad. What is mbPrd for? You should not be allowed to not spawn a publishing thread, let's fix this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for the field reads:

 , settingsPublishPeriod :: !(Maybe NominalDiffTime)
  -- ^ If set to a positive value, traces will be flushed in the background every such period.

Which, I guess, can be interpreted as "if set to Nothing, there is no regular flush", but it makes very little sense (unless for some very specific tasks like daemons that get restarted or things like this ?). I suggest replacing the Nothing case with a sensible default (every seconds, e.g.).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's just rid of the Maybe there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also remove the ignore flag, since it doesn't get used anymore ?

let unspan (ZipkinSpan _ sample) = sample
queue = spanSamples tracer
fillOne item = void . atomically $ do
cannotPush <- isFullTBQueue queue
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is very important, lest an uncaught exception is raised. Of course, as a result, if the queue is full, there will be some traces lost (which is why I set a fairly high limit to the queue).

tags <- readTVar tagsTV
logs <- sortOn (\(t, k, _) -> (t, k)) <$> readTVar logsTV
writeTChan (tracerChannel tracer) (Sample spn tags logs start (end - start))
writeTBQueue (tracerQueue tracer) (Sample spn tags logs start (end - start))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeTBQueue blocks when the queue is full, it should instead do nothing (see SBQueue, in particular writeSBQueue).

else flush
pure $ Zipkin mgr req tracer mbEpt
zpk = Zipkin mgr req tracer mbEpt
for_ (microSeconds <$> mfilter (> 0) mbPrd) $ \delay -> fork $ forever $ do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that looks really bad. What is mbPrd for? You should not be allowed to not spawn a publishing thread, let's fix this.

publish zpk `catch` handler
pure zpk

refillQueue :: (MonadIO m, MonadBaseControl IO m) => Tracer -> [ZipkinSpan] -> m ()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For comparison, in log-elasticsearch we simply flush the queue, take all messages and try to send them to ES. If this fails with any exception, we just retry in place with the messages we got, there's no need to put these items back into queue unless I'm missing something.

You also then won't need the PublishFailed exception IIUC.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see how this behaves with an in-place retry.

-- | Creates a new 'Tracer'.
newTracer :: MonadIO m => m Tracer
newTracer = liftIO $ Tracer <$> newTChanIO <*> newTVarIO 0
newTracer = liftIO $ Tracer <$> newTBQueueIO 10000 <*> newTVarIO 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10000 sounds a bit small, log uses a default of 1000000.

@arybczak
Copy link

I understand the spike (the queue gets filled), I don't get why it gets way lower afterwards; at least it doesn't leak, I guess, but I'm still a bit puzzled by the result.

This graph suggests that when the publisher fails, traces are dropped for some reason 🤔

@Raveline
Copy link
Author

This graph suggests that when the publisher fails, traces are dropped for some reason 🤔

That is my understanding too, but it is not the behaviour I observe in Zipkin; the messages are properly retrieved when the server is restarted. This is why I am so puzzled by the profile, it makes no sense to me. I'll retry it once I've switched to the SBQueue you recommended.

@Raveline Raveline force-pushed the survive-publishing-exceptions branch from af7dd7f to 58abc4e Compare March 21, 2024 09:44
import Data.Maybe (fromMaybe)
import Data.Time.Clock (NominalDiffTime)
import Data.Time.Clock.POSIX (POSIXTime, getPOSIXTime)
import GHC.Conc (retry)
Copy link

@arybczak arybczak Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're at it, can you get rid of the stm-lifted dependency? It's abandoned and not really needed here, AFAIK the only thing it gives us is lifted atomically.

Then you can use import Control.Concurrent.STM and it imports retry.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, let's do it in a different PR.

You can use Control.Monad.STM though instead of GHC.Conc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can, but (a) it's already a reexport for GHC.Conc (b) it requires adding the stm dependency (which is admittedly only hidden for now). I will gladly do a PR dropping stm-lifted as a follow-up to this one.

Copy link

@arybczak arybczak Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's leave it then. I see you already did it ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was explaining why I had used GHC.Conc, not refusing the requested change ! If the state of the PR seems satisfactory to you, I think we can merge and I'll do the PR dropping stm-lifted right after.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍

@Raveline Raveline force-pushed the survive-publishing-exceptions branch from 58abc4e to f38e4e6 Compare March 21, 2024 15:56
@Raveline Raveline changed the title Ensure backend failure don't terminate the publishing process [CORE-6788] Ensure backend failure don't terminate the publishing process Mar 22, 2024
Copy link

@arybczak arybczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks 👍

@@ -1,4 +1,5 @@
{-# LANGUAGE CPP #-}
{-# LANGUAGE ExistentialQuantification #-}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer necessary, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, removed.

import Data.Maybe (fromMaybe)
import Data.Time.Clock (NominalDiffTime)
import Data.Time.Clock.POSIX (POSIXTime, getPOSIXTime)
import GHC.Conc (retry)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍

@Raveline Raveline force-pushed the survive-publishing-exceptions branch from f38e4e6 to a756751 Compare March 25, 2024 08:38
@Raveline Raveline merged commit 4023a7b into master Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants