Manager2

The Manager2 runtime component is responsible for managing the specified executables. Typically it is responsible for starting all Ratatosk executables on a host and for associated periodic tasks, such as sending files to shore. Specifically, it has functionality for:

  • Starting selected components
  • Monitoring and restarting applications based on their console output.
  • Restarting selected applications as specified depending on how the application terminated.

It is configured using a yaml input file with the following format:

  • ManagedProcess: A list of processes to manage.

Each element of ManagedProcess have the following fields:

  • PrgName: The application name, the path must be included if the application is not on the search path of the environment.
  • Arguments: A yaml array with the process arguments, e.g. '[/home/sintef/vesseldeployments/config/myvessel/modbus.yml, 2]'
  • RestartPeriod: If this optional argument is given, the application will be restarted after the specified number of seconds, if the return value indicated success. If unset, the application will not be restarted after successful execution.
  • DelayOnError: If this optional argument is set, the application will be restarted after the specified number of seconds, if the return value indicates an error, the application fails to launch or some other error is detected. If this argument is not given, the application will not be restarted if it fails.
  • DelayOnErrorLong: If this argumentis set, in addition to DelayOnError, a schedule for restarting the application on successive failures will be used. After a number of successive failed restarts with a (short) delay DelayOnError, a (longer) delay of DelayOnError will be used. After this delay, the restart schedule will again run a number of attempts using DelayOnError and so on. Se also MaxRestartsOnError and ResetFailCounterAfter.
  • MaxRestartsOnError: This optional field will limit the number of restarts on unsuccessfull execution. If DelayOnError is set and DelayOnErrorLong is not set, the number of restarts will be limited to MaxRestartsOnError if set. If both DelayOnError and DelayOnErrorLong are set, MaxRestartOnError specifies the number of restarts using DelayOnError for each restart using DelayOnErrorLong. In this case, a default value of three will be used if MaxRestartsOnError is unset.
  • ResetFailCounterAfter: [3] This optional field specifies how long the application must be running without returning an error code or otherwise fail, before the application should be considered to have started successfully. After the application has been running for this duration, the restart schedule will be reset, and a subsequent error will trigger a restart following the reset schedule. If unset, a default value of three seconds will be used. For application which use a longer time to e.g. set up communication with remote peers, it is recommended to increase this value.
  • MaxQuietPeriod: If this optional field is set, the manager will monitor the application's output to standard out and check every MaxQuietPeriod there has been any output within this period. If there has been no output, the application will be terminated immediately and subsequently restarted according to the restart on error schedule. Note that a minimum period of two seconds will be used!
  • WatchdogPeriod: Run a watchdog periodically with the specified period (in seconds). The watchdog will try to detect if the application has somehow terminated without triggering handling of the failure. Typically, this would mean that no notification was sent to the exit handler. If the watchdog detects failure, the process will be restarted according to the beforementioned restart on error schedule. Note that if MaxQuietPeriod is set, the watchdog period specified here will be overwritten by MaxQuietPeriod.

An annotated example input file is shown below, as well as here.

ManagedProcess:
# The following examples should run from the build directory
# 'dummytask' is a simple test program that takes two arguments: a delay [seconds] and an action.
# After the specified delay 'dummytask' will end its execution according to the specified action:
# - success: return EXIT:SUCCESS
# - error: return EXIT_FAILURE
# - throw: throw an unhandled exception
# - sigkill: send a KILL signal to itseelf
#
# The following task will start once
- PrgName: tools/dummytask/dummytask
Arguments: [4, success]
# The following task will return successfully after four seconds and then restart after ten more
# seconds. This periodic execution will run indefinently, unless some failure arises, in which case
# the task will not be restarted.
- PrgName: tools/dummytask/dummytask
Arguments: [4, success]
RestartPeriod: 10
# The following task will return failure after one second and then restart after thirty more
# seconds. The number of attempts to restart the task is unlimited.
- PrgName: tools/dummytask/dummytask
Arguments: [2, error]
RestartPeriod: 600 # If the task were to return success, this would trigger a new execution of the task after ten minutes
DelayOnError: 30
# The following task will return failure after two seconds and then restart after three more
# seconds. The task will attempted to be restarted two times (three starts all in all), before giving up.
- PrgName: tools/dummytask/dummytask
Arguments: [2, error]
RestartPeriod: 15 # If the task were to return success, this would trigger a new execution of the task after 15 seconds
DelayOnError: 3
MaxRestartsOnError: 2
# The following task will fail after two seconds and then restart after three more seconds. After two
# attempts to be restart the task, the next restart attempt will be delayed by 8 seconds.
- PrgName: tools/dummytask/dummytask
Arguments: [2, throw]
RestartPeriod: 15 # If the task were to return success, this would trigger a new execution of the task after 15 seconds
DelayOnError: 3
MaxRestartsOnError: 2
DelayOnErrorLong: 8 # This should be set to a value not smaller than 'DelayOnError'.
# The following task will fail after 8 seconds and then restart after three more seconds.
# 'ResetFailCounterAfter: 6' will trigger a reset of a fail counter after the task has been running for 6 seconds.
# So in this case the task will be restarted after 'DelayOnError' each time it fails (since it fails after 8 seconds)
# indefinently.
- PrgName: tools/dummytask/dummytask
Arguments: [8, throw]
RestartPeriod: 15 # If the task were to return success, this would trigger a new execution of the task after 15 seconds
DelayOnError: 3
MaxRestartsOnError: 2
DelayOnErrorLong: 8 # This should be set to a value not smaller than 'DelayOnError'.
ResetFailCounterAfter: 6 # A default value of three seconds will be used if this is not set.
# The following example includes a watchdog timer. The watchdog timer is started on successfull start of the task.
# The watchdog is intended to restart the task in cases in which the task execution is somehow aborted without
# the normal exit handler being triggerd (i.e. the RatatoskManager will not be alerted by the halted execution.
- PrgName: tools/dummytask/dummytask
Arguments: [60, sigkill]
RestartPeriod: 600 # If the task were to return success, this would trigger a new execution of the task after ten minutes
DelayOnError: 30
WatchdogPeriod: 10
# The following example includes monitoring of output from dummytask to stdout
# A watchdog will check every 'MaxQuietPeriod' if the time since last registered output from the task
# was recieved later than 'MaxPeriod' seconds ago. If no output has been recieved, the manager will terminate dummytask and
# restart it. The restart can be delayed after termination of the task by adding 'DelayOnError'.
ManagedProcess:
- PrgName: tools/dummytask/dummytask
Arguments: [5, success, 500]
MaxQuietPeriod: 2