Retries
Retries allow us to automatically re-submit tasks which have failed due to failure in submission or execution.
Purpose
Retries can be useful for tasks that occasionally fail for known, fixable reasons. Cylc can rerun a failing job multiple times, with user-defined delays between tries.
Tasks that fail because of temporary hardware or network outages may succeed if simply resubmitted after a delay. Others might succeed if configured differently on the retry.
A job environment variable $CYLC_TASK_TRY_NUMBER
increments with each try,
to allow try-dependent behaviour in the task script.
Note
Tasks only enter the submit-failed
state if job submission fails with no
retries left. Otherwise they return to the waiting state, to wait on the
next try.
Tasks only enter the failed
state if job execution fails with no retries
left. Otherwise they return to the waiting state, to wait on the next try.
Example
Create a new workflow by running the following commands:
mkdir -p ~/cylc-src/retries-tutorial
cd ~/cylc-src/retries-tutorial
And paste the following code into a flow.cylc
file. This workflow has a
roll_doubles
task that simulates trying to roll doubles using two dice:
[scheduler]
UTC mode = True # Ignore DST
[scheduling]
[[graph]]
R1 = start => roll_doubles => win
[runtime]
[[start]]
[[win]]
[[roll_doubles]]
script = """
sleep 10
RANDOM=$$ # Seed $RANDOM
DIE_1=$((RANDOM%6 + 1))
DIE_2=$((RANDOM%6 + 1))
echo "Rolled $DIE_1 and $DIE_2..."
if (($DIE_1 == $DIE_2)); then
echo "doubles!"
else
exit 1
fi
"""
Running Without Retries
Let’s see what happens when we run the workflow as it is. Look at the workflow with The Cylc GUI or The Cylc TUI
Then validate install and run the workflow:
cylc validate .
cylc install
cylc play retries-tutorial
Unless you’re lucky, the workflow should fail at the roll_doubles task.
Stop the workflow:
cylc stop retries-tutorial
Configuring Retries
We need to tell Cylc to retry it a few times. To do this, add the following
to the end of the [[roll_doubles]]
task section in the flow.cylc
file:
execution retry delays = 5*PT6S
This means that if the roll_doubles
task fails, Cylc expects to
retry running it 5 times before finally failing. Each retry will have
a delay of 6 seconds.
We can apply multiple retry periods with the execution retry delays
setting
by separating them with commas, for example the following line would tell Cylc
to retry a task four times, once after 15 seconds, then once after 10 minutes,
then once after one hour then once after three hours.
execution retry delays = PT15S, PT10M, PT1H, PT3H
Running With Retries
Look at the workflow with The Cylc GUI or The Cylc TUI
Re-install and run the workflow:
cylc validate .
cylc install
cylc play retries-tutorial
What you should see is Cylc retrying the roll_doubles
task. Hopefully,
it will succeed (there is only about a 1 in 3 chance of every task
failing) and the workflow will continue.
Altering Behaviour
We can alter the behaviour of the task based on the number of retries, using
$CYLC_TASK_TRY_NUMBER
.
Change the script
setting for the roll_doubles
task to this:
sleep 10
RANDOM=$$ # Seed $RANDOM
DIE_1=$((RANDOM%6 + 1))
DIE_2=$((RANDOM%6 + 1))
echo "Rolled $DIE_1 and $DIE_2..."
if (($DIE_1 == $DIE_2)); then
echo "doubles!"
elif (($CYLC_TASK_TRY_NUMBER >= 2)); then
echo "look over there! ..."
echo "doubles!" # Cheat!
else
exit 1
fi
If your workflow is still running, stop it, then run it again.
This time, the task should definitely succeed before the third retry.
Further Reading
For more information see the Cylc User Guide.