cylc-admin

Proposal (Accepted and Implemented): Cylc 8 Spawn-on-Demand

Hilary Oliver, December 2019, and February/March 2020

Implementation PR merged July 2020 cylc-flow#3515

Table of Contents

Introduction

Implementation PR cylc-flow#3515.

See below for terminology.

Background

We first considered replacing task-cycle-based Spawn-On-Submit (SoS) with graph-based Spawn-On-Demand (SoD) in cylc/cylc-flow#993. The ultimate solution might be cylc/cylc-flow#3304 but that requires major Cylc 9 refactoring. It turns out we can implement SoD by itself within the current system just by getting task defs (during graph parsing) to record who depends on each output, so that task proxies can spawn downstream children as outputs are completed - still a self-evolving pool of task proxies with no graph computation at run time.

Advantages

Implementation summary

At start-up:

Then, as outputs get completed:

Implementation details

Spawn-on-outputs or spawn-when-ready?

If a task depends on the outputs of multiple parents (AND trigger) there will be a period when it has partially satisfied prerequisites, because the outputs will not all be completed at the same time (in fact the parents might not even pass through the task pool at the same time). That being the case, when exactly does SoD “demand” that we spawn the task? The options are:

Spawn-on-outputs was easiest to implement first because it leverages our task proxies’ existing ability to manage their own prerequisites. This generates some waiting tasks that might seem to create other problems, but a bit of thought shows there is actually no fundamental difference between spawn-on-outputs and spawn-when-ready. They are just slightly different ways of doing exactly the same thing, which is to manage partially satisfied prerequisites. To see this, note that we could effectively implement spawn-when-ready by spawning waiting task proxies on outputs but keeping them in a hidden pool until satisfied. Even in the main pool they don’t have to affect the workflow any differently than separately-managed prerequisites (e.g. if ignored during runahead computation they won’t stall the workflow). It should not make any difference to users either, whether or not a yet-to-run task is backed by a waiting task proxy or is just an abstract task with real prerequisite data attached to it. Housekeeping is the same too: we either remove satisfied prerequisites to avoid a memory leak, or remove spent task proxies that contain the same satisfied prerequisites.

However, we should change later to spawn-when-ready with separate management of prerequisites, just because it’s cleaner and slightly more efficient (no excess task proxy baggage, just the prerequisites) and doesn’t invite confusion about the implications of waiting task proxies. The scheduler could (e.g.) maintain a dict of (task-id, [prerequisites]): create a new entry when the first parent output of task-id is completed, update it with subsequent outputs, and delete it when the prerequisites are satisfied and the task can run. The dict would also have to be stored in the DB for restarts.

For completeness, see also can we use the database or datastore to satisfy prerequisites?

Preventing conditional reflow

If a task depends conditionally on the outputs of multiple parents (OR trigger), we need to stop the task being spawned separately by each parent ouput (because the outputs will not all be completed at the same time, and in fact the parents may not even pass through the pool at the same time).

A | B => C

Here, if C triggers off of A first, we need to stop it spawning again when B completes later on. We can’t rely on the existence of C in the task pool to stop this because C could be finished and gone before B completes.

We can prevent this “conditional reflow” very simply, by spawning only if the database says the task was not previously spawned in this flow. This requires an extra DB query for every late output in a conditional trigger (where late means the downstream task has finished and gone before the next parent output is generated). It has no impact on AND triggers where the task will remain in the pool until it runs, and a spawn-time DB query is needed to get submit number anyway.

It seems unlikely this extra DB access will be a problem, but see possible in-memory prevention of conditional reflow.

Prerequisite housekeeping?

The minimum housekeeping requirement is to delete partially satisified prerequisites once satisfied (or for spawn-on-outputs, reset the waiting tasks that contain those prerequisites, to ready) then remove task proxies once finished.

The more interesting question is: can prerequisites get stuck forever as partially satisfied (or stuck waiting tasks for spawn-on-outputs), and if so what should we do about that?

Note this is not a SoD-specific problem. It is even worse in SoS where have to worry about tasks gettings stuck wholly unsatisfied as well. There we force users to clean them all up with suicide triggers, or else let them stall the workflow.

We have provisionally decided not to attempt automatic removal of stuck prerequisites on grounds that partial completion of the requirements for a task to run probably means something is wrong that the user needs to know about. It can only happen if some of the associated outputs were generated while others were not. That could mean:

If any of the above do prove problematic we could leave this as a remaining niche case for suicide triggers, or consider possible stuck-prerequisite housekeeping.

Note that reflow can also generate tasks that are stuck waiting on off-flow outputs, but these should not be removed.

Spawning parentless tasks

Tasks with no prerequisites (or which only depend on clock or external triggers) need to be auto-spawned, because they have no parents to “demand” them. At start-up we auto-spawn the first instances of these tasks, then whenever one is released from the runahead pool, spawn its next instance.

(In future, we may want to spawn tasks “on demand” in response to external trigger events too, but for now a waiting task with an xtrigger is fine).

Absolute dependence

At first glance this is not a great fit for the spawn-on-demand model:

...
      R1/2 = "start"
      P1 = "start[2] => bar"

Above, with no final cycle point a single event (start.2 succeeded) should spawn an infinite number of children (bar.1,2,3,...), which obviously is not feasible.

(Note without the first line above start is not defined on any sequence, so with SoS the scheduler will stall with bar unsatisfied, but with SoD it will shut down immediately with nothing to do because bar never gets spawned.)

Worse, bar could also have other non-absolute triggers, and it could be those that spawn initial instances of bar before start is finished:

...
      R1/2 = "start"
      P1 = "start[2] & foo => bar"

Above, if foo runs 3x (say) faster than start then roughly speaking the first three instances of foo will spawn the first three instances of bar before start.2 is done. This suggests we might need to do repeated checking of unsatisfied absolute triggers until they become satisfied. However, it turns out that’s not necessary…

Absolute trigger implementation:

  1. Whenever an absolute output (e.g. start.2:succeed above) is completed the scheduler should:
    • Remember it forever, via a list held by the task pool module and an associated DB table (to load the data again on restart).
    • Spawn (if not already spawned) the first downstream child (bar.1 above) and update the prerequisites of potentially multiple subsequent children already in the pool due to spawning by others (e.g. foo above).
  2. Whenever a task with an absolute trigger (bar above) is spawned, check to see if the trigger is already satisfied. If it is, satisfy it. If it isn’t then it will be satisfied later when the absolute parent finishes and updates all child instances present in the pool as described just above.
  3. Whenever a task with an absolute trigger is released from the runahead pool spawn its next instance (strictly speaking this is only necessary for tasks with no non-absolute triggers to spawn them on demand).

Retriggering

Retriggering a finished absolute parent (with --reflow) causes it to spawn its first child, and subsequent instances of the parent will also continue to be spawned as they drop out of the runahead pool as described above.

Failed task handling

We have provisionally decided to remove failed tasks immediately as finished if handled by a :fail trigger, but otherwise leave them in the pool as unfinished, in anticipation of user intervention.

However we may want to reconsider this later; see better failed task handling?

Workflow stop

If the workflow stalls, that means there is nothing else to do (given the graph and what transpired at run time)

This works fine but I think we could do better - see better workflow completion handling.

Task pool content at shutdown

With a stop point prior to the final cycle point (stop now etc.) the scheduler will stop with waiting tasks in the runahead pool beyond the stop point. This allows a restart to carry on as normal, beyond the stop point.

The task pool will be empty on stopping at the final cycle point because that represents the end of the graph where all recurrence expressions end. It is not possible to restart from here. To go beyond the final cycle point you need to change the final cycle point in the workflow definition and restart from an earlier checkpoint, or warm start prior to the new final point.

Suicide triggers

Suicide triggers are no longer needed for workflow branching because waiting tasks do not get spawned on the “other” branch. Below, if A succeeds only C gets spawned, and if A fails only B gets spawned:

A:fail => B
A => C

Tim W’s case: B triggers and determines that xtrigger1 can never be satisfied, so we need to remove the waiting A (the two xtriggers are watching two mutually exclusive data directories for a new file):

@xtrigger1 => A
@xtrigger2 => B

As described this remains a case for suicide triggers (although they won’t be needed here either if we can spawn xtriggered-tasks “on demand” too, in response to xtrigger outputs).

We will keep suicide triggers for backward compatibility, and in case they are still useful for rare edge cases like this.

Suicide trigger implementation

Suicide triggers are just like normal triggers except for what happens once they are satisfied (i.e. the task gets removed instead of submitting). In particular, the suicide-triggered task has to keep track of its own prerequisites so it knows when it can be removed.

foo & bar => baz 

Above, if foo (say) succeeds before bar, then baz will be spawned (if not already spawned elsewhere) by foo:succeed and its prerequisites updated, before later being updated again by bar:succeed, and then it can run.

Simliarly for suicide triggers:

foo & bar => !baz 

Here, if foo (say) succeeds before bar, then baz will be spawned (if not already spawned elsewhere) by foo:succeed and its prerequisites updated, before later being updated again by bar:succeed, and then it can be removed.

Normally in the case of suicide triggers baz will already have been spawned earlier, so both foo and bar will just update its prerequisites and log a message about not spawning it again in this flow, and it will be removed once both are done. If it was spawned earlier but removed as finished, then only the messages will be logged (in this case baz doesn’t need to track its own prerequisites and remove itself once they are satisfied, because it has already been removed).

Submit number

Submit number should increment linearly with any re-submission, automatic retry or forced retriggering.

In SoS retriggering a finished task that is still in the pool just increments the in-memory submit number. Otherwise submit number gets looked up in the DB, but this is broken: re-inserted tasks get the right submit number from the DB, but their spawned next-cycle successors start at submit number 1 again.

In SoD we have the additional complications that finished tasks don’t stay in the pool, and reflows can be triggered over the top of an earlier flow. We can’t assume that submit number always starts at 1 in the original flow because it is possible to trigger a “reflow” out the front of the main flow. The only foolproof method to always get the right submit number (and thus never clobber older job logs) is to look it up in the DB whenever a new task is spawned. As it turns out the same DB query can also determine associated flow labels (next).

Reflow

Reflow means having the workflow continue to “flow on” according to the graph from a manually re-triggered task.

Reflow in SoS

A restricted form of reflow occurs automatically in SoS if you (re-)insert and trigger a task with only previous-instance dependence (foo[-P1] => foo).

Otherwise it requires laborious manual insertion of downstream task proxies and/or state reset of existing ones to set up the reflow.

Reflow in SoD

Reflow naturally happens in SoD. Consider the cycling workflow graph:

reflow-example

Retriggering a finite sub-graph from a bottleneck task is straightforward and safe:

Retriggering an ongoing cycling workflow from a bottleneck task is equally straightforward:

Retrigging a non-bottleneck task will cause a stalled reflow without manual intervention, because previous-flow outputs are not automatically available

Conditional reflow prevention must be flow-specific. In A | B => C if C spawns off of A in flow-1, we should not spawn another C off of B in flow-1, but we should spawn another C if flow-2 hits the trigger later on.

Flow labels and merging

To distinguish different flows, tasks associated with the same flow have a common flow label that is transmitted downstream to spawned children.

Users should probably be advised to avoid running multiple flows in the same part of the graph at the same time, unless they know what they’re doing, because that would likely involve multiple sets of identical or near-identical jobs running at nearly the same time, which would likely be pointless. However, the system should ensure that nothing pathological happens if one flow does run into another.

We could make flows independent (able to pass through each other unimpeded) but that just seems like a perverse way of running multiple instances of the same workflow at once.

Instead, if one flow catches up to another, they should merge together. The only way to achieve this, realistically, is to merge flow labels wherever a task from one flow encounters the same task from another flow already present in the task pool. So flows merge gradually, not all at once. When this happens the task will take forward a new merged flow label that can act as either of the original flows downstream.

Flow labels are currently implemented as a random character chosen from the set [a-zA-Z], with merged labels represented by joining the upstream labels. So if flow “a” catches up with flow “U”, the merged flow will be labelled “aU” and will act as if it belongs to “a”, “U”, or “aU”. This allows for 52 concurrent flows, which should be plenty, but is super easy to read and type (c.f. verbose UUIDs merging to a list of UUIDs).

There’s no need to keep track of which is the original flow, it doesn’t matter.

We should periodically and infrequently (main loop plugin) remove flow labels that are common to the entire task pool. If flows ‘a’ and ‘U’ merge, eventually all labels will contain ‘aU’ (and some may carry additional labels as well from more recent triggering). At that point we might as well replace ‘aU’ with a single character label.

Reflow restrictions

CLI implications

CLI task globbing:

In SoS we glob on:

In SoD we need to glob on:

Web UI implications

Primarily:

Approaching a stop point:

Graph isolates: expanding the n-distance window will not reveal tasks that are not connected (via the graph) to n=0 tasks (nasty example: cycling workflows with no inter-cycle dependencies).How should the UI make these easily discoverable?

Reflow:

Task Outputs:

Appendices

Possible in-memory conditional reflow prevention

(For comparision with DB-based conditional reflow prevention above.

This would involve remembering that a task was previously spawned until all of its parents are finished - at which there are no parents left to cause the problem.

My initial implementation involved keeping finished task proxies in the pool until their parents were finished, but that required spawning on completion as well as on outputs, and task proxies can only hold the finished status of all their parents if spawned on completion as well as on outputs (e.g. to catch parents that finished by failure instead of success). This was a departure from the SoD concept, and it created extra (non on-demand) waiting tasks that also required parents-finished housekeeping. Too complicated.

A simpler in-memory solution is feasible though: maintain (e.g.) a dict of (spawned-task-id, [list of unfinished parents]). Create a new entry when the first parent finishes, and update it on subsequent parent completions. Remove the entry once the list of unfinished parents is empty. This dict would also have to be stored in the DB for restarts. Some tricky housekeeping might be needed though: what to do with dict entries that get “stuck” because one or more parents never run at all. That could happen after upstream failures, in which case fixing the failure might fix the problem, but it could also happen when the parent tasks are on different alternate paths.

Can we use the datastore instead of the DB?: parents are one edge from children, so if we keep n=1 tasks in the scheduler datastore, can we use that to see if the child was spawned already? Kind of: the child can drop out of n=1 if there is a gap between parents running, in which case it has to be loaded into n=1 again from the DB when the next parent shows up. So this avoids the in-memory housekeeping problem but only by secretly using the DB as long term memory. Plus, we want to keep only n=0 in the scheduler datastore if possible.

Can we use the database or datastore to satisfy prerequisites?

We’ve chosen in-memory management of partially satisfied prerequisites above, but for completeness here are two other possible solutions.

Pure DB: when an output is completed we could try to satisfy the prerequisites of the downstream task using the DB. This would involve looking up all of the other contributing parent outputs each time. If the prerequisites are completely satisfied, spawn the task, otherwise don’t. This would avoid any need for prerequisite housekeeping, but at the cost of a lot more DB access for tasks with many parents.

Can we use the datastore to satisfy prerequisites? Prerequisites are n=1 edges, so could we keep n=1 in the scheduler datastore and use that to check prerequisites before spawning? The answer is no because the as-yet-unspawned child task is n=1 from the parent that just generated the output, but all the other parents have to be checked as well to resolve its prerequisites, and they are n=2 from the original parent. So what about keeping n=2 in the datastore? Kind of: we can check all parents at n=2 whenever an output is generated, but parents don’t have to run concurrently so they can drop out of n=2 between one parent running and the next, in which case they would have to be loaded into n=2 from the DB again when the next parent shows up. So this would work, but with back-door use of the DB as long term memory. Plus we want to keep only n=0 in the scheduler datastore if possible.

Possible stuck-prerequisite housekeeping

We have provisionally decided not do to automatic housekeeping of stuck prerequisites - see above.

If we change our minds about this, we could potentially remove partially satisfied prerequisites (spawn-on-ready) or waiting tasks (spawn-on-outputs) if their parent tasks are all finished, and perhaps also if the rest of the workflow has moved on beyond them, because at that point nothing could satisfy them automatically.

Stuck prerquisites are not an immediate problem, it only matters if they accumulate over time. We don’t have to let them stall the workflow either (although they are often the result of a failure that causes it to stall). So housekeeping could be done by checking the status of the relevant parent tasks via the DB at stall time, or perhaps by infrequent periodic main loop plugin.

Better failed task handling?

We currently remove failed tasks as finished if handled by a :fail trigger, but otherwise leave them in the pool as unfinished, in anticipation of user intervention.

However it would be simpler just to remove all failed tasks, and that would be fine because:

Handling all failed tasks equally would also make for simpler and more consistent workflow completion handling (below).

Better workflow completion handling?

Completion and shutdown handling as described (above) leaves a bit to be desired.

In SoD the following workflow will complete and shut down after running only a, i.e. it won’t be identified as a stall (it would in SoS):

graph = "a:out1 => bar"
[runtime]
   [[a]]
       script = true

That’s the correct thing to do according to the graph. If a does not generate the out1 output, there is no path to bar, and the workflow is complete.

This on the other hand:

graph = """a & b => bar
          # and optionally:
        b:fail => whatever"""
[runtime]
   [[a]]
       script = true
   [[b]]
       script = false

… will be identified as a stall, even though there is no path to bar here either. There will be a partially satisfied (waiting) bar, and (depending on how we handle failed tasks) a failed b in the pool.

That seems inconsistent. In fact there’s really no need to make potentially-incorrect value judgements about the meaning of partially satisfied prerequisites, handled or unhandled failed tasks, or stall vs completion. An unexpected failure might cause the worklfow to stall by taking a side path, but it is impossible for us to know what the “intended” path is (who are we to say that the supposed side path was not actually the intended path?)

Instead (simple!):

Aside on current flawed SoS completion logic: tasks are spawned ahead as waiting regardless of what is “demanded” by the graph at run time, but the workflow still proceeds “on demand”. If the scheduler stalls with waiting tasks present we infer a premature stall, but that might not be a valid conclusion. The scheduler got to this point by doing exactly what the graph said to do under the circumstances that transpired, and who are we to say that the user did not intend that? But unfortunately in SoS we may end up with a bunch of off-piste waiting tasks to deal with, and we use those to infer an intention to follow a different path.

Automatic use of off-flow outputs?

This is possible in principle but we have decided against it as difficult and dangerous. Also, it is really not necessary. SoD reflow is a big improvement even without this (less intervention, straightfoward and consistent).

(Note we actually have a similar-but-worse problem in SoS: previous-flow task outputs will not be used automatically unless those tasks happen to still be present in the task pool - and users do not generally understand what determines whether or not those tasks will still be present).

(For the record, in case we ever reconsider automatic use of off-reflow outputs, graph traversal would be required to determine whether or not an unsatisfied prerequisite can be satisfied later within the reflow or not, and if not, to go the database. As Oliver noted: this could probably be worked out once at the start of the reflow; it could also help us show users graphically the consequences of their intended reflow).

TODO

The following have been transferred to sod-follow-up Issues in cylc-flow

Possible Future Enhancements

Follow-on changes that may or may not be worthwhile before Cylc 9:

Terminology