Remote Job Management
Managing tasks in a workflow requires more than just job execution: Cylc
performs additional actions with
rsync for file transfer, and
direct execution of
cylc sub-commands over non-interactive SSH.
SSH-free Job Management?
Some sites may want to restrict access to job hosts by whitelisting SSH
connections to allow only
rsync for file transfer, and allowing job
execution only via a local job runner that sees the job hosts 1 .
We are investigating the feasibility of SSH-free job management when a local
job runner is available, but this is not yet possible unless your workflow
and job hosts also share a filesystem, which allows Cylc to treat jobs as
entirely local 2 .
SSH-based Job Management
Cylc does not have persistent agent processes running on job hosts to act on instructions received over the network 3 so instead we execute job management commands directly on job hosts over SSH. Reasons for this include:
It works equally for job runner and background jobs.
SSH is required for background jobs, and for jobs in other job runners if the job runner is not available on the workflow host.
Querying the job runner alone is not sufficient for full job polling functionality.
This is because jobs can complete (and then be forgotten by the job runner) while the network, workflow host, or scheduler is down (e.g. between workflow shutdown and restart).
To handle this we get the automatic job wrapper code to write job messages and exit status to job status files that are interrogated by schedulers during job polling operations.
Job status files reside on the job host, so the interrogation is done over SSH.
Job status files also hold job runner name and job ID; this is written by the job submit command, and read by job poll and kill commands
Other Cases Where Cylc Uses SSH Directly
To see if a workflow is running on another host with a shared filesystem - see
A malicious script could be
rsync’d and run from a batch job, but jobs in job runners are considered easier to audit.
The job ID must also be valid to query and kill the job via the local job runner. This is not the case for Slurm, unless the
--clusteroption is explicitly used in job query and kill commands, otherwise the job ID is not recognized by the local Slurm instance.
This would be a more complex solution, in terms of implementation, administration, and security.