Remote Job Management

Managing tasks in a workflow requires more than just job execution: Cylc performs additional actions with rsync for file transfer, and direct execution of cylc sub-commands over non-interactive SSH.

SSH-free Job Management?

Some sites may want to restrict access to job hosts by whitelisting SSH connections to allow only rsync for file transfer, and allowing job execution only via a local job runner that sees the job hosts 1 . We are investigating the feasibility of SSH-free job management when a local job runner is available, but this is not yet possible unless your workflow and job hosts also share a filesystem, which allows Cylc to treat jobs as entirely local 2 .

SSH-based Job Management

Cylc does not have persistent agent processes running on job hosts to act on instructions received over the network 3 so instead we execute job management commands directly on job hosts over SSH. Reasons for this include:

  • It works equally for job runner and background jobs.

  • SSH is required for background jobs, and for jobs in other job runners if the job runner is not available on the workflow host.

  • Querying the job runner alone is not sufficient for full job polling functionality.

    • This is because jobs can complete (and then be forgotten by the job runner) while the network, workflow host, or scheduler is down (e.g. between workflow shutdown and restart).

    • To handle this we get the automatic job wrapper code to write job messages and exit status to job status files that are interrogated by schedulers during job polling operations.

    • Job status files reside on the job host, so the interrogation is done over SSH.

  • Job status files also hold job runner name and job ID; this is written by the job submit command, and read by job poll and kill commands

Other Cases Where Cylc Uses SSH Directly

  • To see if a workflow is running on another host with a shared filesystem - see cylc/flow/workflow_files:detect_old_contact_file.

1

A malicious script could be rsync’d and run from a batch job, but jobs in job runners are considered easier to audit.

2

The job ID must also be valid to query and kill the job via the local job runner. This is not the case for Slurm, unless the --cluster option is explicitly used in job query and kill commands, otherwise the job ID is not recognized by the local Slurm instance.

3

This would be a more complex solution, in terms of implementation, administration, and security.