15.5. Remote Job Management

Managing tasks in a workflow requires more than just job execution: Cylc performs additional actions with rsync for file transfer, and direct execution of cylc sub-commands over non-interactive SSH [4].

15.5.1. SSH-free Job Management?

Some sites may want to restrict access to job hosts by whitelisting SSH connections to allow only rsync for file transfer, and allowing job execution only via a local batch system that sees the job hosts [5] . We are investigating the feasibility of SSH-free job management when a local batch system is available, but this is not yet possible unless your suite and job hosts also share a filesystem, which allows Cylc to treat jobs as entirely local [6] .

15.5.2. SSH-based Job Management

Cylc does not have persistent agent processes running on job hosts to act on instructions received over the network [7] so instead we execute job management commands directly on job hosts over SSH. Reasons for this include:

  • it works equally for batch system and background jobs
  • SSH is required for background jobs, and for batch jobs if the batch system is not available on the suite host
  • querying the batch system alone is not sufficient for full job polling functionality because jobs can complete (and then be forgotten by the batch system) while the network, suite host, or suite server program is down (e.g. between suite shutdown and restart)
    • to handle this we get the automatic job wrapper code to write job messages and exit status to job status files that are interrogated by suite server programs during job polling operations
    • job status files reside on the job host, so the interrogation is done over SSH
  • job status files also hold batch system name and job ID; this is written by the job submit command, and read by job poll and kill commands (all over SSH)

15.5.3. A Concrete Example

The following suite, registered as suitex, is used to illustrate our current SSH-based remote job management. It submits two jobs to a remote, and a local task views a remote job log then polls and kills the remote jobs.

# suite.rc
[scheduling]
   [[dependencies]]
          graph = "delayer => master & REMOTES"
[runtime]
   [[REMOTES]]
      script = "sleep 30"
       [[[remote]]]
           host = wizard
           owner = hobo
   [[remote-a, remote-b]]
       inherit = REMOTES
   [[delayer]]
      script = "sleep 10"
   [[master]]
       script = """
 sleep 5
 cylc cat-log -m c -f o $CYLC_SUITE_NAME remote-a.1
 sleep 2
 cylc poll $CYLC_SUITE_NAME REMOTES.1
 sleep 2
 cylc kill $CYLC_SUITE_NAME REMOTES.1
 sleep 2
 cylc remove $CYLC_SUITE_NAME REMOTES.1"""

The delayer task just separates suite start-up from remote job submission, for clarity when watching the job host (e.g. with watch -n 1 find ~/cylc-run/suitex).

Global config specifies the path to the remote Cylc executable, says to retrieve job logs, and not to use a remote login shell:

# global.rc
[hosts]
   [[wizard]]
       cylc executable = /opt/bin/cylc
       retrieve job logs = True
       use login shell = False

On running the suite, remote job host actions were captured in the transcripts below by wrapping the ssh, scp, and rsync executables in scripts that log their command lines before taking action.

15.5.3.1. Create suite run directory and install source files

Done by rose suite-run before suite start-up (the command will be migrated to Cylc soon though).

  • with --new it invokes bash over SSH and a raw shell expression, to delete previous-run files
  • it invokes itself over SSH to create top level suite directories and install source files
    • skips installation if server UUID file is found on the job host (indicates a shared filesystem)
  • uses rsync for suite source file installation

Note

The same directory structure is used on suite and job hosts, for consistency and simplicity, and because the suite host can also be a job host.

# rose suite-run --new only: initial clean-out
ssh -oBatchMode=yes -oConnectTimeout=10 hobo@wizard bash -l -O extglob -c 'cd; echo '"'"'673d7a0d-7816-42a4-8132-4b1ab394349c'"'"'; ls -d -r cylc-run/suitex/work cylc-run/suitex/share/cycle cylc-run/suitex/share cylc-run/suitex; rm -fr cylc-run/suitex/work cylc-run/suitex/share/cycle cylc-run/suitex/share cylc-run/suitex; (cd ; rmdir -p cylc-run/suitex/work cylc-run/suitex/share/cycle cylc-run/suitex/share cylc-run 2>/dev/null || true)'

# rose suite-run: test for shared filesystem and create share/cycle directories
ssh -oBatchMode=yes -oConnectTimeout=10 -n hobo@wizard env ROSE_VERSION=2018.02.0 CYLC_VERSION=7.6.x bash -l -c '"$0" "$@"' rose suite-run -vv -n suitex --run=run --remote=uuid=231cd6a1-6d61-476d-96e1-4325ef9216fc,now-str=20180416T042319Z

# rose suite-run: install suite source directory to job host
rsync -a --exclude=.* --timeout=1800 --rsh=ssh -oBatchMode=yes -oConnectTimeout=10 --exclude=231cd6a1-6d61-476d-96e1-4325ef9216fc --exclude=log/231cd6a1-6d61-476d-96e1-4325ef9216fc --exclude=share/231cd6a1-6d61-476d-96e1-4325ef9216fc --exclude=share/cycle/231cd6a1-6d61-476d-96e1-4325ef9216fc --exclude=work/231cd6a1-6d61-476d-96e1-4325ef9216fc --exclude=/.* --exclude=/cylc-suite.db --exclude=/log --exclude=/log.* --exclude=/state --exclude=/share --exclude=/work ./ hobo@wizard:cylc-run/suitex
   # (internal rsync)
   ssh -oBatchMode=yes -oConnectTimeout=10 -l hobo wizard rsync --server -logDtpre.iLsfx --timeout=1800 . cylc-run/suitex
   # (internal rsync, back from hobo@wizard)
   rsync --server -logDtpre.iLsfx --timeout=1800 . cylc-run/suitex

Result:

 ~/cylc-run/suitex
|__log->log.20180418T025047Z  # LOG DIRECTORIES
|__log.20180418T025047Z  # log directory for current suite run
|__suiter.rc
|__xxx  # any suite source sub-dirs or file
|__work  # JOB WORK DIRECTORIES
|__share  #  SUITE SHARE DIRECTORY
   |__cycle

15.5.3.2. Server installs service directory

  • server address and credentials, so that clients such as cylc message executed by jobs can connect
  • done just before the first job is submitted to a remote, and at suite restart for the remotes of jobs running when the suite went down (server host, port, etc. may change at restart)
  • uses SSH to invoke cylc remote-init on job hosts. If the remote command does not find a server-side UUID file (which would indicate a shared filesystem) it reads a tar archive of the service directory from stdin, and unpacks it to install.
# cylc remote-init: install suite service directory
ssh -oBatchMode=yes -oConnectTimeout=10 hobo@wizard env CYLC_VERSION=7.6.x /opt/bin/cylc remote-init '066592b1-4525-48b5-b86e-da06eb2380d9' '$HOME/cylc-run/suitex'

Result:

 ~/cylc-run/suitex
|__.service  # SUITE SERVICE DIRECTORY
|  |__contact  # server address information
|  |__passphrase  # suite passphrase
|  |__ssl.cert  # suite SSL certificate
|__log->log.20180418T025047Z  # LOG DIRECTORIES
|__log.20180418T025047Z  # log directory for current suite run
|__suiter.rc
|__xxx  # any suite source sub-dirs or file
|__work  # JOB WORK DIRECTORIES
|__share  #  SUITE SHARE DIRECTORY
   |__cycle

15.5.3.3. Server submits jobs

  • done when tasks are ready to run, for multiple jobs at once
  • uses SSH to invoke cylc jobs-submit on the remote - to read job scripts from stdin, write them to disk, and submit them to run
# cylc jobs-submit: submit two jobs
ssh -oBatchMode=yes -oConnectTimeout=10 hobo@wizard env CYLC_VERSION=7.6.x /opt/bin/cylc jobs-submit '--remote-mode' '--' '$HOME/cylc-run/suitex/log/job' '1/remote-a/01' '1/remote-b/01'

Result:

 ~/cylc-run/suitex
|__.service  # SUITE SERVICE DIRECTORY
|  |__contact  # server address information
|  |__passphrase  # suite passphrase
|  |__ssl.cert  # suite SSL certificate
|__log->log.20180418T025047Z  # LOG DIRECTORIES
|__log.20180418T025047Z  # log directory for current suite run
|  |__ job  # job logs (to be distinguished from log/suite/ on the suite host)
|     |__1  # cycle point
|        |__remote-a  # task name
|        |  |__01  # job submit number
|        |  |  |__job  # job script
|        |  |  |__job.out  # job stdout
|        |  |  |__job.err  # job stderr
|        |  |  |__job.status  # job status
|        |  |__NN->0l  # symlink to latest submit number
|        |__remote-b  # task name
|           |__01  # job submit number
|           |  |__job  # job script
|           |  |__job.out  # job stdout
|           |  |__job.err  # job stderr
|           |  |__job.status  # job status
|           |__NN->0l  # symlink to latest submit number
|__suiter.rc
|__xxx  # any suite source sub-dirs or file
|__work  # JOB WORK DIRECTORIES
|  |__1  # cycle point
|     |__remote-a  # task name
|     |  |__xxx  # (any files written by job to PWD)
|     |__remote-b  # task name
|        |__xxx  # (any files written by job to PWD)
|__share  #  SUITE SHARE DIRECTORY
   |__cycle
   |__xxx  # (any job-created sub-dirs and files)

15.5.3.4. Server tracks job progress

  • jobs send messages back to the server program on the suite host
    • directly: client-server HTTPS over the network (requires service files installed - see above)
    • indirectly: re-invoke clients on the suite host (requires reverse SSH)
  • OR server polls jobs at intervals (requires job polling - see below)

15.5.3.5. User views job logs

  • command cylc cat-log via CLI or GUI, invokes itself over SSH to the remote
  • suites will serve job logs in future, but this will still be needed (e.g. if the suite is down)
# cylc cat-log: view a job log
ssh -oBatchMode=yes -oConnectTimeout=10 -n hobo@wizard env CYLC_VERSION=7.6.x /opt/bin/cylc cat-log --remote-arg='$HOME/cylc-run/suitex/log/job/1/remote-a/NN/job.out' --remote-arg=cat --remote-arg='tail -n +1 -F %(filename)s' suitex

15.5.3.6. Server cancels or kills jobs

  • done automatically or via user command cylc kill, for multiple jobs at once
  • uses SSH to invoke cylc jobs-kill on the remote, with job log paths on the command line. Reads job ID from the job status file.
# cylc jobs-kill: kill two jobs
ssh -oBatchMode=yes -oConnectTimeout=10 hobo@wizard env CYLC_VERSION=7.6.x /opt/bin/cylc jobs-kill '--' '$HOME/cylc-run/suitex/log/job' '1/remote-a/01' '1/remote-b/01'

15.5.3.7. Server polls jobs

  • done automatically or via user command cylc poll, for multiple jobs at once
  • uses SSH to invoke cylc jobs-poll on the remote, with job log paths on the command line. Reads job ID from the job status file.
# cylc jobs-poll: poll two jobs
ssh -oBatchMode=yes -oConnectTimeout=10 hobo@wizard env CYLC_VERSION=7.6.x /opt/bin/cylc jobs-poll '--' '$HOME/cylc-run/suitex/log/job' '1/remote-a/01' '1/remote-b/01'

15.5.3.8. Server retrieves jobs logs

  • done at job completion, according to global config
  • uses rsync
# rsync: retrieve two job logs
rsync -a --rsh=ssh -oBatchMode=yes -oConnectTimeout=10 --include=/1 --include=/1/remote-a --include=/1/remote-a/01 --include=/1/remote-a/01/** --include=/1/remote-b --include=/1/remote-b/01 --include=/1/remote-b/01/** --exclude=/** hobo@wizard:$HOME/cylc-run/suitex/log/job/ /home/vagrant/cylc-run/suitex/log/job/
   # (internal rsync)
   ssh -oBatchMode=yes -oConnectTimeout=10 -l hobo wizard rsync --server --sender -logDtpre.iLsfx . $HOME/cylc-run/suitex/log/job/
   # (internal rsync, back from hobo@wizard)
   rsync --server --sender -logDtpre.iLsfx . /home/hobo/cylc-run/suitex/log/job/

15.5.3.9. Server tidies job remote at shutdown

  • removes .service/contact so that clients won’t repeatedly try to connect
# cylc remote-tidy: remove the remote suite contact file
ssh -oBatchMode=yes -oConnectTimeout=10 hobo@wizard env CYLC_VERSION=7.6.x /opt/bin/cylc remote-tidy '$HOME/cylc-run/suitex'

15.5.4. Other Use of SSH in Cylc

  • see if a suite is running on another host with a shared filesystem - see detect_old_contact_file() in lib/cylc/suite_srv_files_mgr.py
  • cat content of a remote service file over SSH, if possible, for clients on that do not have suite credentials installed - see _load_remote_item() in suite_srv_files_mgr.py
[4]Cylc used to run bare shell expressions over SSH, which required a bash shell and made whitelisting difficult.
[5]A malicious script could be rsync’d and run from a batch job, but batch jobs are considered easier to audit.
[6]The job ID must also be valid to query and kill the job via the local batch system. This is not the case for Slurm, unless the --cluster option is explicitly used in job query and kill commands, otherwise the job ID is not recognized by the local Slurm instance.
[7]This would be a more complex solution, in terms of implementation, administration, and security.