Leveraging Systemd Integration with Gridware Cluster Scheduler

August 5, 2025

The Power of Modern Linux System Management

In today’s world of high-performance computing, efficient resource management is critical.
Gridware Cluster Scheduler and its opensource base, Open Cluster Scheduler (formerly Sun Grid Engine) now offer deep integration with Systemd, the modern init system for Linux. This integration opens new possibilities for system administrators to control and monitor their cluster environments with greater precision.

What is Systemd and Why It Matters

Systemd is a system and service manager for Linux operating systems that manages system processes, services, and resources. Its cgroup-based architecture provides advanced features for resource control and monitoring.

The integration between Cluster Scheduler and Systemd happens in two main areas:

  1. Cluster Scheduler Daemons as Systemd Services: Enhanced management of cluster core components
  2. Job Execution Under Systemd and Cgroup Control: Precise resource control and detailed monitoring of computation jobs. Cgroup version 1 and 2 are supported, with version 2 being the default in modern Linux distributions.

Let’s take a detailed look at both aspects and how they can optimize your cluster environment.

Cluster Scheduler Daemons as Systemd Services

Automatic Installation and Configuration

When you install Cluster Scheduler, it provides systemd service files for the master/shadow daemon and the execution daemons. These are placed in the /etc/systemd/system/ directory and automatically enabled and started during installation.

All Cluster Scheduler components run within a parent systemd slice. The slice name is requested during installation and defaults to ocs<qmaster_port>.slice. For the qmaster daemon running on port 8012, this would be ocs8012.slice.

The individual services are named accordingly as ocs8012-qmaster.service for the qmaster daemon and ocs8012-execd.service for the execution daemon.

Managing Cluster Scheduler Daemons with Systemd

To query the status of all Cluster Scheduler services on a host, you can use the following command:

				
					$ systemctl status ocs8012.slice 
● ocs8012.slice - Slice /ocs8012
     Loaded: loaded
    Drop-In: /etc/systemd/system.control/ocs8012.slice.d
             └─50-TasksAccounting.conf
     Active: active since Sun 2025-06-29 16:30:44 CEST; 13min ago
      Tasks: 28
     Memory: 27.1M (peak: 34.9M)
        CPU: 3.178s
     CGroup: /ocs8012.slice
             ├─ocs8012-execd.service
             │ └─15491 /scratch/joga/clusters/master/bin/lx-amd64/sge_execd
             └─ocs8012-qmaster.service
               ├─23568 /scratch/joga/clusters/master/bin/lx-amd64/sge_qmaster
               └─23631 /scratch/joga/clusters/master/bin/lx-amd64/sge_shadowd

				
			

Standard systemd commands can be used for management:

  • systemctl stop ocs8012-qmaster.service to stop the qmaster service
  • systemctl start ocs8012-execd.service to start the execution daemon
  • systemctl status ocs8012-qmaster.service to check status

 

Enhanced Monitoring Through Systemd Accounting

The accounting information provided by systemd offers valuable insights into Cluster Scheduler daemons’ resource usage. CPU and memory usage are enabled by default. Additionally, IO and network usage accounting can be enabled by modifying the service files
in /etc/systemd/system, e.g., /etc/systemd/system/ocs8012-qmaster.service:

				
					IOAccounting=true
IPAccounting=true
				
			

After restarting the service, detailed usage data will be displayed:

				
					$ systemctl status ocs8012-qmaster.service
● ocs8012-qmaster.service - Open Cluster Scheduler sge_qmaster service
     Loaded: loaded (/etc/systemd/system/ocs8012-qmaster.service; enabled; preset: enabled)
     Active: active (running) since Sun 2025-07-06 09:18:49 CEST; 30min ago
       Docs: man:sge_qmaster(8)
    Process: 736019 ExecStart=/scratch/joga/clusters/master/default/common/sgemaster start (code=exited, status=0/SUCCESS)
         IP: 46.1M in, 128.6M out
         IO: 2.8M read, 426.9M written
      Tasks: 24 (limit: 4605)
     Memory: 1.2G (peak: 1.2G)
        CPU: 1min 45.750s
     CGroup: /ocs8012.slice/ocs8012-qmaster.service
             ├─736174 /scratch/joga/clusters/master/bin/lx-amd64/sge_qmaster
             └─736237 /scratch/joga/clusters/master/bin/lx-amd64/sge_shadowd
				
			

Jobs Under Systemd Control

On hosts with systemd installed, Cluster Scheduler jobs can be run under systemd control. This enables advanced features for:

  • Resource control: CPU and memory limits, core binding, and device isolation
  • Job management: Signaling all processes of a job, suspending and resuming jobs
  • Job monitoring: Status checking and resource usage tracking

 

Hierarchical Organization of Cluster Scheduler Jobs in Systemd

Cluster Scheduler jobs are presented as a hierarchy of systemd units. A simple example illustrates this structure:

				
					$ qsub $SGE_ROOT/examples/jobs/sleeper.sh
Your job 4 ("Sleeper") has been submitted

$ systemctl status ocs8012.slice
● ocs8012.slice - Slice /ocs8012
     Loaded: loaded
    Drop-In: /etc/systemd/system.control/ocs8012.slice.d
             └─50-TasksAccounting.conf
     Active: active since Sun 2025-06-29 16:30:44 CEST; 1h 27min ago
      Tasks: 32
     Memory: 50.9M (peak: 51.6M)
        CPU: 14.171s
     CGroup: /ocs8012.slice
             ├─ocs8012-execd.service
             │ └─30003 /scratch/joga/clusters/master/bin/lx-amd64/sge_execd
             ├─ocs8012-jobs.slice
             │ └─ocs8012.4.scope
             │   ├─33430 /bin/sh /usr/local/testsuite/8012/execd/ubuntu-24-amd64-1/job_scripts/4
             │   └─33433 sleep 60
             └─ocs8012-shepherds.scope
               └─33429 sge_shepherd-4 -bg
				
			
Note the clear separation:
  • The sge_shepherd process runs in a separate scope (ocs8012-shepherds.scope)
  • The job itself runs in its own scope within the jobs slice (ocs8012.4.scope)
This separation provides enhanced stability and allows separate management of sge_execd, sge_shepherd processes and the actual job processes. For array jobs, each array task is running in its own scope:
				
					$ qsub -t 1-3 $SGE_ROOT/examples/jobs/sleeper.sh 
Your job-array 6.1-3:1 ("Sleeper") has been submitted

$ systemctl status ocs8012-jobs.slice 
● ocs8012-jobs.slice - Slice /ocs8012/jobs
     Loaded: loaded
     Active: active since Sun 2025-06-29 16:56:21 CEST; 17h ago
      Tasks: 6
     Memory: 1.1M (peak: 1.9M)
        CPU: 50ms
     CGroup: /ocs8012.slice/ocs8012-jobs.slice
             ├─ocs8012.6.1.scope
             │ ├─42278 /bin/sh /usr/local/testsuite/8012/execd/ubuntu-24-amd64-1/job_scripts/6
             │ └─42283 sleep 60
             ├─ocs8012.6.2.scope
             │ ├─42280 /bin/sh /usr/local/testsuite/8012/execd/ubuntu-24-amd64-1/job_scripts/6
             │ └─42286 sleep 60
             └─ocs8012.6.3.scope
               ├─42279 /bin/sh /usr/local/testsuite/8012/execd/ubuntu-24-amd64-1/job_scripts/6
               └─42289 sleep 60
				
			

For tightly integrated parallel jobs a systemd slice is created for the job, containing a scope each for the master task and for the slave tasks:

				
					$ qsub -cwd -o testmpi.log -j y -l a=lx-amd64 -pe mpich.pe 4 \
-v MPIR_HOME=~/3rd_party/mpi/mpich-4.3.0/lx-amd64 $SGE_ROOT/mpi/examples/testmpi.sh 
Your job 13 ("testmpi.sh") has been submitted

# on the master host:
$ systemctl status ocs8012-jobs.slice 
● ocs8012-jobs.slice - Slice /ocs8012/jobs
     Loaded: loaded
     Active: active since Sun 2025-06-29 16:56:21 CEST; 17h ago
      Tasks: 11
     Memory: 19.5M (peak: 30.0M)
        CPU: 1min 6.510s
     CGroup: /ocs8012.slice/ocs8012-jobs.slice
             └─ocs8012-jobs-13.slice
               └─ocs8012.13.master.scope
                 ├─42518 mpirun ./testmpi
                 ├─42519 /home/joga/3rd_party/mpi/mpich-4.3.0/lx-amd64/bin/hydra_pmi_proxy --control-port ubuntu-24-amd64-1:34879 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 ->
                 ├─42520 /scratch/joga/clusters/master/bin/lx-amd64/qrsh -inherit -V ubuntu-22-amd64-2 "\"/home/joga/3rd_party/mpi/mpich-4.3.0/lx-amd64/bin/hydra_pmi_proxy\"" --control-port u>
                 ├─42521 ./testmpi
                 └─42522 ./testmpi

# on a slave host:
$ systemctl status ocs8012-jobs.slice 
● ocs8012-jobs.slice - Slice /ocs8012/jobs
     Loaded: loaded
     Active: active since Sun 2025-06-29 16:56:17 CEST; 17h ago
      Tasks: 8
     Memory: 15.4M
        CPU: 14.627s
     CGroup: /ocs8012.slice/ocs8012-jobs.slice
             └─ocs8012-jobs-13.slice
               └─ocs8012.13-1.ubuntu-22-amd64-2.scope
                 ├─12273 /scratch/joga/clusters/master/utilbin/lx-amd64/qrsh_starter /usr/local/testsuite/8012/execd/ubuntu-22-amd64-2/active_jobs/13.1/1.ubuntu-22-amd64-2
                 ├─12280 /home/joga/3rd_party/mpi/mpich-4.3.0/lx-amd64/bin/hydra_pmi_proxy --control-port ubuntu-24-amd64-1:34879 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 ->
                 ├─12281 ./testmpi
                 └─12282 ./testmpi

				
			

Advanced Resource Management for Jobs

Job Resource Limits via Systemd

Cluster Scheduler jobs can be configured with various resource limits, either set in queue configuration or overridden at job submission. With systemd integration, certain limits are directly enforced through systemd mechanisms:

OCS LimitSystemd Limit
s_rssMemoryHigh
h_rssMemoryMax
s_vmemMemoryHigh
h_vmemMemoryMax

 

Core Binding with Systemd

Core binding (processor affinity) can be requested at job submission with the `-binding` option. With systemd support, this is implemented via the `AllowedCPUs` property of the job’s systemd scope unit:

				
					$ qsub -binding linear:2 $SGE_ROOT/examples/jobs/sleeper.sh
Your job 6 ("Sleeper") has been submitted
```

```bash
$ systemctl show ocs8012.6.scope | grep EffectiveCPUs
EffectiveCPUs=0-1

				
			

Device Isolation for Secure Job Execution

Device isolation allows restricting access to specific devices for a job – particularly important for jobs requiring exclusive access to hardware resources like GPUs. This is achieved by specifying the AllowedDevices property in the job’s systemd scope unit. Device isolation will be automatically applied for resources defined as RSMAP when the device is configured as a resource property (to come in the next minor release). Until device isolation is implemented in RSMAPs, the feature can be tested by requesting jobs with a specific environment variable, SGE_DEBUG_DEVICES_ALLOW being set to a list of devices and access modes: SGE_DEBUG_DEVICES_ALLOW="<device>=<mode>[;<device>=<mode>...]".
				
					$ qsub -v SGE_DEBUG_DEVICES_ALLOW="/dev/nvidia0=r;/dev/nvidiactl=w" \
$SGE_ROOT/examples/jobs/sleeper.sh
Your job 8 ("Sleeper") has been submitted

				
			

`systemctl show ocs8012.<job_id>.scope | grep ^Device` will show the settings for device isolation of the job scope unit:

				
					systemctl show ocs8012.8.scope  | grep ^Device
DevicePolicy=closed
DeviceAllow=/dev/nvidiactl w
DeviceAllow=/dev/nvidia0 r
				
			

Enhanced Job Monitoring and Resource Collection

Cluster Scheduler can collect resource usage data either from systemd or through its internal PDC (Portable Data Collector) module. The data collection method can be configured via the <strong><code>execd_params USAGE_COLLECTION</strong></code> parameter:

  • FALSE: No online usage information is collected
  • PDC: Usage information is collected by the PDC module, even if systemd is available
  • HYBRID: Usage information is collected via both systemd and the PDC module
  • TRUE: Default mode – usage information is collected via systemd if available, otherwise via the PDC module

 

Conclusion: Benefits of Systemd Integration for Cluster Scheduler

The integration of Cluster Scheduler with systemd provides numerous benefits for cluster administrators:

  • Enhanced daemon management: Easier control and detailed monitoring of cluster services
  • Precise resource control: More accurate enforcement of resource limits for jobs
  • Robust job management: Improved process group management and clean termination
  • Detailed monitoring: Comprehensive insights into resource usage of daemons and jobs
  • Modern Linux integration: Leveraging the latest Linux features for cluster management

If you’re deploying Cluster Scheduler in your environment, systemd integration offers a powerful way to manage your cluster more efficiently and gain deeper insights into its performance.

 

Availability

The systemd integration is available in the daily build of Cluster Scheduler and Open Cluster Scheduler main development branch, which will be released as the next minor version (9.1.0).

It can be downloaded from the Cluster Scheduler download page
or built from the Open Cluster Scheduler GitHub repository.

 

Discover More

To learn more about how Gridware can revolutionize your HPC and AI workloads,  contact us for a personalized consultation at jgabler@hpc-gridware.com.