Systemd Hardening

Evaluate Security of Service

Overall security scores for all running services:

$ systemd-analyze security
ModemManager.service                      5.9 MEDIUM    😐
NetworkManager.service                    7.8 EXPOSED   🙁
systemd-journald.service                  4.3 OK        🙂
systemd-logind.service                    2.6 OK        🙂
…

Detailed security analysis for a single service:

$ systemd-analyze security tor@default.service
  NAME                                       DESCRIPTION                                EXPOSURE
✗ PrivateNetwork=                            Service has access to the host's network        0.5
✗ User=/DynamicUser=                         Service runs as root user                       0.4
✗ CapabilityBoundingSet=~CAP_SET(UID|GID|PC… Service may change UID/GID identities/cap…      0.3
✓ CapabilityBoundingSet=~CAP_SYS_ADMIN       Service has no administrator privileges
✓ CapabilityBoundingSet=~CAP_SYS_PTRACE      Service has no ptrace() debugging abiliti…
✗ RestrictAddressFamilies=~AF_(INET|INET6)   Service may allocate Internet sockets           0.3
✓ RestrictNamespaces=~CLONE_NEWUSER          Service cannot create user namespaces
✓ RestrictAddressFamilies=~…                 Service cannot allocate exotic sockets
✓ CapabilityBoundingSet=~CAP_(CHOWN|FSETID|… Service cannot change file ownership/acce…
✗ CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|… Service may override UNIX file/IPC permis…      0.2
✓ CapabilityBoundingSet=~CAP_NET_ADMIN       Service has no network configuration priv…
✓ CapabilityBoundingSet=~CAP_SYS_MODULE      Service cannot load kernel modules
✓ CapabilityBoundingSet=~CAP_SYS_RAWIO       Service has no raw I/O access
✓ CapabilityBoundingSet=~CAP_SYS_TIME        Service processes cannot change the syste…
✗ DeviceAllow=                               Service has a device ACL with some specia…      0.1
✗ IPAddressDeny=                             Service does not define an IP address all…      0.2
✓ KeyringMode=                               Service doesn't share key material with o…
✓ NoNewPrivileges=                           Service processes cannot acquire new priv…
…

Global Hardening Options

Global options applied to systemd and all processes started by it. A reboot is required to apply the settings to systemd and all processes spawned by it.

Example config (/etc/systemd/system.conf.d/99-custom.conf):

[Manager]
SystemCallArchitectures=native

See systemd-system.conf(5)

CapabilityBoundingSet

Identical to CapabilityBoundingSet for services but applied to systemd itself and all processes it starts.

Danger

Dropping capabilities which are required for the system to boot will leave you with an unbootable system.

Note

As of today, I do not believe that any capability can be dropped easily. There is some capabilities which aren’t usually needed, such as CAP_CHECKPOINT_RESTORE or CAP_PERFMON, but these capabilities, for historical reasons, can also be obtained via CAP_SYS_ADMIN. This renders dropping them moot.

SystemCallArchitectures

Allow native calls only:

SystemCallArchitectures=native

See also service-level SystemCallArchitectures below.

Service Options

Options that can be used in [Service] section of systemd services.

Example, extending the exim4 service with some custom hardening (/etc/systemd/systemd/exim4.service.d/99-custom.conf):

[Service]
PrivateTmp=yes
ProtectSystem=strict
TemporaryFileSystem=/run/exim4
ReadWritePaths=/var/lib/exim4
ReadWritePaths=/var/log/exim4
ReadWritePaths=/var/spool/exim4

See also More Examples below.

AppArmorProfile

Enforce an AppArmor MAC profile for the service.

Enforce profile <profile_name>:

AppArmorProfile=<profile_name>

Capabilities

On Linux, super-user privileges are divided into capabilities. Available capabilities are listed in capabilities(7) and systemd-analyze capability lists all capabilities known to systemd.

CapabilityBoundingSet

Restrict available capabilities (i.e. restrict super-user privileges).

Drop all capabilites:

CapabilityBoundingSet=

Retain only capabilities CAP_SETGID and CAP_SETUID:

CapabilityBoundingSet=CAP_SETGID CAP_SETUID

Drop only Capabilities CAP_SETGID and CAP_SETUID:

CapabilityBoundingSet=~CAP_SETGID CAP_SETUID

See:

AmbientCapabilities

By default, all capabilities are dropped when running a service as non-root user. In order grant a non-root user limited super-user capabilities. This directive can be used.

Grant user backup-daemon capability CAP_DAC_READ_SEARCH:

User=backup-daemon
AmbientCapabilities=CAP_DAC_READ_SEARCH

This should generally be preferred to running a service as root and dropping capabilities via CapabilityBoundingSet because root will still have (write) access to most files as it owns most of them. Also, some services do permission checks based on UID. For instance, Postgres will check the UID/name of the connecting user.

See:

MemoryDenyWriteExecute

Prevent memory allocations that are writeable and executable at the same time:

MemoryDenyWriteExecute=yes
SystemCallFilter=~memfd_create

It may be possible to circumvent this protection unless any one of these conditions is met:

  • The memfd_create syscall is filtered (as shown above).

  • Write access to any file or directory is denied.

  • noexec mount options is set on any accessible filesystem. This may be achived via NoExecPaths.

See systemd.exec(5) → MemoryDenyWriteExecute=

LockPersonality

Disable emulation of different behaviors to support non-Linux-native binaries.

Lock personality:

LockPersonality=yes

See

NoNewPrivileges

Deny process to escalating privileges:

NoNewPrivileges=yes

In particular, the service process and all its children will ignore setuid and and setgid bits used by su and sudo to gain privileges.

Note about systemd socket:

Services with access to run services via systemd (e.g. via systemd-run) may be able to get around this restriction.

See

Devices

DeviceAllow

Allow device /dev/loop-conrol, /dev/loop[0-9]:

DeviceAllow=/dev/loop-control
DeviceAllow=block-loop

Allow read-only access to /dev/sda:

DeviceAllow=/dev/sda:r

Use PrivateDevices when only the default set of pseudo-devices like /dev/null, /dev/zero and /dev/urandom is needed.

By default, access to common pseudo-devices like /dev/null or /dev/urandom is always granted. This behiavior can be changed using systemd.resource-control(5) → DevicePolicy=.

See systemd.resource-control(5) → DeviceAllow=

PrivateDevices

Only provide a minimal set of devices like /dev/null, /dev/zero or /dev/urandom to the service. Systemd will also take other measures to prevent device creation and access.

Enable private devices:

PrivateDevices=yes

See systemd.exec(5) → PrivateDevices=

PrivateIPC

Create a private IPC namespace for the service:

PrivateIPC=yes

Multiple services can be made to share their IPC namespace using JoinsNamespaceOf.

See systemd.exec(5) → PrivateIPC=

Availability: systemd 248

RemoveIPC

Remove IPC objects when service is stopped:

User=exampled
RemoveIPC=yes

Remove all System V and POSIX IPC objects owned by the user (and not the service) when the service is stopped.

See systemd.exec(5) → RemoveIPC=

Availability: systemd 248

/proc///sys/ Filesystem

ProcSubset

Only allow access to PID information in /proc (i.e. /proc/<pid>/):

ProcSubset=pid

See systemd.exec(5) → ProcSubset=

ProtectProc

Control access to processes in /proc.

Deny access to other users processes:

ProtectProc=noaccess

Hide other users processes:

ProtectProc=invisible

Hide non-ptraceable processes:

ProtectProc=ptraceable

You should usually prefer invisible over noaccess as many services do not handle being denied access well.

These directive corresponds to the hidepid= mount option of proc. See proc(5)#Mount_options

See systemd.exec(5) → ProtectProc=

ProtectKernelTunables

Protect kernel variables accessible in /proc, /sys or via sysctl(8)/sysctl.conf(5):

ProtectKernelTunables=yes

See systemd.exec(5) → ProtectKernelTunables=

ProtectClock

Prevent service from manipulating clock:

ProtectClock=yes

See systemd.exec(5) → ProtectClock=

ProtectControlGroups

Prevent modifications to the cgroup hierarchies by the service:

ProtectControlGroups=yes

See systemd.exec(5) → ProtectControlGroups=

Filesystem Access

NoExecPaths

Only allow execution of /usr/bin/serviced:

NoExecPaths=/
ExecPaths=/usr/bin/serviced

This, in combination with MemoryDenyWriteExecute, may be used to make arbitrary code execution harder.

See systemd.exec(5) → NoExecPaths=

Availability: systemd 248

PrivateTmp

Create private, empty /tmp/ and /var/tmp/ for the service:

PrivateTmp=yes

Multiple services can be made to share their /tmp and /var/tmp/ using JoinsNamespaceOf.

Temporary files are cleaned when the service is stopped.

See systemd.exec(5) → PrivateTmp=

ProtectHome

Restrict access to /home/, /root, /run/user for a service.

Make /home/ inaccessible:

ProtectHome=yes

Make /home/ read-only:

ProtectHome=read-only

ReadWritePaths may be used to lift read-only restriction on subdirectories.

Replace /home/ with an empty, read-only directory:

ProtectHome=tmpfs

See:

InaccessiblePaths

Make directory/files at /etc/hidden, /hidden/ and /home/ inaccessible:

InaccessiblePaths=/etc/hidden /hidden/
InaccessiblePaths=/home/

See:

ReadOnlyPaths

Make directory/files at /etc/hidden, /hidden/ and /home/ read-only:

ReadOnlyPaths=/etc/hidden /hidden/
ReadOnlyPaths=/home/

See: * systemd.exec(5) → ReadOnlyPaths= * ExecStart (full write access in ExecStart=, ExecStartPre=, etc.)

ReadWritePaths

Make directory/files at /etc/hidden, /hidden/ and /home/ readable/writable:

ReadWritePaths=/etc/hidden /hidden/
ReadWritePaths=/home/

Directories otherwise read-only or inaccessible due to the use of ProtectHome or ProtectSystem may be made readable/writable.

Subdirectories or files specified in ReadOnlyPaths may be made writable. However, this does not extend to InaccessiblePaths.

See:

RestrictFileSystems

Only allow opening files on a ext4 or tmpfs filesystem:

RestrictFileSystems=ext4 tmpfs

Only deny access to network filesystems:

RestrictFileSystems=~@network

Obtain a list of all known filesystems and groups:

$ systemd-analyze filesystems

See systemd.exec(5) → RestrictFileSystems=

Availability: systemd 250

ProtectSystem

Mount /usr/, /boot/ and /efi/ read-only:

ProtectSystem=yes

Additionally mount /etc/ read-only:

ProtectSystem=full

Mount everything read-only except /dev/, /proc/ and /sys

ProtectSystem=strict

Use ReadWritePaths to allow write access to specific files or directories.

See

ProtectHostname

Prevent service from manipulating hostname (UTS namespace):

ProtectHostname=yes

See systemd.exec(5) → ProtectHostname=

ProtectKernelLogs

Deny service access to kernel logs (e.g. via dmesg(1)):

ProtectKernelLogs=yes

See systemd.exec(5) → ProtectKernelLogs=

ProtectKernelModules

Prevent loading of kernel modules by service:

ProtectKernelModules=yes

See systemd.exec(5) → ProtectKernelModules=

TemporaryFileSystem

Place a empty tmpfs filesystem at /path/directory:

TemporaryFileSystem=/path/directory

The same but make the directory read-only:

TemporaryFileSystem=/path/directory:ro

This is often useful when a service can’t deal with a directory being read-only or inaccessible but is fine with it being empty.

See systemd.exec(5) → TemporaryFileSystem=

Networking

PrivateNetwork

Create a private network namespace with only a private loopback interface:

PrivateNetwork=yes

Multiple services can be made to share their network namespace using JoinsNamespaceOf. Restricting access to the (global) loopback interface, or any other interface, can be done using RestrictNetworkInterfaces.

See systemd.exec(5) → PrivateNetwork=

Availability: systemd 250

RestrictAddressFamilies

Restrict socket access to IPv6, IPv4 and Unix socket families respectively:

RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX

Allow no address family:

RestrictAddressFamilies=none

The special value of none is only supported starting with systemd 249.

Often required:

Family

Reason

AF_NETLINK

Enumerating network interfaces, for instance, to be able to bind to specific interfaces.

AF_UNIX

Logging via syslog(3).

See

RestrictNetworkInterfaces

Restrict access to the loopback (lo) interface:

RestrictNetworkInterfaces=lo

Deny access to interface eth0 only:

RestrictNetworkInterfaces=~eth0

When no network access is needed use PrivateNetwork.

See systemd.resource-conntrol(5) → RestrictNetworkInterfaces=

Availability: systemd 250

SocketBindAllow

Important

Without also specifying SocketBindDeny=any, the service may bind to all ports.

Allow service to bind to TCP ports 80 and 443 only:

SocketBindAllow=tcp:80 tcp:443
SocketBindDeny=any

Omit protocol to allow TCP and UDP:

SocketBindAllow=80 443
SocketBindDeny=any

Port ranges, like 1200-1300, are accepted too.

Allow an unprivileged service to bind to TCP ports 80 and 443 only:

AmbientCapabilities=CAP_NET_BIND_SERVICE
SocketBindAllow=tcp:80 tcp:443
SocketBindDeny=any
User=www-data

Capability CAP_NET_BIND_SERVICE is required to bind to any port lower than 1024. SocketBindAllow can be used to restrict this privilege to certain ports.

See systemd.resource-control(5) → SocketBindAllow=

Availability: systemd 249

IPAddressAllow

Important

Without also specifying IPAddressDeny=any, the service will be allowed to connect to any address.

Only allow connecting to CIDR networks 10.0.0.0/8 and fc00/7:

IPAddressAllow=10.0.0.0/8 fc00/7
IPAddressDeny=any

The value localhost can be used to restrict access to 127.0.0.1 and ::1. If you wish to restrict access to localhost only, consider using RestrictNetworkInterfaces=lo in addition.

See systemd.resource-control(5) → IPAddressAllow=

IPAddressDeny

Deny access to CIDR networks 10.0.0.0/8 and fc00/7:

IPAddressDeny=10.0.0.0/8 fc00/7

See systemd.resource-control(5) → IPAddressDeny=

RestrictNamespaces

Deny any namespace change:

RestrictNamespaces=yes

Only allow access namespaces ipc and net:

RestrictNamespaces=ipc net

Only deny access namespaces ipc and net:

RestrictNamespaces=~ipc net

See systemd.exec(5) → RestrictNamespaces=

RestrictRealtime

Deny access to any realtime scheduling functionality:

RestrictRealtime=yes

See systemd.exec(5) → RestrictRealtime=

RestrictSIUDSGID

Prevent setting of SUID and SGID bits for file permissions:

RestrictSIUDSGID=yes

See

System Call Filtering (seccomp)

SystemCallArchitectures

Allow native calls only:

SystemCallArchitectures=native

Disable ABI for non-native system calls. Namely, this disables support for x86 binaries on x86_64.

SystemCallFilter

Allow only syscalls in group @system-service:

SystemCallFilter=@system-service

Allow syscalls in group @system-service and syscall seccomp except those in group @chown:

SystemCallFilter=@system-service seccomp
SystemCallFilter=~@chown

Deny syscalls in group @chown with error EPERM rather than terminating the process:

SystemCallFilter=~@chown:EPERM

Many services can deal with an EPERM, and other error codes, for certain calls only used for optional functionality.

A list of all known syscalls and groups can be obtained like this:

systemd-analyze syscall-filter

Rather then killing the process, systemd can also be instructed to return an error code like EPERM for all violations:

SystemCallErrorNumber=EPERM

Services using SystemCallFilter should also use SystemCallArchitectures=native.

See

UMask

Create files and directories that are only accessible by user/owner if permission are not explicitly set during creation:

UMask=0077

Allow user and group only:

UMask=0007

See

User / Group

DynamicUser

Dynamically create a Unix user as which the service is ran:

DynamicUser=yes

This is not suitable for services that write persistent data to disk or have to read private data. This because the UID/GID will be unpredictable and may be shared (though not at the same time) with other services.

Read sysemd.exec(5) → DynamicUser= before use.

See also ExecStart (run ExecStart=, ExecStartPre=, etc. with full privileges)

PrivateUser

Run service in a private user namespace:

PrivateUser=yes

See systemd.exec(5) → PrivateUser=

User

Run process as user serviced:

User=serviced

Group is taken from the passwd database unless specified via Group and Supplementary groups from the group database.

See:

Group

Set users group to serviced:

Group=serviced

See:

SupplementaryGroups

On Unix, any process belongs to a user (UID) and group (GID) but it may also belong to additional/supplementary groups. Such supplementary groups are shown in groups= by id:

$ id user
uid=1000(user) gid=1000(user) groups=1000(user),999(qubes),126(docker)

Add service to supplementary group inet:

SupplementaryGroups=inet

Groups from the system’s group database are left untouched and SupplementaryGroups are appended.

See systemd.exec(5) → SupplementaryGroups

Exec{Start,Stop}{,Pre,Post}

Prefixes + and ! can be used to execute commands with full privileges (without User/Group/etc. being applied) and without filesystem access restriction being applied (PrivateHome/ReadOnlyPaths/etc.).

Call mkdir /etc/directory/ as root and with /etc/ being writable:

ExecStartPre=+mkdir /etc/directory
ExecStart=serviced --foreground
ReadOnlyPaths=/etc/
User=serviced

Use ! to only revert the effects of User, Group and SupplementaryGroups.

These prefixes can be used with ExecStart, ExecStartPre, ExecStartPost, ExecStop, ExecStopPre and ExecStopPost.

See systemd.service(5) → ExecStart=

Audit Seccomp Violations

SystemCallFilter and other directives employ seccomp(2) filters and terminate processes that violate the filter. You can use auditd to diagnose filter violations.

  1. Install auditd:

    apt install auditd
    
  2. Try to start the service.

  3. Check exit status:

    $ systemctl --user status remote-ssh-agent.service
    ● remote-ssh-agent.service - Connect to SSH agent on remote machine.
         Loaded: loaded (/home/user/.config/systemd/user/remote-ssh-agent.service; enabled; vendor preset: enabled)
         Active: failed (Result: signal) since Sat 2022-01-08 17:12:17 CET; 2min 41s ago
        Process: 41342 ExecStartPre=rm /var/run/user/1000/remote-ssh-agent.socket (code=exited, status=0/SUCCESS)
        Process: 41343 ExecStart=/usr/bin/ncat -k -l -U /var/run/user/1000/remote-ssh-agent.socket -c qrexec-client-vm svc-ssh-agent-git qubes.SshAgent (code=killed, signal=SYS)
       Main PID: 41343 (code=killed, signal=SYS)
            CPU: 12ms
    
    Jan 08 17:12:17 dev systemd[822]: Starting Connect to SSH agent on remote machine....
    Jan 08 17:12:17 dev systemd[822]: Started Connect to SSH agent on remote machine..
    Jan 08 17:12:17 dev systemd[822]: remote-ssh-agent.service: Main process exited, code=killed, status=31/SYS
    Jan 08 17:12:17 dev systemd[822]: remote-ssh-agent.service: Failed with result 'signal'.

    Processes that violate the seccomp policy are terminated with signal SIGSYS.

  4. Find recent (i.e. last 10 minutes) audit logs with a message type SECCOMP:

    $ ausearch -i -m SECCOMP -ts recent
    ---
    type=SECCOMP msg=audit(01/08/2022 17:12:17.214:96) : auid=user uid=user gid=user ses=1 subj==unconfined pid=41343 comm=ncat exe=/usr/bin/ncat sig=SIGSYS arch=x86_64 syscall=socket compat=0 ip=0x7b9e06e59477 code=kill

    This logs indicate that the process was terminated with SIGSYS because the syscall socket was denied.

  5. Fix the issue:

    If the call is in fact needed allow it. An alternative, in some cases, is to disable certain features in the service that require the syscall.

    You can allow the syscall explicitly:

    SystemCallFilter=socket
    

    Alternatively, you can allow a group that contains the socket syscall:

    SystemCallFilter=@network-io
    

    See SystemCallFilter for more details.

More Examples

Apache2

Apache2 serving static content only (/etc/systemd/system/apache2.service.d/99-custom.conf):

[Service]
CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_CHOWN CAP_SETUID CAP_SETGID CAP_KILL

MemoryDenyWriteExecute=yes
NoNewPrivileges=yes
LockPersonality=yes
ProtectClock=yes

ProtectSystem=strict
ReadWritePaths=/var/log/apache2/
ReadWritePaths=/var/run

ProtectHome=yes
ProtectHostname=yes
ProtectKernelLogs=yes
ProtectKernelModules=yes
ProtectKernelTunables=yes
ProtectControlGroups=yes
RemoveIPC=yes
RestrictAddressFamilies=AF_INET AF_INET6
RestrictNamespaces=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@keyring

Cyrus IMAP

/etc/systemd/system/cyrus-imapd.service.d/50-custom.conf:

[Service]
MemoryDenyWriteExecute=yes
NoNewPrivileges=yes
LockPersonality=yes
ProtectHome=yes
RemoveIPC=yes
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX AF_NETLINK
RestrictNamespaces=yes
ProtectKernelModules=yes
ProtectKernelLogs=yes
ProtectControlGroups=yes
ProtectKernelTunables=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
SystemCallArchitectures=native
SystemCallFilter=@system-service

Exim4

Harden exim4 configured for delivery only: roles/common/templates/exim4.conf

Postfix

/etc/systemd/system/postfix@.service.d/50-custom.conf:

[Service]
MemoryDenyWriteExecute=yes
NoNewPrivileges=yes
LockPersonality=yes
ProtectHome=yes
RemoveIPC=yes
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX AF_NETLINK
RestrictNamespaces=yes
ProtectKernelModules=yes
ProtectKernelLogs=yes
ProtectControlGroups=yes
ProtectKernelTunables=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
SystemCallArchitectures=native
SystemCallFilter=@system-service chroot

Unbound

Recursive DNS resolver: roles/dns_resolver/templates/50-hardening.conf