Systemd Hardening¶
Evaluate Security of Service¶
Overall security scores for all running services:
$ systemd-analyze security
ModemManager.service 5.9 MEDIUM 😐
NetworkManager.service 7.8 EXPOSED 🙁
systemd-journald.service 4.3 OK 🙂
systemd-logind.service 2.6 OK 🙂
…
Detailed security analysis for a single service:
$ systemd-analyze security tor@default.service
NAME DESCRIPTION EXPOSURE
✗ PrivateNetwork= Service has access to the host's network 0.5
✗ User=/DynamicUser= Service runs as root user 0.4
✗ CapabilityBoundingSet=~CAP_SET(UID|GID|PC… Service may change UID/GID identities/cap… 0.3
✓ CapabilityBoundingSet=~CAP_SYS_ADMIN Service has no administrator privileges
✓ CapabilityBoundingSet=~CAP_SYS_PTRACE Service has no ptrace() debugging abiliti…
✗ RestrictAddressFamilies=~AF_(INET|INET6) Service may allocate Internet sockets 0.3
✓ RestrictNamespaces=~CLONE_NEWUSER Service cannot create user namespaces
✓ RestrictAddressFamilies=~… Service cannot allocate exotic sockets
✓ CapabilityBoundingSet=~CAP_(CHOWN|FSETID|… Service cannot change file ownership/acce…
✗ CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|… Service may override UNIX file/IPC permis… 0.2
✓ CapabilityBoundingSet=~CAP_NET_ADMIN Service has no network configuration priv…
✓ CapabilityBoundingSet=~CAP_SYS_MODULE Service cannot load kernel modules
✓ CapabilityBoundingSet=~CAP_SYS_RAWIO Service has no raw I/O access
✓ CapabilityBoundingSet=~CAP_SYS_TIME Service processes cannot change the syste…
✗ DeviceAllow= Service has a device ACL with some specia… 0.1
✗ IPAddressDeny= Service does not define an IP address all… 0.2
✓ KeyringMode= Service doesn't share key material with o…
✓ NoNewPrivileges= Service processes cannot acquire new priv…
…
List of All Options¶
Global Hardening Options¶
Global options applied to systemd and all processes started by it. A reboot is required to apply the settings to systemd and all processes spawned by it.
Example config (/etc/systemd/system.conf.d/99-custom.conf
):
[Manager]
SystemCallArchitectures=native
CapabilityBoundingSet¶
Identical to CapabilityBoundingSet for services but applied to systemd itself and all processes it starts.
Danger
Dropping capabilities which are required for the system to boot will leave you with an unbootable system.
Note
As of today, I do not believe that any capability can be dropped easily. There is some capabilities which aren’t usually needed, such as CAP_CHECKPOINT_RESTORE or CAP_PERFMON, but these capabilities, for historical reasons, can also be obtained via CAP_SYS_ADMIN. This renders dropping them moot.
SystemCallArchitectures¶
Allow native calls only:
SystemCallArchitectures=native
See also service-level SystemCallArchitectures below.
Service Options¶
Options that can be used in [Service]
section of systemd
services.
Example, extending the exim4 service with some custom hardening
(/etc/systemd/systemd/exim4.service.d/99-custom.conf
):
[Service]
PrivateTmp=yes
ProtectSystem=strict
TemporaryFileSystem=/run/exim4
ReadWritePaths=/var/lib/exim4
ReadWritePaths=/var/log/exim4
ReadWritePaths=/var/spool/exim4
See also More Examples below.
AppArmorProfile¶
Enforce an AppArmor MAC profile for the service.
Enforce profile <profile_name>:
AppArmorProfile=<profile_name>
Capabilities¶
On Linux, super-user privileges are divided into capabilities. Available
capabilities are listed in capabilities(7) and systemd-analyze
capability
lists all capabilities known to systemd.
CapabilityBoundingSet¶
Restrict available capabilities (i.e. restrict super-user privileges).
Drop all capabilites:
CapabilityBoundingSet=
Retain only capabilities CAP_SETGID and CAP_SETUID:
CapabilityBoundingSet=CAP_SETGID CAP_SETUID
Drop only Capabilities CAP_SETGID and CAP_SETUID:
CapabilityBoundingSet=~CAP_SETGID CAP_SETUID
See:
systemd.exec(5) → SecureBits= (subtly changes behavior of capabilities)
AmbientCapabilities¶
By default, all capabilities are dropped when running a service as non-root user. In order grant a non-root user limited super-user capabilities. This directive can be used.
Grant user backup-daemon capability CAP_DAC_READ_SEARCH:
User=backup-daemon
AmbientCapabilities=CAP_DAC_READ_SEARCH
This should generally be preferred to running a service as root and dropping capabilities via CapabilityBoundingSet because root will still have (write) access to most files as it owns most of them. Also, some services do permission checks based on UID. For instance, Postgres will check the UID/name of the connecting user.
See:
systemd.exec(5) → SecureBits= (subtly changes behavior of capabilities)
MemoryDenyWriteExecute¶
Prevent memory allocations that are writeable and executable at the same time:
MemoryDenyWriteExecute=yes
SystemCallFilter=~memfd_create
It may be possible to circumvent this protection unless any one of these conditions is met:
The memfd_create syscall is filtered (as shown above).
Write access to any file or directory is denied.
noexec mount options is set on any accessible filesystem. This may be achived via NoExecPaths.
LockPersonality¶
Disable emulation of different behaviors to support non-Linux-native binaries.
Lock personality:
LockPersonality=yes
See
NoNewPrivileges¶
Deny process to escalating privileges:
NoNewPrivileges=yes
In particular, the service process and all its children will ignore setuid and
and setgid bits used by su
and sudo
to gain privileges.
Note about systemd socket:
Services with access to run services via systemd (e.g. via systemd-run
) may
be able to get around this restriction.
See
Devices¶
DeviceAllow¶
Allow device /dev/loop-conrol, /dev/loop[0-9]:
DeviceAllow=/dev/loop-control
DeviceAllow=block-loop
Allow read-only access to /dev/sda:
DeviceAllow=/dev/sda:r
Use PrivateDevices when only the default set of pseudo-devices like /dev/null, /dev/zero and /dev/urandom is needed.
By default, access to common pseudo-devices like /dev/null or /dev/urandom is always granted. This behiavior can be changed using systemd.resource-control(5) → DevicePolicy=.
PrivateDevices¶
Only provide a minimal set of devices like /dev/null, /dev/zero or /dev/urandom to the service. Systemd will also take other measures to prevent device creation and access.
Enable private devices:
PrivateDevices=yes
PrivateIPC¶
Create a private IPC namespace for the service:
PrivateIPC=yes
Multiple services can be made to share their IPC namespace using JoinsNamespaceOf.
See systemd.exec(5) → PrivateIPC=
Availability: systemd 248
RemoveIPC¶
Remove IPC objects when service is stopped:
User=exampled
RemoveIPC=yes
Remove all System V and POSIX IPC objects owned by the user (and not the service) when the service is stopped.
See systemd.exec(5) → RemoveIPC=
Availability: systemd 248
/proc/
//sys/
Filesystem¶
ProcSubset¶
Only allow access to PID information in /proc
(i.e. /proc/<pid>/
):
ProcSubset=pid
ProtectProc¶
Control access to processes in /proc
.
Deny access to other users processes:
ProtectProc=noaccess
Hide other users processes:
ProtectProc=invisible
Hide non-ptraceable processes:
ProtectProc=ptraceable
You should usually prefer invisible over noaccess as many services do not handle being denied access well.
These directive corresponds to the hidepid= mount option of proc. See proc(5)#Mount_options
ProtectKernelTunables¶
Protect kernel variables accessible in /proc, /sys or via sysctl(8)/sysctl.conf(5):
ProtectKernelTunables=yes
ProtectClock¶
Prevent service from manipulating clock:
ProtectClock=yes
ProtectControlGroups¶
Prevent modifications to the cgroup hierarchies by the service:
ProtectControlGroups=yes
Filesystem Access¶
NoExecPaths¶
Only allow execution of /usr/bin/serviced:
NoExecPaths=/
ExecPaths=/usr/bin/serviced
This, in combination with MemoryDenyWriteExecute, may be used to make arbitrary code execution harder.
See systemd.exec(5) → NoExecPaths=
Availability: systemd 248
PrivateTmp¶
Create private, empty /tmp/ and /var/tmp/ for the service:
PrivateTmp=yes
Multiple services can be made to share their /tmp and /var/tmp/ using JoinsNamespaceOf.
Temporary files are cleaned when the service is stopped.
ProtectHome¶
Restrict access to /home/, /root, /run/user for a service.
Make /home/ inaccessible:
ProtectHome=yes
Make /home/ read-only:
ProtectHome=read-only
ReadWritePaths may be used to lift read-only restriction on subdirectories.
Replace /home/ with an empty, read-only directory:
ProtectHome=tmpfs
See:
ExecStart (full write access in ExecStart=, ExecStartPre=, etc.)
InaccessiblePaths¶
Make directory/files at /etc/hidden, /hidden/ and /home/ inaccessible:
InaccessiblePaths=/etc/hidden /hidden/
InaccessiblePaths=/home/
See:
ExecStart (full write access in ExecStart=, ExecStartPre=, etc.)
ReadOnlyPaths¶
Make directory/files at /etc/hidden, /hidden/ and /home/ read-only:
ReadOnlyPaths=/etc/hidden /hidden/
ReadOnlyPaths=/home/
See: * systemd.exec(5) → ReadOnlyPaths= * ExecStart (full write access in ExecStart=, ExecStartPre=, etc.)
ReadWritePaths¶
Make directory/files at /etc/hidden, /hidden/ and /home/ readable/writable:
ReadWritePaths=/etc/hidden /hidden/
ReadWritePaths=/home/
Directories otherwise read-only or inaccessible due to the use of ProtectHome or ProtectSystem may be made readable/writable.
Subdirectories or files specified in ReadOnlyPaths may be made writable. However, this does not extend to InaccessiblePaths.
See:
ExecStart (full write access in ExecStart=, ExecStartPre=, etc.)
RestrictFileSystems¶
Only allow opening files on a ext4 or tmpfs filesystem:
RestrictFileSystems=ext4 tmpfs
Only deny access to network filesystems:
RestrictFileSystems=~@network
Obtain a list of all known filesystems and groups:
$ systemd-analyze filesystems
See systemd.exec(5) → RestrictFileSystems=
Availability: systemd 250
Note
Not available in Debian 12 “bookworm” because BPF_FRAMEWORK is
disabled. Check systemctl --version
.
ProtectSystem¶
Mount /usr/, /boot/ and /efi/ read-only:
ProtectSystem=yes
Additionally mount /etc/ read-only:
ProtectSystem=full
Mount everything read-only except /dev/, /proc/ and /sys
ProtectSystem=strict
Use ReadWritePaths to allow write access to specific files or directories.
See
ExecStart (full write access in ExecStart=, ExecStartPre=, etc.)
ProtectHostname¶
Prevent service from manipulating hostname (UTS namespace):
ProtectHostname=yes
ProtectKernelLogs¶
Deny service access to kernel logs (e.g. via dmesg(1)):
ProtectKernelLogs=yes
ProtectKernelModules¶
Prevent loading of kernel modules by service:
ProtectKernelModules=yes
TemporaryFileSystem¶
Place a empty tmpfs filesystem at /path/directory:
TemporaryFileSystem=/path/directory
The same but make the directory read-only:
TemporaryFileSystem=/path/directory:ro
This is often useful when a service can’t deal with a directory being read-only or inaccessible but is fine with it being empty.
Networking¶
PrivateNetwork¶
Create a private network namespace with only a private loopback interface:
PrivateNetwork=yes
Multiple services can be made to share their network namespace using JoinsNamespaceOf. Restricting access to the (global) loopback interface, or any other interface, can be done using RestrictNetworkInterfaces.
See systemd.exec(5) → PrivateNetwork=
Availability: systemd 250
RestrictAddressFamilies¶
Restrict socket access to IPv6, IPv4 and Unix socket families respectively:
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
Allow no address family:
RestrictAddressFamilies=none
The special value of none
is only supported starting with systemd 249.
Often required:
Family |
Reason |
---|---|
AF_NETLINK |
Enumerating network interfaces, for instance, to be able to bind to specific interfaces. |
AF_UNIX |
Logging via syslog(3). |
See
List of socket types in socket(2)
RestrictNetworkInterfaces¶
Restrict access to the loopback (lo) interface:
RestrictNetworkInterfaces=lo
Deny access to interface eth0 only:
RestrictNetworkInterfaces=~eth0
When no network access is needed use PrivateNetwork.
See systemd.resource-conntrol(5) → RestrictNetworkInterfaces=
Availability: systemd 250
Note
Not available in Debian 12 “bookworm” because BPF_FRAMEWORK is
disabled. Check systemctl --version
.
SocketBindAllow¶
Important
Without also specifying SocketBindDeny=any
, the
service may bind to all ports.
Allow service to bind to TCP ports 80 and 443 only:
SocketBindAllow=tcp:80
SocketBindAllow=tcp:443
SocketBindDeny=any
Omit protocol to allow TCP and UDP:
SocketBindAllow=80
SocketBindAllow=443
SocketBindDeny=any
Port ranges, like 1200-1300
, are accepted too.
Allow an unprivileged service to bind to TCP ports 80 and 443 only:
AmbientCapabilities=CAP_NET_BIND_SERVICE
SocketBindAllow=tcp:80
SocketBindAllow=tcp:443
SocketBindDeny=any
User=www-data
Capability CAP_NET_BIND_SERVICE is required to bind to any port lower than 1024. SocketBindAllow can be used to restrict this privilege to certain ports.
See systemd.resource-control(5) → SocketBindAllow=
Availability: systemd 249
IPAddressAllow¶
Important
Without also specifying IPAddressDeny=any
, the
service will be allowed to connect to any address.
Only allow connecting to CIDR networks 10.0.0.0/8 and fc00/7:
IPAddressAllow=10.0.0.0/8 fc00/7
IPAddressDeny=any
The value localhost
can be used to restrict access to 127.0.0.1
and ::1. If you wish to restrict access to localhost only, consider
using RestrictNetworkInterfaces=lo in addition.
See systemd.resource-control(5) → IPAddressAllow=
Note
Not available in Debian 12 “bookworm” because BPF_FRAMEWORK is
disabled. Check systemctl --version
.
IPAddressDeny¶
Deny access to CIDR networks 10.0.0.0/8 and fc00/7:
IPAddressDeny=10.0.0.0/8 fc00/7
RestrictNamespaces¶
Deny any namespace change:
RestrictNamespaces=yes
Only allow access namespaces ipc and net:
RestrictNamespaces=ipc net
Only deny access namespaces ipc and net:
RestrictNamespaces=~ipc net
RestrictRealtime¶
Deny access to any realtime scheduling functionality:
RestrictRealtime=yes
RestrictSIUDSGID¶
Prevent setting of SUID and SGID bits for file permissions:
RestrictSIUDSGID=yes
See
Details about SUID/SGID (AKA set-user-ID/set-group-ID) in execve(2)
System Call Filtering (seccomp)¶
SystemCallArchitectures¶
Allow native calls only:
SystemCallArchitectures=native
Disable ABI for non-native system calls. Namely, this disables support for x86 binaries on x86_64.
SystemCallFilter¶
Allow only syscalls in group @system-service:
SystemCallFilter=@system-service
Allow syscalls in group @system-service and syscall seccomp except those in group @chown:
SystemCallFilter=@system-service seccomp
SystemCallFilter=~@chown
Deny syscalls in group @chown with error EPERM rather than terminating the process:
SystemCallFilter=~@chown:EPERM
Many services can deal with an EPERM, and other error codes, for certain calls only used for optional functionality.
A list of all known syscalls and groups can be obtained like this:
systemd-analyze syscall-filter
Rather then killing the process, systemd can also be instructed to return an error code like EPERM for all violations:
SystemCallErrorNumber=EPERM
Services using SystemCallFilter should also use SystemCallArchitectures=native.
See
systemd.exec(5) → SystemCallFilter= (includes a list of important syscall groups)
errno(3) (available error codes)
Audit Seccomp Violations for debugging filters.
UMask¶
Create files and directories that are only accessible by user/owner if permission are not explicitly set during creation:
UMask=0077
Allow user and group only:
UMask=0007
See
User / Group¶
DynamicUser¶
Dynamically create a Unix user as which the service is ran:
DynamicUser=yes
This is not suitable for services that write persistent data to disk or have to read private data. This because the UID/GID will be unpredictable and may be shared (though not at the same time) with other services.
Read sysemd.exec(5) → DynamicUser= before use.
See also ExecStart (run ExecStart=, ExecStartPre=, etc. with full privileges)
PrivateUser¶
Run service in a private user namespace:
PrivateUser=yes
User¶
Run process as user serviced:
User=serviced
Group is taken from the passwd database unless specified via Group and Supplementary groups from the group database.
See:
ExecStart (run ExecStart=, ExecStartPre=, etc. with full privileges)
Group¶
Set users group to serviced:
Group=serviced
See:
ExecStart (run ExecStart=, ExecStartPre=, etc. with full privileges)
SupplementaryGroups¶
On Unix, any process belongs to a user (UID) and group (GID)
but it may also belong to additional/supplementary groups. Such
supplementary groups are shown in groups=
by id
:
$ id user
uid=1000(user) gid=1000(user) groups=1000(user),999(qubes),126(docker)
Add service to supplementary group inet:
SupplementaryGroups=inet
Groups from the system’s group database are left untouched and SupplementaryGroups are appended.
Exec{Start,Stop}{,Pre,Post}¶
Prefixes +
and !
can be used to execute commands with full
privileges (without User/Group/etc. being applied) and
without filesystem access restriction being applied
(PrivateHome/ReadOnlyPaths/etc.).
Call mkdir /etc/directory/
as root and with /etc/ being writable:
ExecStartPre=+mkdir /etc/directory
ExecStart=serviced --foreground
ReadOnlyPaths=/etc/
User=serviced
Use !
to only revert the effects of User
, Group
and
SupplementaryGroups
.
These prefixes can be used with ExecStart, ExecStartPre, ExecStartPost, ExecStop, ExecStopPre and ExecStopPost.
Audit Seccomp Violations¶
SystemCallFilter and other directives employ seccomp(2) filters and terminate processes that violate the filter. You can use auditd to diagnose filter violations.
Install auditd:
apt install auditd
Try to start the service.
Check exit status:
$ systemctl --user status remote-ssh-agent.service ● remote-ssh-agent.service - Connect to SSH agent on remote machine. Loaded: loaded (/home/user/.config/systemd/user/remote-ssh-agent.service; enabled; vendor preset: enabled) Active: failed (Result: signal) since Sat 2022-01-08 17:12:17 CET; 2min 41s ago Process: 41342 ExecStartPre=rm /var/run/user/1000/remote-ssh-agent.socket (code=exited, status=0/SUCCESS) Process: 41343 ExecStart=/usr/bin/ncat -k -l -U /var/run/user/1000/remote-ssh-agent.socket -c qrexec-client-vm svc-ssh-agent-git qubes.SshAgent (code=killed, signal=SYS) Main PID: 41343 (code=killed, signal=SYS) CPU: 12ms Jan 08 17:12:17 dev systemd[822]: Starting Connect to SSH agent on remote machine.... Jan 08 17:12:17 dev systemd[822]: Started Connect to SSH agent on remote machine.. Jan 08 17:12:17 dev systemd[822]: remote-ssh-agent.service: Main process exited, code=killed, status=31/SYS Jan 08 17:12:17 dev systemd[822]: remote-ssh-agent.service: Failed with result 'signal'.
Processes that violate the seccomp policy are terminated with signal SIGSYS.
Find recent (i.e. last 10 minutes) audit logs with a message type SECCOMP:
$ ausearch -i -m SECCOMP -ts recent --- type=SECCOMP msg=audit(01/08/2022 17:12:17.214:96) : auid=user uid=user gid=user ses=1 subj==unconfined pid=41343 comm=ncat exe=/usr/bin/ncat sig=SIGSYS arch=x86_64 syscall=socket compat=0 ip=0x7b9e06e59477 code=kill
This logs indicate that the process was terminated with SIGSYS because the syscall socket was denied.
Fix the issue:
If the call is in fact needed allow it. An alternative, in some cases, is to disable certain features in the service that require the syscall.
You can allow the syscall explicitly:
SystemCallFilter=socket
Alternatively, you can allow a group that contains the socket syscall:
SystemCallFilter=@network-io
See SystemCallFilter for more details.
More Examples¶
Apache2¶
Apache2 serving static content only
(/etc/systemd/system/apache2.service.d/99-custom.conf
):
[Service]
CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_CHOWN CAP_SETUID CAP_SETGID CAP_KILL
MemoryDenyWriteExecute=yes
NoNewPrivileges=yes
LockPersonality=yes
ProtectClock=yes
ProtectSystem=strict
ReadWritePaths=/var/log/apache2/
ReadWritePaths=/var/run
ProtectHome=yes
ProtectHostname=yes
ProtectKernelLogs=yes
ProtectKernelModules=yes
ProtectKernelTunables=yes
ProtectControlGroups=yes
RemoveIPC=yes
RestrictAddressFamilies=AF_INET AF_INET6
RestrictNamespaces=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@keyring
Cyrus IMAP¶
/etc/systemd/system/cyrus-imapd.service.d/50-custom.conf
:
[Service]
MemoryDenyWriteExecute=yes
NoNewPrivileges=yes
LockPersonality=yes
ProtectHome=yes
RemoveIPC=yes
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX AF_NETLINK
RestrictNamespaces=yes
ProtectKernelModules=yes
ProtectKernelLogs=yes
ProtectControlGroups=yes
ProtectKernelTunables=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
SystemCallArchitectures=native
SystemCallFilter=@system-service
Exim4¶
Harden exim4 configured for delivery only: roles/common/templates/exim4.conf
Postfix¶
/etc/systemd/system/postfix@.service.d/50-custom.conf
:
[Service]
MemoryDenyWriteExecute=yes
NoNewPrivileges=yes
LockPersonality=yes
ProtectHome=yes
RemoveIPC=yes
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX AF_NETLINK
RestrictNamespaces=yes
ProtectKernelModules=yes
ProtectKernelLogs=yes
ProtectControlGroups=yes
ProtectKernelTunables=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
SystemCallArchitectures=native
SystemCallFilter=@system-service chroot
Tor¶
Tor relay / onion service: roles/tor_server/templates/51-ansible-hardening.conf
Unbound¶
Recursive DNS resolver: roles/dns_resolver/templates/50-hardening.conf