Performance tuning
Rudder provides many options to tune its configuration to support either very large or very small systems. This part describes the different options and their impact, on Disk usage, Memory and CPU.
Reports retention
Reports are sent by default from the nodes to the Rudder Server at each run - on large installation, or on systems with little available disk space, this may be problematic. There are two strategies, that can be combined, to lower the requirement on disk usage.
First, you can lower the retention duration for reports from nodes, by setting in file
/opt/rudder/etc/rudder-web.properties
the options:
-
rudder.batch.reportscleaner.archive.TTL=0
-
rudder.batch.reportscleaner.delete.TTL=3
Second possibility is to make the agents send less reports to the Rudder server, by switching to Non compliance only
compliance reporting mode. In this mode, only reports of changes or errors will be sent to the Rudder server. his mode saves a lot of log space and bandwidth, but leads to some assumptions about actual configuration status in reporting.
This can be configured in the Settings/General page, section Compliance reporting mode
Finally, compliance levels are also historized over time; disabling their saving also save disk space:
/opt/rudder/etc/rudder-web.properties
:
-
rudder.batch.reportscleaner.compliancelevels.delete.TTL=0
Jetty
The Jetty application server is the service that runs Rudder web application and inventory endpoint. It uses the Java runtime environment (JRE).
The default settings fit the basic recommendations for standard Rudder hardware requirements, but there are some configuration switches that you might need to tune to obtain better performance with Rudder, or correct e.g. timezone issues.
To look at the available optimization knobs, please take a look at /etc/default/rudder-jetty
on your Rudder server.
Java "Out Of Memory Error"
It may happen that you get java.lang.OutOfMemoryError
.
They can be of several types,
but the most common is: java.lang.OutOfMemoryError: Java heap space
.
This error means that the web application needs more RAM than what was given. It may be linked to a bug where some process consumed much more memory than needed, but most of the time, it simply means that your system has grown and needs more memory.
You can follow the configuration steps described in the following paragraph.
Configure RAM allocated to Jetty
To change the RAM given to Jetty, you have to edit /etc/default/rudder-jetty to modify JAVA_XMX value:
# modify JAVA_XMX to set the value to your need. # The value is given in MB by default, but you can also use the "G" unit to specify a size in GB. JAVA_XMX=2G # save your changes, and restart Jetty: systemctl restart rudder-jetty
This file is alike to /opt/rudder/etc/rudder-jetty.conf, which is the file with default values. /opt/rudder/etc/rudder-jetty.conf should never be modified directly because the modification would be erased by the next Rudder upgrade. |
On standard installations (less than 1000 nodes), the amount of memory should be the half of the RAM of the server, rounded up to the nearest GB. For example, if the server has 5GB of RAM, 3GB should be allocated to Jetty.
On installations with more than 1000 nodes, two-third (rounded up to the nearest GB) should be allocated to Jetty. For example, if the server has 16GB of RAM, 11GB should be allocated to Jetty
Garbage collector
When RAM allocated to jetty reaches 4 to 6GB (or higher), you may experience long freeze of Rudder, up to several tens of seconds. If this is the case, you can change the JVM garbage collector to one better fitted for larger memory footprint by editing /etc/default/rudder-jetty:
# Uncomment the lines related to "Java Garbage collector option" JAVA_GC="-XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:MaxGCPauseMillis=500 -XX:+UseStringDeduplication" # save your changes, and restart Jetty: systemctl restart rudder-jetty
This option is also a good solution if you are constrained by the amount of memory available.
The String deduplication option of G1GC
saves between one fifth to one fourth of memory
used by rudder-jetty.
Configure Jetty for very small instances
On very small instances, you may be both memory and CPU constrained. With less than 20 nodes, the following options in /etc/default/rudder-jetty should fit on small systems
JAVA_XMX=2G JAVA_GC="-XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:MaxGCPauseMillis=500 -XX:+UseStringDeduplication" JETTY_START_TIMEOUT=360
Make better use of resources
Most operations can be parallelized within Rudder, to take most advantage of the available resources. These parameters can be set through the setting API Most notable settings, with recommended values are:
Name | Default value | Recommended value for large instance | Recommended value for very small instance |
---|---|---|---|
rudder_generation_max_parallelism |
x0.5 |
x0.5 |
1 |
rudder_compute_dyngroups_max_parallelism |
1 |
4 |
1 |
rudder_generation_delay |
"0 seconds" |
5 seconds |
10 seconds |
rudder_report_protocol_default |
HTTPS |
HTTPS |
HTTPS |
reporting_mode |
full-compliance |
|
changes-only |
rudder_compute_changes |
true |
true |
false |
rudder_save_db_compliance_levels |
true |
false |
false |
Note: x0.5 means half the number of available CPUs
LDAP connection pool configuration
By default, there are 2 availables connections to the internal LDAP in Rudder. On large systems, or systems with a high load, that may not be sufficient. A good heuristic is "Number of threads for dynamic group computation" + "half the number of CPUs available" + 2
This value is set in file /opt/rudder/etc/rudder-web.properties, with the value ldap.maxPoolSize
.
For a large system with 16 CPUs, 4 threads allocated to dynamic groups updates, this would result in
ldap.maxPoolSize=14
Configure PostgreSQL server
The default out-of-the-box configuration of PostgreSQL server is really not adapted for high end or even normal by todays standard servers, as it uses a really small amount of memory.
The location of the PostgreSQL server configuration file is usually:
On a Debian system:
/etc/postgresql/X.Y/main/postgresql.conf
On a SUSE or RHEL/CentOS system:
/var/lib/pgsql/data/postgresql.conf
Suggested values for a setup with more than 3000 nodes
# # Amount of System V shared memory # -------------------------------- # shared_buffers = 256MB # On old versions of PostgreSQL, you may need to set the proper amount of shared memory on the system. # # $ sysctl -w kernel.shmmax=268435456 # # Reference: # http://www.postgresql.org/docs/9.2/interactive/kernel-resources.html#SYSVIPC # # Memory for complex operations # ----------------------------- # # Complex query: temp_buffers = 32MB work_mem = 6MB max_stack_depth = 4MB # Complex maintenance: index, vacuum: maintenance_work_mem = 2GB # Write ahead log # --------------- # # Size of the write ahead log: wal_buffers = 4MB # Number of checkpoints checkpoint_segments = 16 # Query planner # ------------- # # Gives hint to the query planner about the size of disk cache. # # Setting effective_cache_size to 1/2 of remaining memory would be a normal # conservative setting: effective_cache_size = 1024MB
Suggested values on a standard server
shared_buffers = 64MB work_mem = 4MB maintenance_work_mem = 256MB wal_buffers = 1MB checkpoint_segments = 8 effective_cache_size = 128MB
Maximum number of file descriptors
If you manage thousands of nodes with Rudder, you should increase the open file limits as policy generation opens and write a lot of file. If you experience the error
ERROR com.normation.rudder.services.policies.ParallelSequence - Failure in boxToEither: Error when trying to open template template name
it means that you should increase the limit of open files
You can change the system-wide maximum number of file descriptors in /etc/sysctl.conf
if necessary:
fs.file-max = 3247518
Then you have to get the rudder application enough file descriptors. To do so, you have to:
-
Have a high enough hard limit for rudder
-
Set the limit used by rudder and root
These can be set in /etc/security/limits.conf
:
rudder soft nofile 8192 root soft nofile 8192 root hard nofile 8192
You have to restart rudder-jetty for these settings to take effect.
You can check current soft and hard limits by running the following commands as the user you want to check:
ulimit -Sn ulimit -Hn
Apache web server
The Apache web server is used by Rudder as a proxy, to connect to the Jetty application server, and to receive inventories using the WebDAV protocol.
There are tons of documentation about Apache performance tuning available on the Internet, but the defaults should be enough for most setups.
Rsyslog
On very large installation, with many reports sent to the rudder servers, some messages may be lost because the default UDP buffer is too small or because rsyslog doesn’t consume the reports fast enough. If you experience random missing reports, there are several changes that will improve the situation.
Increase the UDP buffer
sysctl -w net.core.rmem_max=26214400 sysctl -w net.core.rmem_default=26214400
Upgrade rsyslog to a more recent version
The version of rsyslog included in some distributions can be have trouble handling more than 1000 reports/seconds; our tests show that versions 8.1901.0 and later are necessay to consumme more than 1400 reports/seconds. List of rsyslog supported repositories can be found here: rsyslog repositories
Increase the number of threads for rsyslog
Edit the file /var/rudder/configuration-repository/techniques/system/distributePolicy/1.0/rudder-rsyslog-root.st
on the rudder server, and replace the lines
$ModLoad imudp $UDPServerRun &SYSLOGPORT&
by
module(load="imudp" threads="2") input(type="imudp" port="&SYSLOGPORT&")
and increase the number of action queue workers
$ActionQueueWorkerThreads 2
by
$ActionQueueWorkerThreads 4
and then commit the change
cd /var/rudder/configuration-repository/techniques git add system/distributePolicy/1.0/rudder-rsyslog-root.st git commit -m "Increase number of threads allocated to rsyslog" rudder server reload-techniques
Using syslog over TCP (not recommended)
If you are using syslog over TCP as reporting protocol - which is not recommended (it is set in Administration → Settings → Protocol), you can experience issues with rsyslog on Rudder policy servers (root or relay) when managing a large number of nodes. This happens because using TCP implies the system has to keep track of the connections. It can lead to reach some limits, especially:
-
max number of open files for the user running rsyslog
-
size of network backlogs
-
size of the conntrack table
You have two options in this situation:
-
Switch to UDP (in Administration → Settings → Protocol). It is less reliable than TCP and you can lose reports in case of networking or load issues, but it will prevent breaking your server, and allow to manage more Nodes. Note that this is the default setting.
-
Stay on TCP. Do this only if you need to be sure you will get all your reports to the server. You will should follow the instructions below to tune your system to handle more connections.
All settings needing to modify /etc/sysctl.conf
require to run sysctl -p
to be applied.
Maximum number of TCP sessions in rsyslog
You may need to increase the maximum number of TCP sessions that rsyslog will accept.
Add to your /etc/rsyslog.conf
:
$ModLoad imtcp # 500 for example, depends on the number of nodes and the agent run frequency $InputTCPMaxSessions 500
Note: You can use MaxSessions
instead of InputTCPMaxSessions
on rsyslog >= 7.
Network backlog
You can also have issues with the network queues (which may for example lead to sending SYN cookies):
-
You can increase the maximum number of connection requests awaiting acknowledgment by changing
net.ipv4.tcp_max_syn_backlog = 4096
(for example, the default is 1024) in/etc/sysctl.conf
. -
You may also have to increase the socket listen() backlog in case of bursts, by changing
net.core.somaxconn = 1024
(for example, default is 128) in/etc/sysctl.conf
.
Conntrack table
You may reach the size of the conntrack table, especially if you have other applications
running on the same server. You can increase its size in /etc/sysctl.conf
,
see the Netfilter FAQ
for details.
Agent
If you are using Rudder on a highly stressed machine, which has especially slow or busy I/O’s, you might experience a sluggish agent run every time the system evaluates the policies.
This is because the agent tries to update its internal databases every time the agent
executes a policy (the .lmdb
files in the /var/rudder/cfengine-community/state directory
),
which even if the database is very light, takes some time if the machine has a very high iowait.
In this case, here is a workaround you can use to restore the agent’s full speed: you can use a RAMdisk to store its states.
You might use this solution either temporarily, to examine a slowness problem, or permanently, to mitigate a known I/O problem on a specific machine. We do not recommend as of now to use this on a whole IT infrastructure.
Be warned, this solution has a drawback: you should backup and restore the content of this directory manually in case of a machine reboot because all the persistent states are stored here, so in case you are using, for example the jobScheduler Technique, you might encounter an unwanted job execution because the agent will have "forgotten" the job state.
Also, note that the mode=0700 is important as agent will refuse to run correctly if the state directory is world readable, with an error like:
error: UNTRUSTED: State directory /var/rudder/cfengine-community (mode 770) was not private!
Here is the command line to use:
# How to mount the RAMdisk manually, for a "one shot" test:
mount -t tmpfs -o size=128M,nr_inodes=2k,mode=0700,noexec,nosuid,noatime,nodiratime tmpfs /var/rudder/cfengine-community/state
# How to put this entry in the fstab, to make the modification permanent
echo "tmpfs /var/rudder/cfengine-community/state tmpfs defaults,size=128M,nr_inodes=2k,mode=0700,noexec,nosuid,noatime,nodiratime 0 0" >> /etc/fstab
mount /var/rudder/cfengine-community/state
← Maintenance procedures Monitoring →