Monitoring

This section will give recommendations for:

  • Monitoring Rudder itself (besides standard monitoring)

  • Monitoring the state of your configuration management

Monitoring Rudder itself

Monitoring a Node

The monitoring of a node mainly consists in checking that the Node can speak with its policy server, and that the agent is run regularly.

You can use the 'rudder agent health' command to check for communication errors. It will check the agent configuration and look for connection errors in the last run logs. By default it will output detailed results, but you can start it with the '-n' option to enable "nrpe" mode (like Nagios plugins, but it can be used with other monitoring tools as well). In this mode, it will display a single line result and exit with:

  • 0 for a success

  • 1 for a warning

  • 2 for an error

If you are using nrpe, you can put this line in your 'nrpe.cfg' file:

command[check_rudder]=/opt/rudder/bin/rudder agent health -n

To get the last run time, you can lookup the modification date of /var/rudder/cfengine-community/last_successful_inputs_update.

Monitoring a Server

You can use use regular API calls to check the server is running and has access to its data. For example, you can issue the following command to get the list of currently defined rules:

curl -X GET -H "X-API-Token: yourToken" http://your.rudder.server/rudder/api/latest/rules

You can then check the status code (which should be 200). See the API documentation for more information.

You can also check the webapp logs (in /var/log/rudder/webapp/year_month_day.stderrout.log) for error messages.

Monitoring your configuration management

There are two interesting types of information:

  • Events: all the changes made by the the agents on your Nodes

  • Compliance: the current state of your Nodes compared with the expected configuration

Monitor compliance

You can use the Rudder API to get the current compliance state of your infrastructure. It can be used to simply check for configuration errors, or be integrated in other tools.

Here is an very simple example of API call to check for errors (exits with 1 when there is an error):

curl -s -H "X-API-Token: yourToken" -X GET 'https:/your.rudder.server/rudder/api/latest/compliance/rules' | grep -qv '"status": "error"'

See the API documentation for more information about general API usage, and the compliance API documentation for a list of available calls.

Monitor events

The Web interface gives access to this, but we will here see how to process events automatically. They are available on the root server, in /var/log/rudder/compliance/non-compliant-reports.log. This file contains two types of reports about all the nodes managed by this server:

  • All the modifications made by the agent

  • All the errors that prevented the application of a policy

The lines have the following format:

[%DATE%] N: %NODE_UUID% [%NODE_NAME%] S: [%RESULT%] R: %RULE_UUID% [%RULE_NAME%] D: %DIRECTIVE_UUID% [%DIRECTIVE_NAME%] T: %TECHNIQUE_NAME%/%TECHNIQUE_VERSION% C: [%COMPONENT_NAME%] V: [%KEY%] %MESSAGE%

In particular, the 'RESULT' field contains the type of event (change or error, respectively 'result_repaired' and 'result_error').

You can use the following regex to match the different fields:

^\[(?P<Date>[^\]]+)\] N: (?P<NodeUUID>[^ ]+) \[(?P<NodeFQDN>[^\]]+)\] S: \[(?P<Result>[^\]]+)\] R: (?P<RuleUUID>[^ ]+) \[(?P<RuleName>[^\]]+)\] D: (?P<DirectiveUUID>[^ ]+) \[(?P<DirectiveName>[^\]]+)\] T: (?P<TechniqueName>[^/]+)/(?P<TechniqueVersion>[^ ]+) C: \[(?P<ComponentName>[^\]]+)\] V: \[(?P<ComponentKey>[^\]]+)\] (?P<Message>.+)$

Below is a basic Logstash configuration file for parsing Rudder events. You can then use Kibana to explore the data, and create graphs and dashboards to visualize the changes in your infrastructure.

input {
   file {
      path => "/var/log/rudder/compliance/non-compliant-reports.log"
   }
}

filter {
   grok {
      match => { "message" => "^\[%{DATA:date}\] N: %{DATA:node_uuid} \[%{DATA:node}\] S: \[%{DATA:result}\] R: %{DATA:rule_uuid} \[%{DATA:rule}\] D: %{DATA:directive_uuid} \[%{DATA:directive}\] T: %{DATA:technique}/%{DATA:technique_version} C: \[%{DATA:component}\] V: \[%{DATA:key}\] %{DATA:message}$" }
   }
   # Replace the space in the date by a "T" to make it parseable by Logstash
   mutate {
      gsub => [ "date", " ", "T" ]
   }
   # Parse the event date
   date {
      match => [ "date" , "ISO8601" ]
   }
   # Remove the date field
   mutate { remove => "date" }
   # Remove the key field if it has the "None" value
   if [key] == "None" {
      mutate { remove => "key" }
   }
}

output {
    stdout { codec => rubydebug }
}

← Performance tuning Server installation options →