Set Up System Monitoring with the TIG Stack and Kapacitor
One method to monitor the is to use the TIG (, and ) Stack. While these instructions are specifically written for this toolset, you can also use this guide as the foundation to set up an alternate stack if necessary.
This guide uses InfluxDB, Telegraf, , and Grafana for a complete monitoring solution.
This guide specifically focuses on host-level and software metrics. Monitoring additional components (e.g., ) is outside of the scope of this document.
These instructions were created using these versions of each component.
Component | Version |
---|---|
InfluxDB | 1.8 (OSS) |
Telegraf | 1.16.1 |
Kapacitor | 1.5.6 (OSS) |
Grafana | 7.4.3 |
This guide assumes that:
- SSH access and root-level privileges are available to each server running Ocient Software.
- InfluxDB, Kapacitor, and Grafana are available to be installed in a virtual machine or container. It is important to note that each of these systems should be run independently. For example, do not run InfluxDB, Kapacitor, and Grafana on the same virtual machine.
- All Ocient software is currently deployed and running as expected.
- The following ports must be open for InfluxDB, Kapacitor, Telegraf, and Grafana. InfluxDB is the most critical as each component communicates with it. Always refer to the latest documentation for each product as the definitive reference.
Component | Default Ports | Components Requiring Access |
---|---|---|
InfluxDB | 8086 8088 (optional, used for backup utilities) | Telegraf, Kapacitor, Grafana |
Telegraf | None | None |
Kapacitor | 9092 | None |
Grafana | 3000 | None |
InfluxDB is a timeseries database that provides persistence for the host-level and Ocient software metrics.
1. Refer to current instructions from InfluxDB on how to install the service: https://docs.influxdata.com/influxdb/v2.0/install/ https://docs.influxdata.com/influxdb/v2.0/get-started/
2. Document the IP address of the machine where InfluxDB is installed.
3. Ensure that InfluxDB is running properly.
4. Set the InfluxDB service to start on boot.
5: It is highly recommended that critical components of monitoring infrastructure, including InfluxDB, are also monitored. Monitoring the monitoring stack is outside of the scope of this document. Refer to available resources on potential solutions to monitor InfluxDB.
Telegraf is used for metrics collection. It runs on each of the Admin, Loader, Foundation (LTS), SQL Nodes. The following instructions highlight the common configuration elements, followed by the elements that are specific to different node types.
1. Install Telegraf on each of the Admin, Loader, Foundation (LTS), and SQL Nodes. Refer to the Telegraf installation instructions available from InfluxDB. At this point, do not start the Telegraf service or generate the default configuration.
2. For all nodes, create the Telegraf configuration file (typically located at /etc/telegraf/telegraf.conf). Open the file in an editor.
- Copy and paste the following contents into the file.
- Replace the designated placeholders (denoted with a $) with the specific values for the given environment:
- CLUSTER_NAME = Customer selected name of a given cluster for monitoring identification
- ROLE = One of admin, loader, lts, sql, stream_proc depending on the node type
- INFLUX_URL = URL of the InfluxDB instance (e.g., http://10.6.0.4:8086)
Save the file.
3. For all nodes, create a new file under the telegraf.d directory (typically /etc/telegraf/telegraf.d) named host.conf.
- Copy and paste the following contents into the file.
- Replace the designated placeholders (denoted with an $) with the specific values for the given environment:
- BOOT_DISK — Device name of the boot disk (e.g., sda). You can determine the boot disk by running an lsblk command.
- NET_INT_BOND0_WILDCARD — The wildcard for matching the underlying network interfaces within bond0 (e.g., eno*). If it is unclear, refer to /proc/net/bonding/bond0 and reference the interfaces noted in the Telegraf documentation.
- NET_INT_BOND1_WILDCARD — The wildcard for matching the underlying network interfaces within bond1 (e.g., enp*). If it is unclear, refer to /proc/net/bonding/bond1 and reference the interfaces noted in the Telegraf documentation.
Save the file.
4. For all nodes, create a new file under the telegraf scripts directory (typically under /etc/telegraf/scripts) named uio_pci_generic.sh.
- Copy and paste the following contents into the file.
Create a new file under the telegraf.d directory (typically /etc/telegraf/telegraf.d) named rolehostd.conf.
- Copy and paste the following contents into the file.
- Replace the designated placeholders (denoted with an $) with the specific values for the given environment:
- NODE_BOND0_IP_ADDRESS = IP address associated with bond0
You can also use Filebeat to forward the query.json file to a monitoring platform of your choice.
Create a new file under the telegraf.d directory (typically /etc/telegraf/telegraf.d) named rolehostd.conf.
- Copy and paste the following contents into the file.
- Replace the designated placeholders (denoted with an $) with the specific values for the given environment:
- NODE_BOND0_IP_ADDRESS = IP address associated with bond0
Loader Nodes also run the Loading and Transformation service and require additional Telegraf configuration to capture those metrics.
- Create a new file under the telegraf.d directory (typically /etc/telegraf/telegraf.d) named lat.conf.
- Copy and paste the following contents into the file.
When the configuration is properly entered on all nodes, Telegraf must be started.
1. On each node, start the Telegraf service.
2. Ensure the service came up properly by checking the status.
3. Set the Telegraf service to start on boot.
Kapacitor is used to process time series data, detect anomalies, and trigger alerts. As the preferred alerting mechanisms for every customer will vary, refer to the Kapacitor documentation on how to send alerts through specific tools.
1. Install Kapacitor using the instructions from InfluxDB. https://docs.influxdata.com/kapacitor/v1.6/introduction/installation/
2. Create the Kapacitor configuration file (typically located at /etc/kapacitor/kapacitor.conf). Open the file in an editor.
- Copy and paste the following contents into the file.
- Replace the designated placeholders (denoted with an $) with the specific values for the given environment:
- KAPACITOR_HOST_OR_IP = Hostname or IP address associated with the Kapacitor instance
- INFLUX_URL = URL of the InfluxDB instance (e.g., http://10.6.0.4:8086)
Ocient provides sample Kapacitor alerts and templates upon request. Contact Ocient Support for details.
Within the bundle, relevant contents are located within the kapacitor directory.
- Copy the files located under kapacitor/load/tasks to the Kapacitor system in /etc/kapacitor/load/tasks.
- Copy the files located under kapacitor/load/templates to the Kapacitor system in /etc/kapacitor/load/templates.
When the configuration is complete, Kapacitor must be started.
1. On each node, start the Kapacitor service.
2. Ensure the service came up properly by checking the status.
3. Set the Kapacitor service to start on boot.
Grafana provides the dashboard and visualization capability for metrics. This tool is used as opposed to , which is traditionally used to complete the TICK (Telegraf/InfluxDB/Chronograf/Kapacitor) stack. Grafana is favored over Chronograf due to its versatility beyond the InfluxDB ecosystem, as well as higher frequency of contributions and updates from the broader community.
Install Grafana. Follow the Grafana instructions to ensure that configuration options for items such as users, permissions, and logging all correspond to the environment. https://grafana.com/docs/grafana/latest/installation/
1. Open Grafana in a web browser.
2. In the Grafana UI, click the Settings tab on the left side of the screen. Select Configuration->Data Sources.
3. Click Add data source.
4. Under the data source list, select InfluxDB. Click the Select button.
5. Enter all of the pertinent configuration details associated with the InfluxDB instance:
- Ensure the data source is marked as the Default.
- Under the HTTP section, the URL should be the same as the INFLUX_URL used in previous sections.
- The query language should be set to the default (InfluxQL).
6. Under the InfluxDB Details section, specify telegraf as the database name.
7. When the configuration appears to be correct, click Save and Test.
8. A green banner should appear that indicates the connection to the data source is working properly. If this message does not display, consult the Grafana documentation or contact Ocient Support.
Ocient provides sample Grafana dashboards upon request. Contact Ocient Support for details.
Within the bundle, relevant contents are located within the grafana directory.
Grafana only allows one file at a time.
- Open Grafana in a web browser.
- In the Grafana UI, click the + ("Create") tab on the left side of the screen. Select the Import option.
- On the Import page, click the Upload JSON File button.
- Select the file representing the dashboard you wish to import.
- On the dashboard summary page, verify the settings are correct. Click Import to complete the process.
- Repeat steps 3-6 for each dashboard that needs to be added.
Filebeat is a trademark of Elasticsearch BV.