System Administration
Monitoring
Set Up System Monitoring with the TIG Stack and Kapacitor
one method to monitor the {{ocienthyperscaledatawarehouse}} is to use the tig ( {{telegraf}} , {{influxdb}} and {{grafana}} ) stack while t hese instructions are specifically written for this toolset, you can also use this guide as the foundation to set up an alternate stack if necessary this guide uses influxdb, telegraf, {{kapacitor}} , and grafana for a complete monitoring solution monitoring example with telegraf data collection into influxdb with grafana dashboard and kapacitor alerts this guide specifically focuses on host level and {{ocient}} software metrics monitoring additional components (e g , {{kafka}} ) is outside of the scope of this document these instructions were created using these versions of each component component version influxdb 1 8 (oss) telegraf 1 16 1 kapacitor 1 5 6 (oss) grafana 7 4 3 prerequisites this guide assumes that ssh access and root level privileges are available to each server running ocient software influxdb, kapacitor, and grafana are available to be installed in a virtual machine or container it is important to note that each of these systems should be run independently for example, do not run influxdb, kapacitor, and grafana on the same virtual machine all ocient software is currently deployed and running as expected logging configuration is set up as specified in log monitoring docid\ goypkivlud77sz0sx ekq the following ports must be open for influxdb, kapacitor, telegraf, and grafana influxdb is the most critical as each component communicates with it always refer to the latest documentation for each product as the definitive reference component default ports components requiring access influxdb 8086 8088 (optional, used for backup utilities) telegraf, kapacitor, grafana telegraf none none kapacitor 9092 none grafana 3000 none step 1 deploy influxdb influxdb is a timeseries database that provides persistence for the host level and ocient software metrics 1\ refer to current instructions from influxdb on how to install the service https //docs influxdata com/influxdb/v2 0/install/ https //docs influxdata com/influxdb/v2 0/install/ https //docs influxdata com/influxdb/v2 0/get started/ https //docs influxdata com/influxdb/v2 0/get started/ 2\ document the ip address of the machine where influxdb is installed 3\ ensure that influxdb is running properly /> sudo systemctl status influxdb influxdb service influxdb is an open source, distributed, time series database loaded loaded (/lib/systemd/system/influxdb service; enabled; vendor preset enabled) active active (running) since tue 2021 09 14 17 17 20 utc; 3 weeks 2 days ago docs main pid 32687 (influxd) tasks 18 (limit 4915) cgroup /system slice/influxdb service └─32687 /usr/bin/influxd config /etc/influxdb/influxdb conf/ 4\ set the influxdb service to start on boot /> sudo systemctl enable influxdb 5 it is highly recommended that critical components of monitoring infrastructure, including influxdb, are also monitored monitoring the monitoring stack is outside of the scope of this document refer to available resources on potential solutions to monitor influxdb step 2 deploy and configure telegraf telegraf is used for metrics collection it runs on each of the admin, loader, foundation, sql nodes the following instructions highlight the common configuration elements, followed by the elements that are specific to different node types all nodes 1\ install telegraf on each of the admin, loader, foundation, and sql nodes refer to the telegraf installation instructions available from influxdb at this point, do not start the telegraf service or generate the default configuration 2\ for all nodes , create the telegraf configuration file (typically located at /etc/telegraf/telegraf conf ) open the file in an editor copy and paste the following contents into the file replace the designated placeholders (denoted with a $) with the specific values for the specified environment cluster name = — customer selected name of a provided cluster for monitoring identification role = — one of admin , loader , lts , sql , or stream proc , depending on the node type influx url = — url of the influxdb instance (e g , http //10 6 0 4 8086 ) ; /etc/telegraf/telegraf conf \[global tags] cluster = "$cluster name" cluster role = "$role" # one of admin, loader (aka indexer), lts (aka foundation), sql, stream proc \[agent] interval = "15s" # time metric batch size = 1000 # size metric buffer limit = 100000 # size collection jitter = "3s" # time flush interval = "7s" # time flush jitter = "3s" # time round interval = false \[outputs] \[\[outputs influxdb]] urls = \[ "$influx url" ] database = "telegraf" precision = "1s" timeout = "10s" save the file 3\ for all nodes , create a new file under the telegraf d directory (typically /etc/telegraf/telegraf d ) named host conf copy and paste the following contents into the file replace the designated placeholders (denoted with an $ ) with the specific values for the given environment boot disk — device name of the boot disk (e g , sda ) you can determine the boot disk by running an lsblk command net int bond0 wildcard — the wildcard for matching the underlying network interfaces within bond0 (e g , eno ) if it is unclear, refer to /proc/net/bonding/bond0 and reference the interfaces noted in the telegraf documentation net int bond1 wildcard — the wildcard for matching the underlying network interfaces within bond1 (e g , enp ) if it is unclear, refer to /proc/net/bonding/bond1 and reference the interfaces noted in the telegraf documentation ; /etc/telegraf/telegraf d/host conf \[\[inputs cpu]] percpu = true totalcpu = true fielddrop = \[ "time " ] \[\[inputs disk]] ignore fs = \["tmpfs", "devtmpfs", "vfat"] \[\[inputs diskio]] skip serial number = true devices = \["$boot disk"] \[inputs diskio tags] disk type="os" \[\[inputs mem]] \[\[inputs net]] interfaces = \["$net int bond0 wildcard", "$net int bond1 wildcard"] \[\[inputs net]] interfaces = \["bond0"] \[inputs net tags] primary interface 10gb = "true" \[inputs net tagdrop] interface = \[ "all" ] \[\[inputs net]] interfaces = \["bond1"] \[inputs net tags] primary interface 100gb = "true" \[inputs net tagdrop] interface = \[ "all" ] \[\[inputs system]] \[\[inputs swap]] \[\[inputs linux sysctl fs]] save the file 4\ for all nodes , create a new file under the telegraf scripts directory (typically under /etc/telegraf/scripts ) named uio pci generic sh copy and paste the following contents into the file ; /etc/telegraf/scripts/uio pci generic sh ; a script used to enumerate the pci devices on the system ; ; !/bin/bash devices=$(find /sys/bus/pci/drivers/uio pci generic/0000\\ | awk f'/' '{ print $nf }') for d in $devices; do echo pci devices,device=\\"$d\\" present=1 done foundation and sql nodes create a new file under the telegraf d directory (typically /etc/telegraf/telegraf d ) named rolehostd conf copy and paste the following contents into the file replace the designated placeholders (denoted with an $) with the specific values for the specified environment node bond0 ip address = ip address associated with bond0 \[inputs http]] urls = \[ "http //$node bond0 ip address 9090/v1/status" ] method = "get" timeout = "5s" name override = "rolehostd status" data format = "value" data type = "string" \[\[inputs http response]] urls = \[ "http //$node bond0 ip address 9090/v1/status" ] method = "get" response timeout = "5s" name override = "rolehostd status response" response string match = "active" \[\[inputs http]] urls = \[ "http //$node bond0 ip address 9090/v1/stats" ] method = "get" timeout = "5s" name override = "rolehostd" json time key = "time" json time format = "unix us" json string fields = \["units"] data format = "json" tagexclude = \["url", "node"] tag keys = \[ "name", "socket", "device" ] \[\[inputs http]] urls = \[ "http //$node bond0 ip address 9090/v1/version" ] method = "get" timeout = "5s" name override = "rolehostd version" json query = "{build type,version}" json string fields = \["build type","version"] data format = "json" \[\[inputs http]] urls = \[ "http //$node bond0 ip address 9090/v1/operatorsummary" ] method = "get" timeout = "5s" name override = "rolehostd operatorsummary" data format = "json" \[\[inputs exec]] commands = \['sh /etc/telegraf/scripts/uio pci generic sh'] data format = "influx" \[\[inputs tail]] name override = "tail query json" files = \["/var/opt/ocient/log/query json","/var/opt/ocient/query json"] data format = "json" json query = "{src,msg user,msg database,msg service class,msg client version,msg client ip,msg timestamp start,msg timestamp execstart,msg timestamp optimizationcomplete,msg timestamp complete,msg code,msg priority,msg runtime,msg parallelism,msg cost estimate,msg heuristic cost,msg rows returned,msg bytes returned,msg queue time,msg optimization time,msg default schema,msg major driver version,msg minor driver version,msg total time,msg resultset cached,msg first byte time,msg bytes per second sent,msg default schema,msg major driver version,msg minor driver version,msg total time,msg resultset cached,msg first byte time,msg bytes per second sent}" tag keys = \["src","user","database","service class","code","major driver version","resultset cached"] json string fields = \["client version","client ip","timestamp start","timestamp execstart","timestamp complete"] \[\[processors converter]] namepass=\["tail query json"] \[processors converter fields] integer = \["timestamp start","timestamp execstart","timestamp complete"] you can also use filebeat to forward the query json file to a monitoring platform of your choice admin and loader nodes create a new file under the telegraf d directory (typically /etc/telegraf/telegraf d ) named rolehostd conf copy and paste the following contents into the file replace the designated placeholders (denoted with an $) with the specific values for the given environment node bond0 ip address = ip address associated with bond0 ; /etc/telegraf/telegraf d/rolehostd conf \[\[inputs http]] urls = \[ "http //$node bond0 ip address 9090/v1/status" ] method = "get" timeout = "5s" name override = "rolehostd status" data format = "value" data type = "string" \[\[inputs http response]] urls = \[ "http //$node bond0 ip address 9090/v1/status" ] method = "get" response timeout = "5s" name override = "rolehostd status response" response string match = "active" \[\[inputs http]] urls = \[ "http //$node bond0 ip address 9090/v1/stats" ] method = "get" timeout = "5s" name override = "rolehostd" json time key = "time" json time format = "unix us" json string fields = \["units"] data format = "json" tagexclude = \["url", "node"] tag keys = \[ "name", "socket", "device" ] \[\[inputs http]] urls = \[ "http //$node bond0 ip address 9090/v1/version" ] method = "get" timeout = "5s" name override = "rolehostd version" json query = "{build type,version}" json string fields = \["build type","version"] data format = "json" \[\[inputs exec]] commands = \['sh /etc/telegraf/scripts/uio pci generic sh'] data format = "influx" \[\[inputs tail]] name override = "tail query json" files = \["/var/opt/ocient/log/query json","/var/opt/ocient/query json"] data format = "json" json query = "{src,msg user,msg database,msg service class,msg client version,msg client ip,msg timestamp start,msg timestamp execstart,msg timestamp optimizationcomplete,msg timestamp complete,msg code,msg priority,msg runtime,msg parallelism,msg cost estimate,msg heuristic cost,msg rows returned,msg bytes returned,msg queue time,msg optimization time,msg default schema,msg major driver version,msg minor driver version,msg total time,msg resultset cached,msg first byte time,msg bytes per second sent,msg default schema,msg major driver version,msg minor driver version,msg total time,msg resultset cached,msg first byte time,msg bytes per second sent}" tag keys = \["src","user","database","service class","code","major driver version","resultset cached"] json string fields = \["client version","client ip","timestamp start","timestamp execstart","timestamp complete"] \[\[processors converter]] namepass=\["tail query json"] \[processors converter fields] integer = \["timestamp start","timestamp execstart","timestamp complete"] loader nodes loader nodes also run the loading and transformation service and require additional telegraf configuration to capture those metrics create a new file under the telegraf d directory (typically /etc/telegraf/telegraf d ) named lat conf copy and paste the following contents into the file ; /etc/telegraf/telegraf d/lat conf \[\[inputs jolokia2 agent]] urls = \["http //localhost 8080/v2/metrics"] \[\[inputs jolokia2 agent metric]] name = "pipeline" mbean = "lat\ type=pipeline" \[\[inputs jolokia2 agent metric]] name = "partitions" mbean = "lat\ type=partitions" \[\[inputs procstat]] systemd unit = "lat service" start telegraf when the configuration is properly entered on all nodes, telegraf must be started 1\ on each node , start the telegraf service /> sudo systemctl start telegraf 2\ ensure the service came up properly by checking the status /> sudo systemctl status telegraf telegraf service the plugin driven server agent for reporting metrics into influxdb loaded loaded (/usr/lib/systemd/system/telegraf service; enabled; vendor preset disabled) active active (running) since thu 2021 08 05 22 04 06 utc; 2 months 3 days ago docs https //github com/influxdata/telegraf main pid 2651775 (telegraf) tasks 103 (limit 1589476) memory 221 2m cgroup /system slice/telegraf service └─2651775 /usr/bin/telegraf config /etc/telegraf/telegraf conf config directory /etc/telegraf/telegraf d 3\ set the telegraf service to start on boot /> sudo systemctl enable telegraf step 3 deploy and configure kapacitor kapacitor is used to process time series data, detect anomalies, and trigger alerts as the preferred alerting mechanisms for every customer will vary, refer to the kapacitor documentation on how to send alerts through specific tools https //docs influxdata com/kapacitor/v1 6/working/alerts/ 1\ install kapacitor using the instructions from influxdb https //docs influxdata com/kapacitor/v1 6/introduction/installation/ https //docs influxdata com/kapacitor/v1 6/introduction/installation/ 2\ create the kapacitor configuration file (typically located at /etc/kapacitor/kapacitor conf ) open the file in an editor copy and paste the following contents into the file replace the designated placeholders (denoted with an $) with the specific values for the given environment kapacitor host or ip = — hostname or ip address associated with the kapacitor instance influx url = — url of the influxdb instance (e g , http //10 6 0 4 8086 ) ; /etc/kapacitor/kapacitor conf hostname = "$kapacitor host or ip" data dir = "/var/lib/kapacitor" skip config overrides = false \[logging] \# destination for logs \# can be a path to a file or 'stdout', 'stderr' file = "/var/log/kapacitor/kapacitor log" level = "error" \[load] enabled = true dir = "/etc/kapacitor/load" \[replay] \# where to store replay files, aka recordings dir = "/var/lib/kapacitor/replay" \[storage] boltdb = "/var/lib/kapacitor/kapacitor db" \[\[influxdb]] enabled = true default = true name = "influx" urls = \["$influx url"] username = "admin" password = "" timeout = "10s" startup timeout = "60s" \# subscription mode is either "cluster" or "server" subscription mode = "server" \# which protocol to use for subscriptions \# one of 'udp', 'http', or 'https' subscription protocol = "http" \# subscriptions resync time interval \# useful if you want to subscribe to new created databases \# without restart kapacitord subscriptions sync interval = "1m0s" ; insert alert destination configs here \[stats] enabled = true stats interval = "10s" database = " kapacitor" retention policy= "autogen" add kapacitor alerts and templates ocient provides sample kapacitor alerts and templates upon request contact ocient support for details within the bundle, relevant contents are located within the kapacitor directory copy the files located under kapacitor/load/tasks to the kapacitor system in /etc/kapacitor/load/tasks copy the files located under kapacitor/load/templates to the kapacitor system in /etc/kapacitor/load/templates start kapacitor when the configuration is complete, kapacitor must be started 1\ on each node , start the kapacitor service /> sudo systemctl start kapacitor 2\ ensure the service came up properly by checking the status /> sudo systemctl status kapacitor kapacitor service time series data processing engine loaded loaded (/lib/systemd/system/kapacitor service; enabled; vendor preset enabled) active active (running) since fri 2021 09 03 18 30 37 utc; 1 months 9 days ago docs https //github com/influxdb/kapacitor main pid 16177 (kapacitord) tasks 16 (limit 4915) cgroup /system slice/kapacitor service └─16177 /usr/bin/kapacitord config /etc/kapacitor/kapacitor conf sep 03 18 30 37 kapacitor systemd\[1] started time series data processing engine sep 03 18 30 37 kapacitor kapacitord\[16177] '## '## '### '######## '### '###### '#### '######## '####### '######## sep 03 18 30 37 kapacitor kapacitord\[16177] ## '## '## ## ## ## '## ## '## ## ## ## '## ## ## ## sep 03 18 30 37 kapacitor kapacitord\[16177] ## '## '## ## ## ## '## ## ## ## ## ## ## ## ## sep 03 18 30 37 kapacitor kapacitord\[16177] ##### '## ## ######## '## ## ## ## ## ## ## ######## sep 03 18 30 37 kapacitor kapacitord\[16177] ## ## ######### ## ######### ## ## ## ## ## ## ## sep 03 18 30 37 kapacitor kapacitord\[16177] ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## sep 03 18 30 37 kapacitor kapacitord\[16177] ## ## ## ## ## ## ## ###### '#### ## ####### ## ## sep 03 18 30 37 kapacitor kapacitord\[16177] sep 03 18 30 37 kapacitor kapacitord\[16177] 2021/09/03 18 30 37 using configuration at /etc/kapacitor/kapacitor conf 3\ set the kapacitor service to start on boot /> sudo systemctl enable kapacitor step 4 deploy and configure grafana grafana provides the dashboard and visualization capability for metrics this tool is used as opposed to {{chronograf}} , which is traditionally used to complete the tick (telegraf/influxdb/chronograf/kapacitor) stack grafana is favored over chronograf due to its versatility beyond the influxdb ecosystem, as well as higher frequency of contributions and updates from the broader community install grafana follow the grafana instructions to ensure that configuration options for items such as users, permissions, and logging all correspond to the environment https //grafana com/docs/grafana/latest/installation/ https //grafana com/docs/grafana/latest/installation/ add influxdb as a data source 1\ open grafana in a web browser 2\ in the grafana ui, click the settings tab on the left side of the screen select configuration >data sources configuration of data sources 3\ click add data source 4\ under the data source list, select influxdb click the select button add data source 5\ enter all of the pertinent configuration details associated with the influxdb instance ensure the data source is marked as the default under the http section, the url should be the same as the influx url used in previous sections the query language should be set to the default (influxql) configuration settings 6\ under the influxdb details section, specify telegraf as the database name influxdb details for database access 7\ when the configuration appears to be correct, click save and test 8\ a green banner should appear that indicates the connection to the data source is working properly if this message does not display, consult the grafana documentation or contact ocient support connection status import dashboards ocient provides sample grafana dashboards upon request contact ocient support for details within the bundle, relevant contents are located within the grafana directory grafana only allows one file at a time open grafana in a web browser in the grafana ui, click the + ("create") tab on the left side of the screen select the import option on the import page, click the upload json file button select the file representing the dashboard you wish to import on the dashboard summary page, verify the settings are correct click import to complete the process repeat steps 3 6 for each dashboard that needs to be added related links log monitoring docid\ goypkivlud77sz0sx ekq statistics monitoring docid\ ihe4f 5epw1cfbzgcudga filebeat is a trademark of elasticsearch bv