使用 Telegraf 和 InfluxDB Cloud 监控 AWS

导航至

最近我一直在使用 InfluxDB Cloud 和 Telegraf 对 Amazon Web Services API 端点 进行 综合监控,并想分享我的配置说明。

什么是 InfluxDB 和 Telegraf?

InfluxDB 是一个开源的 时间序列数据库,而 Telegraf 是一个用于将时间序列数据发送到 InfluxDB 的 软件代理。Telegraf 拥有 数百个插件,用于收集不同类型的时间序列数据。

这些插件之一, inputs.http_response,向 URL 列表发送 http/https 请求,并将关于其响应的指标(返回代码、响应时间等)发送到 InfluxDB。

为什么要监控 AWS?

假设您在 Amazon Web Services 上运行应用程序,并且您希望记录您使用的特定 AWS 服务是否正在运行。当然,您可以查看 AWS 服务运行状况仪表板。虽然 AWS 有很多优秀的人在工作,但这有点像让狐狸看守鸡舍

why monitor AWS with Telegraf

因此,最好对 AWS 服务进行独立的监控——对于您依赖的任何云服务或 SaaS 应用程序也是如此。因此,以下是一些关于如何使用 InfluxDB Cloud 和 Telegraf http_response 开始监控的快速提示。

用于综合监控的 Telegraf 配置

首先,在您的机器上运行 Telegraf 和 InfluxDB Cloud 实例。这是一个关于如何执行此操作的 分步教程。接下来,查看这篇关于使用 Telegraf 和 InfluxDB 进行综合监控的 概述

完成上述操作后,将 此 Telegraf 配置文件 复制到您的本地机器,并使用以下命令运行它

telegraf — config ./telegraf-synthetic-aws.conf — debug

假设一切顺利启动,您将看到类似于以下的终端输出

Telegraf configuration

从这里,您登录到 InfluxDB Cloud Data Explorer(从顶部数第二个图标),并开始绘制您的数据图表。

Data Explorer

让我们探讨一下 telegraf-synthetic-aws.conf 的一些关键部分。首先,我们标记时间序列,以便在 Data Explorer 中更轻松地查询。我们声明这些指标来自一家名为 Amazon 的公司和一个名为 AWS 的服务。如果我们正在跟踪来自 Google Cloud 或 Azure 的指标,这可能会有所帮助。

# Global tags can be specified here in key="value" format. 
[global_tags] 
company = "Amazon" 
service = "AWS" # will tag all metrics with service=AWS

由于我们可能要监控很多端点,我喜欢节省我的数据使用量。因此,我将主机名排除在我的数据流之外,因为我不会查询该信息。

## If set to true, do no set the "host" tag in the telegraf agent.omit_hostname = true

我们需要指定我们正在使用 InfluxDB Cloud 版本 2,而不是版本 1 云或本地实例

# 用于将指标发送到 InfluxDB 的配置 [[outputs.influxdb_v2]]

我们声明可以在哪里找到我们的 InfluxDB Cloud 实例(这是一个当前的 InfluxDB Cloud 区域列表

## InfluxDB 集群节点的 URL。 urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]

我们列出我们的密钥令牌,以验证 Telegraf 是否可以将数据发布到我们的 InfluxDB 实例。

## Token for authentication.
token = "$INFLUX_TOKEN"

$INFLUX_TOKEN 表示这来自同名的环境变量。如果您尚未设置此变量,可以通过此终端命令进行设置

export INFLUX_TOKEN=[your unique API token]

如果您还没有 InfluxDB Cloud 令牌,请参阅 如何查找您的 InfluxDB Cloud 令牌

回到 telegraf-synthetic-aws.conf。接下来的几行描述了您要写入的组织和存储桶。

## Organization is the name of the organization you wish to write to; must exist. 
organization = "[email protected]" 
## Destination bucket to write into. 
bucket = "my-bucket"

您的组织名称位于 InfluxDB Cloud 中每个屏幕的顶部

InfluxDB Cloud organization name

您可以在“加载数据”命令下找到您的存储桶列表(存储桶列表?)

InfluxDB load data command

好了,现在我们来谈谈问题的核心:我们正在使用的 Telegraf 插件以及我们要监控的 URL

# HTTP/HTTPS request given an address a method and a timeout 
[[inputs.http_response]] 
## Server address (default http://localhost) 
## List of urls to query. 
urls = [ 
"http://aws.amazon.com" 
]

[[inputs.http_response]] 告诉 Telegraf 运行我们的 http_reponse 插件。每个 URL 都需要用引号括起来,并用逗号分隔。

在上面的示例中,我们仅监控主 AWS URL http://aws.amazon.com。但是,如果我们想监控更多 URL,我们可以轻松做到。例如,以下是如何监控一些 EC2 端点

# HTTP/HTTPS request given an address a method and a timeout 
[[inputs.http_response]] 
## Server address (default http://localhost) 
## List of urls to query. urls = [ "https://ec2.ca-central-1.amazonaws.com", 
"https://ec2.eu-central-1.amazonaws.com", 
"https://ec2.eu-west-1.amazonaws.com",
"https://ec2.sa-east-1.amazonaws.com",
"https://ec2.us-east-1.amazonaws.com",
"https://ec2.us-gov-east-1.amazonaws.com", 
"https://ec2.us-west-1.amazonaws.com", 
]

以下是如何监控一些 S3 端点

# HTTP/HTTPS request given an address a method and a timeout [[inputs.http_response]] 
## Server address (default http://localhost) 
## List of urls to query. 
urls = [
"https://s3.amazonaws.com", 
"https://s3.ca-central-1.amazonaws.com",
"https://s3.eu-central-1.amazonaws.com",
"https://s3.sa-east-1.amazonaws.com", 
"https://s3.us-east-1.amazonaws.com", 
"https://s3.us-west-1.amazonaws.com", 
]

您现在可能已经了解了。这是一个 所有 AWS 端点列表。选择您关心的端点,并将它们插入到上面的部分。

默认的 http_response 配置文件不遵循重定向。我更喜欢遵循重定向,以确保它们不会导致 5xx 错误,这意味着服务已关闭。以下是如何执行此操作

## Whether to follow redirects from the server (defaults to false) 
follow_redirects = true

将以上所有内容放在一起,这是完整的配置文件

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply prepend
# them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
# for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)


# Global tags can be specified here in key="value" format.
[global_tags]
  company = "Amazon"
  service = "AWS" # will tag all metrics with service=AWS
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "30s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "30s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = false
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = true
  # Changed to true since we don't care about the hostname (localhost) and we want to make our payloads smaller.



###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb_v2]]	
  ## The URLs of the InfluxDB cluster nodes.
  ##
  ## Multiple URLs can be specified for a single cluster, only ONE of the
  ## urls will be written to each interval.
  ## urls exp: http://127.0.0.1:9999
  urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]

  ## Token for authentication.
  token = "$INFLUX_TOKEN"

  ## Organization is the name of the organization you wish to write to; must exist.
  organization = "[email protected]"

  ## Destination bucket to write into.
  bucket = "my-bucket"
  


###############################################################################
#                            PROCESSOR PLUGINS                                #
###############################################################################


###############################################################################
#                            AGGREGATOR PLUGINS                               #
###############################################################################


###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# HTTP/HTTPS request given an address a method and a timeout
[[inputs.http_response]]
  ## Server address (default http://localhost)
  ## List of urls to query.
  urls = [
      "http://aws.amazon.com"
    ]

  ## Set http_proxy (telegraf uses the system wide proxy settings if it's is not set)
  # http_proxy = "http://localhost:8888"

  ## Set response_timeout (default 5 seconds)
  # response_timeout = "5s"

  ## HTTP Request Method
  # method = "GET"

  ## Whether to follow redirects from the server (defaults to false)
  follow_redirects = true
  # Changed this to see if a URL leads to a successful HTTP request.

  ## Optional HTTP Request Body
  # body = '''
  # {'fake':'data'}
  # '''

  ## Optional substring or regex match in body of the response
  # response_string_match = ""

  ## Optional TLS Config
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

  ## HTTP Request Headers (all values must be strings)
  # [inputs.http_response.headers]
  #   Host = "github.com"

至此,您应该拥有监控 AWS 和其他可以通过 Web 访问的云服务和 SaaS 应用程序所需的一切。

附录:AWS 端点列表

在撰写这篇文章时,令人惊讶的是现在有如此多的 Amazon Web Services。更令人震惊的是:AWS 通过 2000 多个 http 端点 交付其各种 API。

现在,如果您只使用几个 AWS 端点,您可以跳过本节。但是,如果您想大规模监控 AWS 端点,请继续阅读。

(警告:监控数十个或更多 URL 可能会使您超出 InfluxDB Cloud 免费层级,并要求您使用他们的 付费计划。)

这是一个我从 此 AWS 文档页面 编译的 纯文本格式的 AWS 服务端点列表

由于此列表会随着时间推移而更改,以下是如何使用 终端 编译 AWS 端点列表

curl https://docs.aws.amazon.com/general/latest/gr/rande.html | grep amazonaws | sort -u > aws-endpoints.txt

让我们分解一下这里发生了什么

  • curl 捕获网页
  • https://docs.aws.amazon.com/general/latest/gr/rande.html 是所有 AWS 端点的网页。
  • 使用 Unix 管道,我们将该网页发送到 grep 命令,该命令反过来提取所有包含 amazonaws 的行。谢天谢地,据我所知,AWS 惯例始终在服务端点的 URL 中使用 amazonaws。这使事情变得更容易。
  • sort 按字母顺序对所有文本行进行排序,-u 标志仅显示唯一行,因此我们不会重复。
  • 然后,我们将 sort 的输出发送到一个名为 aws-endpoints.txt 的新文件。

从这里,您可以使用您选择的文本编辑器来处理每行上的任何额外杂项。