使用 Telegraf 和 InfluxDB Cloud 监控 AWS

作者：Al Sargent / 用例, 产品, 开发者
2019 年 11 月 19 日

导航至

最近我一直在使用 InfluxDB Cloud 和 Telegraf 对 Amazon Web Services API 端点进行综合监控，并想分享我的配置说明。

什么是 InfluxDB 和 Telegraf？

InfluxDB 是一个开源的时间序列数据库，而 Telegraf 是一个用于将时间序列数据发送到 InfluxDB 的软件代理。Telegraf 拥有数百个插件，用于收集不同类型的时间序列数据。

这些插件之一， inputs.http_response，向 URL 列表发送 http/https 请求，并将关于其响应的指标（返回代码、响应时间等）发送到 InfluxDB。

为什么要监控 AWS？

假设您在 Amazon Web Services 上运行应用程序，并且您希望记录您使用的特定 AWS 服务是否正在运行。当然，您可以查看 AWS 服务运行状况仪表板。虽然 AWS 有很多优秀的人在工作，但这有点像让狐狸看守鸡舍

why monitor AWS with Telegraf

因此，最好对 AWS 服务进行独立的监控——对于您依赖的任何云服务或 SaaS 应用程序也是如此。因此，以下是一些关于如何使用 InfluxDB Cloud 和 Telegraf http_response 开始监控的快速提示。

用于综合监控的 Telegraf 配置

首先，在您的机器上运行 Telegraf 和 InfluxDB Cloud 实例。这是一个关于如何执行此操作的分步教程。接下来，查看这篇关于使用 Telegraf 和 InfluxDB 进行综合监控的概述。

完成上述操作后，将此 Telegraf 配置文件复制到您的本地机器，并使用以下命令运行它

telegraf — config ./telegraf-synthetic-aws.conf — debug

假设一切顺利启动，您将看到类似于以下的终端输出

Telegraf configuration

从这里，您登录到 InfluxDB Cloud Data Explorer（从顶部数第二个图标），并开始绘制您的数据图表。

Data Explorer

让我们探讨一下 telegraf-synthetic-aws.conf 的一些关键部分。首先，我们标记时间序列，以便在 Data Explorer 中更轻松地查询。我们声明这些指标来自一家名为 Amazon 的公司和一个名为 AWS 的服务。如果我们正在跟踪来自 Google Cloud 或 Azure 的指标，这可能会有所帮助。

# Global tags can be specified here in key="value" format. 
[global_tags] 
company = "Amazon" 
service = "AWS" # will tag all metrics with service=AWS

由于我们可能要监控很多端点，我喜欢节省我的数据使用量。因此，我将主机名排除在我的数据流之外，因为我不会查询该信息。

## If set to true, do no set the "host" tag in the telegraf agent.omit_hostname = true

我们需要指定我们正在使用 InfluxDB Cloud 版本 2，而不是版本 1 云或本地实例

# 用于将指标发送到 InfluxDB 的配置 [[outputs.influxdb_v2]]

我们声明可以在哪里找到我们的 InfluxDB Cloud 实例（这是一个当前的 InfluxDB Cloud 区域列表）

## InfluxDB 集群节点的 URL。 urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]

我们列出我们的密钥令牌，以验证 Telegraf 是否可以将数据发布到我们的 InfluxDB 实例。

## Token for authentication.
token = "$INFLUX_TOKEN"

$INFLUX_TOKEN 表示这来自同名的环境变量。如果您尚未设置此变量，可以通过此终端命令进行设置

export INFLUX_TOKEN=[your unique API token]

如果您还没有 InfluxDB Cloud 令牌，请参阅如何查找您的 InfluxDB Cloud 令牌。

回到 telegraf-synthetic-aws.conf。接下来的几行描述了您要写入的组织和存储桶。

## Organization is the name of the organization you wish to write to; must exist. 
organization = "[email protected]" 
## Destination bucket to write into. 
bucket = "my-bucket"

您的组织名称位于 InfluxDB Cloud 中每个屏幕的顶部

您可以在“加载数据”命令下找到您的存储桶列表（存储桶列表？）

InfluxDB load data command

好了，现在我们来谈谈问题的核心：我们正在使用的 Telegraf 插件以及我们要监控的 URL

# HTTP/HTTPS request given an address a method and a timeout 
[[inputs.http_response]] 
## Server address (default https://) 
## List of urls to query. 
urls = [ 
"http://aws.amazon.com" 
]

行 [[inputs.http_response]] 告诉 Telegraf 运行我们的 http_reponse 插件。每个 URL 都需要用引号括起来，并用逗号分隔。

在上面的示例中，我们仅监控主 AWS URL http://aws.amazon.com。但是，如果我们想监控更多 URL，我们可以轻松做到。例如，以下是如何监控一些 EC2 端点

# HTTP/HTTPS request given an address a method and a timeout 
[[inputs.http_response]] 
## Server address (default https://) 
## List of urls to query. urls = [ "https://ec2.ca-central-1.amazonaws.com", 
"https://ec2.eu-central-1.amazonaws.com", 
"https://ec2.eu-west-1.amazonaws.com",
"https://ec2.sa-east-1.amazonaws.com",
"https://ec2.us-east-1.amazonaws.com",
"https://ec2.us-gov-east-1.amazonaws.com", 
"https://ec2.us-west-1.amazonaws.com", 
]

以下是如何监控一些 S3 端点

# HTTP/HTTPS request given an address a method and a timeout [[inputs.http_response]] 
## Server address (default https://) 
## List of urls to query. 
urls = [
"https://s3.amazonaws.com", 
"https://s3.ca-central-1.amazonaws.com",
"https://s3.eu-central-1.amazonaws.com",
"https://s3.sa-east-1.amazonaws.com", 
"https://s3.us-east-1.amazonaws.com", 
"https://s3.us-west-1.amazonaws.com", 
]

您现在可能已经了解了。这是一个所有 AWS 端点列表。选择您关心的端点，并将它们插入到上面的部分。

默认的 http_response 配置文件不遵循重定向。我更喜欢遵循重定向，以确保它们不会导致 5xx 错误，这意味着服务已关闭。以下是如何执行此操作

## Whether to follow redirects from the server (defaults to false) 
follow_redirects = true

将以上所有内容放在一起，这是完整的配置文件

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply prepend
# them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
# for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)


# Global tags can be specified here in key="value" format.
[global_tags]
  company = "Amazon"
  service = "AWS" # will tag all metrics with service=AWS
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "30s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "30s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = false
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = true
  # Changed to true since we don't care about the hostname (localhost) and we want to make our payloads smaller.



###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb_v2]]	
  ## The URLs of the InfluxDB cluster nodes.
  ##
  ## Multiple URLs can be specified for a single cluster, only ONE of the
  ## urls will be written to each interval.
  ## urls exp: http://127.0.0.1:9999
  urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]

  ## Token for authentication.
  token = "$INFLUX_TOKEN"

  ## Organization is the name of the organization you wish to write to; must exist.
  organization = "[email protected]"

  ## Destination bucket to write into.
  bucket = "my-bucket"
  


###############################################################################
#                            PROCESSOR PLUGINS                                #
###############################################################################


###############################################################################
#                            AGGREGATOR PLUGINS                               #
###############################################################################


###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# HTTP/HTTPS request given an address a method and a timeout
[[inputs.http_response]]
  ## Server address (default https://)
  ## List of urls to query.
  urls = [
      "http://aws.amazon.com"
    ]

  ## Set http_proxy (telegraf uses the system wide proxy settings if it's is not set)
  # http_proxy = "https://:8888"

  ## Set response_timeout (default 5 seconds)
  # response_timeout = "5s"

  ## HTTP Request Method
  # method = "GET"

  ## Whether to follow redirects from the server (defaults to false)
  follow_redirects = true
  # Changed this to see if a URL leads to a successful HTTP request.

  ## Optional HTTP Request Body
  # body = '''
  # {'fake':'data'}
  # '''

  ## Optional substring or regex match in body of the response
  # response_string_match = ""

  ## Optional TLS Config
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

  ## HTTP Request Headers (all values must be strings)
  # [inputs.http_response.headers]
  #   Host = "github.com"

至此，您应该拥有监控 AWS 和其他可以通过 Web 访问的云服务和 SaaS 应用程序所需的一切。

附录：AWS 端点列表

在撰写这篇文章时，令人惊讶的是现在有如此多的 Amazon Web Services。更令人震惊的是：AWS 通过 2000 多个 http 端点交付其各种 API。

现在，如果您只使用几个 AWS 端点，您可以跳过本节。但是，如果您想大规模监控 AWS 端点，请继续阅读。

（警告：监控数十个或更多 URL 可能会使您超出 InfluxDB Cloud 免费层级，并要求您使用他们的付费计划。）

这是一个我从此 AWS 文档页面编译的纯文本格式的 AWS 服务端点列表。

由于此列表会随着时间推移而更改，以下是如何使用终端编译 AWS 端点列表

curl https://docs.aws.amazon.com/general/latest/gr/rande.html | grep amazonaws | sort -u > aws-endpoints.txt

让我们分解一下这里发生了什么

curl 捕获网页
https://docs.aws.amazon.com/general/latest/gr/rande.html 是所有 AWS 端点的网页。
使用 Unix 管道，我们将该网页发送到 grep 命令，该命令反过来提取所有包含 amazonaws 的行。谢天谢地，据我所知，AWS 惯例始终在服务端点的 URL 中使用 amazonaws。这使事情变得更容易。
sort 按字母顺序对所有文本行进行排序，-u 标志仅显示唯一行，因此我们不会重复。
然后，我们将 sort 的输出发送到一个名为 aws-endpoints.txt 的新文件。

从这里，您可以使用您选择的文本编辑器来处理每行上的任何额外杂项。

导航至

试用 InfluxDB Cloud

停止盲目飞行

使用 Telegraf 和 InfluxDB Cloud 监控 AWS

作者：Al Sargent / 用例, 产品, 开发者
2019 年 11 月 19 日

导航至

什么是 InfluxDB 和 Telegraf？

为什么要监控 AWS？

用于综合监控的 Telegraf 配置

附录：AWS 端点列表

准备好开始了吗？

InfluxDB 3 Core 和 Enterprise GA：面向开发者的下一代时间序列平台已问世

数据湖和数据仓库

InfluxDB for Industrial IoT:
现场演示

时间序列数据库详解

网络监控

时间序列数据分析：2025 年的定义和最佳技术

产品与解决方案

开发者

公司

导航至

试用 InfluxDB Cloud

停止盲目飞行

获取更新

使用 Telegraf 和 InfluxDB Cloud 监控 AWS

作者：Al Sargent / 用例, 产品, 开发者 2019 年 11 月 19 日

导航至

什么是 InfluxDB 和 Telegraf？

为什么要监控 AWS？

用于综合监控的 Telegraf 配置

附录：AWS 端点列表

准备好开始了吗？

InfluxDB 3 Core 和 Enterprise GA：面向开发者的下一代时间序列平台已问世

数据湖和数据仓库

InfluxDB for Industrial IoT: 现场演示

时间序列数据库详解

网络监控

时间序列数据分析：2025 年的定义和最佳技术

产品与解决方案

开发者

公司

注册 InfluxData 新闻通讯

关注我们

作者：Al Sargent / 用例, 产品, 开发者
2019 年 11 月 19 日

InfluxDB for Industrial IoT:
现场演示