如何使用 Telegraf 解析 XML 数据

作者：Samantha Wang / 用例, 产品, 开发者
2021 年 4 月 14 日

导航至

3 月，我们发布了 Telegraf 1.18，其中包括各种新的输入和输出插件。一个令人兴奋的新增功能是 XML 解析器插件，它为另一种输入数据格式增加了支持，以便解析为 InfluxDB 指标。

什么是 XML？

XML 代表可扩展标记语言 (eXtensible Markup Language)，是一种标记语言，它定义了一组规则，用于以人类可读和机器可读的格式对文档进行编码。

XML 类似于 HTML，也是一种标记语言，但它被设计为自描述的，并且可以更好地存储和传输数据。例如，当您尝试在不兼容的系统之间交换数据并且需要转换数据时，任何不兼容的数据都可能丢失。XML 旨在简化数据共享和传输，因为它以纯文本格式存储。这提供了一种独立于软件和硬件的存储、传输和共享数据的方式。

了解您的 XML 数据

我们将在整个博客中使用术语根、子、子子，以帮助您理解您尝试解析的数据点。

<root>
  <child>
    <subchild>.....</subchild>
  </child>
</root>

XML 文档必须只包含一个根元素，该元素是所有其他元素的父元素。

来自 OpenWeather 的这个 XML 天气示例是一个很好的基本示例，可以帮助我们理解 XML 数据结构以及如何解析它。

<current>
   <city id="5004223" name="Oakland">
      <coord lon="-83.3999" lat="42.6667" />
      <country>US</country>
      <timezone>-14400</timezone>
      <sun rise="2021-03-24T11:29:19" set="2021-03-24T23:50:05" />
   </city>
   <temperature value="62.26" min="61" max="64.4" unit="fahrenheit" />
   <feels_like value="54.63" unit="fahrenheit" />
   <humidity value="59" unit="%" />
   <pressure value="1007" unit="hPa" />
   <wind>
      <speed value="12.66" unit="mph" name="Moderate breeze" />
      <gusts value="24.16" />
      <direction value="200" code="SSW" name="South-southwest" />
   </wind>
   <clouds value="75" name="broken clouds" />
   <visibility value="10000" />
   <precipitation mode="no" />
   <weather number="803" value="broken clouds" icon="04d" />
   <lastupdate value="2021-03-24T16:15:35" />
</current>

在我们的天气数据中，current 是根元素，city、temperature、wind 和同一级别的其他字段是它的子元素。

XML 元素包括从开始标记 <element> 到元素结束标记 </element> 的所有内容。有些标记可以自我关闭，例如 <coord />。元素本身可以包含

文本 - US 在 <country>US</country> 中
属性 - lon="-83.3999" 和 lat="42.6667" 在 <coord> 元素 <coord lon="-83.3999" lat="42.6667"/> 中
- 属性旨在包含与特定元素相关的数据。当我们解析数据值时，这一点尤其重要。它们可以以一种看起来有点奇怪但仍然有效的方式发出，例如 <foo _="dance"></foo>。
子元素 - <city> 和 <coord> 是 <current> 元素中的其他元素。

元素之间的关系用术语父、子和兄弟来描述。

什么是 XPath？

Telegraf XML 解析器使用 XPath 表达式将 XML 字符串分解为指标字段，并支持大多数 XPath 1.0 功能。解析器将使用 XPath 语法来识别和导航 XML 数据中的 XPath 节点。XPath 支持 200 多个函数，Telegraf XML 解析器支持的函数在底层库存储库中列出。

注意：通常 XPath 表达式选择一个节点或一组节点，您需要调用诸如 string() 或 number() 之类的函数来访问节点的内容。但是，当我们在下面更详细地讨论 Telegraf XML 解析器插件时，您将看到它为了方便起见以以下方式处理：metric_selection 和 field_selection 都只选择节点或节点集，所以它们是普通的 XPath 表达式。但是，所有其他查询将根据 XPath 规范返回节点的“字符串值”。您可以使用如下所示的函数转换类型。

我发现这个 XPath 教程对于理解 XPath 术语和表达式特别有帮助。还有这个 XPath 速查表，它为您提供了一个页面视图，其中包含 XPath 选择器、表达式、函数等的使用方法。

在解析任何数据之前，请查看您的 XML 并了解您要解析的数据的节点和节点集。这个 XPath 测试器将非常方便地测试 XPath 函数，并确保您正在查询正确的路径以解析特定的 XML 节点。

路径	描述	返回的 XML
current	选择相对于当前节点名称为 `current` 的子节点。它不会在节点树中下降，而只搜索 `current` 节点的子节点	<current> <city id="5004223" name="Oakland"> <coord lon=":83.3999" lat="42.6667" /> <country>US</country> <timezone>:14400</timezone> <sun rise="2021:03:24T11:29:19" set="2021:03:24T23:50:05" /> </city> <temperature value="62.26" min="61" max="64.4" unit="fahrenheit" /> <feels_like value="54.63" unit="fahrenheit" /> <humidity value="59" unit="%" /> <pressure value="1007" unit="hPa" /> <wind> <speed value="12.66" unit="mph" name="Moderate breeze" /> <gusts value="24.16" /> <direction value="200" code="SSW" name="South:southwest" /> </wind> <clouds value="75" name="broken clouds" /> <visibility value="10000" /> <precipitation mode="no" /> <weather number="803" value="broken clouds" icon="04d" /> <lastupdate value="2021:03:24T16:15:35" /> </current>
/current	选择根元素 `current`
current/city	选择所有作为 `current` 子节点的 city 元素	`<city id="5004223" name="Oakland"> <coord lon=":83.3999" lat="42.6667" /> <country>US</country> <timezone>:14400</timezone> <sun rise="2021:03:24T11:29:19" set="2021:03:24T23:50:05" /> </city>`
//city	选择所有 `city` 元素，无论它们在文档中的哪个位置
current//country	选择 `current` 元素内的所有 `country` 元素，无论它们在 XML 树中的哪个位置	`<country>US</country>`
current//@name	选择所有名为 `name` 的属性	`name="Oakland" name="Moderate breeze" name="South:southwest" name="broken clouds"`
current/city/@name 或 //city/@name	选择 `city` 元素下名为 `name` 的属性	`name="Oakland"`
current/city/*	选择 `city` 元素下的所有子元素节点	`<coord lon=":83.3999" lat="42.6667"/> <country>US</country> <timezone>:14400</timezone> <sun rise="2021:03:24T11:29:19" set="2021:03:24T23:50:05"/>`
current/city/@*	选择 `city` 元素中的所有属性	`id="5004223" name="Oakland"`

W3Schools 提供了广泛的 XPath 语法列表，并深入探讨了 XPath 轴以及其他示例。

配置 Telegraf 以摄取 XML

XML 是 Telegraf 当前支持的众多输入数据格式之一。这意味着任何包含 data_format 选项的输入插件都可以设置为 xml 并开始解析您的 XML 数据，如下所示

data_format = "xml"

让我们讨论如何正确配置，以便将 XML 数据导入 InfluxDB。如上所述，XML 解析器使用 XPath 表达式将 XML 字符串分解为指标字段。XPath 表达式是解析器用来识别和导航 XML 数据中节点的工具。

以下是使用 XML 解析器的插件默认配置。与其他 Telegraf 配置一样，注释行以磅符号 (#) 开头。

[[inputs.tail]]
  files = ["example.xml"]

  ## Data format to consume.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
  data_format = "xml"

  ## Multiple parsing sections are allowed
  [[inputs.tail.xml]]
    ## Optional: XPath-query to select a subset of nodes from the XML document.
    #metric_selection = "/Bus/child::Sensor"

    ## Optional: XPath-query to set the metric (measurement) name.
    #metric_name = "string('example')"

    ## Optional: Query to extract metric timestamp.
    ## If not specified the time of execution is used.
    #timestamp = "/Gateway/Timestamp"
    ## Optional: Format of the timestamp determined by the query above.
    ## This can be any of "unix", "unix_ms", "unix_us", "unix_ns" or a valid Golang
    ## time format. If not specified, a "unix" timestamp (in seconds) is expected.
    #timestamp_format = "2006-01-02T15:04:05Z"

    ## Tag definitions using the given XPath queries.
    [inputs.tail.xml.tags]
      name   = "substring-after(Sensor/@name, ' ')"
      device = "string('the ultimate sensor')"

    ## Integer field definitions using XPath queries.
    [inputs.tail.xml.fields_int]
      consumers = "Variable/@consumers"

    ## Non-integer field definitions using XPath queries.
    ## The field type is defined using XPath expressions such as number(), boolean() or string(). If no conversion is performed the field will be of type string.
    [inputs.tail.xml.fields]
      temperature = "number(Variable/@temperature)"
      power       = "number(Variable/@power)"
      frequency   = "number(Variable/@frequency)"
      ok          = "Mode != 'ok'"

让我们逐步了解构成 XML 解析器配置的所有步骤和组件。每当您在配置中设置 XPath 查询时，指定的路径可以是绝对路径（以 / 开头）或相对路径。相对路径使用当前选定的节点作为参考。

选择要解析的节点子集（可选） 如果您希望仅解析 XML 数据的子集，您将使用 metric_selection 字段来指定要解析的部分。在我们的天气示例中，假设我们只想解析 wind 元素下的数据，我们会将其设置为 current//wind。让我们继续读取整个天气 XML 文档，所以我将我的 metric_selection = "/current" 设置为。每个由 metric_selection 选择的节点将有一个指标。设置此字段的好处是，在后续配置字段中，我不想在查询的路径名中添加 "current/"。
设置指标名称（可选） 您可以通过设置 metric_name 字段来覆盖默认指标名称（最有可能的是插件名称）。我将设置 metric_name = "'weather'" 以将指标名称从 http 更改为 weather。您还可以为 metric_name 设置 XPath 查询，以直接从 XML 文档中的节点派生指标名称。
设置您想要作为时间戳的值及其格式（可选） 如果您的 XML 数据包含您想要分配给指标的特定时间戳，您将需要设置该值的 XPath 查询。我们的天气数据有一个 lastupdate 值，指示记录此天气数据的确切时间。我将设置 timestamp = "lastupdate/@value" 以读取该值作为我的时间戳。如果未设置 timestamp 字段，则当前时间将用作所有创建的指标的时间戳。
从那里，您可以指定刚刚选择的时间戳的格式。此 timestamp_format 可以设置为 unix、unix_ms、unix_us、unix_ns 或接受的 Go "参考时间"。如果未配置 timestamp_format，Telegraf 将假定您的 timestamp 查询采用 unix 格式。
设置您想要从 XML 数据中获取的标签 要指定 XML 中您想要作为标签的值，您需要配置一个标签子部分 [inputs.http.xml.tags]。在您的子部分中，您将为每个标签添加一行，格式为 tag-name = query，其中 query 是 XPath 查询。对于我们的天气数据，我将添加城市和国家名称作为标签，使用 city = "city/@name" 和 country = "city/country"。可以在一个子部分下设置多个标签。
配置您想要从 XML 数据中获取的整数类型字段 对于您的 XML 数据值，如果它们是您想要作为字段读取的整数，您必须在 fields_int 子部分（例如 [inputs.tail.xml.fields_int]）中配置字段名称和 XPath 查询。这是因为 XML 值仅限于单一类型，即字符串，因此，如果未通过 XPath 函数转换，您的所有数据都将是字符串类型。这将遵循 field_name = query 格式。在我们的天气数据中，湿度和云量等值始终是整数，因此我们将在本子部分中配置它们。这些 field_int 查询的结果将始终转换为 int64。
```
[inputs.http.xml.fields_int]
humidity = "humidity/@value"
clouds = "clouds/@value"
```
配置其余的字段。请务必在 XPath 函数中指示数据类型。要将非整数字段添加到指标，您将在常规字段子部分（例如：[inputs.http.xml.fields]）中以 field_name = query 格式添加正确的 XPath 查询。在这里，使用 XPath 的类型转换函数（例如 number()、boolean() 或 string()）在 XPath 查询中指定字段的数据类型至关重要。如果在查询中未执行任何转换，则字段将为字符串类型。在我们的天气数据中，我们结合了数字和字符串值。例如，我们的风速是一个数字，将被指定为 wind_speed = "number(wind/speed/@value)"，而风力描述是文本，将格式化为字符串，如 wind_desc = "string(wind/speed/@name)"。
选择要解析为字段的 XML 数据中的一组节点（可选）如果您有一个大型 XML 文件，其中包含大量字段，否则需要单独配置，则可以通过使用 XPath 查询配置 field_selection 来选择它们的子集以选择节点。如果节点名称尚不清楚（例如：除非正在下雨，否则不会填充降水值），则通常也会使用此设置。每个由 field_selection 选择的节点都会在指标中形成一个新字段。
您可以使用可选的 field_name 和 field_value XPath 查询来设置每个字段的名称和值。如果未指定这些查询，则字段名称默认为节点名称，字段值默认为所选字段节点的内容。重要的是要注意，仅当指定了 field_selection 时，才使用 field_name 和 field_value 查询。您还可以将这些设置与其他字段规范子部分结合使用。基于下面多节点伦敦自行车示例，要检索 info 元素中的所有属性，您的 field_selection 设置将配置为
```
field_selection = "child::info"
field_name = "name(@*[1])"
field_value = "number(@*[1])"
```
将字段名称扩展为相对于所选节点的路径（可选）如果您希望使用 field_selection 选择的字段名称扩展为相对于所选节点的路径，则需要设置 field_name_expansion = true。此设置允许您展平子树中具有非唯一名称的节点。如果我们选择所有叶节点作为字段，并且这些叶节点没有唯一的名称，则这是必要的。如果未设置 field_name_expansion，我们最终会在字段中得到重复的名称。

示例！

基本解析示例：OpenWeather XML 数据

到目前为止，在本博客中，在解释 XML 概念和配置 XML 解析器的步骤时，我一直引用 OpenWeatherMap XML API 响应。此配置应帮助您了解如何使用 Telegraf 解析一些简单的 XML 数据。插件的 testcases 文件夹中还有一个 5 天 OpenWeather 预报测试用例。

您可以注册免费 API 密钥以通过 HTTP 检索此 XML 数据。获得 API 密钥后（注册后可能需要几个小时），您可以将您的 URL 设置为指定天气位置。我在下面的配置中检索了奥克兰、纽约和伦敦的当前天气数据，单位为英制（责怪我们美国人不了解公制系统：）。如果您想测试下面的示例，请确保将您的 API_KEY 设置为环境变量，以便 Telegraf 配置读取。

天气配置

[[inputs.http]]
  ## OpenWeatherMap API, need to register for $API_KEY: https://openweathermap.org/api
  urls = [
    "http://api.openweathermap.org/data/2.5/weather?q=Oakland&appid=$API_KEY&mode=xml&units=imperial",
"http://api.openweathermap.org/data/2.5/weather?q=New%20York&appid=$API_KEY&mode=xml&units=imperial",    "http://api.openweathermap.org/data/2.5/weather?q=London&appid=$API_KEY&mode=xml&units=imperial"
    ]
  data_format = "xml"
  ## Drop url and hostname from list of tags
  tagexclude = ["url", "host"]

  ## Multiple parsing sections are allowed
  [[inputs.http.xml]]
    ## Optional: XPath-query to select a subset of nodes from the XML document.
    metric_name = "'weather'"
    ## Optional: XPath-query to set the metric (measurement) name.
    metric_selection = "/current"
    ## Optional: Query to extract metric timestamp.
    ## If not specified the time of execution is used.
    timestamp = "lastupdate/@value"
    ## Optional: Format of the timestamp determined by the query above.
    ## This can be any of "unix", "unix_ms", "unix_us", "unix_ns" or a valid Golang
    ## time format. If not specified, a "unix" timestamp (in seconds) is expected.
    timestamp_format = "2006-01-02T15:04:05"
    
    ## Tag definitions using the given XPath queries.
    [inputs.http.xml.tags]
      city = "city/@name"
      country = "city/country"

    ## Integer field definitions using XPath queries.
    [inputs.http.xml.fields_int]
      humidity = "humidity/@value"
      clouds = "clouds/@value"

    ## Non-integer field definitions using XPath queries.
    ## The field type is defined using XPath expressions such as number(), boolean() or string(). If no conversion is performed the field will be of type string.
    [inputs.http.xml.fields]
      temperature = "number(/temperature/@value)"
      precipitation = "number(precipitation/@value)"
      wind_speed = "number(wind/speed/@value)"
      wind_desc = "string(wind/speed/@name)"
      clouds_desc = "string(clouds/@name)"
      lat = "number(city/coord/@lat)"
      lon = "number(city/coord/@lon)"
      ## If "precipitation/@mode" value returns "no", is_it_raining will return false
      is_it_raining = "precipitation/@mode = 'yes'"

上面解释了此天气配置的大多数设置。is_it_raining 的最后一个字段显示了如何在配置中使用 XPath 运算符来返回节点集、字符串、布尔值或数字

is_it_raining = "precipitation/@mode = 'yes'"

天气输出

weather,city=New\ York,country=US clouds=1i,clouds_desc="clear sky",humidity=38i,is_it_raining=false,lat=40.7143,lon=-74.006,precipitation=0,temperature=58.15,wind_desc="Gentle Breeze",wind_speed=8.05 1617128228000000000
weather,city=London,country=GB clouds=0i,clouds_desc="clear sky",humidity=24i,is_it_raining=false,lat=51.5085,lon=-0.1257,precipitation=0,temperature=66.56,wind_desc="Light breeze",wind_speed=5.75 1617128914000000000
weather,city=Oakland,country=US clouds=90i,clouds_desc="overcast clouds",humidity=34i,is_it_raining=false,lat=42.6667,lon=-83.3999,precipitation=0,temperature=64.54,wind_desc="Moderate breeze",wind_speed=17.27 1617128758000000000

多节点选择示例：COVID-19 疫苗分发管辖区分配

您的 XML 数据通常会包含多个部分的类似指标（每个部分可以是不同的设备；在此示例中，每个部分代表不同的管辖区）。您可以使用 XML 解析器进行多节点选择，以生成每个数据块的指标。

考虑到本博客是在 2021 年春季撰写的，因此有很多 COVID-19 数据。为了保持乐观，让我们看一下疾病控制中心 (CDC) 提供的一些 COVID-19 疫苗 XML 数据。 CDC 提供每周按管辖区分配的疫苗。每种疫苗制造商都有一个 HTTP XML 文件：Moderna、辉瑞或 Janssen/Johnson & Johnson。每种疫苗都有其自身的个性类型！

此 COVID 疫苗 XML 数据将是关于如何使用 XML 解析器进行多节点选择的一个很好的示例。

<response>
   <row>
      <row _id="row-vuan~mg8h_vwjk" _uuid="00000000-0000-0000-9614-D811B3DD0141" _position="0" _address="https://data.cdc.gov/resource/saz5-9hgg/row-vuan~mg8h_vwjk">
         <jurisdiction>Connecticut</jurisdiction>
         <week_of_allocations>2021-04-05T00:00:00</week_of_allocations>
         <_1st_dose_allocations>50310</_1st_dose_allocations>
         <_2nd_dose_allocations>50310</_2nd_dose_allocations>
      </row>
      <row _id="row-suay.uwx5_hiiz" _uuid="00000000-0000-0000-C448-E7F5D3B8E3CA" _position="0" _address="https://data.cdc.gov/resource/saz5-9hgg/row-suay.uwx5_hiiz">
         <jurisdiction>Maine</jurisdiction>
         <week_of_allocations>2021-04-05T00:00:00</week_of_allocations>
         <_1st_dose_allocations>19890</_1st_dose_allocations>
         <_2nd_dose_allocations>19890</_2nd_dose_allocations>
      </row>
      <row _id="row-dhdq_gsf8~rzrd" _uuid="00000000-0000-0000-6882-622E1430CDFA" _position="0" _address="https://data.cdc.gov/resource/saz5-9hgg/row-dhdq_gsf8~rzrd">
         <jurisdiction>Massachusetts</jurisdiction>
         <week_of_allocations>2021-04-05T00:00:00</week_of_allocations>
         <_1st_dose_allocations>95940</_1st_dose_allocations>
         <_2nd_dose_allocations>95940</_2nd_dose_allocations>
      </row>
      <row _id="row-jehx-8sxy_8dma" _uuid="00000000-0000-0000-56CD-DCA4760B56BC" _position="0" _address="https://data.cdc.gov/resource/saz5-9hgg/row-jehx-8sxy_8dma">
         <jurisdiction>New York</jurisdiction>
         <week_of_allocations>2021-04-05T00:00:00</week_of_allocations>
         <_1st_dose_allocations>153270</_1st_dose_allocations>
         <_2nd_dose_allocations>153270</_2nd_dose_allocations>
      </row>
      <row _id="row-chrx-6f37~qbn9" _uuid="00000000-0000-0000-30C3-4B8A23B1DF14" _position="0" _address="https://data.cdc.gov/resource/saz5-9hgg/row-chrx-6f37~qbn9">
         <jurisdiction>New York City</jurisdiction>
         <week_of_allocations>2021-04-05T00:00:00</week_of_allocations>
         <_1st_dose_allocations>117000</_1st_dose_allocations>
         <_2nd_dose_allocations>117000</_2nd_dose_allocations>
      </row>
   </row>
</response>

以上脚本摘自 CDC COVID-19 疫苗分发管辖区分配 - 辉瑞

此多节点数据集没有太多子值供我们配置，但有很多父子部分。我们将使用 week_of_allocations 作为我们的时间戳，jurisdiction 作为标签，_1st_dose_allocations 和 _2nd_dose_allocations 作为字段。即使 Janssen/Johnson & Johnson 数据不包含 _2nd_dose_allocations（一次完成），我们也不需要为其单独配置，但解析器只是不会为其发出字段。

我在我的配置中包含了 processors.enum。在 XML 数据本身中，除了 URL 之外，没有指示数据属于哪个制造商。enum 处理器我配置的将为其对应的 URL 添加制造商名称的标签。

配置

[[inputs.http]]
  urls = [
    "https://data.cdc.gov/api/views/b7pe-5nws/rows.xml", # Moderna
    "https://data.cdc.gov/api/views/saz5-9hgg/rows.xml", # Pfizer
    "https://data.cdc.gov/api/views/w9zu-fywh/rows.xml" # Janssen/Johnson & Johnson

    ]
  data_format = "xml"
  ## Drop hostname from list of tags
  tagexclude = ["host"]

    [[inputs.http.xml]]
        metric_selection = "//row"
        metric_name = "'cdc-vaccines'"
        timestamp = "week_of_allocations"
        timestamp_format = "2006-01-02T15:04:05"

        [inputs.http.xml.tags]
            state   = "jurisdiction"

        [inputs.http.xml.fields_int]
            1st_dose_allocations = "_1st_dose_allocations"
            2nd_dose_allocations = "_2nd_dose_allocations"


[[processors.enum]]
  [[processors.enum.mapping]]
    ## Name of the tag to map. Globs accepted.
    tag = "url"

    ## Destination tag or field to be used for the mapped value.  By default the
    ## source tag or field is used, overwriting the original value.
    dest = "vaccine_type"

    ## Table of mappings
    [processors.enum.mapping.value_mappings]
      "https://data.cdc.gov/api/views/b7pe-5nws/rows.xml" = "Moderna"
      "https://data.cdc.gov/api/views/saz5-9hgg/rows.xml" = "Pfizer"
      "https://data.cdc.gov/api/views/w9zu-fywh/rows.xml" = "Janssen"

输出（基于上面 XML 疫苗数据样本的输出片段 — 完整配置将提供更大的输出）

cdc-vaccines,state=Connecticut,url=https://data.cdc.gov/api/views/saz5-9hgg/rows.xml,vaccine_type=Pfizer 1st_dose_allocations=60840i,2nd_dose_allocations=60840i 1617580800000000000
cdc-vaccines,state=Maine,url=https://data.cdc.gov/api/views/saz5-9hgg/rows.xml,vaccine_type=Pfizer 1st_dose_allocations=23400i,2nd_dose_allocations=23400i 1617580800000000000
cdc-vaccines,state=Massachusetts,url=https://data.cdc.gov/api/views/saz5-9hgg/rows.xml,vaccine_type=Pfizer 1st_dose_allocations=117000i,2nd_dose_allocations=117000i 1617580800000000000
cdc-vaccines,state=New\ York,url=https://data.cdc.gov/api/views/saz5-9hgg/rows.xml,vaccine_type=Pfizer 1st_dose_allocations=188370i,2nd_dose_allocations=188370i 1617580800000000000
cdc-vaccines,state=New\ York\ City,url=https://data.cdc.gov/api/views/saz5-9hgg/rows.xml,vaccine_type=Pfizer 1st_dose_allocations=143910i,2nd_dose_allocations=143910i 1617580800000000000

使用字段选择器进行批量字段处理（示例：伦敦自行车数据）

您的 XML 数据通常会包含大量字段的指标，以至于在 [inputs.tail.xml.fields] 子部分中配置每个字段将非常繁琐。此外，您的 XML 数据可能会生成在配置期间未知的字段。在这些情况下，您可以使用字段选择器来解析这些指标。

对于我们的示例，我们将使用伦敦交通局提供的伦敦租用自行车数据。数据包含数据的最新更新时间 (lastUpdate)，我们将用作时间戳。info 节点包含自行车站点状态信息，我们将用作字段。

<stations lastUpdate="1617397861012" version="2.0">
</stations>
<response>
   <location id="1" name="River Street , Clerkenwell">
      <info terminalName="001023" />
      <info lat="51.52916347" />
      <info long="-0.109970527" />
      <info installDate="1278947280000" />
      <temporary>false</temporary>
      <info nbBikes="10" />
      <info nbEmptyDocks="9" />
      <info nbDocks="19" />
   </location>
   <location id="2" name="Phillimore Gardens, Kensington">
      <info terminalName="001018" />
      <info lat="51.49960695" />
      <info long="-0.197574246" />
      <info installDate="1278585780000" />
      <temporary>false</temporary>
      <info nbBikes="28" />
      <info nbEmptyDocks="9" />
      <info nbDocks="37" />
   </location>
   <location id="3" name="Christopher Street, Liverpool Street">
      <info terminalName="001012" />
      <info lat="51.52128377" />
      <info long="-0.084605692" />
      <info installDate="1278240360000" />
      <temporary>false</temporary>
      <info nbBikes="2" />
      <info nbEmptyDocks="30" />
      <info nbDocks="32" />
   </location>
</response>

在我们的配置中，我们仍然使用 metric_selection 选项来选择所有 location 节点。对于每个 location，我们然后使用 field_selection 选择位置的所有子节点作为字段节点。此字段选择是相对于所选节点的 — 对于每个选定的字段节点，我们将配置 field_name 和 field_value 以分别确定字段的名称和值。field_name 拉取节点的第一个属性的名称，而 field_value 拉取第一个属性的值并将结果转换为数字。

对于我们的非数值字段，我们仍然可以将 [inputs.tail.xml.fields] 与 field_selection 结合使用。我们仍然设置包含字符串的节点 temporary 以读取为字段。另外，请注意我的时间戳在我的 metric_selection 之外，因此我必须确保拉取 lastUpdate 的 XPath 查询是以 / 为开头的绝对路径。

配置

[[inputs.tail]]
  files = ["/pathname/london-cycle-for-hire.xml"]
  data_format = "xml"

  [[inputs.tail.xml]]
    metric_selection = "response/child::location"
    metric_name = "string('bikes')"

    timestamp = "/stations/@lastUpdate"
    timestamp_format = "unix_ms"

    field_selection = "child::info"
    field_name = "name(@*[1])"
    field_value = "number(@*[1])"

    [inputs.tail.xml.tags]
      address = "@name"
      id = "@id"

    [inputs.tail.xml.fields]
      placement = "string(temporary)"

输出

bikes,address=River\ Street\ \,\ Clerkenwell,host=MBP15-SWANG.local,id=1 installDate=1278947280000,lat=51.52916347,long=-0.109970527,nbBikes=10,nbDocks=19,nbEmptyDocks=9,placement="false",terminalName=1023 1617397861000000000
bikes,address=Phillimore\ Gardens\,\ Kensington,host=MBP15-SWANG.local,id=2 installDate=1278585780000,lat=51.49960695,long=-0.197574246,nbBikes=28,nbDocks=37,nbEmptyDocks=9,placement="false",terminalName=1018 1617397861000000000
bikes,address=Christopher\ Street\,\ Liverpool\ Street,host=MBP15-SWANG.local,id=3 installDate=1278240360000,lat=51.52128377,long=-0.084605692,nbBikes=2,nbDocks=32,nbEmptyDocks=30,placement="false",terminalName=1012 1617397861000000000

快速提示和其他有用的资源

如果您正在寻找进行通用故障排除，请务必在您的 agent 设置中设置 debug = "true"，并且解析器将（对于 *_selection 设置）在选择为空时向上遍历节点，并打印它找到的子节点数量。这将帮助您了解查询的哪个部分可能导致问题。

像 XPather 或 Code Beautify 的 XPath Tester 这样的 XPath 测试器将是您配置 XML 解析器时的最佳帮手，以帮助您确保为您的数据选择正确的 XPath 查询。当您可以直观地看到您的 XPath 查询正在选择哪些节点时，它将使配置过程不那么令人沮丧。

一些需要重申的语法事项是，当您在配置中设置 XPath 查询时，指定的路径可以是绝对路径（以 / 开头）或相对路径。如果您要查询指标选择之外的节点，记住这一点很重要。如果您不包含开头的 /，您最终会查询您选择的指标中可能不存在的节点。

最后，当我查询属性时（例如：lon 和 lat 在 <coord lon="-83.3999" lat="42.6667"/> 中），我一直遇到的问题是记住在 @ 之前包含 \。我会意外地查询 current/city/coord@lat，这将导致一无所获，而正确的查询是 current/city/coord/@lat。

以下是一些资源，将帮助您更好地理解 Telegraf XML 解析器和 XPath

非常感谢 Sven Rebhan 构建了这个插件！

如果您最终对解析 XML 数据有任何疑问，请通过我们的 InfluxData Community Slack 的 #telegraf 频道联系我们（如果您想专门与 Sven 聊天，请 @Sven Rebhan），或在我们的社区站点上发布任何问题。

想了解更多关于通过 Telegraf 进行数据采集的信息吗？ 免费注册 InfluxDays EMEA 以参加 Jess Ingrassellino 的“数据采集”演讲，内容涵盖 Telegraf、云的 CLI 集成和客户端库，时间为 2021 年 5 月 18 日。

导航至

试用 InfluxDB Cloud

停止盲目飞行

如何使用 Telegraf 解析 XML 数据

作者：Samantha Wang / 用例, 产品, 开发者
2021 年 4 月 14 日

导航至

什么是 XML？

了解您的 XML 数据

什么是 XPath？

路径

描述

返回的 XML

配置 Telegraf 以摄取 XML

示例！

基本解析示例：OpenWeather XML 数据

多节点选择示例：COVID-19 疫苗分发管辖区分配

使用字段选择器进行批量字段处理（示例：伦敦自行车数据）

更多示例

快速提示和其他有用的资源

准备好开始了吗？

InfluxDB 3 Core & Enterprise GA：面向开发人员的下一代时序平台已问世

数据湖和数据仓库

InfluxDB for Industrial IoT:
现场演示

时序数据库详解

网络监控

时序数据分析：2025 年的定义和最佳技术

产品与解决方案

开发者

公司

导航至

试用 InfluxDB Cloud

停止盲目飞行

获取更新

如何使用 Telegraf 解析 XML 数据

作者：Samantha Wang / 用例, 产品, 开发者 2021 年 4 月 14 日

导航至

什么是 XML？

了解您的 XML 数据

什么是 XPath？

路径

描述

返回的 XML

配置 Telegraf 以摄取 XML

示例！

基本解析示例：OpenWeather XML 数据

多节点选择示例：COVID-19 疫苗分发管辖区分配

使用字段选择器进行批量字段处理（示例：伦敦自行车数据）

更多示例

快速提示和其他有用的资源

准备好开始了吗？

InfluxDB 3 Core & Enterprise GA：面向开发人员的下一代时序平台已问世

数据湖和数据仓库

InfluxDB for Industrial IoT: 现场演示

时序数据库详解

网络监控

时序数据分析：2025 年的定义和最佳技术

产品与解决方案

开发者

公司

注册 InfluxData 新闻邮件

关注我们

作者：Samantha Wang / 用例, 产品, 开发者
2021 年 4 月 14 日

InfluxDB for Industrial IoT:
现场演示