故障处理

概述

公众号-CNCF,记一次远程写性能问题引发的Prometheus版本升级事件

compaction failed

compaction failed 是一个 Prometheus 在压缩数据时产生的错误,导致该问题的因素多种多样,最常见的就是使用 NFS 作为 Prometehus 时序数据库的后端存储。

官方文档中曾明确表明不支持 NFS 文件系统

image.png

该问题的表现形式通常为文件丢失,比如,某个 Block 中的 meta.json 文件丢失

msg="compaction failed" err="plan compaction: open /prometheus/01FHHPS3NR7M2E8MAV37S61ME6/meta.json: no such file or directory"

msg="Failed to read meta.json for a block during reloadBlocks. Skipping" dir=/prometheus/01FHHPS3NR7M2E8MAV37S61ME6 err="open /prometheus/01FHHPS3NR7M2E8MAV37S61ME6/meta.json: no such file or directory"

经过日志筛选,该问题起源于一次 Deleting obsolete block 操作之后的 compact blocks,也就是删除过期块后压缩块。 失败操作源于:

msg="compaction failed" err="delete compacted block after failed db reloadBlocks:01FHHPS3NR7M2E8MAV37S61ME6: unlinkat /prometheus/01FHHPS3NR7M2E8MAV37S61ME6/chunks: directory not empty"

image.png

这些报错日志信息,可以在 ./prometheus/tsdb/db.go 代码中找到

解决方式

可以直接删除 01FHHPS3NR7M2E8MAV37S61ME6 块,也就是直接删除这个目录,并重启 Prometheus

Error on ingesting out-of-order result from rule evaluation

Error on ingesting out-of-order result from rule evaluation

该问题通常是由于记录规则的处理结果中,包含 NaN 而产生的告警,所有 NaN 的时间序列都会被丢弃,并不会保存到数据库中。

image.png

image.png

下面是报错的具体内容,可以看到 numDropped 是记录规则查询后生成的新时间序列中,被丢弃的时间序列。

caller=manager.go:651 component="rule manager" group=kube-apiserver.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=231

caller=manager.go:651 component="rule manager" group=kube-apiserver-availability.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=121

这个错误日志,可以从 Prometheus 代码中 ./prometheus/rules/manager.go 的 Group.Eval() 方法中看到,每次评估规则时,只要有异常值得序列,都会抛出该错误日志信息。

// Eval runs a single evaluation cycle in which all rules are evaluated sequentially.
func (g *Group) Eval(ctx context.Context, ts time.Time) {
	var samplesTotal float64
	for i, rule := range g.rules {
		select {
		case <-g.done:
			return
		default:
		}

		func(i int, rule Rule) {
            ......
			for _, s := range vector {
				if _, err := app.Append(0, s.Metric, s.T, s.V); err != nil {
					rule.SetHealth(HealthBad)
					rule.SetLastError(err)

					switch errors.Cause(err) {
					case storage.ErrOutOfOrderSample:
						numOutOfOrder++
						level.Debug(g.logger).Log("msg", "Rule evaluation result discarded", "err", err, "sample", s)
					case storage.ErrDuplicateSampleForTimestamp:
						numDuplicates++
						level.Debug(g.logger).Log("msg", "Rule evaluation result discarded", "err", err, "sample", s)
					default:
						level.Warn(g.logger).Log("msg", "Rule evaluation result discarded", "err", err, "sample", s)
					}
				} else {
					seriesReturned[s.Metric.String()] = s.Metric
				}
			}
			if numOutOfOrder > 0 {
				level.Warn(g.logger).Log("msg", "Error on ingesting out-of-order result from rule evaluation", "numDropped", numOutOfOrder)
			}
			if numDuplicates > 0 {
				level.Warn(g.logger).Log("msg", "Error on ingesting results from rule evaluation with different value but same timestamp", "numDropped", numDuplicates)
			}
			......
		}(i, rule)
	}
	if g.metrics != nil {
		g.metrics.GroupSamples.WithLabelValues(GroupKey(g.File(), g.Name())).Set(samplesTotal)
	}
	g.cleanupStaleSeries(ctx, ts)
}

Error on ingesting out-of-order samples

故障现象

查看日志发现很多 Error on ingesting out-of-order samples Warn 信息

image.png

故障原因

参考:

当一个 job 中从多个 Prometheus 中采集相同指标时,就容易产生这个问题。比如,下图示例:

当采集目标是具有相同数据的多个 Prometheus,并且采集时轮流采集,就会很容易产生上述问题

故障处理

每个 Job 配置中,添加 honor_timestamps: false 配置。


最后修改 December 18, 2024: promql, prom development (986a0fc6)