Skip to content

fix(metrics): export late-created thread reject and error code metrics to Prometheus#16167

Open
Wubabalala wants to merge 1 commit intoapache:3.3from
Wubabalala:fix/metrics-name-count-sampler-registration
Open

fix(metrics): export late-created thread reject and error code metrics to Prometheus#16167
Wubabalala wants to merge 1 commit intoapache:3.3from
Wubabalala:fix/metrics-name-count-sampler-registration

Conversation

@Wubabalala
Copy link

What is the purpose of the change?

Fixes #16148

Dynamically added metric series in MetricsNameCountSampler subclasses (e.g., ThreadRejectMetricsCountSampler, ErrorCodeSampler) are not exported to Prometheus when the first event arrives after the initial reporter sync cycle.

Root Cause Analysis

MetricsNameCountSampler.samplesChanged is initialized to true and consumed by the reporter's first calSamplesChanged() poll (CAS true → false). At that point, sample() returns nothing because no actual metric series exist yet.

When the first real event arrives later (e.g., a thread pool rejection), SimpleMetricsCountSampler.inc() creates a new metric series entry in metricCounter. However, MetricsNameCountSampler.samplesChanged is only updated during metric name registration — it is never set back to true when a new metric series is first created at runtime. The reporter sees false on subsequent polls and never re-registers the new metric series to the Prometheus registry.

Brief changelog

  • SimpleMetricsCountSampler: Refactored getAtomicCounter() into incrementAndGetCreated() which returns true when a new metric series is created for the first time (detected via reference equality against the candidate AtomicLong).
  • MetricsNameCountSampler: Override inc() to call incrementAndGetCreated() and set samplesChanged = true only when a new metric series is created. This avoids unnecessary re-registration for updates to already-registered series. Also made addMetricName() idempotent.

How to verify

Thread pool reject path:

  1. Start a Dubbo 3.3.x provider with dubbo.metrics.enable-threadpool: true and dubbo.protocol.threads: 2.
  2. Wait 3+ minutes (so the initial samplesChanged flag is consumed).
  3. Send enough concurrent requests to trigger thread pool exhaustion.
  4. Check /actuator/prometheus for dubbo_thread_pool_reject_thread_count — it should now appear.

Error code path is covered by the unit test ErrorCodeSampleTest.testErrorCodeMetricChangesAfterFirstLateEvent, which verifies the same timing sequence with error code events.

New tests

  • ErrorCodeSampleTest.testErrorCodeMetricChangesAfterFirstLateEvent: Verifies the exact timing sequence — initial flag consumed, then first error event sets flag, repeat of same error does not, new error code sets flag again.
  • PrometheusMetricsThreadPoolTest.testThreadPoolRejectMetricsExportedAfterLateFirstEvent: End-to-end test — simulates the full startup → first poll → late event timeline and asserts that Prometheus scrape output contains the late-arriving reject metric series.

…s to Prometheus

MetricsNameCountSampler.samplesChanged is consumed by the first reporter
poll before any actual metric series exist. When the first real event
arrives later, SimpleMetricsCountSampler.inc() creates a new counter
entry but samplesChanged is never set back to true, so the reporter
never re-registers the new series to Prometheus.

Refactor getAtomicCounter() into incrementAndGetCreated() that returns
whether a new Metric->AtomicLong entry was created. Override inc() in
MetricsNameCountSampler to set samplesChanged only on new series
creation, avoiding unnecessary re-registration for existing series.

Fixes apache#16148
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Dubbo ThreadPoolRejectMetric & ErrorCodeMetric cannot be exported to Prometheus

1 participant