Observability
Ratatoskr integrates with OpenTelemetry for distributed tracing and metrics. All instrumentation uses the standard System.Diagnostics APIs — ActivitySource for tracing and Meter for metrics.
Setup
Register the Ratatoskr activity source and meter with the .NET OpenTelemetry SDK:
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.AddSource(RatatoskrDiagnostics.ActivitySourceName)
.AddAspNetCoreInstrumentation())
.WithMetrics(metrics => metrics
.AddMeter(RatatoskrDiagnostics.MeterName)
.AddAspNetCoreInstrumentation());
The constants RatatoskrDiagnostics.ActivitySourceName and RatatoskrDiagnostics.MeterName are both "Ratatoskr".
Tracing
Ratatoskr creates Activity spans at key pipeline stages:
- Publish — A span is created when
PublishDirectAsyncis called, covering serialization and transport send - Consume — A span is created when a message is received from the transport, covering routing and dispatch
- Dispatch — A child span covers handler invocation
- Outbox/Inbox — Spans cover background processor batch operations
W3C Trace Context Propagation
Trace context is automatically propagated through messages:
- On publish,
Activity.Current.IdandTraceStateStringare injected intoMessagePropertiesastraceparentandtracestate - On consume, the trace context is extracted and used to create a child activity, continuing the distributed trace across services
This means your existing APM tools (Jaeger, Zipkin, Azure Monitor, Datadog, etc.) will show end-to-end traces spanning publish → transport → consume → handle.
Metrics Reference
All metrics are emitted on the "Ratatoskr" meter.
Standard OpenTelemetry Messaging Metrics
| Metric | Type | Unit | Description |
|---|---|---|---|
messaging.client.operation.duration |
Histogram | s |
Duration of messaging operation initiated by a producer or consumer client |
messaging.client.sent.messages |
Counter | {message} |
Number of messages producer attempted to send to the broker |
messaging.client.consumed.messages |
Counter | {message} |
Number of messages delivered to the application |
messaging.process.duration |
Histogram | s |
Duration of processing operation |
Lag Metrics
| Metric | Type | Unit | Description |
|---|---|---|---|
ratatoskr.receive.lag |
Histogram | s |
Time from message creation to reception |
ratatoskr.process.lag |
Histogram | s |
Time from message creation to processing completion |
Reliability Metrics
| Metric | Type | Unit | Description |
|---|---|---|---|
ratatoskr.retry.messages |
Counter | {message} |
Messages scheduled for retry |
ratatoskr.dead_letter.messages |
Counter | {message} |
Messages sent to DLQ |
Outbox Metrics
| Metric | Type | Unit | Description |
|---|---|---|---|
ratatoskr.outbox.process.count |
Counter | {message} |
Messages processed from the outbox (tagged status=success\|failure) |
ratatoskr.outbox.poison.count |
Counter | {message} |
Outbox messages marked as poisoned |
ratatoskr.outbox.process.duration |
Histogram | s |
Duration of outbox processing batch |
ratatoskr.outbox.batch.size |
Histogram | {message} |
Messages picked up per outbox batch |
ratatoskr.outbox.pending.messages |
Observable gauge | 1 |
Rows still in the outbox (ProcessedAt is null, not poisoned); tagged db_context |
ratatoskr.outbox.poisoned.messages |
Observable gauge | 1 |
Rows still in the outbox, poisoned and not yet processed; tagged db_context |
Inbox Metrics
| Metric | Type | Unit | Description |
|---|---|---|---|
ratatoskr.inbox.deliver.count |
Counter | {message} |
Inbox handler deliveries attempted (tagged status) |
ratatoskr.inbox.poison.count |
Counter | {message} |
Inbox handler statuses marked as poisoned |
ratatoskr.inbox.process.duration |
Histogram | s |
Duration of inbox processing batch |
ratatoskr.inbox.batch.size |
Histogram | {message} |
Handler statuses picked up per inbox batch |
ratatoskr.inbox.pending.statuses |
Observable gauge | 1 |
Handler status rows not completed and not poisoned; tagged db_context |
ratatoskr.inbox.poisoned.statuses |
Observable gauge | 1 |
Handler status rows poisoned and not completed; tagged db_context |
Note
EF Core backlog gauges: When you register durability with AddEfCoreDurability, the four observable gauges above are registered (inbox-only and outbox-only apps still expose all four names; the unused side reads as zero). Their values are refreshed in the background on a configurable interval (default 30 seconds) using no-tracking COUNT queries, each with its own cancellation timeout (default 5 seconds), so metric scrapes do not hit the database on every collection interval. Override defaults with WithMetricsPollingInterval and WithMetricsQueryTimeout on the durability builder. The db_context tag is the DbContext type’s full name (for example MyApp.OrderDbContext). If only the outbox or only the inbox is enabled for that context, the gauges for the disabled side stay at zero.
Cleanup Metrics
| Metric | Type | Unit | Description |
|---|---|---|---|
ratatoskr.outbox.cleanup.count |
Counter | {message} |
Processed outbox messages deleted by cleanup |
ratatoskr.outbox.cleanup.duration |
Histogram | s |
Duration of outbox cleanup operation |
ratatoskr.inbox.cleanup.status.count |
Counter | {status} |
Completed inbox handler statuses deleted by cleanup |
ratatoskr.inbox.cleanup.message.count |
Counter | {message} |
Orphaned inbox messages deleted by cleanup |
ratatoskr.inbox.cleanup.duration |
Histogram | s |
Duration of inbox cleanup operation |
Distributed Lock Metrics
| Metric | Type | Unit | Description |
|---|---|---|---|
ratatoskr.lock.acquisition.failure |
Counter | {attempt} |
Failed lock acquisitions |
ratatoskr.lock.lost |
Counter | {event} |
Lock losses during processing |
Message Activity Observers
IMessageActivityObserver implementations are notified at pipeline stages for custom instrumentation:
| Stage | When |
|---|---|
Published |
After each send attempt during PublishDirectAsync |
Sent |
After bytes are sent to the transport |
Received |
When the consumer receives a message from the transport |
Dispatched |
After handler invocation completes |
OutboxStaged |
When a message is serialized into an outbox entity during SaveChanges |
OutboxSent |
When the outbox processor sends a message to the transport |
OutboxPoisoned |
When an outbox message exceeds max retries |
InboxQueued |
When a message is accepted into the inbox |
InboxDispatched |
When an inbox handler delivery is attempted (success or failure) |
InboxPoisoned |
When an inbox handler status exceeds max retries |
Note
Observers are designed for instrumentation and testing — not for reliable side effects. Observer exceptions are caught and logged at Warning level. They never affect the message pipeline.
The Ratatoskr.Testing package uses IMessageActivityObserver internally to power MessageTrackingSession. See Testing for details.
Example Prometheus Queries
Monitor key health indicators with these PromQL queries:
# Outbox poison rate (should be 0)
rate(ratatoskr_outbox_poison_count_total[5m])
# Inbox poison rate (should be 0)
rate(ratatoskr_inbox_poison_count_total[5m])
# Outbox processing throughput
rate(ratatoskr_outbox_process_count_total{status="success"}[5m])
# Average receive lag
rate(ratatoskr_receive_lag_sum[5m]) / rate(ratatoskr_receive_lag_count[5m])
# Lock acquisition failures (infrastructure issues)
rate(ratatoskr_lock_acquisition_failure_total[5m])
# P99 message processing duration
histogram_quantile(0.99, rate(messaging_process_duration_bucket[5m]))
# Outbox backlog depth (gauges; names depend on your Prometheus/OTel mapping)
ratatoskr_outbox_pending_messages
ratatoskr_outbox_poisoned_messages
# Inbox backlog depth
ratatoskr_inbox_pending_statuses
ratatoskr_inbox_poisoned_statuses
What's Next
- Operations — Alert thresholds and monitoring runbook
- Testing — Using message tracking sessions for test assertions
- Architecture — Where tracing spans are created in the pipeline