Operations
This page covers day-to-day operational concerns: monitoring, handling failures, data retention, distributed lock providers, and deployment considerations.
Monitoring
Key Metrics
Ratatoskr exposes OpenTelemetry metrics via System.Diagnostics.Metrics. See Observability for the complete metrics reference and setup.
| Metric | Type | Alert Threshold |
|---|---|---|
ratatoskr.outbox.process.count |
Counter | High failure rate |
ratatoskr.outbox.poison.count |
Counter | Any increment |
ratatoskr.outbox.batch.size |
Histogram | Sustained high values (backlog) |
ratatoskr.outbox.process.duration |
Histogram | > 30s |
ratatoskr.inbox.deliver.count |
Counter | High failure rate |
ratatoskr.inbox.poison.count |
Counter | Any increment |
ratatoskr.inbox.batch.size |
Histogram | Sustained high values (backlog) |
ratatoskr.inbox.process.duration |
Histogram | > 30s |
ratatoskr.lock.acquisition.failure |
Counter | Sustained failures |
ratatoskr.lock.lost |
Counter | Any increment |
ratatoskr.receive.lag |
Histogram | Growing lag |
ratatoskr.process.lag |
Histogram | Growing lag |
Critical Alerts
- Poison count > 0 — Any poisoned message requires investigation
- Lock lost — Indicates infrastructure issues (database connection drops, network partitions)
- Growing backlog — If batch size consistently equals
BatchSize, the processor cannot keep up - Receive/process lag trending up — Throughput is insufficient for message volume
Health Checks
Ratatoskr exposes ASP.NET Core health checks suitable for Kubernetes liveness and readiness probes.
Register them explicitly per component:
services.AddHealthChecks()
.AddRatatoskrRabbitMq()
.AddRatatoskrOutbox<AppDbContext>()
.AddRatatoskrInbox<AppDbContext>();
Probes:
- Readiness Probe (
"ready"tag): By default, all Ratatoskr components include the"ready"tag. This ensures K8s does not send traffic to your pod if the RabbitMQ consumer is disconnected or if the outbox/inbox processors crash and fail to successfully loop within theunhealthyThreshold(default: 2 minutes). - Liveness Probe (
"live"tag): Do not map Ratatoskr health checks to liveness probes. If a downstream service is down causing the processor to backoff, restarting the container will not fix it. Liveness probes should only check fundamental process health.
Handling Poisoned Messages
Investigation
Poisoned messages have exhausted their retry budget and remain in the database for manual investigation.
Outbox (PostgreSQL):
SELECT "Id", "TransportName", "ErrorCount", "Error", "CreatedAt", "FailedAt"
FROM "OutboxMessages"
WHERE "IsPoisoned" = true
ORDER BY "FailedAt" DESC;
Outbox (SQL Server):
SELECT [Id], [TransportName], [ErrorCount], [Error], [CreatedAt], [FailedAt]
FROM [OutboxMessages]
WHERE [IsPoisoned] = 1
ORDER BY [FailedAt] DESC;
Inbox (PostgreSQL):
SELECT s."Id", s."MessageId", s."HandlerKey", s."ErrorCount", s."LastError", s."CreatedAt",
m."SerializedProperties"
FROM "InboxHandlerStatuses" s
JOIN "InboxMessages" m ON m."Id" = s."MessageId"
WHERE s."IsPoisoned" = true
ORDER BY s."CreatedAt" DESC;
Inbox (SQL Server):
SELECT s.[Id], s.[MessageId], s.[HandlerKey], s.[ErrorCount], s.[LastError], s.[CreatedAt],
m.[SerializedProperties]
FROM [InboxHandlerStatuses] s
JOIN [InboxMessages] m ON m.[Id] = s.[MessageId]
WHERE s.[IsPoisoned] = 1
ORDER BY s.[CreatedAt] DESC;
Manual Retry
Reset a poisoned message's state to retry it:
Outbox (PostgreSQL):
UPDATE "OutboxMessages"
SET "IsPoisoned" = false,
"ErrorCount" = 0,
"NextAttemptAt" = NULL,
"ProcessingStartedAt" = NULL,
"Version" = "Version" + 1
WHERE "Id" = '<message-id>';
Outbox (SQL Server):
UPDATE [OutboxMessages]
SET [IsPoisoned] = 0,
[ErrorCount] = 0,
[NextAttemptAt] = NULL,
[ProcessingStartedAt] = NULL,
[Version] = [Version] + 1
WHERE [Id] = '<message-id>';
Inbox (PostgreSQL):
UPDATE "InboxHandlerStatuses"
SET "IsPoisoned" = false,
"ErrorCount" = 0,
"NextAttemptAt" = NULL,
"ProcessingStartedAt" = NULL,
"Version" = "Version" + 1
WHERE "Id" = '<status-id>';
Inbox (SQL Server):
UPDATE [InboxHandlerStatuses]
SET [IsPoisoned] = 0,
[ErrorCount] = 0,
[NextAttemptAt] = NULL,
[ProcessingStartedAt] = NULL,
[Version] = [Version] + 1
WHERE [Id] = '<status-id>';
The processor picks up the message on its next polling cycle.
Data Retention
Automatic Cleanup
Configure retention on the outbox and inbox builders. The cleanup service runs as a background IHostedService and deletes old processed messages in batches. Poisoned messages are never auto-deleted.
bus.AddEfCoreDurability<OrderDbContext>(d => d
.UseOutbox(outbox => outbox
.WithRetention(TimeSpan.FromDays(7))
.WithCleanupInterval(TimeSpan.FromHours(1))
.WithCleanupBatchSize(10_000))
.UseInbox(inbox => inbox
.WithRetention(TimeSpan.FromDays(30))));
The inbox cleanup also removes orphaned InboxMessages rows with no remaining handler statuses.
Note
The cleanup service waits one full CleanupInterval (default: 1 hour) before its first run. Use the manual SQL below for initial cleanup on large existing tables.
WithoutBackgroundProcessing() disables the cleanup service even when WithRetention() is configured.
In multi-instance deployments, cleanup services use a distributed lock to ensure only one instance runs cleanup per cycle. Other instances skip the cycle and try again at the next interval. This reduces unnecessary database I/O — cleanup is idempotent, so concurrent execution would be safe but wasteful.
Manual Cleanup
Outbox (PostgreSQL):
DELETE FROM "OutboxMessages"
WHERE "ProcessedAt" IS NOT NULL
AND "ProcessedAt" < NOW() - INTERVAL '7 days';
DELETE FROM "OutboxMessages"
WHERE "IsPoisoned" = true
AND "FailedAt" < NOW() - INTERVAL '30 days';
Outbox (SQL Server):
DELETE FROM [OutboxMessages]
WHERE [ProcessedAt] IS NOT NULL
AND [ProcessedAt] < DATEADD(DAY, -7, GETUTCDATE());
DELETE FROM [OutboxMessages]
WHERE [IsPoisoned] = 1
AND [FailedAt] < DATEADD(DAY, -30, GETUTCDATE());
Inbox (PostgreSQL):
DELETE FROM "InboxHandlerStatuses"
WHERE "CompletedAt" IS NOT NULL
AND "CompletedAt" < NOW() - INTERVAL '30 days';
DELETE FROM "InboxHandlerStatuses"
WHERE "IsPoisoned" = true
AND "CreatedAt" < NOW() - INTERVAL '30 days';
DELETE FROM "InboxMessages"
WHERE NOT EXISTS (
SELECT 1 FROM "InboxHandlerStatuses"
WHERE "MessageId" = "InboxMessages"."Id"
);
Inbox (SQL Server):
DELETE FROM [InboxHandlerStatuses]
WHERE [CompletedAt] IS NOT NULL
AND [CompletedAt] < DATEADD(DAY, -30, GETUTCDATE());
DELETE FROM [InboxHandlerStatuses]
WHERE [IsPoisoned] = 1
AND [CreatedAt] < DATEADD(DAY, -30, GETUTCDATE());
DELETE FROM [InboxMessages]
WHERE NOT EXISTS (
SELECT 1 FROM [InboxHandlerStatuses]
WHERE [MessageId] = [InboxMessages].[Id]
);
For large tables, use batched deletes:
PostgreSQL:
DELETE FROM "OutboxMessages"
WHERE "Id" IN (
SELECT "Id" FROM "OutboxMessages"
WHERE "ProcessedAt" IS NOT NULL
AND "ProcessedAt" < NOW() - INTERVAL '7 days'
LIMIT 10000
);
SQL Server:
DELETE TOP (10000) FROM [OutboxMessages]
WHERE [ProcessedAt] IS NOT NULL
AND [ProcessedAt] < DATEADD(DAY, -7, GETUTCDATE());
Distributed Lock Provider
Ratatoskr uses Medallion.Threading for distributed locks. Choose a provider based on your deployment topology.
Single Machine / Development
services.AddSingleton<IDistributedLockProvider>(_ =>
new FileDistributedSynchronizationProvider(
new DirectoryInfo("/var/locks/ratatoskr")));
Multi-Instance (PostgreSQL)
services.AddSingleton<IDistributedLockProvider>(_ =>
new PostgresDistributedSynchronizationProvider(connectionString));
Multi-Instance (SQL Server)
services.AddSingleton<IDistributedLockProvider>(_ =>
new SqlDistributedSynchronizationProvider(connectionString));
Multi-Instance (Redis)
services.AddSingleton<IDistributedLockProvider>(sp =>
new RedisDistributedSynchronizationProvider(
"ratatoskr", sp.GetRequiredService<IDatabase>()));
Important
File-based locks do not work across machines. For horizontally scaled deployments, use a database or Redis-backed provider. Without a shared lock provider, multiple processors may run concurrently, causing duplicate message processing.
Lock Names
Lock names are auto-generated per DbContext type:
- Outbox processor:
OutboxProcessor_{DbContextTypeName} - Inbox processor:
InboxProcessor_{DbContextTypeName} - Outbox cleanup:
OutboxCleanup_{DbContextTypeName} - Inbox cleanup:
InboxCleanup_{DbContextTypeName}
Override with WithLockName("custom-name") or WithCleanupLockName("custom-name") if needed.
Disaster Recovery
Stuck Messages
If a processor crashes mid-batch, messages may be left in "processing" state. Stuck message detection automatically recovers them after the configured threshold (default: 5 minutes).
To manually clear stuck messages:
PostgreSQL:
-- Outbox
UPDATE "OutboxMessages"
SET "ProcessingStartedAt" = NULL, "Version" = "Version" + 1
WHERE "ProcessingStartedAt" IS NOT NULL
AND "ProcessedAt" IS NULL AND "IsPoisoned" = false;
-- Inbox
UPDATE "InboxHandlerStatuses"
SET "ProcessingStartedAt" = NULL, "Version" = "Version" + 1
WHERE "ProcessingStartedAt" IS NOT NULL
AND "CompletedAt" IS NULL AND "IsPoisoned" = false;
SQL Server:
-- Outbox
UPDATE [OutboxMessages]
SET [ProcessingStartedAt] = NULL, [Version] = [Version] + 1
WHERE [ProcessingStartedAt] IS NOT NULL
AND [ProcessedAt] IS NULL AND [IsPoisoned] = 0;
-- Inbox
UPDATE [InboxHandlerStatuses]
SET [ProcessingStartedAt] = NULL, [Version] = [Version] + 1
WHERE [ProcessingStartedAt] IS NOT NULL
AND [CompletedAt] IS NULL AND [IsPoisoned] = 0;
Processor Not Running
If no processor is picking up messages:
- Check that
WithoutBackgroundProcessing()is not called in production - Check distributed lock acquisition — another instance may hold the lock. Monitor
ratatoskr.lock.acquisition.failure - Check database connectivity
- Check logs for
OutboxProcessor/InboxProcessoratWarningandErrorlevels
RabbitMQ Consumer Disconnection
The RabbitMqConsumer automatically reconnects with exponential backoff (1s to 30s with jitter). If persistently disconnected:
- Check RabbitMQ connectivity and credentials
- Check consumer logs for error details
- Verify queue and exchange topology matches configuration
Graceful Shutdown
On SIGTERM or application shutdown:
- The
OutboxProcessorandInboxProcessorstop accepting new batches and wait for the current batch to complete - The
RabbitMqConsumerstops consuming and waits for in-flight messages to be acknowledged - If a handler is running during inbox shutdown, the
CancellationTokenis triggered. The attempt is not counted as a failure — it's recovered by stuck message detection on next startup
For rolling deployments, ensure StuckMessageThreshold is longer than your shutdown grace period.
EF Core Migrations
When upgrading Ratatoskr versions, the outbox/inbox database schema may change. Generate a new migration after updating the package:
dotnet ef migrations add UpgradeRatatoskr
dotnet ef database update
Review the generated migration to understand schema changes before applying to production.
Deployment Safety
Rolling Deployment Checklist
Before deploying a new version that changes handler configuration:
- [ ] Handler keys stable — If renaming a handler key, use legacy keys to drain in-flight messages. See Inbox: Handler Key Renaming.
- [ ] Message types backward-compatible — Only additive field changes. No renames or removals. See Architecture: Schema Evolution.
- [ ] EF Core migrations applied — Run
dotnet ef migrations addanddotnet ef database updatebefore deploying the new application version. - [ ] Monitoring in place — Verify
ratatoskr.outbox.poison.countandratatoskr.inbox.poison.countcounters are being collected. A spike after deployment indicates a compatibility issue.
Monitoring After Deployment
After deploying a new version, monitor these signals for 15-30 minutes:
| Signal | What it means | Action |
|---|---|---|
ratatoskr.inbox.poison.count spike |
Handler key mismatch or deserialization failure | Rollback or investigate poisoned rows |
ratatoskr.outbox.poison.count spike |
Transport misconfiguration or serialization mismatch | Rollback or investigate poisoned rows |
| Health check unhealthy | Processor or consumer not running | Check logs for startup errors |
ratatoskr.lock.acquisition.failure spike |
Distributed lock contention from old instances | Wait for old instances to drain |
What's Next
- Observability — Complete metrics reference and setup
- Configuration Reference — All configuration options at a glance
- Outbox — Outbox configuration and processing details
- Inbox — Inbox configuration and processing details