Skip to main content

Monitoring

Ratchet provides two monitoring channels: Micrometer metrics for quantitative dashboards and alerts, and the event system for real-time programmatic observation.

Micrometer Integration

Setup

Add the Micrometer module:

<dependency>
<groupId>run.ratchet</groupId>
<artifactId>ratchet-micrometer</artifactId>
<version>0.1.0-SNAPSHOT</version>
</dependency>

Ensure a MeterRegistry is available as a CDI bean. If you're running on a framework like Quarkus or Spring Boot (via CDI bridge), this is typically provided automatically. Otherwise, produce one:

@Produces
@ApplicationScoped
public MeterRegistry meterRegistry() {
return new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
}

The MicrometerMetricsCollector is annotated @Alternative @Priority(1000), so it automatically overrides the default NoOpMetricsCollector when the module is on the classpath.

Published Metrics

MetricTypeTagsDescription
ratchet.jobs.startedCountertype, priorityJobs that began execution
ratchet.jobs.completedCountertypeJobs that finished successfully
ratchet.jobs.failedCountertype, exceptionJobs that failed (exception class name as tag)
ratchet.jobs.durationTimertypeExecution time in milliseconds

The type tag corresponds to JobType (e.g., SINGLE, RECURRING, BATCH_CHILD). The priority tag corresponds to JobPriority (e.g., NORMAL, HIGH, CRITICAL).

Grafana Dashboard Queries

Job throughput (Prometheus):

rate(ratchet_jobs_completed_total[5m])

Failure rate:

rate(ratchet_jobs_failed_total[5m])
/ (rate(ratchet_jobs_completed_total[5m]) + rate(ratchet_jobs_failed_total[5m]))

P95 execution time:

histogram_quantile(0.95, rate(ratchet_jobs_duration_seconds_bucket[5m]))

Failures by exception type:

topk(5, sum by (exception) (rate(ratchet_jobs_failed_total[5m])))

Custom MetricsCollector

If you need different metric names, additional tags, or a non-Micrometer backend, implement the MetricsCollector SPI directly:

public interface MetricsCollector {
void jobStarted(UUID jobId, JobType type, JobPriority priority);
void jobCompleted(UUID jobId, JobType type, long executionTimeMs);
void jobFailed(UUID jobId, JobType type, Throwable cause, int attempt);
}

Register your implementation as a CDI alternative with a higher priority than 1000 to override the Micrometer collector.

Event-Based Monitoring

The event system provides fine-grained lifecycle notifications. Unlike metrics (which are aggregated counters/timers), events carry full context about individual job executions.

CDI Observers

@ApplicationScoped
public class JobMonitor {

public void onStarted(@Observes JobStartedEvent event) {
log.info("Job {} started on node {}", event.getJobId(), event.getNodeId());
}

public void onCompleted(@Observes JobCompletedEvent event) {
log.info("Job {} completed in {}ms",
event.getJobId(), event.getExecutionTimeMs());
}

public void onFailed(@Observes JobFailedEvent event) {
log.error("Job {} failed (attempt {}): {}",
event.getJobId(), event.getRetryAttempt(), event.getErrorMessage());
}

public void onRetrying(@Observes JobRetryingEvent event) {
log.warn("Job {} retrying (attempt {}): {}",
event.getJobId(), event.getRetryAttempt(), event.getErrorMessage());
}

public void onDlq(@Observes JobDlqEvent event) {
// Alert: job exhausted all retries
alertOps("Job " + event.getJobId() + " moved to DLQ: " + event.getErrorMessage());
}
}

Programmatic Listeners

For dynamic registration (useful in frameworks or libraries that can't use CDI observers):

scheduler.addEventListener(event -> {
if (event instanceof PerformanceMetricsEvent perf) {
log.info("Poll cycle: claimed={}, executed={}, duration={}ms",
perf.getClaimedCount(), perf.getExecutedCount(), perf.getDurationMs());
}
});

Event Types

EventWhen fired
JobStartedEventWorker begins executing a job
JobCompletedEventJob finished successfully
JobFailedEventJob threw an exception
JobRetryingEventJob failed but will be retried
JobDlqEventJob exhausted retries, moved to dead-letter queue
JobCancelledEventJob was cancelled via API
BatchCompletingEventAll children in a batch have finished
ChainStartedEventA chained job was triggered by its parent
WorkflowBranchTriggeredEventA conditional branch was activated
PerformanceMetricsEventEnd-of-cycle summary from the poller

Health Checks

Node Heartbeat Check

Query the scheduler_node table to verify all expected nodes are alive:

SELECT node_id, last_heartbeat,
TIMESTAMPDIFF(SECOND, last_heartbeat, NOW()) AS seconds_stale
FROM scheduler_node
WHERE last_heartbeat < NOW() - INTERVAL 2 MINUTE;

Any rows returned indicate stale nodes. For programmatic health checks:

@ApplicationScoped
public class RatchetHealthCheck {

@Inject
NodeStore nodeStore;

public boolean isHealthy() {
Instant cutoff = Instant.now().minus(Duration.ofMinutes(2));
return nodeStore.findInactiveNodesSince(cutoff).isEmpty();
}
}

Queue Depth Check

Monitor the number of pending jobs to detect backlog:

SELECT priority, COUNT(*) as pending_count
FROM scheduler_job
WHERE status = 'PENDING'
GROUP BY priority;

Alert if CRITICAL or HIGH jobs have been pending for more than a few seconds.

DLQ Check

Jobs in the dead-letter queue need human attention:

SELECT COUNT(*) FROM scheduler_job WHERE status = 'DEAD_LETTER';

Wire this into your alerting system — a non-zero DLQ count means something failed permanently.

Alerting Recommendations

ConditionSeverityAction
DLQ count > 0CriticalInvestigate and retry or archive
Failure rate > 5% (5min window)WarningCheck logs for systematic errors
P95 duration > 2x normalWarningCheck for resource contention or slow dependencies
Node heartbeat stale > 2minCriticalNode may be down — check process health
Pending CRITICAL jobs > 30sCriticalCluster may be overloaded
Pending queue depth growingWarningScale up nodes or increase thread pool

See Also