Monitoring

Ratchet provides two monitoring channels: Micrometer metrics for quantitative dashboards and alerts, and the event system for real-time programmatic observation.

Micrometer Integration

Setup

Add the Micrometer module:

<dependency>
  <groupId>run.ratchet</groupId>
  <artifactId>ratchet-micrometer</artifactId>
  <version>0.1.0-SNAPSHOT</version>
</dependency>

Ensure a MeterRegistry is available as a CDI bean. If you're running on a framework like Quarkus or Spring Boot (via CDI bridge), this is typically provided automatically. Otherwise, produce one:

@Produces
@ApplicationScoped
public MeterRegistry meterRegistry() {
    return new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
}

The MicrometerMetricsCollector is annotated @Alternative @Priority(1000), so it automatically overrides the default NoOpMetricsCollector when the module is on the classpath.

Published Metrics

Metric	Type	Tags	Description
`ratchet.jobs.started`	Counter	`type`, `priority`	Jobs that began execution
`ratchet.jobs.completed`	Counter	`type`	Jobs that finished successfully
`ratchet.jobs.failed`	Counter	`type`, `exception`	Jobs that failed (exception class name as tag)
`ratchet.jobs.duration`	Timer	`type`	Execution time in milliseconds

The type tag corresponds to JobType (e.g., SINGLE, RECURRING, BATCH_CHILD). The priority tag corresponds to JobPriority (e.g., NORMAL, HIGH, CRITICAL).

Grafana Dashboard Queries

Job throughput (Prometheus):

rate(ratchet_jobs_completed_total[5m])

Failure rate:

rate(ratchet_jobs_failed_total[5m])
  / (rate(ratchet_jobs_completed_total[5m]) + rate(ratchet_jobs_failed_total[5m]))

P95 execution time:

histogram_quantile(0.95, rate(ratchet_jobs_duration_seconds_bucket[5m]))

Failures by exception type:

topk(5, sum by (exception) (rate(ratchet_jobs_failed_total[5m])))

Custom MetricsCollector

If you need different metric names, additional tags, or a non-Micrometer backend, implement the MetricsCollector SPI directly:

public interface MetricsCollector {
    void jobStarted(UUID jobId, JobType type, JobPriority priority);
    void jobCompleted(UUID jobId, JobType type, long executionTimeMs);
    void jobFailed(UUID jobId, JobType type, Throwable cause, int attempt);
}

Event-Based Monitoring

The event system provides fine-grained lifecycle notifications. Unlike metrics (which are aggregated counters/timers), events carry full context about individual job executions.

CDI Observers

@ApplicationScoped
public class JobMonitor {

    public void onStarted(@Observes JobStartedEvent event) {
        log.info("Job {} started on node {}", event.getJobId(), event.getNodeId());
    }

    public void onCompleted(@Observes JobCompletedEvent event) {
        log.info("Job {} completed in {}ms",
            event.getJobId(), event.getExecutionTimeMs());
    }

    public void onFailed(@Observes JobFailedEvent event) {
        log.error("Job {} failed (attempt {}): {}",
            event.getJobId(), event.getRetryAttempt(), event.getErrorMessage());
    }

    public void onRetrying(@Observes JobRetryingEvent event) {
        log.warn("Job {} retrying (attempt {}): {}",
            event.getJobId(), event.getRetryAttempt(), event.getErrorMessage());
    }

    public void onDlq(@Observes JobDlqEvent event) {
        // Alert: job exhausted all retries
        alertOps("Job " + event.getJobId() + " moved to DLQ: " + event.getErrorMessage());
    }
}

Programmatic Listeners

For dynamic registration (useful in frameworks or libraries that can't use CDI observers):

scheduler.addEventListener(event -> {
    if (event instanceof PerformanceMetricsEvent perf) {
        log.info("Poll cycle: claimed={}, executed={}, duration={}ms",
            perf.getClaimedCount(), perf.getExecutedCount(), perf.getDurationMs());
    }
});

Event Types

Event	When fired
`JobStartedEvent`	Worker begins executing a job
`JobCompletedEvent`	Job finished successfully
`JobFailedEvent`	Job threw an exception
`JobRetryingEvent`	Job failed but will be retried
`JobDlqEvent`	Job exhausted retries, moved to dead-letter queue
`JobCancelledEvent`	Job was cancelled via API
`BatchCompletingEvent`	All children in a batch have finished
`ChainStartedEvent`	A chained job was triggered by its parent
`WorkflowBranchTriggeredEvent`	A conditional branch was activated
`PerformanceMetricsEvent`	End-of-cycle summary from the poller

Health Checks

Node Heartbeat Check

Query the scheduler_node table to verify all expected nodes are alive:

SELECT node_id, last_heartbeat,
       TIMESTAMPDIFF(SECOND, last_heartbeat, NOW()) AS seconds_stale
FROM scheduler_node
WHERE last_heartbeat < NOW() - INTERVAL 2 MINUTE;

Any rows returned indicate stale nodes. For programmatic health checks:

@ApplicationScoped
public class RatchetHealthCheck {

    @Inject
    NodeStore nodeStore;

    public boolean isHealthy() {
        Instant cutoff = Instant.now().minus(Duration.ofMinutes(2));
        return nodeStore.findInactiveNodesSince(cutoff).isEmpty();
    }
}

Queue Depth Check

Monitor the number of pending jobs to detect backlog:

SELECT priority, COUNT(*) as pending_count
FROM scheduler_job
WHERE status = 'PENDING'
GROUP BY priority;

Alert if CRITICAL or HIGH jobs have been pending for more than a few seconds.

DLQ Check

Jobs in the dead-letter queue need human attention:

SELECT COUNT(*) FROM scheduler_job WHERE status = 'DEAD_LETTER';

Wire this into your alerting system — a non-zero DLQ count means something failed permanently.

Alerting Recommendations

Condition	Severity	Action
DLQ count > 0	Critical	Investigate and retry or archive
Failure rate > 5% (5min window)	Warning	Check logs for systematic errors
P95 duration > 2x normal	Warning	Check for resource contention or slow dependencies
Node heartbeat stale > 2min	Critical	Node may be down — check process health
Pending CRITICAL jobs > 30s	Critical	Cluster may be overloaded
Pending queue depth growing	Warning	Scale up nodes or increase thread pool

Micrometer Integration​

Setup​

Published Metrics​

Grafana Dashboard Queries​

Custom MetricsCollector​

Event-Based Monitoring​

CDI Observers​

Programmatic Listeners​

Event Types​

Health Checks​

Node Heartbeat Check​

Queue Depth Check​

DLQ Check​

Alerting Recommendations​

See Also​