Add job management capabilities to your flow with checkpoints

This document outlines the functionality and benefits of qibb’s Job Management capabilities, enabled by Checkpoint Nodes strategically placed within your flows. This feature empowers you with enhanced job tracking, monitoring, and control, providing granular management over job processing for increased reliability and flexibility.

Minimum Version Requirements

Checkpoint Nodes require your app to be running on flow v4.0.0 or higher. Please ensure your app is up-to-date before installing these nodes.

Job management features in the Portal require the platform to be running on qibb v1.42.0 or higher.

Understanding Checkpoint Nodes

qibb’s Checkpoint Nodes act as control points within your flow, allowing you to observe and manipulate jobs as they progress. They provide a robust framework for managing the lifecycle of your jobs, from initiation to completion or termination.

Key Capabilities

Here's a breakdown of the powerful controls offered by Checkpoint Nodes:

Job Checkpoints: Clearly mark significant stages in your job's journey, such as the start, intermediate progress points, and final completion for success and failure cases. This provides a visual and auditable trail of execution.
Job Scheduling & Timeouts: Plan the execution of individual jobs for a future time and define maximum execution durations. Jobs exceeding these timeouts can be automatically flagged as failed.
Retries & Cancellation: Implement automatic retry mechanisms for failed jobs, with configurable limits and backoff strategies. You also have the ability to manually retry or cancel jobs as needed.
Queue Management: Optimize job processing with features like priority queuing (processing higher-priority jobs first), rate limiting (controlling the rate of jobs leaving a checkpoint), and the ability to temporarily pause and resume checkpoint queues.
Approval-Based "Wait" Checkpoints: Introduce manual approval steps into your flow. Jobs reaching a "Wait" checkpoint will be held until explicitly released via user interface in the Portal, an injected message command in the flow or API call.
Browse & Search: Effortlessly track and analyze all jobs of an app through a dedicated interface in the qibb Portal, offering multiple filters and full-text search capabilities. This includes searching across job metadata and the complete job lifecycle. For audit purposes, you can inspect the event history and snapshotted message properties for each job.

Benefits of Using Checkpoint Nodes

Integrating Checkpoint Nodes into your flows offers several significant advantages:

Enhanced Job Tracking: Gain complete visibility into the status and progression of your jobs.
Improved Reliability: Implement automatic retries and timeouts to handle transient failures and prevent indefinite job hangs.
Increased Control: Manually intervene in job processing through cancellation, retries, and approval mechanisms.
Optimized Resource Utilization: Manage concurrent job execution with rate limiting and prioritize critical tasks with priority queuing.
Greater Flexibility: Schedule jobs for future execution and introduce manual approval steps for critical processes.
Resilience: Jobs persist through platform downtime or upgrades, ensuring continuity of processing. Stalled jobs are automatically flagged for attention.
Comprehensive Auditing: Maintain a detailed history of job progression at each checkpoint, including custom metadata.

Using Checkpoint Nodes in Your Flows

When designing your flows, you can strategically place different types of checkpoint nodes to implement the desired control mechanisms:

Start Checkpoint: Typically placed at the beginning of a job's flow. It can handle automatic retries for failed jobs.
Update Checkpoint: Used to mark progress or intermediate stages of a job. Supports pausing and resuming the queue.
Wait Checkpoint: Holds incoming jobs until they are manually or programmatically released.
Success Checkpoint: Indicates the successful completion of a job.
Fail Checkpoint: Indicates the failed completion of a job.

Jobs can be viewed and managed from the Portal

Example flow of using checkpoints along a typical media flow involving wait/approval steps

Example flow with commands to control checkpoints such as pausing/resuming a checkpoint

Managing Jobs in the Portal

The qibb Portal provides a dedicated interface for managing jobs processed through Checkpoint Nodes at the space level:

Job Overview: View a comprehensive list of jobs with their current status, description, and relevant metadata.
Filtering and Searching: Utilize multiple filters (e.g., status, checkpoint, creation time) and full-text search to quickly locate specific jobs.
Job Actions: Perform control actions on individual jobs, including:
- Approve: Release jobs held at "Wait" checkpoints.
- Retry: Manually trigger a retry attempt for a failed job.
- Cancel: Terminate a pending or running job.
- Delete: Remove a job from the system.
- Edit: Modify certain job properties (if supported).

Implementation Details and Considerations

Resilience: Checkpoints ensure job persistence during downtime and app upgrades. Automatic retry rules aid in recovering from failures. The platform's architecture minimizes dependencies, allowing job and queue operation even during network disruptions. Cluster backups provide disaster recovery for job states and queues.
Pause & Resume Checkpoints: Start and Update Checkpoints can be paused and resumed via message commands, temporarily halting or restarting job processing at that point.
Approve Jobs: "Wait" Checkpoints hold jobs until a release command is issued through the UI or API.
Cancel Jobs: Jobs can be canceled individually using a specific message command referencing the job ID.
Delete Jobs: Individual jobs can be permanently removed using a message command with the job ID.
Schedule Jobs: Jobs can be scheduled for future execution by including an ISO date string in the msg.scheduled_at property.
Time Outs for Jobs: Define a timeout period in seconds using the msg.timeout_in_sec property. Jobs exceeding this duration from their creation time will be flagged as failed.
Priority Queuing: Assign a priority level to jobs using the msg.priority property (higher number indicates higher priority). The queue will dynamically reorder to process higher-priority jobs first.
Rate Limiting: Configure the maximum number of jobs concurrently leaving a checkpoint (options: 1, 10, 50, 100 jobs/second).
Automatic Retry for Jobs: The Start Checkpoint can be configured to automatically retry failed jobs. Configure the maximum number of attempts in the node properties. The original message at the time of job creation will be re-injected during retry.
Manual Retry for Jobs: Trigger a manual retry for a specific job ID via a message command. This bypasses the automatic retry limits, allowing for troubleshooting.
Job Tracking: The system automatically tracks job status (pending, running, success, fail) along with relevant metadata and a history of checkpoint transitions. Custom metadata structures for external IDs are also supported.
Flow Editor Interface: The flow editor provides visual indicators and interactive elements for Checkpoint Nodes:
- See the current queue count for each checkpoint in the node status label.
- View the job queue of a checkpoint in the debug sidebar.
- Perform actions like deleting, canceling, releasing, pausing, resuming, manually retrying, and scheduling jobs using inject nodes with specific commands.
- Configure rate limits per checkpoint.
- Configure automatic retry attempts for the Start Checkpoint (For other checkpoint types, the setting will be ignored.)
Automatic Cleanup of Jobs: Future updates will introduce automatic deletion of old jobs based on configurable data retention policies to manage storage and prevent overflow.

Flows triggered by HTTP requests (via HTTP-in node) are supported by checkpoints under the following restrictions:

To leverage all checkpoint features, immediately handle incoming HTTP requests and remove the msg.res object before the message reaches the first checkpoint. Otherwise certain features will be automatically disabled, including queuing, rate limiting, wait/approve, and retries for that specific job, and a warning will be displayed in the debug sidebar. Jobs with the msg.res object will bypass queues and paused checkpoints.

Custom Metadata and Job Events

Checkpoint Nodes provide powerful capabilities for auditing and tracking your jobs, primarily through the use of custom metadata and automatically generated job events.

Custom Metadata for Enhanced Tracking:

You can enrich your job records with custom metadata by adding properties to the msg object (e.g., msg.payload.customer_id, msg.asset.file_type). This allows you to store specific business-relevant information alongside your job.

Beyond arbitrary custom fields, qibb also supports a set of standardized metadata fields within msg.qibb for common tracking needs:

owner_id, owner_name, owner_url
asset_id, asset_name, asset_url
external_id, external_name, external_url

These standardized fields are specifically displayed in a dedicated column of the Jobs Table within the qibb Portal and are searchable, making it easier to filter and find jobs related to particular external systems, assets, or owners. Custom metadata (both standardized and user-defined) enhances job searchability, provides crucial context in the Portal, and forms a key part of your job's auditable trail.

Comprehensive Job Event History:

As jobs progress through Checkpoint Nodes, the system automatically generates job events. Each event records a significant transition or action in the job's lifecycle, such as:

CREATED: When a job is initiated at a Start Checkpoint.
STARTED: When a job begins leaves a Start Checkpoint.
CHECKED_OUT: When a job leaves an Update checkpoint.
WAIT: When a job enters a Wait Checkpoint.
APPROVED: When a job is released from a Wait Checkpoint.
SCHEDULED_FOR_RETRY: When a job is scheduled for a retry attempt.
SUCCEEDED: When a job reaches a Success Checkpoint.
FAILED: When a job reaches a Fail Checkpoint or times out.
CANCELLED: When a job has been cancelled by a command or user action.
DELETED: When a job has been deleted by a command or user action.

Each event typically includes:

Timestamp of the event.
The job_id and the checkpoint_id/name/type it occurred at.
The queue_type and job_state at the time of the event.
A plain-text summary of the event.
A snapshot of the msg object (Only if Event Data Storage is set to “Full”)
The current attempt number for the job.

Configurable Event Data Storage:
To manage storage consumption, you can configure the level of detail stored for each job event via the EVENT_DATA_STORAGE setting:

'Compact': Stores only essential event metadata, suitable for general tracking and auditing.
'Full': Includes a complete snapshot of the msg object at the time the event occurred. This provides deep debugging capabilities by allowing you to inspect the message content at any point in the job's history, but consumes significantly more storage.

Reactivity and Timing of Checkpoints

Checkpoint Nodes leverage an internal, asynchronous scheduler to manage job queues, process scheduled events, and maintain job states. This design ensures robustness and resilience, but it also introduces predictable timing characteristics that are important to understand for optimizing your flows, especially in near-real-time scenarios.

How Checkpoints Process Jobs

Queue Processing: Checkpoints periodically evaluate their internal queues. Jobs ready for processing (e.g., IMMEDIATE jobs) are moved to the next node in the flow. The frequency of this processing dynamically adapts: it speeds up when there are many jobs waiting to be processed (as often as every 3 seconds) and slows down when queues are empty (up to every 15 seconds). This processing also adheres to any configured rate limits.
Scheduled Jobs: For jobs with a msg.scheduled_at time (handled by Start Checkpoints), the scheduler periodically checks for jobs whose schedule has arrived. Once detected, these jobs are moved to the immediate processing queue. Expect a slight delay for scheduled jobs to be picked up, typically within 5 seconds to a minute after their scheduled time, depending on system load and scheduler adaptation.
Automatic Retries: (Specific to Start Checkpoints) When a job fails and automatic retries are configured, the Start Checkpoint re-schedules the job for a future retry attempt. A built-in exponential backoff strategy is applied (e.g., starting at 10 seconds for the first retry attempt). The scheduler checks for jobs to retry approximately every 20 seconds to 5 minutes, adapting its interval based on the number of pending retries.
Waiting Jobs (Approval): (Specific to Wait Checkpoints) Jobs held at a "Wait" checkpoint are released to the immediate queue as soon as they are approved via the Portal, API, or an injected message command. The checkpoint's internal scheduler then picks up these newly released jobs for processing, typically within 5 seconds to a minute.
Flagging Stalled Jobs: (Specific to Start Checkpoints) The system periodically identifies jobs that have been in a RUNNING state for an extended period (default: 5 minutes) without progressing through subsequent checkpoints. These jobs are automatically flagged as STALLED for attention, with this check occurring approximately every 5 to 10 minutes.

High-Volume Ingestion

For high-volume incoming messages, checkpoints also have a mechanism to write jobs to the database immediately as jobs enter a job checkpoint. If the number of incoming messages in the node's internal batch reaches a batch threshold (50 messages within a few seconds), they are instantly written to the database. This helps reduce initial ingestion latency for bursts of jobs.

Latency and Responsiveness:

Predictable Delay: Jobs entering a queued checkpoint will experience a slight, predictable delay (ranging from a few seconds to a minute for scheduled/waiting jobs, or up to 15 seconds for direct queue processing) before being processed and sent to the next node. This is a trade-off for the enhanced reliability, tracking, and control that checkpoints provide.
Adaptive Behavior: The system is inherently reactive. If a queue rapidly fills up (e.g., many IMMEDIATE jobs arrive), the processing intervals will automatically shorten to clear the backlog faster, ensuring higher throughput under load.