Declarative Data Sync Pipelines: Ship Integrations as Config, Not Code
Replace warehouse-centric ETL with declarative pass-through sync pipelines. Complete guide with manifests, JSONata mappings, checkpoints, overrides, and a migration checklist.
If you are spending engineering sprints writing Python or Node.js scripts to pull data from your customers' SaaS tools into your product, you already know the pattern. A new enterprise prospect needs Zendesk, so you write a Zendesk sync script. Then someone needs Jira. Then ServiceNow. Building imperative data pipelines means maintaining brittle state, handling endless edge cases, and updating code every time a vendor changes an endpoint.
Each script has its own pagination logic, its own incremental sync cursor, its own error handling, and its own maintenance burden. Six months later, you have a graveyard of brittle ETL scripts that nobody wants to touch. When an upstream provider deprecates a field, your team drops product work to push an emergency fix. This cycle makes scaling integrations a massive engineering bottleneck.
The pressure to support more platforms is a mathematical reality for modern software companies. The average company uses 106 SaaS applications in 2024, according to BetterCloud's State of SaaS report. Every one of those apps is a potential integration point your customers will ask about during a sales call. Furthermore, 60% of all SaaS sales cycles now involve integration discussions, and for 84% of customers, integrations are either very important or a dealbreaker. Customers with five or more integrations are up to 80% less likely to churn.
The math is simple: if each integration requires a dedicated sync script, and each script takes an engineer two to four weeks to build, test, and harden for production, your integration backlog will outpace your hiring plan by an order of magnitude. To capture this revenue without expanding engineering headcount, you must fundamentally change how you build integrations. You need to move away from imperative scripts and adopt declarative data sync pipelines.
This guide breaks down the architectural shift from imperative integration scripts to declarative sync pipelines, walks through complete pipeline manifests with incremental checkpoint design and JSONata transforms, explains why pass-through architectures beat double-ETL for embedded B2B use cases, and provides a practical migration checklist for teams ready to eliminate their ETL toolchain.
The Embedded Integration Bottleneck: Why Imperative ETL Scripts Fail at Scale
Building integrations in-house usually starts with a logical approach: you write a dedicated API client for each provider. A HubSpotAdapter class handles HubSpot's nested properties objects. A SalesforceAdapter class handles SOQL queries and PascalCase field names. Engineers write imperative logic to fetch, paginate, transform, and load the data.
This approach fails spectacularly at scale. Managing incremental loads, handling dependencies, and dealing with late-arriving data are all "undifferentiated heavy lifting" that distract from core business logic. As data sources evolve, imperative pipelines become brittle.
Here is what this looks like in practice. A typical imperative sync script for pulling tickets from Zendesk might contain:
- A
fetch_all_tickets()function with manual cursor management - A database table storing the last sync timestamp per customer account
- Custom retry logic for 429 errors (that will inevitably drift from Zendesk's actual behavior)
- A transformation layer mapping Zendesk's field names to your internal schema
- Scheduled execution via cron or a task queue
Multiply that by 20 integrations and you have thousands of lines of integration-specific code, each with its own edge cases and failure modes. Schema drift from upstream API changes silently breaks these scripts. An engineer leaves the company, and nobody remembers why the Salesforce sync has a special sleep(3) call on line 247. When you maintain separate code paths for each integration, adding a new CRM means writing new endpoint handler functions, creating new database schemas, adding conditional branches in shared code, and going through a full CI/CD deployment cycle.
Embedded ETL platforms like Hotglue attempt to solve this by giving you a platform to host these scripts, but they are still fundamentally code-first. You are still writing Python transformation scripts for every integration. Conversely, internal data warehousing tools like Airbyte or Fivetran offer zero-maintenance pipelines, but they are built for internal analytics, not embedded B2B SaaS use cases. Fivetran operates on a usage-based pricing model where costs are based on Monthly Active Rows (MAR) - the number of rows inserted, updated, or deleted. As of March 2025, MAR is calculated per connection, meaning each data source has its own MAR count. This change significantly impacts costs for companies syncing multiple high-volume data sources for thousands of individual customer tenants.
To truly reduce technical debt from maintaining dozens of API integrations, you must abstract the integration logic out of your runtime code entirely.
Why Declarative Pass-Through Syncs Beat Double-ETL at Scale
The warehouse-centric pattern - land data in a data warehouse, transform with SQL or dbt, then push it back out via reverse ETL - was designed for internal analytics teams. When B2B SaaS companies repurpose this "double-ETL" pattern for embedded customer integrations, three problems compound at scale.
graph TD
subgraph doubleETL["Double-ETL: 5 hops, 3 tools, data at rest"]
A1[Customer SaaS API] --> B1[ETL Tool]
B1 --> C1[Data Warehouse]
C1 --> D1[SQL / dbt Transform]
D1 --> E1[Reverse ETL Tool]
E1 --> F1[Your App Database]
end
subgraph passThrough["Pass-Through: 1 hop, 1 config, zero retention"]
A2[Customer SaaS API] --> B2[Declarative Sync<br>Pipeline]
B2 --> C2[Your App Database]
endLatency accumulates at every hop. Data flows from your customer's SaaS API into an ETL tool, lands in a warehouse, gets transformed on a schedule, then gets pushed back out via reverse ETL to your application database. Each stage adds its own sync interval. If your ETL tool syncs every 6 hours and your reverse ETL runs hourly, your customers could be looking at data that is 7+ hours stale before it reaches your application.
Cost scales per tenant, not per integration. Warehouse-centric architectures charge based on data volume - rows synced, storage consumed, compute used. When you are syncing data for thousands of customer tenants, each with their own SaaS accounts, those per-row costs multiply fast. If you have 500 customers each with a Zendesk connection averaging 50,000 monthly active rows, you are paying for 25 million rows - and that is before the warehouse compute for transformation and the reverse ETL tool's own pricing layer.
Every intermediate data store is a compliance surface. When your customer's HRIS data lands in your data warehouse before reaching your application, that warehouse becomes a data processor under GDPR. You now need data processing agreements, retention policies, and deletion procedures for an intermediate system that exists purely as plumbing. We cover the compliance implications in detail later in this guide.
A pass-through declarative pipeline eliminates the intermediate warehouse entirely. Data flows from the customer's SaaS API through a declarative pipeline directly into your application's datastore. No warehouse hop. No reverse ETL. One tool, one config, zero intermediate retention. This is the core of an ETL-free data sync strategy for embedded B2B SaaS.
This is not about replacing data warehouses for analytics. If your data team needs to build dashboards across customer data, a warehouse is the right tool. But if your product simply needs to pull your customer's tickets, contacts, or employees into your own database for operational use, the double-ETL path adds cost, latency, and compliance overhead that a pass-through pipeline avoids entirely.
What Are Declarative Data Sync Pipelines?
Declarative data pipelines shift the engineering focus from how to execute a sync to what data needs to be synced.
A declarative data sync pipeline defines what data to fetch, from which resources, and in what order - without specifying the procedural steps for HTTP calls, pagination, authentication, or error handling.
In a declarative system, you define the desired end state using configuration data. The underlying execution engine reads this configuration and determines the most efficient execution plan. It handles the HTTP requests, pagination, authentication, and error formatting automatically. Adding a new integration becomes a data operation, not a code deployment.
Truto's RapidBridge is a configuration-driven engine that eliminates integration-specific code. With RapidBridge, you define a Sync Job as a JSON object. This object specifies which resources to fetch, how they depend on each other, and where to send the data. The same engine that syncs Zendesk tickets syncs Jira issues, ServiceNow incidents, or Asana tasks - because all integration-specific behavior lives as data-only configuration, not compiled code.
Here is an example of a declarative Sync Job definition that fetches Zendesk users, contacts, tickets, and the comments associated with each ticket:
{
"integration_name": "zendesk",
"resources": [
{
"resource": "ticketing/users",
"method": "list"
},
{
"resource": "ticketing/tickets",
"method": "list"
},
{
"resource": "ticketing/comments",
"method": "list",
"depends_on": "ticketing/tickets",
"query": {
"ticket_id": "{{resources.ticketing.tickets.id}}"
}
}
]
}Notice the depends_on array. Instead of writing nested for loops in Node.js to fetch comments for every ticket, you simply declare the dependency graph. The RapidBridge engine resolves the execution graph, replaces the {{resources.ticketing.tickets.id}} placeholder dynamically, and handles the orchestration. Comments cannot be fetched until tickets are available, and the engine ensures this execution order automatically.
This architecture allows you to sync unified API data directly to a datastore or route it to a webhook endpoint, entirely driven by configuration. Compare this to code-first embedded ETL platforms where you write and maintain Python transformation scripts for each integration. Code-first gives you maximum flexibility; declarative gives you maximum leverage. At 5 integrations, the difference is marginal. At 50, it is the difference between a sustainable system and an engineering crisis.
End-to-End Architecture: Source to Datastore Without Retention
Here is the complete data flow for a pass-through declarative sync pipeline. The integration platform never persists your customer's data - it processes records in memory, applies transforms, and delivers them to your system.
sequenceDiagram
participant App as Your Application
participant RB as RapidBridge Engine
participant API as Customer SaaS API
App->>RB: Trigger Sync Job Run
RB->>RB: Load manifest + per-tenant checkpoint
RB->>API: GET resources (incremental, paginated)
API-->>RB: Records batch
RB->>RB: JSONata transform (in memory)
RB->>App: POST to your webhook endpoint
App-->>RB: 200 OK
Note over RB: Zero customer data retained
RB->>RB: Update per-tenant checkpointEvery step in this flow is driven by the JSON manifest. The engine reads the configuration, resolves the resource dependency graph, fetches data incrementally using the stored checkpoint, applies JSONata transforms in memory, and delivers the results to your webhook or datastore. After successful delivery, it updates the per-tenant checkpoint. At no point does customer data get written to an intermediate database.
For pipelines that need to assemble fragmented API responses - such as knowledge base pages split across hundreds of block objects - a Spool Node collects all paginated results into a single batch in temporary memory before passing them to a Transform Node. This is the complete pipeline architecture including spool nodes:
flowchart TD
A[JSON Sync Job<br>Manifest] --> B[RapidBridge Engine]
B --> C[Resolve Resource<br>Dependency Graph]
C --> D[Fetch Resources<br>Incremental Filters Applied]
D --> E{Nested or<br>Fragmented?}
E -->|Nested| F[Recursive Fetch<br>Child Resources]
E -->|Fragmented| G[Spool Node<br>Collect All Pages]
E -->|Simple| H[Transform Node<br>JSONata Expression]
F --> G
G --> H
H --> I{persist: true?}
I -->|Yes| J[Deliver to<br>Webhook or Datastore]
I -->|No| K[Feed to<br>Downstream Nodes]
J --> L[Update Per-Tenant<br>Checkpoint]The pass-through guarantee holds even with spool nodes - temporary memory is cleared after the batch is delivered. No customer records touch persistent storage within the integration layer.
Complete Pipeline Walkthrough: Manifest, Mapping, and Transforms
Let's walk through a complete sync job that ties together resource dependencies, incremental sync, JSONata transforms, and spool nodes for nested content.
This manifest syncs HRIS data from BambooHR. It fetches departments, incrementally syncs employees, filters to active employees using a JSONata transform, and loops dependent time-off requests for each employee:
{
"integration_name": "bamboohr",
"args_schema": {
"sync_start_date": {
"type": "string",
"format": "date-time"
}
},
"resources": [
{
"resource": "hris/departments",
"method": "list"
},
{
"name": "all-employees",
"resource": "hris/employees",
"method": "list",
"query": {
"updated_at": {
"gt": "{{args.sync_start_date|previous_run_date}}"
}
},
"persist": false
},
{
"name": "active-employees",
"type": "transform",
"config": {
"expression": "resources.hris.employees[employment_status = 'active'].{ 'id': id, 'full_name': first_name & ' ' & last_name, 'primary_email': email_addresses[is_primary].email, 'department_id': department.id, 'custom_fields': custom_fields }"
},
"depends_on": "all-employees",
"persist": true
},
{
"resource": "hris/time-off-requests",
"method": "list",
"depends_on": "all-employees",
"query": {
"employee_id": "{{resources.hris.employees.id}}"
}
}
]
}Here is what happens when this pipeline executes:
- Checkpoint resolution. The engine loads the
previous_run_datefor this specific integrated account. Tenant A's checkpoint is completely independent of tenant B's. - Departments fetch. A full list of departments is fetched (no incremental filter - departments change rarely).
- Incremental employee fetch. Only employees updated since the last successful run (or since
sync_start_dateif provided as an argument) are fetched. Theall-employeesnode setspersist: false, so raw employee data is not sent to your webhook. - JSONata transform. The
active-employeestransform filters to active employees and reshapes the output. The expression filters onemployment_status, concatenates name fields, extracts the primary email, and carries through custom fields. This node setspersist: true, so your webhook receives only clean, filtered records. - Dependent loop. For each employee fetched, the engine fetches their time-off requests by substituting
{{resources.hris.employees.id}}with each employee's ID. - Delivery. Departments, active employees, and time-off requests are delivered to your webhook as separate batched events.
- Checkpoint update. After all resources complete successfully, the engine updates
previous_run_datefor this tenant.
The JSONata expression in the transform node deserves closer attention:
resources.hris.employees[employment_status = 'active'].{
'id': id,
'full_name': first_name & ' ' & last_name,
'primary_email': email_addresses[is_primary].email,
'department_id': department.id,
'custom_fields': custom_fields
}This single expression performs three operations: it filters the array to active employees, restructures the output shape, and extracts nested values (primary email from an array of email objects). In imperative code, this would be a filter, a map, and a find - at least 10 lines of JavaScript. As a JSONata expression, it is a single storable string in your sync job configuration.
Mastering Incremental Sync: Escaping the Full-Refresh Trap
Full-refresh syncs - pulling every record from a source on every run - are the default in most hand-rolled integration scripts because they are the simplest to implement. They are also highly wasteful. If your customer has 500,000 Zendesk tickets and only 200 changed since your last sync, fetching all 500,000 every six hours is burning API quota, bandwidth, and processing time for no reason.
In production environments, you must implement incremental syncing to fetch only the records that have changed since the last successful run. The trick is reliable state tracking: you need to know when the last sync completed successfully and use that timestamp as a filter.
In imperative scripts, managing incremental state is notoriously difficult. You have to read the last sync timestamp from a database, pass it to the API request, handle timezone conversions, and update the timestamp only if the entire batch succeeds. If a script crashes halfway through, you risk data duplication or missed records.
Declarative pipelines handle state tracking automatically. RapidBridge exposes a system-managed context variable called previous_run_date. This variable stores the exact UTC timestamp of the last successful sync job for a specific integrated account.
To convert a full refresh into an incremental sync, you bind this variable to the upstream API's filter parameters:
{
"resource": "ticketing/tickets",
"method": "list",
"query": {
"updated_at": {
"gt": "{{previous_run_date}}"
}
}
}That is the entire incremental sync implementation. No database migration for a sync_cursors table. No custom logic to handle the first-run edge case. No manual timestamp bookkeeping. On the very first run, RapidBridge evaluates previous_run_date as 1970-01-01T00:00:00.000Z, triggering a historical sync. On subsequent runs, it injects the precise timestamp of the last success.
Overrides and Dynamic Arguments
There are scenarios where you need to override this automatic state. If a customer reports stale data and you want to force a full historical sync on demand, you can pass an ignore_previous_run flag in the API request. This bypasses the stored state without altering the pipeline definition:
{
"sync_job_id": "7279a917-b447-4629-9e46-a1eeb791ad6b",
"integrated_account_id": "7ae7b0ab-c6a7-4f29-aec1-1f123517af5d",
"webhook_id": "a5b21886-3b4d-4fd0-9956-ffc0714d701c",
"ignore_previous_run": true
}Sometimes, you need to sync data based on user input rather than automated timestamps, such as during targeted backfill operations. RapidBridge supports dynamic arguments via an args_schema definition. If you want to allow users to specify a custom start date for their ticket sync, you define the schema and reference it in your query:
{
"args_schema": {
"ticket_sync_start_date": {
"type": "string",
"format": "date-time"
}
}
}When triggering the job, you pass the argument:
{
"args": {
"ticket_sync_start_date": "2025-01-15T00:00:00.000Z"
}
}You can then use a conditional placeholder to fall back to the previous run date if the user does not provide a custom argument:
{
"query": {
"updated_at": {
"gt": "{{args.ticket_sync_start_date|previous_run_date}}"
}
}
}This declarative flexibility completely removes the need for conditional logic and state management in your application backend.
Per-Tenant, Per-Resource Checkpoint Isolation
A question that comes up immediately in multi-tenant architectures: what happens to the checkpoint when one tenant's sync fails but another's succeeds?
RapidBridge isolates checkpoints at the intersection of sync job and integrated account. Tenant A's previous_run_date for a Zendesk tickets sync is completely independent of tenant B's. If tenant A's sync fails due to an expired OAuth token, tenant B's checkpoint still advances normally. When tenant A's token is refreshed and the sync retries, it picks up from tenant A's last successful checkpoint - not from some global cursor.
| Tenant | Integration | Last Successful Sync | Next Run Fetches |
|---|---|---|---|
| Acme Corp | Zendesk | 2025-04-17T08:00:00Z | Records updated after 08:00 |
| Globex Inc | Zendesk | 2025-04-17T14:30:00Z | Records updated after 14:30 |
| Initech | Jira | 2025-04-16T22:00:00Z | Records updated after 22:00 |
This isolation extends to the job level: the checkpoint for an entire sync job only advances after all resources in the manifest complete successfully. If your manifest fetches both ticketing/tickets and ticketing/users, and the users fetch fails due to a rate limit, the checkpoint does not advance - even though tickets fetched successfully. This prevents partial sync states where tickets are up to date but the users they reference are stale. The next retry re-fetches everything from the same checkpoint.
Per-Tenant Overrides: Examples and Best Practices
In B2B SaaS, your customers' SaaS instances are rarely identical. One customer's BambooHR might have custom fields for visa_status and equity_grant_date. Another might use a non-standard department taxonomy. A declarative pipeline needs to handle this variability without forking your entire configuration.
RapidBridge supports a three-level override hierarchy that lets you customize sync behavior at increasing levels of specificity:
| Level | Scope | Use Case |
|---|---|---|
| Platform | All customers, all accounts | Default mappings and sync configurations |
| Environment | A specific customer environment | Custom field mappings, modified query logic |
| Account | A single integrated account | Account-specific field remapping, endpoint overrides |
Each level deep-merges on top of the previous one. An account-level override does not replace the entire configuration - it patches only the fields you specify.
For example, if a customer's Zendesk instance uses a custom field priority_override that you want mapped to your schema's priority field, you can apply an account-level override to the response mapping without touching the platform-level configuration:
{
"unified_model_override": {
"ticketing": {
"tickets": {
"list": {
"response_mapping": "$.{ 'priority': custom_fields.priority_override ? custom_fields.priority_override : priority }"
}
}
}
}
}This override is stored on the individual integrated account record. Every other customer's Zendesk sync continues using the platform-level mapping. No code change, no deployment, no impact on other tenants.
Best practices for managing overrides:
- Start with broad platform-level mappings that cover 80% of use cases. Only create overrides when a specific customer's data structure demands it.
- Use environment-level overrides for customer segments, not individual accounts. If all enterprise customers need a specific field, apply it at the environment level.
- Document every account-level override. These are the hardest to track because they are invisible to other customers. Treat them like feature flags - review them quarterly and promote recurring patterns to the platform level.
- Test overrides against real API responses. Use the JSONata Exerciser with sample data from the specific account before deploying to production.
Transforming Data on the Fly with JSONata
Pulling raw data from a third-party API is only half the job. Unified schemas are highly effective for standardizing core data models across providers, but enterprise customers frequently use custom fields, unique naming conventions, or highly specific data structures that fall outside standard schemas. The data almost always needs filtering, reshaping, or enrichment before it is useful in your system.
When transforming complex data containing nested object hierarchies in native JavaScript or Python, you often have to write complex nested looping structures. These mentally demanding implementations soak up precious development resources that should be dedicated to implementing business logic. Business logic provides real value for customers; maintaining complicated mapping scripts simply slows development velocity.
RapidBridge solves this using JSONata. JSONata is a lightweight, Turing-complete functional query and transformation language purpose-built for JSON data, inspired by the location path semantics of XPath 3.1. It allows you to express complex transformations - filtering, aggregation, string manipulation, and conditional logic - as compact strings that live as configuration.
In RapidBridge, you implement these transformations using Transform Nodes. A Transform Node depends on a resource node and applies a JSONata expression to the fetched data before passing it downstream.
Consider a real use case: you fetch a massive list of contacts from Zendesk, but you only want to persist contacts that were updated after the previous_run_date. While some APIs support updated_at filtering at the query level, many older APIs do not. You have to fetch everything and filter it in memory.
Here is how you handle that declaratively with a Transform Node:
{
"integration_name": "zendesk",
"resources": [
{
"name": "all-contacts",
"resource": "ticketing/contacts",
"method": "list",
"persist": false
},
{
"name": "filtered-contacts",
"type": "transform",
"config": {
"expression": "resources.ticketing.contacts[updated_at >= %.%.%.previous_run_date]"
},
"depends_on": "all-contacts",
"persist": true
}
]
}In this configuration:
- The
all-contactsnode fetches the data but setspersist: false, meaning this raw data will not be sent to your webhook or datastore. - The
filtered-contactsnode depends onall-contacts. - The JSONata expression filters the array, keeping only objects where
updated_atis greater than or equal to theprevious_run_datecontext variable. - The filtered node sets
persist: true, ensuring only the relevant records are delivered.
Transform nodes also chain. You can have a transform that depends on another transform, building multi-step processing pipelines without writing a single line of procedural code. JSONata expressions have full access to a rich context object, including the arguments passed to the run, the data fetched by previous resources, and any custom variables stored on the integrated account.
JSONata is a Domain Specific Language and does require some ramp-up to onboard developers. Its compact syntax has a learning curve, and debugging complex expressions is different from stepping through JavaScript in a debugger. However, for teams unfamiliar with functional expression languages, tools like the JSONata Exerciser are invaluable for iterating on expressions interactively, and the long-term maintenance benefits far outweigh the initial learning curve.
Handling Complex Pagination and Spooling
Not all APIs return clean, flat arrays of data. Document APIs, knowledge bases, and file storage systems often return heavily nested or fragmented data structures.
Some APIs fragment a single logical resource across many paginated responses. Notion is the poster child for this: fetching a page does not return the full document text. It returns a paginated list of block IDs. You have to recursively fetch the content of each block, handle pagination for each request, and stitch the text back together. Assembling that into a single coherent document in an imperative script is a recursive nightmare requiring significant memory management.
RapidBridge handles this with two features that work together: recursive fetching and Spool Nodes.
Recursive fetching follows parent-child relationships within a single resource automatically. If a block has children, the engine fetches them based on a condition:
{
"resource": "file-storage/drive-items",
"method": "list",
"recurse": {
"if": "{{resources.file-storage.drive-items.has_children:bool}}",
"config": {
"query": {
"parent": {
"id": "{{resources.file-storage.drive-items.id}}"
}
}
}
}
}Spool Nodes collect all paginated results into a single batch before forwarding them. Instead of receiving 15 separate webhook events for 15 pages of content blocks, you receive one event with the complete, assembled collection. You can then pass the entire collection to a Transform Node to be combined into a single document.
sequenceDiagram
participant Truto as RapidBridge
participant API as Third-Party API
participant Spool as Spool Node
participant Transform as Transform Node
Truto->>API: Fetch Block List (Page 1)
API-->>Truto: Blocks 1-50 + Next Cursor
Truto->>Spool: Store Blocks 1-50
Truto->>API: Fetch Block List (Page 2)
API-->>Truto: Blocks 51-100
Truto->>Spool: Store Blocks 51-100
Spool->>Transform: Pass all 100 blocks
Transform->>Transform: Evaluate JSONata (Combine Text)
Transform-->>Truto: Single Markdown DocumentTo prevent memory exhaustion, Spool Nodes limit temporary storage to 128KB per block. If your pages contain massive embedded content or remote data payloads, you should pair the Spool Node with an intermediate Transform Node that strips out heavy, unnecessary metadata before the data hits the spool.
By chaining a resource node, a stripping transform node, a spool node, and a combining transform node, you can convert highly fragmented, paginated API responses into clean, single-document webhook events - entirely through configuration.
The Reality of API Rate Limits and 429 Errors
If you are building data sync pipelines at scale, you will hit rate limits. There is no avoiding it. How your integration platform handles these limits dictates the reliability of your entire architecture.
Many integration platforms market "automatic retries" and "built-in exponential backoff" as selling points. They intercept HTTP 429 (Too Many Requests) errors, hold the connection open, and retry the request behind the scenes. This sounds great until you realize that opaque retry logic makes debugging impossible and creates cascading failures in distributed systems. Your background workers time out waiting for a response, queue backlogs explode, and you lose all visibility into the actual health of the upstream API.
Here is how rate limits actually work in the real world: every major API gateway and proxy - Red Hat 3scale, Kong, Envoy, Azure API Gateway - implements rate limiting differently. Salesforce uses a rolling 24-hour window. HubSpot enforces per-second and daily limits. Zendesk returns Retry-After headers. Each vendor expresses their limits in different response headers with different naming conventions.
Truto takes a radically transparent approach: it does not retry, throttle, or absorb rate limit errors.
If an upstream API returns an HTTP 429, Truto passes that error directly back to your caller. You are explicitly responsible for handling the failure and implementing your own retry logic.
To solve the chaos of inconsistent upstream headers, Truto normalizes all upstream rate limit data into standardized response headers based on the IETF RateLimit specification:
| Header | Meaning |
|---|---|
ratelimit-limit |
The maximum number of requests permitted in the current window |
ratelimit-remaining |
The number of requests remaining in the current window |
ratelimit-reset |
The number of seconds until the rate limit window resets |
Why is this better than automatic retry? Because you - the caller - know your business context. You know whether a failed sync can wait 30 seconds or needs to be escalated immediately. You know whether you should back off exponentially or switch to a different customer's sync job while the rate limit window resets. An opaque retry layer inside your integration platform strips you of that control.
A practical implementation looks like this:
async function syncWithRateLimitHandling(syncFn, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const response = await syncFn();
if (response.status === 429) {
const resetIn = parseInt(
response.headers.get('ratelimit-reset') || '60',
10
);
const remaining = response.headers.get('ratelimit-remaining');
console.log(`Rate limited. ${remaining} requests left. Waiting ${resetIn}s.`);
await sleep(resetIn * 1000);
continue;
}
return response;
}
throw new Error('Rate limit retries exhausted');
}The standardized headers mean this retry logic works identically whether the upstream API is Salesforce, Zendesk, or BambooHR. You write it once and it works across every integration. This explicit failure state gives your engineering team complete, deterministic control, which is the only way to adhere to best practices for handling API rate limits across multiple third-party APIs.
Concurrency Controls for Multi-Tenant Sync
When you are syncing data for hundreds of customers against the same API provider, concurrency becomes a real concern. If 50 of your customers use Zendesk and all 50 sync jobs trigger simultaneously, you could be hitting Zendesk's API with 50 parallel request streams.
Most SaaS APIs enforce rate limits per OAuth token or per API key, not across all API consumers globally. This means individual tenant syncs are unlikely to interfere with each other at the API level. However, some providers (notably Salesforce) enforce org-wide concurrent request limits that apply to all API consumers against a single Salesforce org. If your customer's own internal integrations are also active, your sync job competes for the same concurrency pool.
The practical approach: stagger your sync job schedules. Instead of triggering all sync jobs at the top of the hour, distribute them across the scheduling window. If you have 200 Zendesk syncs scheduled every 6 hours, randomize each job's start time within a 30-minute window to smooth out the API load.
For sync jobs running via RapidBridge, rate limit events are reported as sync_job_run:rate_limited webhook events. By monitoring these events across your tenant base, you can detect provider-level throttling patterns and adjust your scheduling accordingly. If a specific customer's Salesforce org consistently hits concurrency limits, you can shift that customer's sync to an off-peak window without changing the pipeline configuration for anyone else.
Architectural Takeaway: Never trust a system that hides rate limits from your application logic. For sync jobs that run via RapidBridge, rate limit events are reported as sync_job_run:rate_limited webhook events, so your system can track and react to throttling across all your connected accounts programmatically.
Compliance and Zero-Retention Validation
For B2B SaaS companies handling customer data, every system that stores or processes personal information expands your compliance surface. In a double-ETL architecture, your intermediate data warehouse becomes a data processor under GDPR, subject to storage limitation requirements under Article 5(1)(e) and data subject access requests under Article 15.
GDPR Article 5(1)(e) requires that personal data be "kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed." A data warehouse that exists only as a relay between your customer's SaaS API and your application database is hard to justify as a necessary processing purpose - it is architectural convenience masquerading as data infrastructure.
A pass-through declarative pipeline sidesteps this entirely. Customer records flow through the pipeline's transform layer in memory and are delivered directly to your application. The integration platform retains no customer data after delivery. Only sync metadata - job status, checkpoint timestamps, error counts - persists, and none of that is customer personal data.
To validate and prove this architecture during security reviews and SOC 2 audits:
- Confirm no intermediate persistence. Verify that your sync pipeline does not write customer records to any database, object store, or log file within the integration layer.
- Audit webhook delivery payloads. Ensure that records delivered to your webhook contain only the fields your application needs. Use JSONata transforms to strip unnecessary personal data before delivery, directly supporting the GDPR data minimization principle under Article 5(1)(c).
- Document data flow paths. Map every integration to show: source API → pipeline (in-memory transform) → your datastore. This documentation is what your SOC 2 auditor or customer's InfoSec team will ask for during a security review.
- Implement delivery confirmation. Ensure your webhook endpoint returns a 200 status only after successfully persisting the records. If delivery fails, the pipeline retries - but the data never sits in an intermediate queue within the integration platform.
- Verify rate limit passthrough. Since Truto passes 429 errors directly to your caller rather than queuing retries internally, there are no retry queues holding customer data. This eliminates a subtle data retention vector that many integration platforms introduce when they implement automatic retries with persistent queues.
SOC 2 Scope Reduction: In a store-and-sync architecture, your integration middleware enters your SOC 2 audit scope as a system that processes and stores customer data. A pass-through architecture keeps it out of scope for data-at-rest controls, reducing your audit surface and the number of controls you need to demonstrate.
Migration Checklist: From ETL Tools to Declarative Pass-Through
If you are currently running imperative ETL scripts or using warehouse-centric tools for embedded customer integrations, here is a structured migration path to no-ETL data pipelines.
Phase 1: Audit and Catalog
- Inventory every active sync script. List the integration, resources synced, sync frequency, and data volume per tenant.
- Identify incremental sync parameters. For each API, document which fields support date-based filtering (
updated_at,modified_since, etc.) and which require full refresh. - Map transformation logic. Catalog every field mapping, filter, and enrichment step in your current scripts. These become JSONata expressions in the declarative manifest.
- Document rate limit behavior. Record each provider's rate limit headers, limits, and retry patterns. You will need this for your client-side retry logic.
Phase 2: Build and Validate
- Create declarative manifests. Convert each sync script to a JSON Sync Job definition. Start with the simplest integration (fewest resources, no dependencies) and work up to complex ones with transforms and spool nodes.
- Write JSONata transforms. Replace imperative transformation code with JSONata expressions. Test every expression against real API responses using the JSONata Exerciser.
- Set up webhook endpoints. Your backend needs an endpoint that accepts batched records from the sync pipeline and writes them to your database. Handle idempotency - the same record may be delivered more than once during retries.
- Configure per-tenant overrides for any customer accounts that required special handling in your old scripts (custom field mappings, non-standard query logic, etc.).
Phase 3: Parallel Run and Cutover
- Run old and new pipelines in parallel. For at least two full sync cycles, run both systems and compare the output. Diff the records to catch mapping discrepancies.
- Monitor rate limit events. Watch for
sync_job_run:rate_limitedwebhook events and compare against your old scripts' retry patterns. - Validate checkpoint behavior. Confirm that
previous_run_dateadvances correctly per tenant after each successful run, and that failed runs do not advance the checkpoint. - Cut over one integration at a time. Disable the old script, verify the declarative pipeline produces identical results for at least one full cycle, then move to the next integration.
- Decommission old infrastructure. Remove cron jobs, task queues, sync cursor tables, and any intermediate databases that supported the old scripts.
Start with read-only syncs. Migrate your data pull pipelines first (list operations). Write-back operations (create, update) are typically more sensitive and benefit from longer parallel testing periods.
Where Declarative Pipelines Sit in the Integration Stack
Declarative sync pipelines are not a silver bullet. They work best when:
- You need to pull data from customers' SaaS accounts into your own data store - the classic embedded integration use case for B2B SaaS.
- The data model is well-defined - tickets, contacts, employees, invoices - resources that map cleanly to unified schemas.
- You are syncing at scheduled intervals - every 15 minutes, every hour, every 6 hours.
They are less well-suited for:
- Complex bidirectional sync with conflict resolution - that requires application-level logic that goes beyond what a declarative config can express.
- Real-time event-driven workflows where sub-second latency matters - webhooks plus a real-time unified API are a better fit.
- Highly custom ETL where every customer needs radically different transformation logic - though RapidBridge's JSONata transforms handle more of this than you might expect.
Combined with scheduled cron triggers, you can have a complete sync pipeline that runs every 6 hours, fetches only incremental changes, transforms the data to your schema, and delivers it to your database - all defined as a JSON configuration that fits in a single screen.
Shipping Integrations as Data Operations
The pattern here is the same one that transformed infrastructure management: move from imperative scripts to declarative specifications. Just as Terraform replaced shell scripts for provisioning servers, declarative sync pipelines replace hand-rolled ETL scripts for provisioning integrations. Relying on imperative ETL scripts, custom Python functions, and code-heavy orchestration platforms creates a maintenance burden that eventually stalls product development.
The practical impact for engineering leaders:
- New integrations become data operations, not engineering sprints. Adding a new CRM or HRIS source means writing a JSON configuration, not a new codebase.
- Incremental sync is a configuration flag, not an architecture project. Binding
previous_run_dateto a query filter takes 30 seconds. - Transform logic lives alongside sync definitions, not in separate codebases. JSONata expressions are part of the sync job config - versioned, auditable, and hot-swappable.
- Rate limit handling is explicit and under your control. Standardized IETF headers give you the data you need to implement retry logic that matches your business requirements.
For product managers staring at a backlog of integration requests blocking six-figure deals, this architecture changes the economics. The question shifts from "How many engineers do we need to hire for integrations?" to "How quickly can we add a new sync configuration?"
Stop writing custom API clients. Start shipping integrations as data operations.
FAQ
- What is a declarative data sync pipeline?
- A declarative data sync pipeline defines what data to pull from third-party APIs using configuration (typically JSON) rather than procedural code. The pipeline engine handles pagination, authentication, state tracking, and delivery automatically.
- How does RapidBridge handle incremental syncs?
- RapidBridge tracks state using a previous_run_date context variable per job and connected account. You bind this variable to a query filter in your config, allowing the engine to fetch only records updated since the last successful run.
- What is JSONata and why is it used for data transformation?
- JSONata is a functional query and transformation language for JSON data. It allows you to filter, reshape, and aggregate API responses using compact expressions stored as configuration strings, eliminating the need for complex nested loops in code.
- How does Truto handle API rate limits in sync pipelines?
- Truto does not automatically retry or absorb HTTP 429 errors. Instead, it passes the error back to the caller and normalizes upstream rate limits into IETF-standard headers (ratelimit-limit, ratelimit-remaining, ratelimit-reset) so you can implement precise retry logic.
- What is the difference between declarative sync pipelines and tools like Fivetran?
- Tools like Fivetran are built for internal analytics data warehousing and often charge per row (MAR). Declarative sync pipelines like RapidBridge are designed for embedded B2B integrations, syncing your customers' SaaS data directly into your product's operational datastore.