Staging Data Refresh¶

How and why production data is occasionally cloned onto staging, plus the safeguards that keep the result safe to run.

Looking for the procedure? See Replicate Prod → Staging. This page explains the design; that one is the runbook.

Why this exists¶

Staging is normally seeded with synthetic data, which is fine for unit tests and feature work but a poor proxy for real-world load and edge cases. Two recurring needs push us to occasionally mirror production:

QA against realistic volume. Performance and UX issues that only surface at production scale (organizations with thousands of products, generation histories spanning months) cannot be reproduced from a fresh seed.
Bug reproduction. Some defects only manifest against the specific shape of a real account's data — orphaned shoots, exotic style configurations, near-quota subscriptions.

The trade-off is obvious: cloning prod onto staging puts real PII on a less-hardened environment, and brings live partner-integration credentials with it. The procedure below is shaped end-to-end to remove those hazards before staging is unfrozen.

What gets replicated¶

Layer	Source	Destination	Mechanism
Backend PostgreSQL (`app` database)	latest snapshot in `s3://sartiq-snapshots/snapshots/backend/production/`	staging backend DB	`pg_dump` snapshot → drop+recreate → `psql <` restore, run inside the staging `pg-backup` container
R2 media bucket	`r2://shootify-media`	`r2://shootify-media-staging`	`rclone sync --server-side-across-configs --delete-during`

Two stores, two mechanisms — but both are read-only against production: only the snapshot bucket and the prod media bucket are ever read, never written. The compute server's PostgreSQL and Redis are deliberately out of scope: compute holds task / workflow state that has no meaning detached from the live API surface that produced it.

The PostgreSQL pipeline¶

flowchart LR
    subgraph Prod["Production VM"]
        ProdPg[(PostgreSQL)]
        ProdBackup[pg-backup]
        ProdPg -->|"every 3h<br/>pg_dump + zstd"| ProdBackup
    end

    subgraph R2["Cloudflare R2"]
        Snap[(sartiq-snapshots)]
        ProdBackup -->|S3 PUT| Snap
    end

    subgraph Staging["Staging VM"]
        StagePg[(PostgreSQL)]
        StageBackup[pg-backup]
    end

    Snap -->|"GET<br/>(restore-db.sh)"| StageBackup
    StageBackup -->|"DROP DATABASE app;<br/>CREATE DATABASE app;<br/>psql < dump.sql"| StagePg
    StageBackup -->|"anonymize SQL<br/>(anonymize-staging.sh)"| StagePg

Step-by-step¶

Choose a snapshot. Production's pg-backup (the tiredofit/db-backup sidecar — see Backup) writes a fresh dump every 3 hours. We pick one by listing r2://sartiq-snapshots/snapshots/backend/production/ and choosing the most recent that suits the QA need.
Restore via restore-db.sh --source-env production. The shared script (source) drops and recreates the staging app database, then streams the decompressed dump through psql. The --source-env flag was added so the script can read a snapshot from a directory that doesn't match the target environment — the restore-from-same-env case is still the default.
Anonymize via anonymize-staging.sh. A small post-restore shell script (source) connects to the staging DB through the pg-backup container and runs three UPDATEs in a single transaction (see below).

The FK-anchored admin re-key¶

The single most important invariant in this workflow is that the prod admin user's UUID must survive the restore unchanged — it is the foreign-key parent of nearly every owned resource (product.user_id, shooting.user_id, generation.user_id, external_system_integration.user_id, …; see app/models/user.py in shootify-backend). Deleting the row and re-inserting a fresh staging admin would orphan everything; we'd be staring at a working login but an empty UI.

The pattern is therefore: rewrite, don't replace.

UPDATE "user"
SET email = :staging_admin_email,
    full_name = 'Staging Admin',
    hashed_password = crypt(:staging_admin_password, gen_salt('bf', 12)),
    is_superuser = true,
    is_active = true
WHERE email = :prod_admin_email;

id is untouched; email and hashed_password are swapped to the staging credentials; the row keeps its place as the owner of every prod-side FK. The prod admin email is supplied as a script parameter (default info@shootify.io) so the lookup is explicit and can be audited from the command line.

The script aborts early if the lookup finds zero rows. This is intentional: a silent miss would have left staging without a working admin and with the prod admin's stale credentials still in the DB.

Anonymizing the rest of the user table¶

UPDATE "user"
SET email = 'anon_' || replace(id::text, '-', '') || '@staging.invalid',
    full_name = 'Anonymized User',
    hashed_password = crypt(gen_random_uuid()::text, gen_salt('bf', 12))
WHERE email <> :prod_admin_email;

A few choices worth calling out:

@staging.invalid uses the RFC 6761 reserved TLD, which never resolves — even if a celery task slips past our other guards and tries to send mail through the staging SMTP creds, delivery will fail at the resolver instead of reaching a real inbox.
crypt(gen_random_uuid()::text, gen_salt('bf', 12)) produces a fresh bcrypt hash per user with a random secret nobody knows. Cost factor 12 matches the backend's bcrypt.gensalt() default in app/core/security.py, so the column stays compatible with the verifier the API uses on login.
Why pgcrypto? The backend hashes via Python bcrypt, but doing 100k hashes in Python from the operator's laptop would be slow, fragile across environments, and would require shipping plaintext into the script. pgcrypto lets us do everything server-side in one transaction.

Disabling external integrations¶

UPDATE external_system_integration
SET push_products = false,
    push_service_base_url = NULL,
    push_service_api_key = NULL;

external_system_integration rows arrive in the dump with real partner credentials and push_products = true. Per app/services/image_push/service.py, the push code path early-returns whenever either push_service_base_url or push_service_api_key is null, so nulling both gives belt-and-braces protection: even if a configuration update somewhere flips push_products back on, no push can happen without the URL and key. Setting push_products = false on top of that ensures the push isn't even attempted.

The user table being anonymized first means external_system_integration.owner is now an anonymized account, but the integration row itself was created by the prod admin and is still owned by the now-staged admin UUID — fine, because integrations were never going to fire anyway.

Why not just truncate the user table?¶

Truncating would invalidate every FK pointing at user.id. We could detach FKs, truncate, and re-create — but the goal is to inspect prod-shaped data, and tearing out the user graph removes the very thing QA wants to look at. The "rewrite admin in place, anonymize others" pattern preserves the entire ownership graph while making it impossible to log in as anyone but the staging admin.

The R2 media pipeline¶

flowchart LR
    Prod[("r2://shootify-media<br/>(prod)")]
    Stage[("r2://shootify-media-staging<br/>(staging)")]
    Prod -->|"server-side CopyObject<br/>(rclone sync --server-side-across-configs)"| Stage
    Stage -.->|"--delete-during<br/>removes stale staging keys"| Stage

The two buckets live in the same Cloudflare R2 account, which makes the whole copy a sequence of S3 CopyObject calls. R2 executes them server-side: no bytes traverse the operator's machine, no egress is billed, and a 1.4 TB / 700k-object bucket finishes in ~30–60 minutes.

rclone sync is preferred over rclone copy because the destination should end up equal to the source. After the DB restore, the staging DB references exactly the prod set of media keys; any staging-only object is now dead data with no DB row pointing at it. --delete-during walks both listings and trims those orphans during the run.

Why no DB rewrite for image URLs?¶

Per app/services/storage/storage.py, image references in the database are stored as relative R2 keys (media/<resource_id>/file.<ext>). The serving URL is composed at read time from R2_PUBLIC_URL in the environment's .env — staging's already points at https://staging-media.sartiq.com, which is CNAME'd to staging's R2 bucket. So the rows we restored from prod work as-is on staging the moment the bucket sync completes.

If a future model ever stores absolute URLs, that assumption breaks and we'll need a rewrite step. Until then, the schema has actively chosen to keep the host out of the row, and this procedure leans on that choice.

What this procedure does NOT do¶

Compute server. The compute DB is not cloned. Cross-DB FKs are intentionally avoided in the schema so the two DBs can drift; the staging compute keeps its own, unrelated state.
Redis. Staging Redis is FLUSHALL-ed before bringing the backend back up. Any in-flight Celery jobs from before the cutover would have run against the now-deleted state, so dropping them is correct.
Caddy / TLS / DNS. None of this changes. Staging continues to be served at staging-api.sartiq.com and staging-media.sartiq.com.
Outbound email at the SMTP layer. The staging .env keeps live SMTP credentials. We rely on the anonymized email domain (@staging.invalid) being unresolvable rather than disabling SMTP, because the staging admin (you) still wants password-reset and similar transactional emails to deliver.

Risk model and what we trust¶

Risk	Mitigation	Residual
Real customer email leaks via outbound mail	All non-admin users get an `@staging.invalid` address (RFC 6761 reserved, unroutable)	Mail to a real address can only happen via the one preserved staging admin account
Real partner API push from staging	`push_service_base_url`, `push_service_api_key`, `push_products` all neutralised; the push code path requires the first two to be non-null	Code change that bypasses the early-return would reintroduce the risk; integration-test with the staging copy before the next refresh
Lost FK targets after restore	Admin row is rewritten in place; if it isn't found the script aborts before any other change is committed	None if the script completes
Stale orphan media on staging	`rclone sync --delete-during` removes any key not present in prod	Concurrent writes to staging during the sync would race; we mitigate by stopping the backend in step 3 of the runbook
Operator running it against prod by accident	`restore-db.sh` reads the SSH target from `GCP-Backend-staging/.env.destination`; `anonymize-staging.sh` does the same; the runbook never sources a prod env file	Only an explicit edit to the destination env file would redirect the procedure at prod

Replicate Prod → Staging — Step-by-step runbook
DB Recovery — Same restore script, narrower use case
Backup — Where the snapshots come from, how often, retention
Storage — R2 buckets and key conventions