Staging Data Refresh¶
How and why production data is occasionally cloned onto staging, plus the safeguards that keep the result safe to run.
Looking for the procedure? See Replicate Prod → Staging. This page explains the design; that one is the runbook.
Why this exists¶
Staging is normally seeded with synthetic data, which is fine for unit tests and feature work but a poor proxy for real-world load and edge cases. Two recurring needs push us to occasionally mirror production:
- QA against realistic volume. Performance and UX issues that only surface at production scale (organizations with thousands of products, generation histories spanning months) cannot be reproduced from a fresh seed.
- Bug reproduction. Some defects only manifest against the specific shape of a real account's data — orphaned shoots, exotic style configurations, near-quota subscriptions.
The trade-off is obvious: cloning prod onto staging puts real PII on a less-hardened environment, and brings live partner-integration credentials with it. The procedure below is shaped end-to-end to remove those hazards before staging is unfrozen.
What gets replicated¶
| Layer | Source | Destination | Mechanism |
|---|---|---|---|
Backend PostgreSQL (app database) |
latest snapshot in s3://sartiq-snapshots/snapshots/backend/production/ |
staging backend DB | pg_dump snapshot → drop+recreate → psql < restore, run inside the staging pg-backup container |
| R2 media bucket | r2://shootify-media |
r2://shootify-media-staging |
rclone sync --server-side-across-configs --delete-during |
Two stores, two mechanisms — but both are read-only against production: only the snapshot bucket and the prod media bucket are ever read, never written. The compute server's PostgreSQL and Redis are deliberately out of scope: compute holds task / workflow state that has no meaning detached from the live API surface that produced it.
The PostgreSQL pipeline¶
flowchart LR
subgraph Prod["Production VM"]
ProdPg[(PostgreSQL)]
ProdBackup[pg-backup]
ProdPg -->|"every 3h<br/>pg_dump + zstd"| ProdBackup
end
subgraph R2["Cloudflare R2"]
Snap[(sartiq-snapshots)]
ProdBackup -->|S3 PUT| Snap
end
subgraph Staging["Staging VM"]
StagePg[(PostgreSQL)]
StageBackup[pg-backup]
end
Snap -->|"GET<br/>(restore-db.sh)"| StageBackup
StageBackup -->|"DROP DATABASE app;<br/>CREATE DATABASE app;<br/>psql < dump.sql"| StagePg
StageBackup -->|"anonymize SQL<br/>(anonymize-staging.sh)"| StagePg
Step-by-step¶
- Choose a snapshot. Production's
pg-backup(thetiredofit/db-backupsidecar — see Backup) writes a fresh dump every 3 hours. We pick one by listingr2://sartiq-snapshots/snapshots/backend/production/and choosing the most recent that suits the QA need. - Restore via
restore-db.sh --source-env production. The shared script (source) drops and recreates the stagingappdatabase, then streams the decompressed dump throughpsql. The--source-envflag was added so the script can read a snapshot from a directory that doesn't match the target environment — the restore-from-same-env case is still the default. - Anonymize via
anonymize-staging.sh. A small post-restore shell script (source) connects to the staging DB through thepg-backupcontainer and runs threeUPDATEs in a single transaction (see below).
The FK-anchored admin re-key¶
The single most important invariant in this workflow is that the prod admin user's UUID must survive the restore unchanged — it is the foreign-key parent of nearly every owned resource (product.user_id, shooting.user_id, generation.user_id, external_system_integration.user_id, …; see app/models/user.py in shootify-backend). Deleting the row and re-inserting a fresh staging admin would orphan everything; we'd be staring at a working login but an empty UI.
The pattern is therefore: rewrite, don't replace.
UPDATE "user"
SET email = :staging_admin_email,
full_name = 'Staging Admin',
hashed_password = crypt(:staging_admin_password, gen_salt('bf', 12)),
is_superuser = true,
is_active = true
WHERE email = :prod_admin_email;
id is untouched; email and hashed_password are swapped to the staging credentials; the row keeps its place as the owner of every prod-side FK. The prod admin email is supplied as a script parameter (default info@shootify.io) so the lookup is explicit and can be audited from the command line.
The script aborts early if the lookup finds zero rows. This is intentional: a silent miss would have left staging without a working admin and with the prod admin's stale credentials still in the DB.
Anonymizing the rest of the user table¶
UPDATE "user"
SET email = 'anon_' || replace(id::text, '-', '') || '@staging.invalid',
full_name = 'Anonymized User',
hashed_password = crypt(gen_random_uuid()::text, gen_salt('bf', 12))
WHERE email <> :prod_admin_email;
A few choices worth calling out:
@staging.invaliduses the RFC 6761 reserved TLD, which never resolves — even if a celery task slips past our other guards and tries to send mail through the staging SMTP creds, delivery will fail at the resolver instead of reaching a real inbox.crypt(gen_random_uuid()::text, gen_salt('bf', 12))produces a fresh bcrypt hash per user with a random secret nobody knows. Cost factor 12 matches the backend'sbcrypt.gensalt()default inapp/core/security.py, so the column stays compatible with the verifier the API uses on login.- Why
pgcrypto? The backend hashes via Pythonbcrypt, but doing 100k hashes in Python from the operator's laptop would be slow, fragile across environments, and would require shipping plaintext into the script.pgcryptolets us do everything server-side in one transaction.
Disabling external integrations¶
UPDATE external_system_integration
SET push_products = false,
push_service_base_url = NULL,
push_service_api_key = NULL;
external_system_integration rows arrive in the dump with real partner credentials and push_products = true. Per app/services/image_push/service.py, the push code path early-returns whenever either push_service_base_url or push_service_api_key is null, so nulling both gives belt-and-braces protection: even if a configuration update somewhere flips push_products back on, no push can happen without the URL and key. Setting push_products = false on top of that ensures the push isn't even attempted.
The user table being anonymized first means external_system_integration.owner is now an anonymized account, but the integration row itself was created by the prod admin and is still owned by the now-staged admin UUID — fine, because integrations were never going to fire anyway.
Why not just truncate the user table?¶
Truncating would invalidate every FK pointing at user.id. We could detach FKs, truncate, and re-create — but the goal is to inspect prod-shaped data, and tearing out the user graph removes the very thing QA wants to look at. The "rewrite admin in place, anonymize others" pattern preserves the entire ownership graph while making it impossible to log in as anyone but the staging admin.
The R2 media pipeline¶
flowchart LR
Prod[("r2://shootify-media<br/>(prod)")]
Stage[("r2://shootify-media-staging<br/>(staging)")]
Prod -->|"server-side CopyObject<br/>(rclone sync --server-side-across-configs)"| Stage
Stage -.->|"--delete-during<br/>removes stale staging keys"| Stage
The two buckets live in the same Cloudflare R2 account, which makes the whole copy a sequence of S3 CopyObject calls. R2 executes them server-side: no bytes traverse the operator's machine, no egress is billed, and a 1.4 TB / 700k-object bucket finishes in ~30–60 minutes.
rclone sync is preferred over rclone copy because the destination should end up equal to the source. After the DB restore, the staging DB references exactly the prod set of media keys; any staging-only object is now dead data with no DB row pointing at it. --delete-during walks both listings and trims those orphans during the run.
Why no DB rewrite for image URLs?¶
Per app/services/storage/storage.py, image references in the database are stored as relative R2 keys (media/<resource_id>/file.<ext>). The serving URL is composed at read time from R2_PUBLIC_URL in the environment's .env — staging's already points at https://staging-media.sartiq.com, which is CNAME'd to staging's R2 bucket. So the rows we restored from prod work as-is on staging the moment the bucket sync completes.
If a future model ever stores absolute URLs, that assumption breaks and we'll need a rewrite step. Until then, the schema has actively chosen to keep the host out of the row, and this procedure leans on that choice.
What this procedure does NOT do¶
- Compute server. The compute DB is not cloned. Cross-DB FKs are intentionally avoided in the schema so the two DBs can drift; the staging compute keeps its own, unrelated state.
- Redis. Staging Redis is
FLUSHALL-ed before bringing the backend back up. Any in-flight Celery jobs from before the cutover would have run against the now-deleted state, so dropping them is correct. - Caddy / TLS / DNS. None of this changes. Staging continues to be served at
staging-api.sartiq.comandstaging-media.sartiq.com. - Outbound email at the SMTP layer. The staging
.envkeeps live SMTP credentials. We rely on the anonymized email domain (@staging.invalid) being unresolvable rather than disabling SMTP, because the staging admin (you) still wants password-reset and similar transactional emails to deliver.
Risk model and what we trust¶
| Risk | Mitigation | Residual |
|---|---|---|
| Real customer email leaks via outbound mail | All non-admin users get an @staging.invalid address (RFC 6761 reserved, unroutable) |
Mail to a real address can only happen via the one preserved staging admin account |
| Real partner API push from staging | push_service_base_url, push_service_api_key, push_products all neutralised; the push code path requires the first two to be non-null |
Code change that bypasses the early-return would reintroduce the risk; integration-test with the staging copy before the next refresh |
| Lost FK targets after restore | Admin row is rewritten in place; if it isn't found the script aborts before any other change is committed | None if the script completes |
| Stale orphan media on staging | rclone sync --delete-during removes any key not present in prod |
Concurrent writes to staging during the sync would race; we mitigate by stopping the backend in step 3 of the runbook |
| Operator running it against prod by accident | restore-db.sh reads the SSH target from GCP-Backend-staging/.env.destination; anonymize-staging.sh does the same; the runbook never sources a prod env file |
Only an explicit edit to the destination env file would redirect the procedure at prod |
Related documentation¶
- Replicate Prod → Staging — Step-by-step runbook
- DB Recovery — Same restore script, narrower use case
- Backup — Where the snapshots come from, how often, retention
- Storage — R2 buckets and key conventions