Replicate Production Data onto Staging¶
Runbook for cloning the backend PostgreSQL database and the R2 media bucket from production into staging, then sanitizing the result.
Scope. Backend only. The compute server's PostgreSQL and Redis are out of scope and must not be touched.
Frequency. Expected to be run rarely (typically once, when staging needs to mirror real data for QA or training). This runbook exists so it can be reproduced safely.
Conceptual background. See Staging Data Refresh for the design rationale, what is preserved vs anonymized, and why each step is shaped the way it is.
What gets cloned¶
| Layer | Source | Destination | Tool |
|---|---|---|---|
PostgreSQL backend.app |
latest snapshot under s3://sartiq-snapshots/snapshots/backend/production/ |
staging backend DB on VM 34.154.100.70 |
scripts/restore-db.sh ... --source-env production |
| R2 media | r2://shootify-media |
r2://shootify-media-staging |
rclone sync --server-side-across-configs |
R2 image keys are stored as relative keys (e.g. media/<uuid>/photo.jpg) — they don't include a hostname, so no DB rewrite is needed. The serving URL is built at runtime from R2_PUBLIC_URL, which on staging already points at https://staging-media.sartiq.com.
Production safety guarantees¶
This procedure leaves production data, code, and images completely untouched. The only operations against prod-owned resources are:
LISTandGETons3://sartiq-snapshots/snapshots/backend/production/LISTand server-sideCopyObject(aGET) onr2://shootify-media
Every mutation (DROP DATABASE, anonymization SQL, docker compose stop, redis FLUSHALL, bucket writes, deletes) is scoped to staging. The runbook never SSHes into a prod VM and never authenticates with write scope on a prod resource.
Prerequisites¶
Tools¶
R2 credentials (S3-compatible)¶
Both buckets live in the same Cloudflare account (d17a38676462caea7952a0be1262f0be). One credential pair with read on shootify-media and read+write on shootify-media-staging is sufficient.
export R2_ACCOUNT_ID=d17a38676462caea7952a0be1262f0be
export R2_ACCESS_KEY_ID=...
export R2_SECRET_ACCESS_KEY=...
mkdir -p ~/.config/rclone
cat >> ~/.config/rclone/rclone.conf <<EOF
[r2-prod]
type = s3
provider = Cloudflare
access_key_id = $R2_ACCESS_KEY_ID
secret_access_key = $R2_SECRET_ACCESS_KEY
endpoint = https://$R2_ACCOUNT_ID.r2.cloudflarestorage.com
acl = private
[r2-staging]
type = s3
provider = Cloudflare
access_key_id = $R2_ACCESS_KEY_ID
secret_access_key = $R2_SECRET_ACCESS_KEY
endpoint = https://$R2_ACCOUNT_ID.r2.cloudflarestorage.com
acl = private
EOF
SSH access to the staging backend VM¶
The IP and user come from shootify-deployment/GCP-Backend-staging/.env.destination. Quick check:
Inputs you will need to supply¶
- Prod admin email — value of
FIRST_SUPERUSERin production's.env. Used byanonymize-staging.shto find the row to re-key. Defaults toinfo@shootify.io; pass--prod-admin-email <other>only if production was bootstrapped with a different superuser. Don't confuse with staging'sFIRST_SUPERUSER. - Staging admin email + password — read automatically from
/home/ubuntu/services/shootify-deploy/.envon the staging VM (FIRST_SUPERUSERandFIRST_SUPERUSER_PASSWORD). Override with--staging-admin-*flags only if you want to avoid sourcing the remote file.
Procedure¶
All commands run from the shootify-deployment repo root unless noted otherwise.
1. Announce a maintenance window¶
Staging is unusable for the duration of the restore (~10–60 min depending on DB and bucket size). Notify whoever uses staging-app.sartiq.com and staging-api.sartiq.com.
2. Pick the prod snapshot¶
# List recent prod backup snapshots, newest last
rclone lsf r2-prod:sartiq-snapshots/snapshots/backend/production/ | sort | tail
Production is on a 3-hour schedule (0 0,6,9,12,15,18,21 * * *, UTC), so a recent snapshot is always available. Hold the chosen filename:
3. Stop staging writers and drain Redis¶
ssh ubuntu@34.154.100.70 \
'cd /home/ubuntu/services/shootify-deploy && docker compose stop backend scheduler worker pg-backup'
ssh ubuntu@34.154.100.70 \
'cd /home/ubuntu/services/shootify-deploy && docker compose exec -T redis redis-cli FLUSHALL'
Flushing Redis prevents queued Celery jobs from running against the freshly-restored data after the cutover. The db and caddy containers stay up — db because we restore into it, caddy so users hitting the URL get a clean connection refused on :8000 rather than a TLS error.
4. Restore the database¶
The script prompts before dropping the staging DB; confirm only after double-checking the snapshot path printed in the banner.
In order, it:
- SSHes into staging.
- Inside the staging
pg-backupcontainer:aws s3 cpfroms3://sartiq-snapshots/snapshots/backend/production/$SNAPSHOTDROP DATABASE app; CREATE DATABASE app;zstd -dandpsql < dump.sql
The dump is taken with --clean --if-exists, so any CREATE EXTENSION / role mismatches between prod and staging surface as warnings rather than aborts. Skim the output: only schema/table-level errors after the fresh DB is created should be fatal.
5. Anonymize the restore¶
# Default --prod-admin-email is info@shootify.io
./scripts/anonymize-staging.sh
# or, if production's FIRST_SUPERUSER differs:
./scripts/anonymize-staging.sh --prod-admin-email someone-else@shootify.io
The script prompts once before mutating, then in a single transaction:
UPDATE "user" SET email='anon_<uuid>@staging.invalid', full_name='Anonymized User', hashed_password=<random bcrypt> WHERE email <> :prod_admin_emailUPDATE "user" SET email=:staging_admin_email, full_name='Staging Admin', hashed_password=<bcrypt of :staging_admin_password>, is_superuser=true, is_active=true WHERE email = :prod_admin_emailUPDATE external_system_integration SET push_products=false, push_service_base_url=NULL, push_service_api_key=NULL
Bcrypt hashes are produced server-side via pgcrypto (crypt(value, gen_salt('bf', 12))), matching the cost factor used by the backend's bcrypt.gensalt().
The post-check at the end prints row counts. Sanity-check before continuing:
staging_admin_rows = 1still_push_products = 0still_have_push_url = 0
If the script aborts with No user found with prod admin email ..., do not continue: either you passed the wrong email or the dump isn't what you think it is. The admin row is the FK anchor for almost all owned data; a miss here would leave staging without a working admin login.
6. Sync the media bucket¶
R2 server-side copy stays inside the same Cloudflare account, so this is fast and incurs no egress charges.
# Dry run — eyeball the count
rclone sync r2-prod:shootify-media r2-staging:shootify-media-staging \
--server-side-across-configs \
--transfers 64 --checkers 64 \
--fast-list \
--dry-run
# Real run. --delete-during removes anything in staging that isn't in prod.
rclone sync r2-prod:shootify-media r2-staging:shootify-media-staging \
--server-side-across-configs \
--transfers 64 --checkers 64 \
--fast-list \
--delete-during \
--progress
We use sync (not copy) so the destination ends up equal to the source. Stale staging keys would no longer be referenced by the freshly-restored DB, so leaving them behind would be dead data — and --delete-during prevents that.
For a bucket the size of shootify-media (~1.4 TB / 700K objects per the r2-backup-worker README), expect ~30–60 minutes. The transfer can be interrupted and restarted: rclone sync is incremental.
7. Bring staging back up¶
Tail the backend logs until it stabilises:
ssh ubuntu@34.154.100.70 \
'cd /home/ubuntu/services/shootify-deploy && docker compose logs -f backend'
If the staging container image's schema is ahead of the prod dump, alembic at startup will apply the missing migrations on top of the restored data. A migration failure is not normal — abort and investigate.
8. Smoke test¶
-
curl https://staging-api.sartiq.com/api/v1/utils/health-check/returns 200. - Log in to
https://staging-app.sartiq.comwith the staging admin credentials. The dashboard should show prod-volume data. - Open one product / shoot — its image should load from
staging-media.sartiq.com. - Spot-check
SELECT count(*)on"user",product,shooting,generationagainst expectations from prod. -
SELECT email FROM "user" WHERE email NOT LIKE 'anon_%' LIMIT 5;should return only the staging admin email. -
SELECT count(*) FROM external_system_integration WHERE push_products = true;should be0.
9. Re-enable backups on staging¶
docker compose up -d already restarted pg-backup. Verify the next scheduled window writes to s3://sartiq-snapshots/snapshots/backend/staging/. The schedule is 0 0 * * * UTC for staging — see Backup.
Rollback¶
We don't take an explicit pre-restore snapshot of the staging DB — staging's own backup runs daily at 00:00 UTC and is retained for 30 days, so the previous day's dump is in s3://sartiq-snapshots/snapshots/backend/staging/. To revert:
rclone lsf r2-prod:sartiq-snapshots/snapshots/backend/staging/ | sort | tail
./scripts/restore-db.sh backend staging <staging_snapshot_filename>
For the media bucket there is no automatic rollback — rclone sync --delete-during removed any staging-only objects. The production R2 backup worker (sartiq-media-backup) only covers shootify-media, not the staging bucket. If you anticipate needing a rollback, omit --delete-during in step 6 and accept some stale objects in staging instead.
Troubleshooting¶
No user found with prod admin email <...> from anonymize-staging.sh¶
Either the dump didn't restore, the prod admin email is different from what was passed, or the dump came from a different application. Verify by querying the staging DB:
ssh ubuntu@34.154.100.70 \
'cd /home/ubuntu/services/shootify-deploy && \
docker compose exec -T pg-backup bash -c "PGPASSWORD=\$DB01_PASS psql -h \$DB01_HOST -U \$DB01_USER -d \$DB01_NAME -c \"SELECT email, is_superuser FROM \\\"user\\\" WHERE is_superuser = true;\""'
Pick the right email and re-run with --prod-admin-email.
permission denied to create extension "pgcrypto"¶
The DB role used by pg-backup (POSTGRES_USER, typically postgres) is a superuser on stock postgres:17 images, so this should not happen. If it does, exec a psql shell as a superuser and run CREATE EXTENSION pgcrypto; once, then re-run the anonymizer.
rclone sync hits "Forbidden" or 0 transfers¶
Re-check the credentials in ~/.config/rclone/rclone.conf. The same key needs read on shootify-media and write on shootify-media-staging. Both buckets are in the same account, so a token scoped to that account works for both.
Related Documentation¶
- Staging Data Refresh — Design rationale, FK-anchored admin re-key, anonymization guarantees
- DB Recovery — Same
restore-db.shscript, used here under--source-env production - Backup — Backup schedule, retention, R2 bucket layout
- Storage —
shootify-mediaandshootify-media-stagingbucket details