Replicate Production Data onto Staging¶

Runbook for cloning the backend PostgreSQL database and the R2 media bucket from production into staging, then sanitizing the result.

Scope. Backend only. The compute server's PostgreSQL and Redis are out of scope and must not be touched.

Frequency. Expected to be run rarely (typically once, when staging needs to mirror real data for QA or training). This runbook exists so it can be reproduced safely.

Conceptual background. See Staging Data Refresh for the design rationale, what is preserved vs anonymized, and why each step is shaped the way it is.

What gets cloned¶

Layer	Source	Destination	Tool
PostgreSQL `backend.app`	latest snapshot under `s3://sartiq-snapshots/snapshots/backend/production/`	staging backend DB on VM `34.154.100.70`	`scripts/restore-db.sh ... --source-env production`
R2 media	`r2://shootify-media`	`r2://shootify-media-staging`	`rclone sync --server-side-across-configs`

R2 image keys are stored as relative keys (e.g. media/<uuid>/photo.jpg) — they don't include a hostname, so no DB rewrite is needed. The serving URL is built at runtime from R2_PUBLIC_URL, which on staging already points at https://staging-media.sartiq.com.

Production safety guarantees¶

This procedure leaves production data, code, and images completely untouched. The only operations against prod-owned resources are:

LIST and GET on s3://sartiq-snapshots/snapshots/backend/production/
LIST and server-side CopyObject (a GET) on r2://shootify-media

Every mutation (DROP DATABASE, anonymization SQL, docker compose stop, redis FLUSHALL, bucket writes, deletes) is scoped to staging. The runbook never SSHes into a prod VM and never authenticates with write scope on a prod resource.

Prerequisites¶

Tools¶

sudo dnf install -y rclone   # or: brew install rclone

R2 credentials (S3-compatible)¶

Both buckets live in the same Cloudflare account (d17a38676462caea7952a0be1262f0be). One credential pair with read on shootify-media and read+write on shootify-media-staging is sufficient.

export R2_ACCOUNT_ID=d17a38676462caea7952a0be1262f0be
export R2_ACCESS_KEY_ID=...
export R2_SECRET_ACCESS_KEY=...

mkdir -p ~/.config/rclone
cat >> ~/.config/rclone/rclone.conf <<EOF
[r2-prod]
type = s3
provider = Cloudflare
access_key_id = $R2_ACCESS_KEY_ID
secret_access_key = $R2_SECRET_ACCESS_KEY
endpoint = https://$R2_ACCOUNT_ID.r2.cloudflarestorage.com
acl = private

[r2-staging]
type = s3
provider = Cloudflare
access_key_id = $R2_ACCESS_KEY_ID
secret_access_key = $R2_SECRET_ACCESS_KEY
endpoint = https://$R2_ACCOUNT_ID.r2.cloudflarestorage.com
acl = private
EOF

SSH access to the staging backend VM¶

The IP and user come from shootify-deployment/GCP-Backend-staging/.env.destination. Quick check:

ssh -o ConnectTimeout=5 ubuntu@34.154.100.70 'echo ok'

Inputs you will need to supply¶

Prod admin email — value of FIRST_SUPERUSER in production's .env. Used by anonymize-staging.sh to find the row to re-key. Defaults to info@shootify.io; pass --prod-admin-email <other> only if production was bootstrapped with a different superuser. Don't confuse with staging's FIRST_SUPERUSER.
Staging admin email + password — read automatically from /home/ubuntu/services/shootify-deploy/.env on the staging VM (FIRST_SUPERUSER and FIRST_SUPERUSER_PASSWORD). Override with --staging-admin-* flags only if you want to avoid sourcing the remote file.

Procedure¶

All commands run from the shootify-deployment repo root unless noted otherwise.

1. Announce a maintenance window¶

Staging is unusable for the duration of the restore (~10–60 min depending on DB and bucket size). Notify whoever uses staging-app.sartiq.com and staging-api.sartiq.com.

2. Pick the prod snapshot¶

# List recent prod backup snapshots, newest last
rclone lsf r2-prod:sartiq-snapshots/snapshots/backend/production/ | sort | tail

Production is on a 3-hour schedule (0 0,6,9,12,15,18,21 * * *, UTC), so a recent snapshot is always available. Hold the chosen filename:

SNAPSHOT=pgsql_app_db_YYYYMMDD-HHMMSS.sql.zst

3. Stop staging writers and drain Redis¶

ssh ubuntu@34.154.100.70 \
  'cd /home/ubuntu/services/shootify-deploy && docker compose stop backend scheduler worker pg-backup'

ssh ubuntu@34.154.100.70 \
  'cd /home/ubuntu/services/shootify-deploy && docker compose exec -T redis redis-cli FLUSHALL'

Flushing Redis prevents queued Celery jobs from running against the freshly-restored data after the cutover. The db and caddy containers stay up — db because we restore into it, caddy so users hitting the URL get a clean connection refused on :8000 rather than a TLS error.

4. Restore the database¶

./scripts/restore-db.sh backend staging "$SNAPSHOT" --source-env production

The script prompts before dropping the staging DB; confirm only after double-checking the snapshot path printed in the banner.

In order, it:

SSHes into staging.
Inside the staging pg-backup container:
- aws s3 cp from s3://sartiq-snapshots/snapshots/backend/production/$SNAPSHOT
- DROP DATABASE app; CREATE DATABASE app;
- zstd -d and psql < dump.sql

The dump is taken with --clean --if-exists, so any CREATE EXTENSION / role mismatches between prod and staging surface as warnings rather than aborts. Skim the output: only schema/table-level errors after the fresh DB is created should be fatal.

5. Anonymize the restore¶

# Default --prod-admin-email is info@shootify.io
./scripts/anonymize-staging.sh
# or, if production's FIRST_SUPERUSER differs:
./scripts/anonymize-staging.sh --prod-admin-email someone-else@shootify.io

The script prompts once before mutating, then in a single transaction:

UPDATE "user" SET email='anon_<uuid>@staging.invalid', full_name='Anonymized User', hashed_password=<random bcrypt> WHERE email <> :prod_admin_email
UPDATE "user" SET email=:staging_admin_email, full_name='Staging Admin', hashed_password=<bcrypt of :staging_admin_password>, is_superuser=true, is_active=true WHERE email = :prod_admin_email
UPDATE external_system_integration SET push_products=false, push_service_base_url=NULL, push_service_api_key=NULL

Bcrypt hashes are produced server-side via pgcrypto (crypt(value, gen_salt('bf', 12))), matching the cost factor used by the backend's bcrypt.gensalt().

The post-check at the end prints row counts. Sanity-check before continuing:

staging_admin_rows = 1
still_push_products = 0
still_have_push_url = 0

If the script aborts with No user found with prod admin email ..., do not continue: either you passed the wrong email or the dump isn't what you think it is. The admin row is the FK anchor for almost all owned data; a miss here would leave staging without a working admin login.

6. Sync the media bucket¶

R2 server-side copy stays inside the same Cloudflare account, so this is fast and incurs no egress charges.

# Dry run — eyeball the count
rclone sync r2-prod:shootify-media r2-staging:shootify-media-staging \
  --server-side-across-configs \
  --transfers 64 --checkers 64 \
  --fast-list \
  --dry-run

# Real run. --delete-during removes anything in staging that isn't in prod.
rclone sync r2-prod:shootify-media r2-staging:shootify-media-staging \
  --server-side-across-configs \
  --transfers 64 --checkers 64 \
  --fast-list \
  --delete-during \
  --progress

We use sync (not copy) so the destination ends up equal to the source. Stale staging keys would no longer be referenced by the freshly-restored DB, so leaving them behind would be dead data — and --delete-during prevents that.

For a bucket the size of shootify-media (~1.4 TB / 700K objects per the r2-backup-worker README), expect ~30–60 minutes. The transfer can be interrupted and restarted: rclone sync is incremental.

7. Bring staging back up¶

ssh ubuntu@34.154.100.70 \
  'cd /home/ubuntu/services/shootify-deploy && docker compose up -d'

Tail the backend logs until it stabilises:

ssh ubuntu@34.154.100.70 \
  'cd /home/ubuntu/services/shootify-deploy && docker compose logs -f backend'

If the staging container image's schema is ahead of the prod dump, alembic at startup will apply the missing migrations on top of the restored data. A migration failure is not normal — abort and investigate.

8. Smoke test¶

curl https://staging-api.sartiq.com/api/v1/utils/health-check/ returns 200.
Log in to https://staging-app.sartiq.com with the staging admin credentials. The dashboard should show prod-volume data.
Open one product / shoot — its image should load from staging-media.sartiq.com.
Spot-check SELECT count(*) on "user", product, shooting, generation against expectations from prod.
SELECT email FROM "user" WHERE email NOT LIKE 'anon_%' LIMIT 5; should return only the staging admin email.
SELECT count(*) FROM external_system_integration WHERE push_products = true; should be 0.

9. Re-enable backups on staging¶

docker compose up -d already restarted pg-backup. Verify the next scheduled window writes to s3://sartiq-snapshots/snapshots/backend/staging/. The schedule is 0 0 * * * UTC for staging — see Backup.

Rollback¶

We don't take an explicit pre-restore snapshot of the staging DB — staging's own backup runs daily at 00:00 UTC and is retained for 30 days, so the previous day's dump is in s3://sartiq-snapshots/snapshots/backend/staging/. To revert:

rclone lsf r2-prod:sartiq-snapshots/snapshots/backend/staging/ | sort | tail
./scripts/restore-db.sh backend staging <staging_snapshot_filename>

For the media bucket there is no automatic rollback — rclone sync --delete-during removed any staging-only objects. The production R2 backup worker (sartiq-media-backup) only covers shootify-media, not the staging bucket. If you anticipate needing a rollback, omit --delete-during in step 6 and accept some stale objects in staging instead.

Troubleshooting¶

`No user found with prod admin email <...>` from `anonymize-staging.sh`¶

Either the dump didn't restore, the prod admin email is different from what was passed, or the dump came from a different application. Verify by querying the staging DB:

ssh ubuntu@34.154.100.70 \
  'cd /home/ubuntu/services/shootify-deploy && \
   docker compose exec -T pg-backup bash -c "PGPASSWORD=\$DB01_PASS psql -h \$DB01_HOST -U \$DB01_USER -d \$DB01_NAME -c \"SELECT email, is_superuser FROM \\\"user\\\" WHERE is_superuser = true;\""'

Pick the right email and re-run with --prod-admin-email.

`permission denied to create extension "pgcrypto"`¶

The DB role used by pg-backup (POSTGRES_USER, typically postgres) is a superuser on stock postgres:17 images, so this should not happen. If it does, exec a psql shell as a superuser and run CREATE EXTENSION pgcrypto; once, then re-run the anonymizer.

`rclone sync` hits "Forbidden" or 0 transfers¶

Re-check the credentials in ~/.config/rclone/rclone.conf. The same key needs read on shootify-media and write on shootify-media-staging. Both buckets are in the same account, so a token scoped to that account works for both.

Staging Data Refresh — Design rationale, FK-anchored admin re-key, anonymization guarantees
DB Recovery — Same restore-db.sh script, used here under --source-env production
Backup — Backup schedule, retention, R2 bucket layout
Storage — shootify-media and shootify-media-staging bucket details