MediaResource Lifecycle¶

This document describes the MediaResource system — the centralized, content-addressable media management layer introduced to replace per-entity storage path conventions with a unified, deduplicated approach.

Overview¶

Why MediaResource Exists¶

Before MediaResource, each entity type (Product, Subject, Style, etc.) managed its own storage paths with bespoke naming conventions (images/products/{id}/..., images/subjects/...). This led to:

No deduplication — identical files uploaded twice consumed double storage
Fragile path coupling — entity-specific path formats baked into multiple services
No centralized cleanup — orphaned files accumulated when entities were deleted
Inconsistent metadata — dimensions, hashes, and file info tracked differently per entity

MediaResource solves these problems with:

Content-addressable storage — files identified by MD5 content hash, automatic dedup
Entity-agnostic canonical paths — all files stored at media/{resource_id}/file.{ext}
Polymorphic attachments — one join table links any entity to any media resource
Resilient deletion — deferred post-commit deletions with dead-letter retry and orphan scanning

Data Model¶

Entity-Relationship Diagram¶

erDiagram
    MediaResource ||--o{ MediaResourceAttachment : "has many"
    MediaResource {
        uuid id PK
        enum resource_type "IMAGE | VIDEO"
        string storage_key "R2 path"
        string content_hash "MD5 for dedup"
        string extension
        int file_size
        int width
        int height
        string dominant_color
        bool protected
        string alt_text
        string caption
        string color_profile "images only"
        float duration "videos only"
        datetime created_at
    }

    MediaResourceAttachment {
        uuid id PK
        uuid media_resource_id FK
        enum entity_type "PRODUCT | SUBJECT | ..."
        uuid entity_id
        string slot "e.g. cover_image"
        int position "NULL for single slots"
        datetime created_at
    }

    MediaDeletionDeadLetter {
        uuid id PK
        string storage_key
        string error_message
        int attempts
        datetime created_at
        datetime last_attempted_at
    }

MediaResource¶

The central entity representing a unique media file in storage.

Field	Type	Description
`id`	UUID	Primary key
`resource_type`	`IMAGE` \| `VIDEO`	Single-table inheritance discriminator
`storage_key`	string	Relative R2 path (e.g., `media/{id}/file.webp`)
`content_hash`	string	MD5 hash for deduplication
`extension`	string	File extension (e.g., `webp`, `png`)
`file_size`	int	Size in bytes
`width` / `height`	int?	Pixel dimensions
`dominant_color`	string?	Hex color code
`protected`	bool	Prevents deletion by orphan scanner
`alt_text` / `caption`	string?	Accessibility and display text
`color_profile`	string?	ICC profile (images only)
`duration`	float?	Length in seconds (videos only)

Computed properties: - aspect_ratio — width / height (if both present) - orientation — "portrait", "landscape", or "square"

MediaResourceAttachment¶

Polymorphic join table connecting any entity to any MediaResource.

Field	Type	Description
`media_resource_id`	UUID (FK)	Points to `media_resource.id`
`entity_type`	MediaEntityType	Which entity table (e.g., `PRODUCT`, `PREDICTION`)
`entity_id`	UUID	The owning entity's primary key
`slot`	string	Field name on entity (e.g., `cover_image`, `reference_images`)
`position`	int?	Order within list slots; `NULL` for single-value slots

Uniqueness constraints: - Single slots: UNIQUE(entity_type, entity_id, slot) WHERE position IS NULL - List slots: UNIQUE(entity_type, entity_id, slot, position) WHERE position IS NOT NULL

MediaDeletionDeadLetter¶

Records failed R2 storage deletions for retry by the scheduler.

Field	Type	Description
`storage_key`	string	R2 path that failed to delete
`error_message`	string	Failure reason
`attempts`	int	Retry counter (starts at 1)
`last_attempted_at`	datetime	Most recent retry timestamp

Canonical Path¶

All new media files are stored at:

media/{resource_id}/file.{ext}

Regex: ^media/[0-9a-f\-]{36}/file\.\w+$

Benefits: - Entity-agnostic — renaming or re-typing an entity requires no file move - Predictable — any resource_id maps to exactly one path - No naming collisions — UUID guarantees uniqueness

Legacy paths (images/products/..., images/subjects/..., shooting_looks/...) remain valid for pre-existing data. Both legacy and canonical paths resolve correctly via CDN.

Ingestion Flows¶

Path A: User Uploads (via `ingest_all`)¶

Used when creating or updating entities from API requests with file URLs.

sequenceDiagram
    participant FE as Frontend
    participant R2 as R2 Storage
    participant BE as Backend
    participant DB as Database

    FE->>R2: PUT file to presigned URL
    Note over R2: File at temp/{file_id}_{ts}_{name}

    FE->>BE: POST /products (with temp URLs)
    BE->>R2: HEAD temp file (get ETag, size)
    BE->>BE: Resolve content_hash from ETag

    alt Dedup hit (hash exists)
        BE->>R2: DELETE temp file
        BE->>DB: Reuse existing MediaResource
    else Dedup miss (new file)
        BE->>DB: CREATE MediaResource
        BE->>R2: COPY temp → media/{resource_id}/file.{ext}
        BE->>DB: UPDATE storage_key to canonical path
    end

    BE->>DB: CREATE MediaResourceAttachment
    BE->>DB: COMMIT
    BE->>R2: Delete temp file (deferred)

Entities using Path A: Product, Subject, Style, Organization, ShotType, GuidelinesShotType, ShootingLookOutfit, BackgroundPreset

Key method: media_service.ingest_all(entity, input_schema)

Path B: Compute Results (via `ingest_from_storage_key`, `relocate=True`)¶

Used when the Backend ingests results from the Compute Server.

sequenceDiagram
    participant CS as Compute Server
    participant R2 as R2 Storage
    participant BE as Backend
    participant DB as Database

    CS->>R2: Save result at compute/{type}s/{task_id}/result_0.webp
    CS-->>BE: task.completed event

    BE->>R2: HEAD permanent key (get ETag, size)
    BE->>BE: Resolve content_hash

    alt Dedup hit
        BE->>DB: Reuse existing MediaResource, attach
        Note over BE: Source key queued for deferred deletion
    else Dedup miss
        BE->>DB: CREATE MediaResource
        BE->>R2: COPY source → media/{resource_id}/file.{ext}
        BE->>DB: UPDATE storage_key to canonical
        BE->>DB: CREATE attachment
        Note over BE: Source key queued for deferred deletion
    end

    BE->>DB: COMMIT
    BE->>R2: Delete source key (deferred, dead-letter on failure)

Entities using Path B: Prediction (result_image), Refine

Path B.2: Shared References (via `ingest_from_storage_key`, `relocate=False`)¶

Used for files that should remain at their source path because multiple entities reference them.

File stays at the existing R2 path (no copy/move)
MediaResource created with storage_key pointing to the source path
No deferred deletion of source

Entities using Path B.2: Generation snapshot fields (e.g., product_snapshot_url)

Schema System¶

Input Schemas¶

The MediaInput schema standardizes how media fields are received in API requests:

class MediaInput(SQLModel):
    url: Optional[str] = None              # Presigned temp URL (new upload)
    media_resource_id: Optional[uuid.UUID] = None  # Reuse existing resource
    width: Optional[int] = None
    height: Optional[int] = None
    # ... other metadata fields

Validation: Exactly one of url or media_resource_id must be provided.

Slot Annotations¶

MediaSlot is a PEP 593 Annotated metadata tag that decouples Python field names from persisted slot identifiers:

# Schema field definition
cover_img: Annotated[Optional[MediaInput], MediaSlot("cover_image")] = None

Renaming cover_img to main_image on the Python side requires no data migration — the persisted slot name remains "cover_image".

Type Factories¶

Convenience factories reduce boilerplate:

Factory	Produces	Example
`MediaField("slot")`	`Annotated[Optional[MediaInput], MediaSlot("slot")]`	Single image input
`MediaListField("slot")`	`Annotated[Optional[List[MediaInput]], MediaSlot("slot")]`	Multiple images input
`MediaOutput("slot")`	`Annotated[Optional[MediaResourcePublic], MediaSlot("slot")]`	Single image response
`MediaListOutput("slot")`	`Annotated[Optional[List[MediaResourcePublic]], MediaSlot("slot")]`	Multiple images response

Auto-Discovery¶

discover_media_fields(schema_class) introspects a Pydantic schema to find all MediaInput (or MediaResourcePublic) fields, unwrapping Annotated, Optional, Union, and List layers. Returns a dict of field_name → MediaFieldInfo(slot, is_list, field_name, config).

extract_media_inputs(input_data) reads values from a schema instance, using model_fields_set to distinguish between "not provided" (skip) and "explicitly set to None" (detach).

Response Population¶

populate_media_response(entity, response) auto-populates MediaResourcePublic output fields on a response schema by matching slot names to fetched media data. A batch variant populate_media_responses_batch() handles list endpoints efficiently.

Response Format¶

The MediaResourcePublic schema is returned in API responses:

{
  "id": "a1b2c3d4-...",
  "resource_type": "IMAGE",
  "url": "https://media.sartiq.com/media/a1b2c3d4-.../file.webp",
  "content_hash": "d41d8cd98f00b204e9800998ecf8427e",
  "extension": "webp",
  "file_size": 245760,
  "width": 1024,
  "height": 1536,
  "aspect_ratio": 0.667,
  "orientation": "portrait",
  "dominant_color": "#2a3f5f",
  "protected": false,
  "alt_text": null,
  "caption": null
}

The url field contains the full CDN URL resolved from the internal storage_key.

Deduplication¶

Content Hash¶

Every file is identified by its MD5 content hash, derived from the S3/R2 ETag:

Standard upload — ETag is the MD5 hash; use directly
Multipart upload — ETag is a composite hash (contains -); download file and compute true MD5

Race Condition Handling¶

When two concurrent requests upload the same file:

Both pass the dedup check (no existing hash yet)
First INSERT succeeds; second hits IntegrityError (unique constraint on content_hash)
On IntegrityError: rollback, SELECT the existing resource, reuse it, delete the duplicate file

Resilience¶

Deferred Deletions¶

Storage deletions are never performed inside a database transaction. Instead:

During the transaction, storage keys are appended to _pending_deletions
After session.commit(), execute_deferred_deletions() is called
Each key is deleted from R2; failures are recorded in the dead-letter table

This prevents inconsistency where a transaction rolls back but files were already deleted.

Dead-Letter Table¶

Failed deletions are recorded in media_deletion_dead_letter with: - Incrementing attempts counter - Updated last_attempted_at timestamp - Capped at max_attempts (default 10) before being abandoned

A scheduler job calls retry_media_deletions() periodically to retry pending entries.

Orphan Scanner¶

find_orphans_batch() identifies MediaResource records with zero attachments:

Age-gated — only considers resources older than a threshold (default 60 minutes) to avoid deleting resources mid-transaction
FOR UPDATE locking — prevents TOCTOU races where an attachment is created between the orphan check and deletion
Orphaned resources are deleted from both the database and R2 storage

Transaction Safety¶

All CRUD flush() without committing — the caller controls the transaction boundary
SELECT ... FOR UPDATE used when checking orphan status or reusing existing resources
Relocated files tracked in _pending_relocations for rollback cleanup on transaction failure

Entity Slot Map¶

All entities that participate in the MediaResource system and their registered slots:

Entity	`media_entity_type`	Slots
Product	`PRODUCT`	`cover_image`, `back_image`, `reference_images` (list)
Subject	`SUBJECT`	`cover_image`
Style	`STYLE`	`cover_image`, `reference_images` (list)
Organization	`ORGANIZATION`	`logo`
ShotType	`SHOT_TYPE`	`reference_image`
GuidelinesShotType	`GUIDELINES_SHOT_TYPE`	`reference_image`
BackgroundPreset	`BACKGROUND_PRESET`	`reference_image`
ShootingLookOutfit	`SHOOTING_LOOK_OUTFIT`	`reference_images` (list)
Prediction	`PREDICTION`	`result_image`
Refine	`REFINE`	`result_image`
Generation	`GENERATION`	(snapshot fields — not yet fully integrated)
ShotRevisionImage	`SHOT_REVISION_IMAGE`	(not yet integrated)

Migration Status¶

The MediaResource system is in a dual-format transition period:

What's Complete¶

All entities listed above (except noted exceptions) use MediaResource for new uploads
Canonical media/{resource_id}/file.{ext} paths for all new files
Deduplication, resilient deletion, orphan scanning
MediaResourcePublic in API responses alongside legacy URL strings

Dual-Format Writeback¶

During the transition, services write to both: - The new MediaResourceAttachment records - The old *_url string fields on entities (e.g., product.cover_image_url)

This ensures backward compatibility while frontends migrate to the new response format.

Backfill Script¶

A backfill script at scripts/backfill_media_resources.py creates MediaResource records for pre-existing data that only has legacy *_url fields.

Not Yet Integrated¶

Entity / Field	Status
`ShotRevisionImage`	Manual uploads not yet routed through MediaResource
`Subject.base_images`	JSON field with image URLs; requires schema migration
`Generation` snapshot fields	5 snapshot URL fields (`product_snapshot_url`, etc.) use `relocate=False` but full integration pending

Key Source Files¶

File	Purpose
`app/services/media_resource_service.py`	Central orchestrator — ingestion, dedup, deletion, response population
`app/models/media_resource.py`	Database models — `MediaResource`, `MediaResourceAttachment`, `MediaDeletionDeadLetter`
`app/schemas/media_resource.py`	Schemas — `MediaInput`, `MediaResourcePublic`, `MediaSlot`, type factories
`app/crud/media_resource.py`	CRUD — queries, attachments, dead-letter, orphan scanner
`app/utils/media_introspection.py`	Schema auto-discovery — `discover_media_fields()`, `extract_media_inputs()`

Storage Infrastructure — R2 bucket structure, CDN, lifecycle policies
Product Ingestion — Upload flow with MediaResource integration
Generation Pipeline — Compute result ingestion via MediaResource
Backend Architecture — Service layer overview
Backend Schemas — API schema reference