Skip to content

MediaResource Lifecycle

This document describes the MediaResource system — the centralized, content-addressable media management layer introduced to replace per-entity storage path conventions with a unified, deduplicated approach.


Overview

Why MediaResource Exists

Before MediaResource, each entity type (Product, Subject, Style, etc.) managed its own storage paths with bespoke naming conventions (images/products/{id}/..., images/subjects/...). This led to:

  • No deduplication — identical files uploaded twice consumed double storage
  • Fragile path coupling — entity-specific path formats baked into multiple services
  • No centralized cleanup — orphaned files accumulated when entities were deleted
  • Inconsistent metadata — dimensions, hashes, and file info tracked differently per entity

MediaResource solves these problems with:

  • Content-addressable storage — files identified by MD5 content hash, automatic dedup
  • Entity-agnostic canonical paths — all files stored at media/{resource_id}/file.{ext}
  • Polymorphic attachments — one join table links any entity to any media resource
  • Resilient deletion — deferred post-commit deletions with dead-letter retry and orphan scanning

Data Model

Entity-Relationship Diagram

erDiagram
    MediaResource ||--o{ MediaResourceAttachment : "has many"
    MediaResource {
        uuid id PK
        enum resource_type "IMAGE | VIDEO"
        string storage_key "R2 path"
        string content_hash "MD5 for dedup"
        string extension
        int file_size
        int width
        int height
        string dominant_color
        bool protected
        string alt_text
        string caption
        string color_profile "images only"
        float duration "videos only"
        datetime created_at
    }

    MediaResourceAttachment {
        uuid id PK
        uuid media_resource_id FK
        enum entity_type "PRODUCT | SUBJECT | ..."
        uuid entity_id
        string slot "e.g. cover_image"
        int position "NULL for single slots"
        datetime created_at
    }

    MediaDeletionDeadLetter {
        uuid id PK
        string storage_key
        string error_message
        int attempts
        datetime created_at
        datetime last_attempted_at
    }

MediaResource

The central entity representing a unique media file in storage.

Field Type Description
id UUID Primary key
resource_type IMAGE | VIDEO Single-table inheritance discriminator
storage_key string Relative R2 path (e.g., media/{id}/file.webp)
content_hash string MD5 hash for deduplication
extension string File extension (e.g., webp, png)
file_size int Size in bytes
width / height int? Pixel dimensions
dominant_color string? Hex color code
protected bool Prevents deletion by orphan scanner
alt_text / caption string? Accessibility and display text
color_profile string? ICC profile (images only)
duration float? Length in seconds (videos only)

Computed properties: - aspect_ratiowidth / height (if both present) - orientation"portrait", "landscape", or "square"

MediaResourceAttachment

Polymorphic join table connecting any entity to any MediaResource.

Field Type Description
media_resource_id UUID (FK) Points to media_resource.id
entity_type MediaEntityType Which entity table (e.g., PRODUCT, PREDICTION)
entity_id UUID The owning entity's primary key
slot string Field name on entity (e.g., cover_image, reference_images)
position int? Order within list slots; NULL for single-value slots

Uniqueness constraints: - Single slots: UNIQUE(entity_type, entity_id, slot) WHERE position IS NULL - List slots: UNIQUE(entity_type, entity_id, slot, position) WHERE position IS NOT NULL

MediaDeletionDeadLetter

Records failed R2 storage deletions for retry by the scheduler.

Field Type Description
storage_key string R2 path that failed to delete
error_message string Failure reason
attempts int Retry counter (starts at 1)
last_attempted_at datetime Most recent retry timestamp

Canonical Path

All new media files are stored at:

media/{resource_id}/file.{ext}

Regex: ^media/[0-9a-f\-]{36}/file\.\w+$

Benefits: - Entity-agnostic — renaming or re-typing an entity requires no file move - Predictable — any resource_id maps to exactly one path - No naming collisions — UUID guarantees uniqueness

Legacy paths (images/products/..., images/subjects/..., shooting_looks/...) remain valid for pre-existing data. Both legacy and canonical paths resolve correctly via CDN.


Ingestion Flows

Path A: User Uploads (via ingest_all)

Used when creating or updating entities from API requests with file URLs.

sequenceDiagram
    participant FE as Frontend
    participant R2 as R2 Storage
    participant BE as Backend
    participant DB as Database

    FE->>R2: PUT file to presigned URL
    Note over R2: File at temp/{file_id}_{ts}_{name}

    FE->>BE: POST /products (with temp URLs)
    BE->>R2: HEAD temp file (get ETag, size)
    BE->>BE: Resolve content_hash from ETag

    alt Dedup hit (hash exists)
        BE->>R2: DELETE temp file
        BE->>DB: Reuse existing MediaResource
    else Dedup miss (new file)
        BE->>DB: CREATE MediaResource
        BE->>R2: COPY temp → media/{resource_id}/file.{ext}
        BE->>DB: UPDATE storage_key to canonical path
    end

    BE->>DB: CREATE MediaResourceAttachment
    BE->>DB: COMMIT
    BE->>R2: Delete temp file (deferred)

Entities using Path A: Product, Subject, Style, Organization, ShotType, GuidelinesShotType, ShootingLookOutfit, BackgroundPreset

Key method: media_service.ingest_all(entity, input_schema)

Path B: Compute Results (via ingest_from_storage_key, relocate=True)

Used when the Backend ingests results from the Compute Server.

sequenceDiagram
    participant CS as Compute Server
    participant R2 as R2 Storage
    participant BE as Backend
    participant DB as Database

    CS->>R2: Save result at compute/{type}s/{task_id}/result_0.webp
    CS-->>BE: task.completed event

    BE->>R2: HEAD permanent key (get ETag, size)
    BE->>BE: Resolve content_hash

    alt Dedup hit
        BE->>DB: Reuse existing MediaResource, attach
        Note over BE: Source key queued for deferred deletion
    else Dedup miss
        BE->>DB: CREATE MediaResource
        BE->>R2: COPY source → media/{resource_id}/file.{ext}
        BE->>DB: UPDATE storage_key to canonical
        BE->>DB: CREATE attachment
        Note over BE: Source key queued for deferred deletion
    end

    BE->>DB: COMMIT
    BE->>R2: Delete source key (deferred, dead-letter on failure)

Entities using Path B: Prediction (result_image), Refine

Path B.2: Shared References (via ingest_from_storage_key, relocate=False)

Used for files that should remain at their source path because multiple entities reference them.

  • File stays at the existing R2 path (no copy/move)
  • MediaResource created with storage_key pointing to the source path
  • No deferred deletion of source

Entities using Path B.2: Generation snapshot fields (e.g., product_snapshot_url)


Schema System

Input Schemas

The MediaInput schema standardizes how media fields are received in API requests:

class MediaInput(SQLModel):
    url: Optional[str] = None              # Presigned temp URL (new upload)
    media_resource_id: Optional[uuid.UUID] = None  # Reuse existing resource
    width: Optional[int] = None
    height: Optional[int] = None
    # ... other metadata fields

Validation: Exactly one of url or media_resource_id must be provided.

Slot Annotations

MediaSlot is a PEP 593 Annotated metadata tag that decouples Python field names from persisted slot identifiers:

# Schema field definition
cover_img: Annotated[Optional[MediaInput], MediaSlot("cover_image")] = None

Renaming cover_img to main_image on the Python side requires no data migration — the persisted slot name remains "cover_image".

Type Factories

Convenience factories reduce boilerplate:

Factory Produces Example
MediaField("slot") Annotated[Optional[MediaInput], MediaSlot("slot")] Single image input
MediaListField("slot") Annotated[Optional[List[MediaInput]], MediaSlot("slot")] Multiple images input
MediaOutput("slot") Annotated[Optional[MediaResourcePublic], MediaSlot("slot")] Single image response
MediaListOutput("slot") Annotated[Optional[List[MediaResourcePublic]], MediaSlot("slot")] Multiple images response

Auto-Discovery

discover_media_fields(schema_class) introspects a Pydantic schema to find all MediaInput (or MediaResourcePublic) fields, unwrapping Annotated, Optional, Union, and List layers. Returns a dict of field_name → MediaFieldInfo(slot, is_list, field_name, config).

extract_media_inputs(input_data) reads values from a schema instance, using model_fields_set to distinguish between "not provided" (skip) and "explicitly set to None" (detach).

Response Population

populate_media_response(entity, response) auto-populates MediaResourcePublic output fields on a response schema by matching slot names to fetched media data. A batch variant populate_media_responses_batch() handles list endpoints efficiently.


Response Format

The MediaResourcePublic schema is returned in API responses:

{
  "id": "a1b2c3d4-...",
  "resource_type": "IMAGE",
  "url": "https://media.sartiq.com/media/a1b2c3d4-.../file.webp",
  "content_hash": "d41d8cd98f00b204e9800998ecf8427e",
  "extension": "webp",
  "file_size": 245760,
  "width": 1024,
  "height": 1536,
  "aspect_ratio": 0.667,
  "orientation": "portrait",
  "dominant_color": "#2a3f5f",
  "protected": false,
  "alt_text": null,
  "caption": null
}

The url field contains the full CDN URL resolved from the internal storage_key.


Deduplication

Content Hash

Every file is identified by its MD5 content hash, derived from the S3/R2 ETag:

  1. Standard upload — ETag is the MD5 hash; use directly
  2. Multipart upload — ETag is a composite hash (contains -); download file and compute true MD5

Race Condition Handling

When two concurrent requests upload the same file:

  1. Both pass the dedup check (no existing hash yet)
  2. First INSERT succeeds; second hits IntegrityError (unique constraint on content_hash)
  3. On IntegrityError: rollback, SELECT the existing resource, reuse it, delete the duplicate file

Resilience

Deferred Deletions

Storage deletions are never performed inside a database transaction. Instead:

  1. During the transaction, storage keys are appended to _pending_deletions
  2. After session.commit(), execute_deferred_deletions() is called
  3. Each key is deleted from R2; failures are recorded in the dead-letter table

This prevents inconsistency where a transaction rolls back but files were already deleted.

Dead-Letter Table

Failed deletions are recorded in media_deletion_dead_letter with: - Incrementing attempts counter - Updated last_attempted_at timestamp - Capped at max_attempts (default 10) before being abandoned

A scheduler job calls retry_media_deletions() periodically to retry pending entries.

Orphan Scanner

find_orphans_batch() identifies MediaResource records with zero attachments:

  • Age-gated — only considers resources older than a threshold (default 60 minutes) to avoid deleting resources mid-transaction
  • FOR UPDATE locking — prevents TOCTOU races where an attachment is created between the orphan check and deletion
  • Orphaned resources are deleted from both the database and R2 storage

Transaction Safety

  • All CRUD flush() without committing — the caller controls the transaction boundary
  • SELECT ... FOR UPDATE used when checking orphan status or reusing existing resources
  • Relocated files tracked in _pending_relocations for rollback cleanup on transaction failure

Entity Slot Map

All entities that participate in the MediaResource system and their registered slots:

Entity media_entity_type Slots
Product PRODUCT cover_image, back_image, reference_images (list)
Subject SUBJECT cover_image
Style STYLE cover_image, reference_images (list)
Organization ORGANIZATION logo
ShotType SHOT_TYPE reference_image
GuidelinesShotType GUIDELINES_SHOT_TYPE reference_image
BackgroundPreset BACKGROUND_PRESET reference_image
ShootingLookOutfit SHOOTING_LOOK_OUTFIT reference_images (list)
Prediction PREDICTION result_image
Refine REFINE result_image
Generation GENERATION (snapshot fields — not yet fully integrated)
ShotRevisionImage SHOT_REVISION_IMAGE (not yet integrated)

Migration Status

The MediaResource system is in a dual-format transition period:

What's Complete

  • All entities listed above (except noted exceptions) use MediaResource for new uploads
  • Canonical media/{resource_id}/file.{ext} paths for all new files
  • Deduplication, resilient deletion, orphan scanning
  • MediaResourcePublic in API responses alongside legacy URL strings

Dual-Format Writeback

During the transition, services write to both: - The new MediaResourceAttachment records - The old *_url string fields on entities (e.g., product.cover_image_url)

This ensures backward compatibility while frontends migrate to the new response format.

Backfill Script

A backfill script at scripts/backfill_media_resources.py creates MediaResource records for pre-existing data that only has legacy *_url fields.

Not Yet Integrated

Entity / Field Status
ShotRevisionImage Manual uploads not yet routed through MediaResource
Subject.base_images JSON field with image URLs; requires schema migration
Generation snapshot fields 5 snapshot URL fields (product_snapshot_url, etc.) use relocate=False but full integration pending

Key Source Files

File Purpose
app/services/media_resource_service.py Central orchestrator — ingestion, dedup, deletion, response population
app/models/media_resource.py Database models — MediaResource, MediaResourceAttachment, MediaDeletionDeadLetter
app/schemas/media_resource.py Schemas — MediaInput, MediaResourcePublic, MediaSlot, type factories
app/crud/media_resource.py CRUD — queries, attachments, dead-letter, orphan scanner
app/utils/media_introspection.py Schema auto-discovery — discover_media_fields(), extract_media_inputs()