Unsupervised Learning¶

Unsupervised learning techniques allow the Sartiq pipeline to analyze garment images without labeled training data. The two core tasks covered here are clustering -- grouping visually similar pixel regions into coherent garment parts (sleeves, collar, body panel, etc.) -- and optimal assignment -- establishing one-to-one correspondences between those parts across different views of the same garment (e.g., a still-life flat lay and a model-worn photograph).

Together, these techniques enable downstream operations such as color transfer between views, quality comparison of matching regions, and compositing alignment.

K-Means Clustering¶

What It Solves¶

Given a garment image, we need to partition pixels into \(K\) coherent regions that correspond to visually distinct parts. K-Means achieves this by grouping pixels based on similarity in a combined color-and-spatial feature space, without any prior knowledge of what the garment parts look like.

Algorithm -- Lloyd's Iteration¶

K-Means follows a simple iterative procedure:

Initialize \(K\) centroids (cluster centers) in feature space.
Assign each data point to the nearest centroid (using Euclidean distance).
Update each centroid to the mean of its assigned points.
Repeat steps 2--3 until assignments no longer change (or a maximum iteration count is reached).

Convergence is guaranteed because each step monotonically decreases the total within-cluster variance (the objective function). However, the algorithm converges to a local minimum, which depends on the initial centroid positions.

Feature Space¶

Raw pixel colors alone are insufficient -- two regions may share similar colors but belong to different garment parts (e.g., both sleeves are the same color but spatially separated). The feature vector for each pixel combines color and position:

\[ \mathbf{f} = \left[\, L^*,\; a^*,\; b^*,\; \frac{w \cdot x}{S},\; \frac{w \cdot y}{S} \,\right] \]

where:

\((L^*, a^*, b^*)\) are the pixel's coordinates in CIELAB color space, which is approximately perceptually uniform (equal numeric distances roughly correspond to equal perceived color differences).
\((x, y)\) are the pixel's spatial coordinates, normalized by image dimension \(S\).
\(w\) is a spatial weighting factor that controls the trade-off between color similarity and spatial proximity. A larger \(w\) produces more compact, spatially contiguous clusters; a smaller \(w\) allows clusters to be driven primarily by color.

This formulation is closely related to SLIC superpixels, but applied at the full-image level with a fixed \(K\) tuned to the number of expected garment parts.

Choosing K¶

The number of clusters \(K\) must be specified in advance. Two common data-driven methods help select it:

Elbow method -- Plot total within-cluster variance (inertia) as a function of \(K\). The "elbow" where marginal improvement flattens indicates a reasonable \(K\).
Silhouette score -- Measures how similar each point is to its own cluster versus the nearest neighboring cluster. Values range from \(-1\) to \(+1\); higher is better.

In practice, garment images have a practical heuristic: typical apparel items decompose into 4--8 distinct regions (front body, back body, left/right sleeves, collar, hem, pocket). The pipeline defaults to \(K = 6\) and adjusts based on garment category metadata when available.

Limitations¶

Limitation	Description	Mitigation
Spherical cluster assumption	K-Means assumes clusters are roughly spherical (isotropic) in feature space	Adequate for color+spatial features; elongated structures are rare
Initialization sensitivity	Different starting centroids can yield different results	Use k-means++ initialization, which spreads initial centroids apart
Fixed K	The number of clusters must be specified up front	Use the elbow method or domain knowledge
Uniform cluster size	Has a bias toward roughly equal-sized clusters even when natural groups differ in size	Acceptable for garments where parts are roughly comparable in area

Code Example¶

import cv2
import numpy as np
from sklearn.cluster import KMeans


def segment_garment_kmeans(
    image_bgr: np.ndarray,
    k: int = 6,
    spatial_weight: float = 0.5,
) -> tuple[np.ndarray, np.ndarray]:
    """Segment a garment image into K regions using color + spatial features.

    Returns the label map and the LAB-only cluster centroids (K, 3).
    """
    h, w = image_bgr.shape[:2]

    # Convert to CIELAB for approximately perceptually uniform color distances.
    # OpenCV uint8 LAB: L in [0, 255] (standard: [0, 100]),
    #                    a in [0, 255] (standard: ~[-128, 127], offset by +128),
    #                    b in [0, 255] (standard: ~[-128, 127], offset by +128).
    # Normalize to [0, 1] so color and spatial channels are comparable.
    lab = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2LAB).astype(np.float64) / 255.0

    # Build feature matrix: [L, a, b, weighted_x, weighted_y]
    yy, xx = np.mgrid[0:h, 0:w]
    scale = max(h, w)
    features = np.column_stack([
        lab.reshape(-1, 3),
        spatial_weight * xx.ravel()[:, np.newaxis] / scale,
        spatial_weight * yy.ravel()[:, np.newaxis] / scale,
    ])

    kmeans = KMeans(n_clusters=k, init="k-means++", n_init=10, random_state=42)
    labels = kmeans.fit_predict(features)

    label_map = labels.reshape(h, w)
    # Return only the LAB columns of the centroids (first 3 of 5),
    # converted from normalized OpenCV LAB back to standard CIELAB.
    raw = kmeans.cluster_centers_[:, :3] * 255.0  # back to OpenCV [0, 255] range
    lab_centroids = np.column_stack([
        raw[:, 0] * (100.0 / 255.0),        # L*: [0, 255] -> [0, 100]
        raw[:, 1] - 128.0,                   # a*: [0, 255] -> [-128, 127]
        raw[:, 2] - 128.0,                   # b*: [0, 255] -> [-128, 127]
    ])
    return label_map, lab_centroids

/// details | Using OpenCV's built-in K-Means

/// name="__codelineno-1-1" href="#__codelineno-1-1">import cv2 class="kn">import numpy as np class="k">def segment_garment_cv2( image_bgr: np.ndarray, k: int = 6, spatial_weight: float = 0.5, class="p">) -> tuple[np.ndarray, np.ndarray]: class="w"> """K-Means segmentation using OpenCV (faster for large images). class="sd"> Returns the label map and LAB-only centroids (K, 3). class="sd"> """ h, w = image_bgr.shape[:2] lab = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2LAB).astype(np.float32) / 255.0 yy, xx = np.mgrid[0:h, 0:w] scale = max(h, w) features = np.column_stack([ lab.reshape(-1, 3), (spatial_weight * xx.ravel() / scale).astype(np.float32)[:, np.newaxis], (spatial_weight * yy.ravel() / scale).astype(np.float32)[:, np.newaxis], ]) criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 100, 0.2) _, labels, centers = cv2.kmeans( features, k, None, criteria, 10, cv2.KMEANS_PP_CENTERS, ) raw = centers[:, :3] * 255.0 # back to OpenCV [0, 255] range lab_centroids = np.column_stack([ raw[:, 0] * (100.0 / 255.0), # L*: [0, 255] -> [0, 100] raw[:, 1] - 128.0, # a*: [0, 255] -> [-128, 127] raw[:, 2] - 128.0, # b*: [0, 255] -> [-128, 127] ]) return labels.reshape(h, w), lab_centroids

From clustering to matching -- K-Means gives us coherent regions within a single image, but the pipeline processes pairs of images (still-life and worn). The next step is establishing which cluster in one view corresponds to which cluster in the other -- a combinatorial optimization problem that the Hungarian algorithm solves exactly.

Hungarian Algorithm (Optimal Assignment)¶

The Assignment Problem¶

After segmenting both a still-life image and a worn image of the same garment into \(K\) clusters each, we need to determine which cluster in one view corresponds to which cluster in the other. This is the linear assignment problem: find a one-to-one mapping between two sets of \(K\) elements that minimizes total cost.

Cost Matrix Construction¶

The cost matrix \(C\) is a \(K \times K\) matrix where entry \(C_{ij}\) is the distance between cluster \(i\) from the still-life image and cluster \(j\) from the worn image. The distance is computed in CIELAB color space using the Delta-E (\(\Delta E\)) metric on the cluster centroids:

\[ C_{ij} = \Delta E\!\left(\mathbf{c}_i^{\text{still}},\; \mathbf{c}_j^{\text{worn}}\right) = \sqrt{(L_i^* - L_j^*)^2 + (a_i^* - a_j^*)^2 + (b_i^* - b_j^*)^2} \]

This is the CIE76 formula. For higher accuracy, CIEDE2000 (\(\Delta E_{00}\)) can be used, which accounts for perceptual non-uniformities in the LAB space, but the simpler form is sufficient when cluster centroids are well-separated.

Algorithm Overview¶

The Hungarian method (Kuhn--Munkres algorithm) solves the assignment problem in \(O(n^3)\) time. The key steps are:

Row reduction -- Subtract the row minimum from each row of the cost matrix.
Column reduction -- Subtract the column minimum from each column.
Cover zeros -- Find the minimum number of lines (rows and columns) needed to cover all zeros. If this equals \(K\), an optimal assignment exists among the zeros.
Augment -- If fewer than \(K\) lines suffice, adjust the matrix: subtract the smallest uncovered value from all uncovered elements and add it to doubly-covered elements. Repeat from step 3.

The result is a permutation that minimizes the total assignment cost.

Why Not Greedy?¶

A greedy approach -- repeatedly picking the smallest available entry in the cost matrix -- can produce suboptimal results because a locally optimal choice may block a better global assignment. Consider this \(3 \times 3\) cost matrix:

\[ C = \begin{bmatrix} 4 & 1 & 3 \\ 2 & 0 & 5 \\ 3 & 2 & 2 \end{bmatrix} \]

Greedy (always pick the smallest remaining entry):

Pick \((1, 1)\) with cost 0 (global minimum).
Remaining rows \(\{0, 2\}\), columns \(\{0, 2\}\). Smallest is \((2, 2)\) with cost 2.
Forced into \((0, 0)\) with cost 4.
Total = \(0 + 2 + 4\) = 6.

Hungarian optimal: the assignment \((0, 1) + (1, 0) + (2, 2)\) yields \(1 + 2 + 2\) = 5.

By greedily claiming column 1 for row 1, the greedy algorithm locked row 0 out of its cheapest option (column 1, cost 1), forcing a more expensive assignment overall. The Hungarian method avoids this by considering all possible assignments simultaneously. With \(K = 6\) garment parts the gap can be larger, and the \(O(n^3)\) cost is negligible at this scale.

Code Example¶

import numpy as np
from scipy.optimize import linear_sum_assignment


def match_clusters(
    centroids_still: np.ndarray,
    centroids_worn: np.ndarray,
) -> list[tuple[int, int]]:
    """Find optimal one-to-one matching between cluster centroids.

    Parameters
    ----------
    centroids_still : array of shape (K, 3)
        LAB centroids from the still-life segmentation.
    centroids_worn : array of shape (K, 3)
        LAB centroids from the worn-image segmentation.

    Returns
    -------
    List of (still_index, worn_index) pairs forming the optimal assignment.
    """
    # Build cost matrix using Euclidean distance in LAB (CIE76 Delta-E)
    k = centroids_still.shape[0]
    cost_matrix = np.zeros((k, k))
    for i in range(k):
        for j in range(k):
            diff = centroids_still[i] - centroids_worn[j]
            cost_matrix[i, j] = np.sqrt(np.sum(diff ** 2))

    row_ind, col_ind = linear_sum_assignment(cost_matrix)
    return list(zip(row_ind.tolist(), col_ind.tolist()))

/// details | Vectorized cost matrix construction

import numpy as np
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment


def match_clusters_fast(
    centroids_still: np.ndarray,
    centroids_worn: np.ndarray,
) -> list[tuple[int, int]]:
    """Optimized version using scipy.spatial.distance.cdist."""
    cost_matrix = cdist(centroids_still, centroids_worn, metric="euclidean")
    row_ind, col_ind = linear_sum_assignment(cost_matrix)
    return list(zip(row_ind.tolist(), col_ind.tolist()))

///

From pairwise matching to the full pipeline -- With clustering and optimal assignment in hand, we can now trace the complete flow: background removal feeds into per-view segmentation, the cost matrix connects the two views, and the matched pairs drive all downstream operations.

Pipeline Integration¶

End-to-End Flow¶

The clustering and matching stages fit into the broader garment processing pipeline as follows:

flowchart LR
    A[Input Images] --> B[Background Removal]
    B -- "edge / mask" --> C[K-Means Segmentation]
    C --> D[Build Cost Matrix]
    D --> E[Hungarian Matching]
    E --> F[Downstream Tasks]
    F --> G[Color Transfer]
    F --> H[Quality Comparison]
    F --> I[Compositing Alignment]

Segment garment from background -- using mask generation techniques (see Computer Vision Techniques for edge detection and mask refinement concepts used at this stage).
K-Means on each view -- cluster the masked garment pixels in LAB + spatial feature space to identify garment parts.
Build cost matrix -- compute pairwise Delta-E distances between centroids from the two views.
Hungarian match -- find the optimal one-to-one assignment.
Use matched pairs downstream -- the correspondences enable:
- Color transfer -- map the color distribution of each region from one view to the matching region in the other.
- Quality comparison -- compare texture sharpness, color consistency, or exposure between matched parts.
- Compositing alignment -- use matched regions as anchor points when aligning layers for final output.

Comparison with Other Clustering Approaches¶

Method	Time Complexity	Requires K?	Cluster Shape	Noise Handling	Why / Why Not for Garments
K-Means	\(O(nKt)\)	Yes	Spherical	Poor	Best fit -- fast, controllable K matches known garment structure, sufficient for color+spatial features
DBSCAN	\(O(n^2)\)	No	Arbitrary	Excellent	K is not controllable; tends to over-segment or under-segment depending on density parameters
Mean Shift	\(O(n^2)\)	No	Arbitrary	Good	Too slow for full-resolution garment images; K not controllable
GMM (EM)	\(O(nK t d^3)\)	Yes	Ellipsoidal	Moderate	More flexible cluster shapes but significantly slower; ellipsoidal modeling is unnecessary for this feature space

K-Means is the right choice here because the number of garment parts is known (or narrowly bounded), the feature space is low-dimensional, clusters are approximately spherical in the LAB+spatial representation, and speed matters for production throughput.