No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

Joonsung Jeon, Woo Jae Kim, Suhyeon Ha, Sooel Son*, and Sung-Eui Yoon*

*Corresponding Authors

Paper arXiv Code

Figure 1: Distribution of membership scores across different conditioning types: (a) ground-truth (GT) captions, (b) VLM-generated captions, and (c) our model-fitted embeddings. (d) and (e) illustrate $\mathcal{L}_{\text{cond}}$ values under GT and VLM-generated caption conditions, and $\mathcal{L}_{\text{uncond}}$ values for member and hold-out samples, respectively.

Observation

Replacing GT captions with VLM-generated captions leads to a clear drop in MIA performance, despite their semantic similarity to the images. We attribute this to an asymmetric sensitivity to conditioning:

Members: $\mathcal{L}_{\text{cond}}$ increases sharply when GT captions are replaced by VLM-generated ones (Fig. 1(d)).
Hold-outs: $\mathcal{L}_{\text{cond}}$ shows only a modest increase under the same substitution (Fig. 1(e)).
Both groups: $\mathcal{L}_{\text{uncond}}$ remains approximately unchanged.

This asymmetry causes a systematic upward shift in the membership score $(\textit{i.e.,}\ \mathcal{L}_{\text{cond}} - \mathcal{L}_{\text{uncond}})$ for members, explaining the decreased separability observed in Fig. 1(b).

Intuition

We propose to exploit the observed sensitivity difference to improve membership inference in the caption-free setting. Specifically, we generate conditioning embeddings that:

Member samples:
- $\mathcal{L}_{\text{cond}}$: increases sharply with fitted embeddings.
- $\mathcal{L}_{\text{uncond}}$: stays low (due to direct involvement in training).
Hold-out samples:
- $\mathcal{L}_{\text{cond}}$: rises only modestly.
- $\mathcal{L}_{\text{uncond}}$: remains relatively high.

As in Fig. 1(c), MoFit generates embeddings that amplify $\mathcal{L}_{\text{cond}} - \mathcal{L}_{\text{uncond}}$ for members while suppressing it for hold-outs, thereby reinstating a reliable separability signal in the caption-free setting.

Methodology

Figure 2: Overview of our proposed method. (a) Given a query image x0, we first optimize a perturbation $\delta$ to overfit to the learned representation from the model. (b) From the resulting surrogateimage $x_0 + \delta^$, we extract a model-fitted embedding $\phi^$, which is then used as a synthetic condition to amplify the disparity between member and hold-out samples in (c).

Experimental Result

Table 1: Comparison of membership inference performance under the caption-free setting, where baseline methods are conditioned using either ground-truth or VLM-generated captions. Bold numbers denote the best, and underlined numbers indicate the second-best results.

We evaluate on LDMs fine-tuned with Pokemon, MS-COCO, and Flickr datasets under a caption-free threat model.
Key results from Tab. 1:

VLM captions fall short: Replacing GT captions with VLM-generated alternatives causes a substantial drop in MIA performance — $\textit{e.g.}$, CLiD's ASR on Pokemon decreases by nearly 29%.
MoFit outperforms all caption-free baselines: Model-fitted embeddings consistently achieve higher ASR and AUC across all three datasets.
Surpasses GT-caption upper bound: On MS-COCO, MoFit even exceeds CLiD conditioned on GT captions, demonstrating that surrogate-based misalignment can be a competitive alternative.