No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

*Corresponding Authors

Figure 1: Distribution of membership scores across different conditioning types: (a) ground-truth (GT) captions, (b) VLM-generated captions, and (c) our model-fitted embeddings. (d) and (e) illustrate \(\mathcal{L}_{\text{cond}}\) values under GT and VLM-generated caption conditions, and \(\mathcal{L}_{\text{uncond}}\) values for member and hold-out samples, respectively.


Observation

Replacing GT captions with VLM-generated captions leads to a clear drop in MIA performance, despite their semantic similarity to the images. We attribute this to an asymmetric sensitivity to conditioning:

  • Members: \(\mathcal{L}_{\text{cond}}\) increases sharply when GT captions are replaced by VLM-generated ones (Fig. 1(d)).
  • Hold-outs: \(\mathcal{L}_{\text{cond}}\) shows only a modest increase under the same substitution (Fig. 1(e)).
  • Both groups: \(\mathcal{L}_{\text{uncond}}\) remains approximately unchanged.

This asymmetry causes a systematic upward shift in the membership score $(\textit{i.e.,}\ \mathcal{L}_{\text{cond}} - \mathcal{L}_{\text{uncond}})$ for members, explaining the decreased separability observed in Fig. 1(b).

Intuition

We propose to exploit the observed sensitivity difference to improve membership inference in the caption-free setting. Specifically, we generate conditioning embeddings that:

  • Member samples:
    • $\mathcal{L}_{\text{cond}}$: increases sharply with fitted embeddings.
    • $\mathcal{L}_{\text{uncond}}$: stays low (due to direct involvement in training).
  • Hold-out samples:
    • $\mathcal{L}_{\text{cond}}$: rises only modestly.
    • $\mathcal{L}_{\text{uncond}}$: remains relatively high.

As in Fig. 1(c), MoFit generates embeddings that amplify $\mathcal{L}_{\text{cond}} - \mathcal{L}_{\text{uncond}}$ for members while suppressing it for hold-outs, thereby reinstating a reliable separability signal in the caption-free setting.

Methodology

Figure 2: Overview of our proposed method. (a) Given a query image x0, we first optimize a perturbation $\delta$ to overfit to the learned representation from the model. (b) From the resulting surrogateimage $x_0 + \delta^*$, we extract a model-fitted embedding $\phi^*$, which is then used as a synthetic condition to amplify the disparity between member and hold-out samples in (c).




Experimental Result

Table 1: Comparison of membership inference performance under the caption-free setting, where baseline methods are conditioned using either ground-truth or VLM-generated captions. Bold numbers denote the best, and underlined numbers indicate the second-best results.

We evaluate on LDMs fine-tuned with Pokemon, MS-COCO, and Flickr datasets under a caption-free threat model.
Key results from Tab. 1:

  • VLM captions fall short: Replacing GT captions with VLM-generated alternatives causes a substantial drop in MIA performance — $\textit{e.g.}$, CLiD's ASR on Pokemon decreases by nearly 29%.
  • MoFit outperforms all caption-free baselines: Model-fitted embeddings consistently achieve higher ASR and AUC across all three datasets.
  • Surpasses GT-caption upper bound: On MS-COCO, MoFit even exceeds CLiD conditioned on GT captions, demonstrating that surrogate-based misalignment can be a competitive alternative.