Feature Separation and Recalibration for Adversarial Robustness

The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023

Feature Separation and Recalibration for Adversarial Robustness
Feature Separation and Recalibration for
Adversarial Robustness

Woo Jae Kim, Yoonki Cho, Junsik Jung, and Sung-Eui Yoon

Korea Advanced Institute of Science and Technology (KAIST)

[Paper] [Code] [Slides] [Poster]

Video

Abstract

Deep neural networks are susceptible to adversarial attacks due to the accumulation of perturbations in the feature level, and numerous works have boosted model robustness by deactivating the non-robust feature activations that cause model mispredictions. However, we claim that these malicious activations still contain discriminative cues and that with recalibration, they can capture additional useful information for correct model predictions. To this end, we propose a novel, easy-to-plugin approach named Feature Separation and Recalibration (FSR) that recalibrates the malicious, non-robust activations for more robust feature maps through Separation and Recalibration. The Separation part disentangles the input feature map into the robust feature with activations that help the model make correct predictions and the non-robust feature with activations that are responsible for model mispredictions upon adversarial attack. The Recalibration part then adjusts the non-robust activations to restore the potentially useful cues for model predictions. Extensive experiments verify the superiority of FSR compared to traditional deactivation techniques and demonstrate that it improves the robustness of existing adversarial training methods by up to 8.57% with small computational overhead. Codes are available at this https URL.

Motivations

Figure 1: Adversarial attacks disrupt the intermediate feature maps learned by deep neural networks, leading to mispredictions. (a) Conventional approaches have focused on suppressing or deactivating these disrupted activations, which can lead to loss of potentially useful cues that have been exploited in natural images. (b) We instead propose to restore useful cues from these disrupted activations that are otherwise neglected. These additional useful information will better guide the model to make correct predictions under attack.

Feature Separation and Recalibration (FSR)

Figure 2: We propose the Feature Separation and Recalibration (FSR) module to restore useful cues for predictions from disrupted feature activations. The Separation stage disentangles the input feature $f$ into the robust feature $f^{+}$ responsible for correct model predictions and the non-robust feature $f^{-}$ responsible for model misprediction. Then, the Recalibration stage recalibrates the non-robust feature into the recalibrated feature ${\tilde{f}}^{-}$ to restore useful cues for correct model predictions. The combined output feature $\tilde{f}$ of the robust and recalibrated features are passed down to subsequent layers of the model. FSR is attachable to any CNN model and can be trained with any adversarial training technique in an end-to-end manner.

Separation

Figure 3: We design the Separation Net $S$ to learn the robustness of each feature activation based on their relevant to correct prediction. We activation-wise disentangle the input feature into the robust feature and the non-robust feature based on this robustness.

Recalibration

Figure 4: We design the Recalibration Net $R$ to learn the recalibrating units that restore useful cues for correct predictions from the non-robust feature.

Quantitative Results

Table 1: Robustness (accuracy (%)) of various adversarial training strategies (AT, TRADES, and MART) upon adding our FSR module (+ FSR). We test method on ResNet-18 under CIFAR-10 and SVHN datasets. Our FSR module consistently improves the robustness against various attacks. Better results are marked in bold. Please refer to our paper for more results on other models (VGG16, WideResNet-34-10) and datasets (CIFAR-100, Tiny ImageNet).

Qualitative Results

Figure 5: Visualization of attention maps on the features of natural images (Natural) and the robust, non-robust, and the recalibrated feature of the adversarial images. The robust, feature captures discriminative cues regarding the ground truth class, while the non-robust feature captures irrelevant cues. To further boost feature robustness, we recalibrate the non-robust and capture additional useful cues for model predictions.