An overview on Adversarial Attacks and Defenses

Comments · 41 Views

Although neural networks have grown significantly smarter over the past decade, they have yet to become foolproof. These networks can be forced to misclassify simple images by adding perturbations that may be indiscernible to the human eye. Many of today’s most prominent services, like those provided by facebook and google, are tightly integrated with neural networks; as such it is critical that these systems are built to withstand attacks. In this post, I have reviewed four different types of attacks and defenses that I have implemented on the MNIST dataset using the Cleverhans library [1] and Keras framework. The code for my work is available at

One of the first examples of adversarial attacks was performed by Ian Goodfellow et al [2] where they added perturbation to an image of a panda which had a confidence of 57.7% with GoogLeNet’s classifier. Post-attack, GoogLeNet classified the same image as a gibbon with 99.3% confidence. A large scale example of an application of Adversarial Attacks is in fooling a facial recognition system. The following was implemented by Sharif et al. [3] where users in the first row wore eyeglasses with adversarial perturbations and were recognised as those in the second row.

These attacks could leave a network vulnerable and be used for malicious purposes. In order to defend and maintain security, one should gain a better understanding of how these attacks work.

All the attacks and defenses that are being described have been implemented on the MNIST dataset using Cleverhans library and Keras framework.

Types of Attacks

  1. Fast gradient sign method

    In general, a classifier is taught which input sample falls in which category by being penalized for misclassifications. The penalty corresponds to the severity of the misclassification, and over time, the classifier makes fewer and fewer mistakes. This practice is encompassed in a neural network’s cost function, whose output is the penalty applied to the network. A simple attack would try to maximize the error produced by the cost function, which would then force a misclassification. The Fast Gradient Sign Method (FGSM) does just this by computing the sign of the cost function’s gradient and adding a certain error to each element in the data sample to maximize the loss.

    x refers to the original input sample and sgn(∇x J (w,x, y)) is the sign of the gradient of the cost function J, which is a function of the input x, the output classification y, and the classifier weights w. ε is the error or perturbation that is added uniformly to each dimension in x. By using an ε of 0.25 the following was the result of the FGSM attack on the MNIST dataset.

    These attacked images on classification only gave an accuracy of  8.28%.

  2. Projected Gradient Descent

    Projected Gradient Descent is very similar to FGSM but instead of calculating the gradient of the cost function and adding the error, a new gradient is calculated in every iteration and a much smaller perturbation is added to each element in the data sample. By doing this process iteratively, this algorithm essentially takes mini FGSM steps towards a misclassification, making it easier to choose the ideal attacking step size. This is represented as:

    xt+1 refers to the data sample at the current iteration, xt refers to the data sample at the previous iteration, S is the set of all samples in the dataset, and α is a small perturbation constant. 

    These attacked images on classification only gave an accuracy of 0.99%.

  3. Carlini Wagner Attack

    Carlini Wagner is one of the most effective adversarial attacks, though it comes with a high computational cost. [4]  CW is an iterative attack that adds a perturbation to a sample under the l2 constraint, maximizing the probability of the prediction to a target class t that is not equal to the true class i. This is implemented by finding a w in this optimization problem using gradient descent:

    Where f is 

    In the above equation, x is the true sample, x’ is the adversarial sample, κ is a constant that allows the adversarial sample to be misclassified with high confidence, and Z(⋅) is the output of the classifier before applying the softmax activation.

    These attacked images on classification only gave an accuracy of 1.49%.

  4. DeepFool

    Deepfool is also an iterative attack where small perturbations are added to the sample. In each of these iterations, the perturbed sample is checked for adversariality. If the sample is still correctly classified by the neural network, the iteration continues. When the network finally misclassifies the sample, the loop terminates. In the original paper [5] the authors attacked an image of a whale with the following noise and misclassified as a turtle. 

    The pseudocode for a multi-class classifier is shown below:

    These attacked images on classification only gave an accuracy of 1.63%.

Types of Defenses

  1. Adversarial Detection using Denoising Autoencoder

    Essentially, an autoencoder breaks down some input data to its most robust features and recreates the original image from those features. This helps in preventing it from simply learning the identity.

     At first glance, a neural network that simply produces an input at its output seems useless. But its use becomes apparent in the context of removing perturbations added to input images. A denoising autoencoder may produce a replica of an input at its output, maintaining its most important features while removing perturbations imposed by an attack [6].

    Traditional defenses rely on removing perturbations from attacked samples so that a neural network may correctly classify the data. While this is essential in making networks more robust, there will inevitably be samples that even the defenses cannot fix. In these extreme cases, it becomes necessary that a network at the very least knows that a sample has been attacked so that necessary precautions can be taken. This is known as adversarial detection.  

    In my implementation, I defended the fast gradient sign method (FGSM) attack by training an autoencoder with adversarial training data and finding the mean squared error (MSE) between the adversarial samples and the reconstructed samples. During testing, the minimum error found during training which was 0.020346 is set as the threshold. On inputting data to the autoencoder and finding the reconstructions, we calculate the MSE for the testing set. To conduct adversarial detection, we say any sample that’s higher than the threshold is a false positive and anything lower is counted as a true positive. On defending the FGSM attack, the ratio of true positives to false positives was 10000:1.

  2. Adversarial Training

    This defense mechanism works by generating adversarial examples using the gradient of the target classifier and re-training the classifier with the perturbed and unperturbed labels. This defense mechanism however would be only possible under two cases:

    • The target classifier is known. 
    • Assuming there are not going to be any further attacks.

Though this method is more thorough than others in giving better accuracies, it does come with a high computational cost.

In my implementation, I defended the projected gradient descent (PGD) attack and obtained an accuracy of 97.73%.

3. Principal Component Analysis

Principal Component Analysis is a technique for dimensionality reduction that identifies patterns in data based on the correlation between features. It then transforms the data by projecting it onto a set of orthogonal axes with equal or fewer dimensions. The goal of this is to extract the most impactful dimensions of some data, such that the noise, which would not change those impactful dimensions greatly, can be filtered out. 

In this implementation, I took the first 100 principal components of each sample in the training set and trained a classifier with the new feature vectors. To defend the data, I took the transformed 100 principal components of the perturbed or attacked data, and on classification, it gave an accuracy of 83.65%.


This review delineates some methods on how neural networks may be exploited for malicious intent and some fundamental defenses against those exploitations. The original FGSM finds an inventive way to elicit a high-cost function response from a neural network that otherwise would have accurately identified an image, while an iterative follow-up to this attack incrementally modifies images until they reach a point at which a neural network misclassifies. 

These attacks often have tradeoffs in terms of computational cost and effectiveness, just as their defense counterparts do. On the flip side of the coin, we have defenses like denoising autoencoders, which attempt to break down the input image and recreate it to remove noise or PCA which only retains the most important dimensions of the input sample. Together, these explanations consist of a gentle introduction to the world of neural network security, which will only become more important as the world tends to a smarter, more connected future. 

The code for my work on the above-mentioned attacks and defenses is available at


  1. Cleverhans Library Repository -
  2. Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples
  3. Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, Michael K. Reiter, Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition -
  4. Nicholas Carlini, David Wagner. Towards Evaluating the Robustness of Neural Networks -
  5. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Pascal Frossard. DeepFool: a simple and accurate method to fool deep neural networks -
  6. Rajeev Sahay, Rehana Mahfuz, Aly El Gamal. Combatting Adversarial Attacks through Denoising and Dimensionality Reduction: A Cascaded Autoencoder Approach -