Adversarial Attack Using Genetic Algorithm

Fooling Convolution Neural Network into prediction what you want

Photo by Daniele Levis Pelusi on Unsplash

Adversarial Attack Using Genetic Algorithm

Adversarial attacks on machine learning models has been a hot research topic for the last year. While many teams are working on understanding the implications of adversarial approach, it is still a new area.

There are two main approaches: White-box optimization requires access to trained model architecture and weights, and uses it’s differentiability property to generate adversarial sample. Black-box optimization treats model as an object with unknown hidden state that only has some I\O interface. While it’s generally more computationally involved, black-box scenario is much closer to real-life event.

In this article I will explain how to generate adversarial examples using genetic programming.

Imagine having your identity stolen by adding unnoticeable noise to your social profile picture. Or having spam email to pass the provider filter by using special wording or other minor changed to the content. This is what a typical adversarial attack would look like.

Setup the model

Without further ado, we’ll take a MNIST dataset. It contains 60000 grayscale images of handwritten digits, each image is 28x28 and contains a single digit.
I will use PyTorch because it is highly flexible library that allows for fast prototyping and is ideal for such kinds of experimenting. The architecture is typical for small neural networks - three [Convolution -> Pooling -> Activation] blocks are followed by fully-connected “head” that ends with softmax layer to output probability of image to belong to each class.

import torch as pt

class CNN(pt.nn.Module):
    def __init__(self):
        self.model = pt.nn.Sequential(
        pt.nn.Conv2d(1, 30, 3),
        pt.nn.Conv2d(30, 30, 3),
        pt.nn.Conv2d(30, 15, 3),
        pt.nn.Linear(15, 64),
        pt.nn.Linear(64, 10),
    def predict(self, X):
        return self.model(X)

After training for a couple of epochs, model achieves 95% accuracy on holdout dataset. Here are some of the predictions for test images - it works like a charm.
CNN predictions.png

Genetic Algorithm

In order to generate adversarial example, we need to define the optimization problem. Given any input image, our model outputs probability of each class. Suppose, that we want it to believe that image contains 7 with high confidence. Then, we need to generate such image that the probability of it being classified as 7 is maximized.

Genetic algorithm is a general approach that requires to define several things beforehand:
  • Gene - single instance of problem solution. In our case it’s 28x28 matrix that represents an image
  • Population - a list of genes
  • Fitness function - this evaluates how good each candidate is. In our setup it’s $P(candidate = 7)$
  • Mutation - function that adds randomness to candidate solution. In our case, randomly changing pixel intensities
  • Crossover - routine to combine several candidate solutions. In our case, we randomly choose half of pixels from one image, and the remaining half from the other

Once we define this, the algorithm is straightforward:

  1. Initialize population randomly
  2. For n epochs:
    • Score each candidate using fitness function
    • Select top $k$ candidates, the others are discarded with a large probability
    • Randomly mutate fraction of them (except for the best one)
    • For the remaining places, add random crossovers until the population size is restored

I won’t go into the implementation details, since you can check code in the repository. Suffice to say, that we have GeneticSolver class that requires image size for initialization. solve method is used as an entry point for optimization and takes fitness function as an argument. It also accepts n_generations - number of generations to run.

Adversarial Attack

Class probability

We start with simple objective: maximize the probability of adversarial target adv_target. We set adv_target=7 for this example.

def fitness_class_probability(X):
    """ Maximize probability of adversarial target class """
    y = model.predict(pt.Tensor(X).unsqueeze(1)).detach().numpy()
    y_target = y[:, adv_target]
    return y_target

The result is pretty impressive: after 1400 epochs, we have persuaded the model that there is 7 on the image with probability of 90%. Here is the visualization of the optimization process:

Class probability with non-zero content penalty

Now let’s do something more fun. For example, let’s not only persuade the model that noise on the image is some digit, but let’s do it with minimal number of active pixels. Transforming this into fitness function:

def fitness_class_probability_empty(X):
    """ Maximize probability of adversarial target class, penalizing mean pixel intensity """
    y = model.predict(pt.Tensor(X).unsqueeze(1)).detach().numpy()
    y_target = y[:, adv_target]
    X_mean = X.mean(axis=1).mean(axis=1)
    return y_target - X_mean

Let’s take adv_target=5 as our next target label. To obtain visually good result I increased the population size from default 100 to 300. Evolution over 1800 epochs:

Change class with minimum change in content

Next, let’s take some adversarial sample from holdout dataset and persuade neural network in the wrong label, say adv_target=2.
Formalizing this into fitness function, we get:

def fitness_similarity(X):
    """ Maximize adv_target probability while minimizing MSE from adv_sample """
    y = model.predict(pt.Tensor(X).unsqueeze(1)).detach().numpy()
    y_target = y[:, adv_target]
    mse = np.sqrt(np.power(X - adv_sample, 2).mean(axis=1).mean(axis=1))
    return y_target - mse



As you can see, one can “fool” model into pretty much whatever output you want, given the appropriate function to optimize. There are a couple of reasons of such behavior:

Distributed representation

What makes neural networks particularly susceptible to such kind of attack is distributed representation - knowledge extracted from the training dataset is shared across all nodes and in a hierarchical manner. Thus, small change in the input can have an avalanche-like effect on the output.

Poor architecture

In the setup I choose, there is no way of incorporating information about how “real” the input looks. One way of making model more robust is to add second task of adversary sample detection. Another approach would be to add dummy class for invalid input. If the architecture change is costly, one can add preprocessing step of using train data for anomaly detection, assuming that input distribution is stationary.

Problem irreducibility

It’s not hard to argue that even when precautionary measures are taken, there will always be inherent susceptibility in black-box attack. Similarly to Information Security, new exploits and countermeasures will be developed in an arms race fashion.


I have set up a repository for you to reproduce and enhance this example.


There are a couple of papers on this subject that I have found interesting and inspiring:

Pavel Tyshevskyi
Pavel Tyshevskyi
Machine Learning Consultant

My interests include machine learning, probability theory, and programming.