Adversarial Attack Using Genetic Algorithm
Fooling Convolution Neural Network into prediction what you want
Adversarial Attack Using Genetic Algorithm
Adversarial attacks on machine learning models has been a hot research topic for the last year. While many teams are working on understanding the implications of adversarial approach, it is still a new area.
There are two main approaches: White-box optimization requires access to trained model architecture and weights, and uses it’s differentiability property to generate adversarial sample. Black-box optimization treats model as an object with unknown hidden state that only has some I\O interface. While it’s generally more computationally involved, black-box scenario is much closer to real-life event.
In this article I will explain how to generate adversarial examples using genetic programming.
Imagine having your identity stolen by adding unnoticeable noise to your social profile picture. Or having spam email to pass the provider filter by using special wording or other minor changed to the content. This is what a typical adversarial attack would look like.
Setup the model
Without further ado, we’ll take a MNIST dataset. It contains 60000 grayscale images of handwritten digits, each image is 28x28 and contains a single digit.
I will use PyTorch because it is highly flexible library that allows for fast prototyping and is ideal for such kinds of experimenting. The architecture is typical for small neural networks - three [Convolution -> Pooling -> Activation] blocks are followed by fully-connected “head” that ends with softmax layer to output probability of image to belong to each class.
import torch as pt class CNN(pt.nn.Module): def __init__(self): super().__init__() self.model = pt.nn.Sequential( pt.nn.Conv2d(1, 30, 3), pt.nn.MaxPool2d(2), pt.nn.ReLU(), pt.nn.Conv2d(30, 30, 3), pt.nn.MaxPool2d(2), pt.nn.ReLU(), pt.nn.Conv2d(30, 15, 3), pt.nn.MaxPool2d(2), pt.nn.ReLU(), pt.nn.Flatten(), pt.nn.Linear(15, 64), pt.nn.ReLU(), pt.nn.Linear(64, 10), pt.nn.Softmax() ) def predict(self, X): return self.model(X)
After training for a couple of epochs, model achieves 95% accuracy on holdout dataset. Here are some of the predictions for test images - it works like a charm.
In order to generate adversarial example, we need to define the optimization problem. Given any input image, our model outputs probability of each class. Suppose, that we want it to believe that image contains 7 with high confidence. Then, we need to generate such image that the probability of it being classified as 7 is maximized.
- Gene - single instance of problem solution. In our case it’s 28x28 matrix that represents an image
- Population - a list of genes
- Fitness function - this evaluates how good each candidate is. In our setup it’s $P(candidate = 7)$
- Mutation - function that adds randomness to candidate solution. In our case, randomly changing pixel intensities
- Crossover - routine to combine several candidate solutions. In our case, we randomly choose half of pixels from one image, and the remaining half from the other
Once we define this, the algorithm is straightforward:
- Initialize population randomly
- For n epochs:
- Score each candidate using fitness function
- Select top $k$ candidates, the others are discarded with a large probability
- Randomly mutate fraction of them (except for the best one)
- For the remaining places, add random crossovers until the population size is restored
I won’t go into the implementation details, since you can check code in the repository. Suffice to say, that we have
GeneticSolver class that requires image size for initialization.
solve method is used as an entry point for optimization and takes fitness function as an argument. It also accepts
n_generations - number of generations to run.
We start with simple objective: maximize the probability of adversarial target
adv_target. We set
adv_target=7 for this example.
def fitness_class_probability(X): """ Maximize probability of adversarial target class """ y = model.predict(pt.Tensor(X).unsqueeze(1)).detach().numpy() y_target = y[:, adv_target] return y_target
The result is pretty impressive: after 1400 epochs, we have persuaded the model that there is 7 on the image with probability of 90%. Here is the visualization of the optimization process:
Class probability with non-zero content penalty
Now let’s do something more fun. For example, let’s not only persuade the model that noise on the image is some digit, but let’s do it with minimal number of active pixels. Transforming this into fitness function:
def fitness_class_probability_empty(X): """ Maximize probability of adversarial target class, penalizing mean pixel intensity """ y = model.predict(pt.Tensor(X).unsqueeze(1)).detach().numpy() y_target = y[:, adv_target] X_mean = X.mean(axis=1).mean(axis=1) return y_target - X_mean
adv_target=5 as our next target label. To obtain visually good result I increased the population size from default 100 to 300. Evolution over 1800 epochs:
Change class with minimum change in content
Next, let’s take some adversarial sample from holdout dataset and persuade neural network in the wrong label, say
Formalizing this into fitness function, we get:
def fitness_similarity(X): """ Maximize adv_target probability while minimizing MSE from adv_sample """ y = model.predict(pt.Tensor(X).unsqueeze(1)).detach().numpy() y_target = y[:, adv_target] mse = np.sqrt(np.power(X - adv_sample, 2).mean(axis=1).mean(axis=1)) return y_target - mse
As you can see, one can “fool” model into pretty much whatever output you want, given the appropriate function to optimize. There are a couple of reasons of such behavior:
What makes neural networks particularly susceptible to such kind of attack is distributed representation - knowledge extracted from the training dataset is shared across all nodes and in a hierarchical manner. Thus, small change in the input can have an avalanche-like effect on the output.
In the setup I choose, there is no way of incorporating information about how “real” the input looks. One way of making model more robust is to add second task of adversary sample detection. Another approach would be to add dummy class for invalid input. If the architecture change is costly, one can add preprocessing step of using train data for anomaly detection, assuming that input distribution is stationary.
It’s not hard to argue that even when precautionary measures are taken, there will always be inherent susceptibility in black-box attack. Similarly to Information Security, new exploits and countermeasures will be developed in an arms race fashion.
I have set up a repository for you to reproduce and enhance this example.
There are a couple of papers on this subject that I have found interesting and inspiring: