The CW Attack Algorithm
- Brief introduction
- Original CW attack algorithm
- Constructing the function
- Formulating the loss function
- Solving the last constraint
The CW attack algorithm is a very typical adversarial attack, which utilizes two separate losses:
- An adversarial loss to make the generated image actually adversarial, i.e., is capable of fooling image classifiers.
- An image distance loss to constraint the quality of the adversarial examples so as not to make the perturbation too obvious to the naked eye.
This paradigm makes CW attack and its variants capable of being integrated with many other image quality metrics like the PSNR or the SSIM.
Original CW attack algorithm
When adversarial examples were first discovered in 2013, the optimization problem to craft adversarial examples was formulated as:
- is the input image, is the perturbation, is the dimension of the image and is the target class.
- Function serves as the distance metric between the adversarial and the real image, and function is the classifier function.
Traditionally well known ways to solve this optimization problem is to define an objective function and to perform gradient descent on it, which will eventually guide us to an optimal point in the function. However, the formula above is difficult to solve because is highly non-linear (the classifier is not a straight forward linear function).
In CW, we express Constraint 1 in a different form as an objective function such that when is satisfied, is also satisfied.
Constructing the function
Conceptually, the objective function tells us how close we are getting to being classified as . One simple but not a very good choice for function is:
Where is the probability of being classified as . If the probability is low, then the value of is closer to 1 whereas when it is classified as , is equal to 0. This is how the objective function works, but clearly we can't use this in real world implementations.
In the original paper, seven different objective functions are assessed, and the best among them is given by:
- is the logit (the unnormalized raw probability predictions of the model for each class / a vector of probabilities) when the input is an adversarial .
- is the probability of the target class (which represents how confident the model is on misclassifying the adversarial as the target).
- So, is the difference between what the model thinks the current image most probably is and what we want it to think.
The above term is essentially the difference of two probability values, so when we specify another term and take a max, we are setting a lower limit on the value of loss. Hence, by controlling the parameter we can specify how confident we want our adversarial to be classified as.
Formulating the loss function
We then reformulates the original optimization problem by moving the difficult of the given constraints into the minimization function.
Here we introduce a constant to formulate our final loss function, and by doing so we are left with only one of the prior two constraints. Constant is best found by doing a binary search, where the most often lower bound is and the upper bound is .
The best constant
I personally found that the best constant is often found lying between 1 or 2 through my personal experiments.
Solving the last constraint
After formulating our final loss function, we are presented with this final constraint:
This constraint is expressed in this particular form known as the "box constraint", which means that there is an upper bound and a lower bound set to this constraint. In order to solve this, we will need to apply a method called "change of variable", in which we optimize over instead of the original variable , where is given by:
Where is the hyperbolic tangent function, so when varies from -1 to 1, varies from 0 to 1.
Therefore, our final optimization problem is:
The CW attack is the solution to the optimization problem (optimized over ) given above using Adam optimizer. To avoid gradient descent getting stuck, we use multiple starting point gradient descent in the solver.