An Adversarial (re)Analysis of Zhou/Firestone 2019
This post is a reanalysis of Zhenglong Zhou and Chaz Firestoneâs paper âHumans can deciper adversarial images,â so letâs get some links out of the way
- Paper - Zhou Z, Firestone C. Humans can deciper adversarial images. Nature Communications 10(1):1334 p.2041-1723, DOI:10.1038/s41467-019-08931-6
- Preprint
- Data and Code on OSF
Overview
Chaz Firestone came and presented this data at the UOâs cognitive neuroscience seminar series this winter just before the paper came out. The idea is compelling: Convolutional neural nets trained to classify images are vulnerable to adversarial attacks where images can be manipulated or synthesized to trigger a specific categorization.
On their face, these adversarial images highlight the dramatic differences between the human/mammalian visual system â the types of things that fool us are very different than tactically adding static throughout an image. However if there were any overlap between the types of adversarial image manipulations that fool us and fool CNNs, they argue it would point to a possible mechanistic overlap.
They use two types of adversarial images:
Fooling images are otherwise meaningless patterns that are classified as familiar objects by a machine-vision system.
ya fooled me doc
and
Perturbed images are images that would normally be classified accurately and straightforwardly [âŚ] but that are perturbed only slightly to produce a completely different classification by the machine
youch wat trickery!
They present 8 experiments with 5 image sets that mostly ask human subjects to guess what a computer would classify the images as â what they call âmachine theory of mindâ â although the degree to which that is different than just having people classify images is ambiguous in their data.
Clarifying Hypotheses
There seems to be an abnormally high airgap between data and interpretation here that I think is worthy of careful handling. The ultimate motivation here is to detect some similarity between CNNs and mammalian/human visual systems, and since CNNs seem to be vulnerable to adversarially manipulated images, if there is mechanistic overlap humans should be too. The authors aim to fill the empirical gap where, according to them, no one has actually tested whether humans misclassify these images
A primary factor that makes adversarial images so intriguing is the intuitive assumption that a human would not classify the image as the machine does. (Indeed, this is part of what makes animageâadversarialâin thefirst place, though that definition is notyet fully settled.) However, surprisingly little work has actively explored this assumption by testing human performance on such images, even though it is often asserted that adversarial images are âtotally unrecognizable to human eyesâ
There are a number of very similar hypotheses and results that are possible here, we should delineate between them.
The most straightforward test of a hypothesis would be:
Base image, well classified by humans and CNNs -> Perturbed image, CNNs consistent misclassify -> If humans misclassify in the same way and at same rate, implied mechanistic similarity.
A critical component of this is that the image is misclassified according to humans, or classified in a way that is not âwhat is looks like.â There is a tautology that makes this experiment impossible (as the authors note) - if the adversarially manipulated image didnât still âlook likeâ the base image, it wouldnât be an adversarially useful image. A less strong test of the hypothesis would be to relax the requirement that there be some well-classified base image
Generated or perturbed image not obviously classified by humans, CNNs consistently classify -> when forced, if humans classify in same way at same rate, implied mechanistic similarity. (similar to experiments 3a, 3b)
In this case, the images cannot obviously resemble the classes they are assigned by the CNN, as that would just mean the CNNs correctly learned some abstract representation of the way images look to humans. Such a result is not uninteresting, it is just the same as finding that CNNs can classify images, and we know that already. To the degree than an image resembles the class that the CNN assigned it, that image is not suitable to test this hypothesis. Another subtlety here is that the humans should have to classify in the same way as the CNNs, ie. choose from a list of all possible categories. Giving additional structure to the humans would require giving the same to the CNNs to make the results comparable.
There are several implicit hypotheses tested in this paper that are essentially unrelated to the central question of machine/human overlap.
Humans are told images were misclassified, choose the misclassification from an array of all possible image classes. If misclassification correctly identified, humans can recognize visual features that drive misclassifications in CNNs. (experiment 5)
This is a distinct hypothesis from âhumans have the same visual processing as CNNs,â in this case since the human subjects are told there is a misclassification, they are looking not for what they think the image actually is, but for what would have driven the mistake. The interpretation should be that humans are capable of inferring what makes a machine misclassify an image, not that we process images similarly.
Humans are given an image class and examples, if they choose the image that was categorized as that class, they can recognize some element of the image class in the adversarial image. (experiment 4)
This is another separate question â in this case the subjects are asked to recognize some feature from the example images in the perturbed images. Their being able to see those features is not indicative that they process the images in the same way as the CNN, but amongst an array of imperfect examples they can see the image that is the most similar to the example images.
The optimal outcome for all of these experiments is
- The subject all categorize with high accuracy - the subjects should all have the same performance as the machine to affirm the hypothesis.
- The images are all categorized with equal accuracy - since the question is about human/machine agreement in general, that overlap should be true of all images. Having images with wildly different accuracy rates is useful to assess the visual features that drive human/machine, but for the same reason points to some specific qualities of the images that make the more and less accurately categorized rather than a general similarity between human/machine overlap. Remember â the machines classify all of these images incorrectly with a high degree of confidence, so humans should too.
Reanalysis Details
Aside from the structure of the hypotheses, I had questions about the data analysis itself. During his presentation I was confused about why the data was reported as it was â the main results they report are the % of subjects whose categorization agreed with the machine and % of images where the majority categorization agreed with the machine. It seemed like that analysis would obscure the actual rates of categorization â ie. the actual rate of âcorrectâ responses grouped by subject and image. The percents of subject and image agreement were also counted just by their mean categorization being above chance rather than being statistically distinguishable from chance (ie. confidence intervals exclude the chance threshold), I will also report those as âadjusted accuraciesâ using pearson-klopper binomial 95% confidence intervals.
Because I think this is such a potentially cool line of research I thought I would do the reanalysis myself. Thankfully, the authors released their (very clean!) data. I think the results are quite a bit more subtle than initially reported.
Code Boilerplate
Weâll use the following libraries
Iâve put the data in a directory in my website structure, â/assets/data/advâ. Weâll load them all into variable names expt_1
, expt_2
, etc. and do some cleanup.
Summarizing Functions
Since so much of the data has the same structure, weâll write functions to summarize the image responses by image and subject. Theyâll return
n_trials
- the number of trials per groupn_correct
- the number of âcorrect,â or matching the categorization of the CNN, trialsmeancx
- the proportion of correct answers per groupcilo
,cihi
- the 95% confidence interval around the mean correct.
Experiment 1
The first image uses images from Nguyen A, et al 2014, which were generated using a âcompositional pattern-producing networkâ that
can produce images that both humans and DNNs can recognize.
Importantly though,
These images were produced on PicBreeder.org, a site where users serve as the fitness function in an evolutionary algorithm by selecting images they like, which become the parents of the next generation.
So using these images may make the results particularly difficult to interpret, as itâs not clear how aesthetic preference interacts with the preference for recognizable objects. It could be the case that people pick images to preserve in the image generation process that look like real objects, so they arenât âadversarialâ images, strictly speaking. Indeed, the authors of the image-generation paper note
the generated images do often contain some features of the target class
so a human classifying an image as the same class as a machine might be unsurprising for these images. Since some of the images do indeed resemble the âtargetâ classes, those images are unsuitable for assessing the degree to which the human visual system makes the same âerrorsâ as machine vision.
The subjects in this task saw one of 48 âfooling images,â and were presented with the âcorrectâ label and a randomly selected label from the other 47 images. The primary result the report for this experiment is that
98% of observers chose the machineâs label at above-chance rates. [âŚ] Additionally, 94% of the images showed above-chance human-machine agreement
Reanalyzing by image and subject, howeverâŚ
So far so good, although if we use the binomial confidence intervals rather than just the mean response rate â what Iâll call corrected accuracies â we get a more valid description of above-chance accuracy
Only 85.4% of images were categorized above chance, and 81.2% of subjects did, as opposed to the reported 94% and 98%, respectively.
Experiment 2 - 1st vs 2nd best labels
Of course, not all foil labels are created equal, so a more conservative test for human/machine overlap is to compare the highest and second highest labels predicted by the machine.
This looks much worse, and the corrected accuracies reflect that
Only 54.2% of images and 31.3% of subjects classified above chance, as opposed to the reported 71% and 91%, respectively.
Collapsing across all images and subjects, only 60.6% of responses agreed with the top category of the CNN.
We can see the accuracy-inflating strength of having bad foils by comparing experiment 1 vs 2. Images whose classifications remained high in experiment 2 are robust to their next-best label, while those that are significantly worse in experiment 2 are vulnerable.
Experiments 3a and 3b
Experiments 3a and 3b presented all possible labels instead of two. 3a was the âmachine theory of mindâ task, and 3b asked subjects to rate what they thought the images were. First, the overall summaries of 3a
Again, only 60.4% of images and 47.4% of subjects were above chance accuracy of 1/48, as opposed to the reported 79% and 88%, respectively. Experiment 3b has qualitatively the same results, but their interpretation doesnât necessarily follow from the data:
These results suggest that the humansâability to decipher adversarial images doesnât depend on the peculiarities of our machine-theory-of-mind task, and that human performance reflects a more general agreement with machine (mis)classification.
There are actually two degenerate interpretations here: either human performance is the same as machine performance, or the subjects were just rating what they thought the images were in all tasks. No further differentiating experiments were done to tease these interpretations apart, so this point is a wash.
Further, if one looks at the most accurately categorized imagesâŚ
Chainlink fence
Spotlight
Monarch Butterfly
⌠we can easily see why they were. Remember the argument here is that these are supposedly adversarial images that fool a classifier. A finding that humans and image classification algorithms similarly categorize things that really do look like those categories is unremarkable.
Experiment 4 - Static images
Experiment 4 uses âstaticâ images (from the same source paper), but also changes the task in a meaningful way. Rather than asking what category an image was, the subject is presented with the category and a set of representative images and asked âwhich image has this category?â
This experiment only has 8 images in the set of static images, and each is presented in every trial. The authors note that
upon very close inspection, you may notice a small, often central,âobjectâ within each image.
and they are actually quite pronounced. Even if the central âobjectsâ donât look recognizably like the categorized object, they are distinguishable that subjects should be able to recognize them between trials. Since the subjects are asked to choose one category for each of the images, it seems possible for them to use that information to exclude images from later trials. In other words, the trials are not independent. This is reflected in the positive slope of accuracy over trial number
Again, the corrected accuracies are much lower than they report, accounting for uncertainty, only 75% of images and 8.4% of subjects had accuracies above chance, rather than the reported 100% and 81%, respectively. This is exceptionally troubling for their interpretation of their results, as it is subject accuracy that matters, not image accuracy.
Experiment 5 - Digit classification
In this experiment, perturbed MNIST digits are given, and the subjects are told they were miscategorized â ie. choose the mistaken digit NOT the one that it looks like.
As a first pass, things look okâŚ
But something is off with the confusion matrix
It looks like everyone just said everything was an 8. In the plot below, the rate of â8â responses is colored in red.
The interpretation of this experiment given in the paper is straightforwardly inaccurate. Most subjects did not agree with the machine classification, they just classified everything as an 8. The âabove chance accuracyâ of the target labels was only due to the very low rates of other digit responses.
This is reflected in the subject accuracy, where only 15.1% of subjects had accuracies significantly better than chance.
Experiment 6
These images have surprisingly high accuracy! At last! This one seems solid.
However, the image perturbation introduces small images that look exactly like the target category into the images. Some examples:
The ârock beautyâ fish with the highest accuracy:
and the milk jug
This is especially problematic since the task was to choose one of two labels â as was the case in experiment 1 as compared to 2, even when the primary label isnât immediately obvious, if the foil label is significantly worse the categorization becomes trivial.
Experiment 7
Again, only 44.3% of images and 16.3% of subjects had accuracy significantly above chance, as opposed to the reported 78% of images and 83% of subjects. Overall, across all images and subjects, the total accuracy was 58.9%.
The image synthesis technique is tuned to minimize perceptual perturbations, but does impart a recognizable texture to the objects in the image. This was especially problematic in examples where the original image and the target class were semantically related, or had a similar texture, for example
A dog_191, whose adversarial target was âairedaleâ
In others, the texture was so obvious that it is no longer visually undetectable.
Dog 63, whose adversarial target was âindian cobra.â
Overall summary
Iâve put the data in a directory in my website structure, â/assets/data/advâ. Weâll load them all into variable names expt_1
, expt_2
, etc. and do some cleanup.
Summarizing Functions
Since so much of the data has the same structure, weâll write functions to summarize the image responses by image and subject. Theyâll return
n_trials
- the number of trials per groupn_correct
- the number of âcorrect,â or matching the categorization of the CNN, trialsmeancx
- the proportion of correct answers per groupcilo
,cihi
- the 95% confidence interval around the mean correct.
Experiment 1
The first image uses images from Nguyen A, et al 2014, which were generated using a âcompositional pattern-producing networkâ that
can produce images that both humans and DNNs can recognize.
Importantly though,
These images were produced on PicBreeder.org, a site where users serve as the fitness function in an evolutionary algorithm by selecting images they like, which become the parents of the next generation.
So using these images may make the results particularly difficult to interpret, as itâs not clear how aesthetic preference interacts with the preference for recognizable objects. It could be the case that people pick images to preserve in the image generation process that look like real objects, so they arenât âadversarialâ images, strictly speaking. Indeed, the authors of the image-generation paper note
the generated images do often contain some features of the target class
so a human classifying an image as the same class as a machine might be unsurprising for these images. Since some of the images do indeed resemble the âtargetâ classes, those images are unsuitable for assessing the degree to which the human visual system makes the same âerrorsâ as machine vision.
The subjects in this task saw one of 48 âfooling images,â and were presented with the âcorrectâ label and a randomly selected label from the other 47 images. The primary result the report for this experiment is that
98% of observers chose the machineâs label at above-chance rates. [âŚ] Additionally, 94% of the images showed above-chance human-machine agreement
Reanalyzing by image and subject, howeverâŚ
So far so good, although if we use the binomial confidence intervals rather than just the mean response rate â what Iâll call corrected accuracies â we get a more valid description of above-chance accuracy
Only 85.4% of images were categorized above chance, and 81.2% of subjects did, as opposed to the reported 94% and 98%, respectively.
Experiment 2 - 1st vs 2nd best labels
Of course, not all foil labels are created equal, so a more conservative test for human/machine overlap is to compare the highest and second highest labels predicted by the machine.
This looks much worse, and the corrected accuracies reflect that
Only 54.2% of images and 31.3% of subjects classified above chance, as opposed to the reported 71% and 91%, respectively.
Collapsing across all images and subjects, only 60.6% of responses agreed with the top category of the CNN.
We can see the accuracy-inflating strength of having bad foils by comparing experiment 1 vs 2. Images whose classifications remained high in experiment 2 are robust to their next-best label, while those that are significantly worse in experiment 2 are vulnerable.
Experiments 3a and 3b
Experiments 3a and 3b presented all possible labels instead of two. 3a was the âmachine theory of mindâ task, and 3b asked subjects to rate what they thought the images were. First, the overall summaries of 3a
Again, only 60.4% of images and 47.4% of subjects were above chance accuracy of 1/48, as opposed to the reported 79% and 88%, respectively. Experiment 3b has qualitatively the same results, but their interpretation doesnât necessarily follow from the data:
These results suggest that the humansâability to decipher adversarial images doesnât depend on the peculiarities of our machine-theory-of-mind task, and that human performance reflects a more general agreement with machine (mis)classification.
There are actually two degenerate interpretations here: either human performance is the same as machine performance, or the subjects were just rating what they thought the images were in all tasks. No further differentiating experiments were done to tease these interpretations apart, so this point is a wash.
Further, if one looks at the most accurately categorized imagesâŚ
Chainlink fence
Spotlight
Monarch Butterfly
⌠we can easily see why they were. Remember the argument here is that these are supposedly adversarial images that fool a classifier. A finding that humans and image classification algorithms similarly categorize things that really do look like those categories is unremarkable.
Experiment 4 - Static images
Experiment 4 uses âstaticâ images (from the same source paper), but also changes the task in a meaningful way. Rather than asking what category an image was, the subject is presented with the category and a set of representative images and asked âwhich image has this category?â
This experiment only has 8 images in the set of static images, and each is presented in every trial. The authors note that
upon very close inspection, you may notice a small, often central,âobjectâ within each image.
and they are actually quite pronounced. Even if the central âobjectsâ donât look recognizably like the categorized object, they are distinguishable that subjects should be able to recognize them between trials. Since the subjects are asked to choose one category for each of the images, it seems possible for them to use that information to exclude images from later trials. In other words, the trials are not independent. This is reflected in the positive slope of accuracy over trial number
Again, the corrected accuracies are much lower than they report, accounting for uncertainty, only 75% of images and 8.4% of subjects had accuracies above chance, rather than the reported 100% and 81%, respectively. This is exceptionally troubling for their interpretation of their results, as it is subject accuracy that matters, not image accuracy.
Experiment 5 - Digit classification
In this experiment, perturbed MNIST digits are given, and the subjects are told they were miscategorized â ie. choose the mistaken digit NOT the one that it looks like.
As a first pass, things look okâŚ
But something is off with the confusion matrix
It looks like everyone just said everything was an 8. In the plot below, the rate of â8â responses is colored in red.
The interpretation of this experiment given in the paper is straightforwardly inaccurate. Most subjects did not agree with the machine classification, they just classified everything as an 8. The âabove chance accuracyâ of the target labels was only due to the very low rates of other digit responses.
This is reflected in the subject accuracy, where only 15.1% of subjects had accuracies significantly better than chance.
Experiment 6
These images have surprisingly high accuracy! At last! This one seems solid.
However, the image perturbation introduces small images that look exactly like the target category into the images. Some examples:
The ârock beautyâ fish with the highest accuracy:
and the milk jug
This is especially problematic since the task was to choose one of two labels â as was the case in experiment 1 as compared to 2, even when the primary label isnât immediately obvious, if the foil label is significantly worse the categorization becomes trivial.
Experiment 7
Again, only 44.3% of images and 16.3% of subjects had accuracy significantly above chance, as opposed to the reported 78% of images and 83% of subjects. Overall, across all images and subjects, the total accuracy was 58.9%.
The image synthesis technique is tuned to minimize perceptual perturbations, but does impart a recognizable texture to the objects in the image. This was especially problematic in examples where the original image and the target class were semantically related, or had a similar texture, for example
A dog_191, whose adversarial target was âairedaleâ
In others, the texture was so obvious that it is no longer visually undetectable.
Dog 63, whose adversarial target was âindian cobra.â