Peer review is at the heart of the scientific process. As I have written about before, scientific results are deemed publishable by top journals and conferences only once they are given a stamp of approval by a panel of expert reviewers (“peers”). These reviewers act as a critical quality control, rejecting bogus or uninteresting results.
But peer review involves human judgment and as such it is subject to bias. One source of bias is a scientific paper’s authorship: reviewers may judge the work of unknown or minority authors more negatively, or judge the work of famous authors more positively, independent of the merits of the work itself.
The double-blind review process aims to mitigate authorship bias by withholding the identity of authors from reviewers. Unfortunately, simply removing author names from the paper (along with other straightforward prescriptions) may not be enough to prevent the reviewers from guessing who the authors are. If reviewers often guess and are correct, the benefits of blinding may not be worth the costs.
While I am a believer in double-blind reviewing, I have often wondered about its efficacy. So as part of the review process of CSF’16, I carried out an experiment:[ref]The structure of this experiment was inspired by the process Emery Berger put in place for PLDI’16, following a suggestion by Kathryn McKinley.[/ref] I asked reviewers to indicate, after reviewing a paper, whether they had a good guess about the authors of the paper, and if so to name the author(s). This post presents the results. In sum, reviewers often (2/3 of the time) had no good guess about authorship, but when they did, they were often correct (4/5 of the time). I think these results support using a double-blind process, as I discuss at the end.
The peer review process, reviewed
Here is a quick refresher on the peer review process.
Some scientists carry out some research on a particular topic and author a paper explaining their results. This paper is submitted to a venue that publishes scientific results, which in turn solicits the opinions of several experts (i.e., “peers” of the submitting scientists). After reading the paper, these experts render a judgment about whether the paper should be accepted for publication. This judgment is based on whether, in their view, the result is sufficiently important or thought-provoking, is correct, and whether it is well presented, i.e., so that it can be understood by the community that the publication venue caters to. Whether or not the paper is accepted, reviewer comments are sent back to the authors anonymously so that the authors can improve the paper and the work. Reviewer identities are kept hidden so that they can provide honest judgments without fear of retribution.
Single blind review
While reviewers are usually anonymous to authors, authors are often not anonymous to reviewers. In a single-blind review (SBR) process, authors’ identities may appear in the paper text itself and in metadata associated with the paper. The disadvantage with SBR is that despite reviewers’ best intentions, implicit (unconscious) bias can creep into their judgment. As such, the result of review may favor various groups, e.g., men over women, famous authors over unknown ones, authors at famous institutions over those at “lesser” ones, etc. Such biases have been observed in many contexts outside of peer review, and psychological studies have shown that human beings exhibit implicit bias systematically; check out the implicit association test to see the effect in yourself.
Double blind review
In a double-blind review (DBR) process, author identities are withheld from reviewers. Typically, authors’ names and affiliations are redacted from the paper’s text, and are hidden by the review management software. The paper’s text should also not to reveal the authors’ identities indirectly; e.g., authors may be required to cite their own prior work in the third person (as though it could have been done by someone else), and/or they may be restricted from broadly advertising their work while it is under review. The intention is that reviewers should consider the merits of a paper based purely on its content, and not on preconceptions about the paper’s authors.
Author blinding, revealed
An important assumption of DBR is that the steps taken to blind the paper, like removing the authors’ names, actually succeed in masking authorship from the reviewers. If a reviewer can infer the paper’s authors in spite of these steps then one might wonder whether bias will creep back in. To test the effectiveness of blinding, I conducted an experiment to measure how often reviewers could guess author identities.
Experiment
The experiment was carried out as part of the review process of the 2016 Computer Security Foundations Symposium (CSF), of which I was the program co-Chair, along with Boris Köpf. CSF’16 employed a light form of double-blind review. The authors were asked to redact their names from the paper, and to cite their own prior work in the third person. Authors were not required to change the names of well-known research systems they might have been writing about, since doing so might create doubt about authorship but could create confusion about related work. Authors were also permitted to post their work to their web page, give talks about it, etc. as part of the normal scientific process.
For each paper reviewed, a reviewer fills out a form describing their judgment of the paper. For the experiment, I extended the form to ask, first of all, if the reviewer had a guess about one or more authors of the paper they had just reviewed. If so, the form asked them to list the apparent authors. They could also optionally describe the basis of their guess.
Results
For the 87 papers submitted, the program committee (and a handful of outsiders) performed 270 reviews. In 90 out of 270 cases, a reviewer had a guess about the paper’s authors. 74 times out of 90, the reviewer guessed at least one author correctly. In the remaining 16 cases, all guesses of authorship were incorrect.
In sum, most (67%) of the time, reviewers were not sufficiently confident about authorship to have a reasonable guess. In these cases, double blind helped avoid bias based on casual knowledge of the authors, their institution, their gender or nationality, etc. In those cases that the reviewers had a guess, 82% of the time they were right. But every once in a while (roughly 1 time in 5) they were wrong.
Some other studies also consider blinding efficacy; Snodgrass summarizes some of these. I previously conducted a survey of POPL’12 reviewers to ask them to recall whether they had guessed (correctly or incorrectly) a submitted paper’s authorship. From that survey, 77% who guessed did so correctly. A flaw with this result was that it was based on recollection well after the fact. The experiment I report here ought to be more reliable since reviewers made a guess when writing their review.
Source of unmasking
Returning to the CSF’16 experiment, I asked reviewers to optionally indicate the reasons for their guess. Very often, reviewers stated that citations in the paper were a strong indicator. In particular, they assumed that the most closely related prior work was by the same authors. Many times they were right, but sometimes they were wrong. In one case, two different reviewers incorrectly guessed the authors to be those of the closest prior work. Another common basis for a guess was that a reviewer had seen an unblinded, prior version of the same paper.
Guesses and expertise
I also looked into how expertise correlates with guessing and guessing correctly. In particular, reviewers are asked to state, on the review form, their level of expertise in subject area of the paper; the options are ‘X’ for expert, ‘Y’ for knowledgable, and ‘Z’ for interested outsider. Here I found that the expertise breakdown for the 90 guesses to be X=43, Y=35, and Z=12, and for the 179 non-guesses to be X=74, Y=75, and Z=31. Just eyeballing these numbers, it does seem that those with higher expertise are a bit more likely to guess authorship.
For the 16 guesses that were incorrect, the breakdown was X=6, Y=8, and Z=2. For those who guessed right, about half were X and the other half were Y or Z; for those who guessed wrong, fewer were expert (6 vs. 10). So perhaps higher expertise correlates with likelihood of guessing, and guessing correctly, but not by very much.
Guesses and acceptance
One interesting question, originally raised in the comments below, is how guessing relates to decisions about acceptance. In particular, we might wonder whether accepted papers are more likely to be written by authors whose identities are readily guessed.
Of the 31 accepted papers, 25 of them had a reviewer that guessed the authors correctly, while in 5 cases no guesses were offered. In 6 cases, accepted papers had at least one incorrect guess, while in all but one of these there was also a correct guess. Considering individual reviews, of the 90 reviews done for the 31 accepted papers, 39 reviewers guessed right, 7 guessed wrong, and 54 had no guess.
Of the 56 rejected papers, 22 of them had a reviewer that guessed the authors correctly, while in 28 cases no guesses were offered. In 7 cases, rejected papers had at least one incorrect guess, and in 6 of these no correct guesses were offered. Considering individual reviews, of the 180 reviews done for the 56 rejected papers, 35 reviewers guessed right, 9 guessed wrong, and 126 had no guess.
Looking at these numbers, we can see that reviewers of accepted papers were more likely to offer a guess (46/90 reviews vs. 54/180 reviews), and nearly all accepted papers had at least one of their authors guessed correctly (25/31 papers as compared to 22/56 for rejected papers). Reviewers who guessed wrong did so a bit more often for rejected papers (7/46 guesses for accepted ones vs. 9/44 for rejected ones). Also, accepted papers were more likely to have multiple reviewers correctly guess authorship.
Discussion
What should we take from these results?
Those against DBR might suggest that the cost of DBR to reviewers and authors is not worth the benefit. They might say that reviewers, when they guess, are very often right. The rest of the time, reviewers didn’t know the reviewed work enough to guess authors, meaning that had they known the authors it may not have influenced their judgment. But I think it’s hard to justify the latter statement without more evidence.
Indeed, those in favor of DBR might point out that very often (67% of the time) reviewers could not (or did not) guess authorship, meaning that authorship could not be a source of bias. Even for those reviewers who did guess authorship, they did so incorrectly 1 time in 5, on average. For reviewers who regularly participate in a double blind process, knowing that they are not always right may sow sufficient doubt that any guesses that have do not rise to the level of biasing judgment.
I find these arguments in favor of DBR to be convincing. Though DBR is far from perfect, it creates an expectation that authorship is not a factor in review, and it enforces this expectation sufficiently often. Moreover, the light form of DBR I used at CSF and POPL is not particularly costly, and both reviewers and authors seemed to feel it worked, as detailed in my Chair report for POPL’12.
That said, the analysis also showed that the paper of a guessed author was more likely to be accepted than a paper whose authorship was not guessed. Perhaps guessing (correctly) materially affected the final judgment? Or, perhaps being known within the community correlates with paper quality; after all, a history of publishing within a community should say something about the quality of the submitted work (we just don’t want that history to be a proxy for direct assessment).
Ideally we could go beyond studying the process, and instead measure outcomes, i.e., that DBR yields more papers from minorities, women, etc. who might otherwise be discriminated against. Unfortunately, it is very hard for me to see how to measure this effect directly, in a controlled manner. Until we can, we will have to do our best to strive for both quality and fairness and low cost.
Update, 11am EDT, June 28: In response to a comment below, I updated the post to discuss the correlation between paper acceptance/rejection and guessing authorship.
Thanks for doing this, it is good to have numbers.
Two question, what is the correlation between papers for which the PC members did guess the authors and acceptance? (Is it the case that PC members guess authors of most of the “good” papers) How many of the papers that were not guessed correctly were from newcomers? (Guess are based on track record, authors without a track record in the field are harder to identify).
Good question! I have updated the post to answer at least some of these questions. For the question I didn’t answer: of the 5 papers accepted without a guess, all were by existing members of the community (probably just working in new areas).
I would suggest investigating the unconscious bias in favor of DBR by its zealous advocates, who have never shown a problem to exist and who can therefore never show that they are solving one, and who are determined to interpret all data as being in favor of DBR, no matter what.
I think there is plenty of evidence that shows human beings are, in general, subject to unconscious bias. See the work of Daniel Kahneman, for example (“Thinking Fast and Slow”). More evidence has been gathered in the FAQ I have put together for CSF at http://csf2016.tecnico.ulisboa.pt/FAQ.html
That said, I think you may be right that the effects of bias in reviewing have not been shown directly and conclusively, at least not in CS communities I know well.
Given this, we can either say “there is no problem” or we can say “the evidence suggests there may be a problem that’s hard to see directly.” If the latter then we must ask whether the costs of acting to try to address any problem are too high compared to possible benefits. My opinion is that for the light form of DBR the costs are low (to authors and reviewers) with benefits of “unbiased first look” which Kahneman’s work seems to show is useful, and good perception – we are trying to encourage under represented groups. Therefore overall it is worth doing.
“Blinding of reviews is another fertile area of study. In 1998, my colleagues and I conducted a five-journal trial of double-blind peer review (neither author nor reviewer knows the identity of the other). We found no difference in the quality of reviews. What’s more, attempts to mask authors’ identities were often ineffective and imposed a considerable bureaucratic burden. We concluded that the only potential benefit to a (largely unsuccessful) policy of masking is the appearance, not the reality, of fairness. Since then, online technologies for blinding have increased, as have numbers of scientists (and thus the difficulty of guessing who authors may be). It will be interesting to see how similar studies work out now, and whether double-blind reviewing affects acceptance rates for women and under-represented minorities.”
http://www.nature.com/news/let-s-make-peer-review-scientific-1.20194?WT.mc_id=TWT_NatureNews
(Forwarded by Alex Potanin)
Here are the basic stats from PLDI 2016, which had “strong” double-blind – authorship was only revealed for accepted papers after all decisions had been concluded.
Each reviewer (who never directly learned the identity of any authors until the process was complete) was asked three questions (whose answers only I could see): (1) Do you think you know one or more authors of this submission? (2) If your answer was Yes, please enter the authors you think wrote this paper, one per line. (3) Did the authors’ use of citations reveal their identities?
There were 1177 reviews with 915 responses. Of these, only 32% said yes to question (1). I have processed the data with scripts created by Mike to get some stats and hope to release these soon. In any event, it’s clear that most papers enjoyed anonymity during the review process by most reviewers, despite past claims by some that reviewers invariably know who wrote the papers they are reviewing.
Having run this process, I personally find the notion that double-blind reviewing adds significant (or even modest) burden to be without foundation.
In addition, respondents to a PLDI 2016 survey preferred or strongly preferred this form of double-blind (67%). Only 34% prefer or strongly prefer “light” double-blind (34%) and just 14% prefer single-blind reviewing.
Pingback: Measuring Single vs. Double-blind Reviewing - The PL Enthusiast