Consider this claim
Quality is more important than quantity
I expect few people would disagree with it, and yet we do not always act as if it were true. In Academia, when considering candidates to hire or promote, we count their papers, their citations, their funding, their software download rates, their graduated students, the number of their committee memberships or journal editorships, and more.
Researchers are getting the message: quantity matters. Ugo Bardi proposes the economic underpinnings of this apparent trend, cleverly arguing that scientific papers are currency, subject to phenomena like inflation (more papers!), assaying (peer review validates papers, which support funding proposals, which finance more papers), and counterfeiting (papers published without review by unscrupulous publishers). Moshe Vardi, in a recent blog post, concurs that “we have slid down the slippery path of using quantity as a proxy for quality” and that “the inflationary pressure to publish more and more encourages speed and brevity, rather than careful scholarship.”[ref]Update 8/21/2016: As more evidence of the problem, here’s a great retrospective from the editor of a top journal in sociology points to quantity greatly devaluing quality.[/ref]
In this post I consider the problem of incentivizing, and assessing, research quality, starting with a recent set of guidelines put out by the CRA. I conclude with a set of questions—I hope you will share your opinion.
Encouraging quality by reducing quantity
To help address this problem, Batya Friedman and Fred B. Schneider recently published a CRA report, Incentivizing Quality and Impact: Evaluating Scholarship in Hiring, Tenure, and Promotion. It was developed over an 18-month period by a diverse Committee on Best Practices for Hiring, Promotion, and Scholarship, which Friedman and Schneider co-chaired.
They make a simple suggestion: Take quantity out of the evaluation process by asking candidates to put forward a handful of papers for careful consideration, rather than the entirety of their CVs. In particular,
- For hiring, identify 1-2 pubs “read by hiring committees in determining whom to invite to campus for an interview and, ultimately, whom to hire.”
- For promotion, identify 3-5 pubs where “Tenure and promotion committees should invite external reviewers to comment on impact, depth, and scholarship of these publications or artifacts.”
The goal is to “focus on producing high quality research, with quantity being a secondary consideration.” Viewing papers as currency, this recommendation aims to combat inflation by fixing prices. I like the idea. Importantly, it implies that we can determine “quality” by a careful, direct examination of a handful of papers (assessing “impact, depth, and scholarship”), rather than an at-a-distance examination of many papers. I think that doing so can be challenging.
Defining quality
What features does a high-quality research paper have? I can think of several:
- Problem being addressed is important
- Approach/description is elegant and/or insightful
- Clever/novel techniques are employed
- Results are impressive
- Methods are convincing; e.g., robust experimental evaluation, proof of correctness (perhaps mechanically verified), etc.
Program committees are often instructed to look for these features when deciding whether to accept a paper. Of course, not all must be present. For example, one of the tenets of “basic” research is that some problems may have no clear application as yet, so we must judge them using other features.
Impact
The above elaborate on the CRA’s recommendation of “depth” and “scholarship;” the third thing they mention is “impact.” In a sense, my list of features comprises intrinsic judgments of a paper, whereas impact is an extrinsic measure of how things played out. A commonly used judgment of impact is citations: If a paper is cited a lot it has evidently impacted the research world. We might assume that it did so because it exhibited some or all of the intrinsic measures above. Another measure is adoption; e.g., if the results of a paper are incorporated into a major industrial effort or product, then that would show that the problem was, indeed, important, and the results were convincing. There are many others.
Assessing impact (or its potential) is desirable because intrinsic features are telling, but not necessarily discerning. We can imagine that many people will write intrinsically good papers, but few of these will be significantly “impactful.” We would like to hire/promote those researchers with a penchant for impact.
But assessing impact is also difficult. For one, it takes time—perhaps a long time. A paper I co-authored on the safe manual memory management scheme in the Cyclone programming language appeared in 2005, but the emergence of Mozilla’s Rust programming language, which incorporates many of the ideas in that paper, didn’t occur until recently, well after my tenure case. We can also imagine that impact changes with time; e.g., ideas once viewed as groundbreaking can lose their luster (one example that comes to mind is software transactions for concurrency control).
Less qualitatively, citation counts are not always a good proxy or predictor of impact. My Cyclone paper was not cited much when I went up for tenure, and still hasn’t been. (Fortunately, my tenure case did not rest on this one paper!) As a more high-profile example, Sumit Gulwani‘s FlashFill work was incorporated into Excel — this is a major impact. While the paper is cited a respectable 145 times in 5 years (as of 4 Nov 2015), it is certainly not among the most cited of that timeframe (e.g., consider the sel4 paper).
Improving measures
Nevertheless, it is hard to get away entirely from measures like citation counts. While citation counts can mislead, we know from the literature on behavioral economics that qualitative human judgment is not completely trustworthy either. The book Thinking Fast and Slow, by Nobel Laureate Daniel Kahneman is particularly eye opening. Among many other interesting results, the book shows that people can be powerfully persuaded at the start of an evaluation process by superfluous details (which is a reason I support light double-blind review); that they will carry over a positive impression about one aspect of a person, based on good evidence, to another aspect of that person for which they have no evidence; that being heavily invested in a project or movement clouds judgment about that project’s prospects; and more. Therefore, quantitative measures offer an important data point, even if that point cannot be completely relied upon.
Improving these measures while not neglecting direct, thoughtful assessment (as per CRA’s recommendations), seems useful. For example, rather than treating all citations as equal, we could look at the citing text to understand whether references are positive or negative and what their purpose is, e.g., as a comparison to related work or as a reference to a prior technique being employed. We can thus construct a more nuanced picture. We could even use NLP techniques to automate the process.[ref]More radically, we might wonder whether NLP techniques could examine the text of a whole paper and then try to predict impact, down the line, assuming you choose an impact measure like citation counts. My colleague, Hal Daumé III, pointed me to a paper that does this, but adds “I guess I’m slightly dubious of automating fine-grained predictions because I think we as humans are pretty bad at making them (eg when deciding what papers to accept/reject). I totally think we can do coarse-grained automatically (at a sub-sub-sub-field level) but individual paper is hard.”[/ref]
How to promote high quality PL research?
I will close with three questions.
First, which Departments are actually adopting the CRA recommendations, at least in some form? My understanding is that Cornell is adopting them, which would make sense given that a Cornell professor co-authored them. Others?
Second, what PL papers serve as models of quality and impact? There are several lists of “classic” PL papers, such as Pierce’s 2004 survey, Might’s top-10 static analysis papers (scroll to the bottom), and Aldrich’s classic papers list. One of my favorite papers is Wright and Felleisen’s A Syntactic Approach to Type Soundness. As far as papers exemplifying depth, clarity, convincing methods, and impact go, it’s hard to beat this one. What are your choices for great paper, and why?
Third, aside from limiting the number of papers considered at hiring/promotion time, are there other ways to incentivize great research? One boring, but potentially effective thing we could do is to have higher, and clearer, standards for review. For example, a complaint I have heard a lot is that the experiments done in PL papers are sketchy: They use poor statistics and/or cherry-picked benchmark suites. Thus we could imagine asking reviewers to more carefully confirm that good methods are used.[ref]Raising review standards may be problematic in PL’s conference-oriented culture, since we might like to accept flawed/incomplete papers to conferences that report potentially path-breaking ideas. Even if we left official review standards alone, the CRA guidelines might serve as a counter-incentive, pushing authors to write less flawed, more complete papers, since they know fewer papers will be considered when they are evaluated.[/ref] Another idea would be to nudge people towards certain types of quality with explicit rewards, e.g., best paper awards with clear criteria: “Best idea paper” or “Best empirical paper” or “Best tool paper” or “Best new problem paper”. What else could we do to incentivize quality (and make it clear when we’ve seen it)?
Promoting research quality is extremely important for the success of science and all who rely upon it. How can we do better?
Thanks to Jeff Foster and Swarat Chaudhuri who provided thoughtful comments on a draft of this post. Thanks to David Walker for the idea to write it.
We’ll have to implement Matthias Felleisen’s idea. You’re hired as an assistance professor. Upon publishing a paper, you’re promoted to associate. Next time you publish, you’re promoted to full professor. Finally, after publishing the third paper, you’re out on the street.
(That’s not quite MF’s proposal. You start at Full and get demoted every time you publish.)
Observation: the lists of classic papers are quite skewed. They contain exactly zero papers from PLDI. The only paper mentioned (including recursively) in the entire post that appeared at PLDI is Cyclone.
Sorry, just came from a quick search. What classics would you provide? I would suggest that your DieHard (PLDI’06) paper is a model of high quality!
Kathryn McKinley put together this collection of 20 years of PLDI (1979-1999): https://www.cs.utexas.edu/users/mckinley/20-years.html
That list has lots of *excellent* and incredibly influential papers.
I should add that Kathryn spearheaded this joint effort with a committee of leaders in our field.
It has zero papers because it has zero papers that would be considered greatest of the great… know any?
Although I appreciate the spirit behind the CRA guidelines, there’s at least one thing I don’t like, or at least don’t understand, about them. Why are *papers* taken as the unit of impactful result? Rome wasn’t built in a day. At least the way I do research, it takes years of slaving away at a topic to slowly and *incrementally* build up scalable solutions to problems. (Yes, *incrementally*. As Bertrand Meyer said, “One and one-tenth cheer for incremental research!!!”) If hiring/tenure committees want to evaluate someone, they should really look at their research trajectory, not individual papers.
There’s something else that troubles me about the CRA guidelines. Reading between the lines, my takeaway is that these guidelines imply that our conference review processes are failing us: people are under pressure to write more papers, so they’re slicing the salami too thin, and top conferences are accepting these thin salami slices — if they weren’t, the salami-slicing wouldn’t be an issue — so we should leave the important decisions about the quality of people’s research to tenure/hiring committees, who are more reliable. But why exactly do people trust the peer review processes that go into tenure/hiring committee decisions over the ones that go into conference acceptance decisions? Tenure/hiring committee decisions are far less open and fairly determined, and in particular they only take into account the opinions of senior researchers.
In short, quality is in principle more important than quantity, but I think quantity is still a useful indicator. If someone publishes consistently at top conferences, it is usually a sign that they are interested in publishing in solid increments so as to communicate ideas with their fellow researchers (an important aspect of being a good scientist, which the CRA guidelines do not mention at all!), and in my experience it is usually also a sign that they are doing good work.
Write a long form journal paper or book that describes the arc, cuts out all of the blind alleys, etc.
Writing long-form stuff is great, but it’s very time-consuming, and it’s not something I can really ask a student to collaborate with me on, so it just doesn’t get done. I would love to have a sabbatical in which to write a book, but I don’t get sabbaticals.
Funny story, though. Lars and I were trying to write a book a year or two ago. As we, together with our students, postdocs, etc., were trying to work out the streamlined story, we realized there was an even simpler way of setting everything up. We ended up publishing that as a new research contribution (our Iris logic, in POPL’15). The book has fallen by the wayside, but the push for the book resulted in a nice result, on which we are now building in several ways.
I don’t buy that quantity predicts quality. Publishing at top conferences says a lot about your writing skills, your technical abilities but little about the importance of the ideas.
Recently, I have started to ask myself the question: if I was to only publish one paper a year, would I do somethings differently. The answer is yes. Lots.
Hmm. At least in the areas I am expert in, if I look at the people who publish the most in top venues, they are among the people doing the best work in the area. Of course, there are a number of other really good people, too, who don’t publish as much.
As far as publishing less, we probably already argued on Facebook about it. If I were forced to publish less, it would suck. I would be forced to put all my eggs in one basket, sequentialize my research, save up for big, impressive papers that are too dense for anyone to really appreciate rather than writing more modest but comprehensible papers. Why is this desirable?
Derek, I think if one takes time to refine and work through ideas one can produce simpler rather than too dense papers. I always thought Hoare and Reynolds set great examples of this. Neither published so many papers, but both would get ideas to a point where they were easily publishable, then not publish and instead improve and simplify, and iterate that process until much simpler and more memorable and impactful papers were the result.
There is an amazing story where the paper “Proof of a Program: Find”, which kicked off the idea of developing programs and proofs together, started as a much more complex tour de force which one would be extremely proud of, but where Hoare relentlessly simplified instead of publishing until he got to a tiny and seemingly “easy” paper that achieved so much more.
Yeah, the times were different… but any encouragement to publish less could be taken as an opportunity to strive for greater simplicity, thereby making the spread of ideas much easier to achieve.
[My supervisor, Tennent, advised me long ago not to go for quantity against what seemed then and now as a trend, his argument stuck then, and I am really happy to see now these exhortations form CRA.]
I was waiting for someone to bring up Reynolds. 🙂 Look, there will always be outliers, and you know I love Reynolds and Hoare, but in general I think the practice of refining and polishing work until it is “just right” is counterproductive to science. Research is an evolving process of slowly understanding things better. That process is (generally) best carried out by communicating with others through incremental papers and getting feedback on them, not through an iteration with one’s desk drawer. I write a paper once I think I’ve found something interesting and I’ve worked it out well enough that I feel comfortable sharing it with others. (This is a particularly good strategy for a perfectionist like me: if I waited till my papers were perfect, I would never submit anything.) Eventually, it’d be great to write a summary paper that streamlines the whole research arc as Greg suggested, but I would argue that’s just another natural step in the iterative, incremental research process, and tenure/hiring decisions should not be dependent on such papers. (FWIW, Reynolds didn’t write many of such retrospective papers either.)
Let me also be clear: I’m not arguing to “go for quantity”! I’m arguing to structure one’s research around results that make steady incremental progress on important problems. And that means writing interesting, useful, and imperfect papers at a consistent rate rather than saving up for one big, perfect, groundbreaking paper.
For example (and I hope he doesn’t mind my using his name in vain), take Lars Birkedal [disclosure: he is a frequent collaborator of mine]. He has published a gazillion papers, many elaborating on the same basic themes. A number of these are clearly incremental papers which are subsumed by later ones, but they serve a real purpose. He (together with his students, postdocs, and collaborators, of course) is putting complex technical ideas to work and showing in detail how they actually pan out in a variety of different settings, for solving a number of related problems. I find this enormously valuable.
To take another example you know very well, look at the history of papers on separation logic. There are tons of incremental results there (including several with Reynolds’s name on them). I think it’s fascinating to read those intermediate papers to understand how the ideas evolved, and the deluge of intermediate results reflected the excitement around the topic. I don’t see it as something to be frowned upon.
It starts with hiring.
One reason most senior people (except Derek who can’t help himself) publish lots is that each and every one of their students needs lots of papers to get a job.
The problem with hiring is that most places want to weigh apples and oranges on the same scale (and please don’t say “but we don’t do that” at Ivy School X, I don’t care you are in the noise of the hiring process). And the only scale that everyone seems to agree on is the length of your CV weighed by the impact factor of the venue.
Given that there are fields where PhD candidate seem to be able to pump out several papers a year, this sets the bar for everyone.
Fix hiring and we have a shot.
I’m a bit confused about how the proposed fix, to have hiring committees only consider a certain number of papers, could work. My impression is that hiring now is pretty free-form: a number of candidates are interviewed and the committee says which one they think is best. So if productive researchers get hired, that seems to indicate that universities want to hire productive researches. Even if you make a rule that people should submit their best papers, it seems that the information about who is productive and who isn’t will leak through pretty easily–it’s not like you are going to place the candidate behind a screen when taking them out to dinner?
Productivity is not impact. It’s more often than not just gaming the system (weoDoc). It’s getting your name on as many projects in the lab as you can.
(weoDoc = with the exception of Derek of course)
Look, we all know people who publish a lot of papers that they themselves haven’t even read, let alone contributed to (weoJoc). It doesn’t mean this is the norm, and in my experience it is not, but maybe I hang out with people who are unusually responsible.
In any case, no one is arguing that productivity = impact, but I think the focus on impact is a red herring. By the time people come up for hiring/tenure decisions, at least in PL it is very unusual for their research to have had major direct impact. PL impact tends to be much more diffuse and to take decades to seep into mainstream languages and systems (cf. Mike’s mention of Cyclone’s indirect impact as one of several influences on Rust ten years later). So hiring/tenure decisions will be based instead largely on the same kind of criteria used to judge peer-reviewed papers — is the work interesting, well-executed, and well-presented? — and additionally does the researcher have a compelling research trajectory? I would argue that positive answers to these questions tend to be correlated with productivity. That’s in the best-case scenario. In the worst case, these decisions will be made based on the particular prejudices of the hiring/tenure committee concerning what research directions are worthwhile.
I think the major problem with this approach is that it isn’t really incentive-compatible. That is, everyone has the incentive to defect, and there’s no punishment for defecting. As Vilhelm says, we’ll all know if the postdoc we’re considering for a TT position, or the Assistant Professor up for tenure also has some other papers, and it will hardly hurt your case. And since people outside our field will care, it benefits everyone to hire/tenure/promote more productive people. So students/faculty/etc still have the incentive to publish more.
One way to change this would be to significantly _increase_ selectivity for publication. For example, in some fields a single CellNatureScience paper will get you a job. If one POPL paper got you a job, people would publish fewer of them. That would require having POPL accept many many fewer papers. Perhaps the criteria is that a PLDI best paper award gets you a job, and a second one gets you tenure. But I don’t think that would be better than the current system, for a variety of reasons (for example, I wouldn’t have gotten a job :).
First, I have proposed to use the Friedman-Schneider CRA report as essential input for formulating ‘tenure’ criteria for ‘tenure-track’ junior positions at DIKU. DIKU has adopted Google Scholar profiles (proposed by Lars Birkedal, chair at Aarhus) as a metric basis for impact as an alternative to Scopus and Web of Science, which is used by our dean’s office. For hiring junior faculty, the hiring process (same application material and hiring process across the Faculty of Science) as a mechanism makes adopting CRA-style criteria unrealistic.
Third, (re)emphasize the inherently fuzzy furthering of science and enlightenment as being the core mission of free research, not only meeting by-the-second submission deadlines, page limits, page layout rules, coolest graphics etc. Just say it out loud once in a while when Ph.D. students, junior researchers, senior researchers (I often forget, nice to be reminded) and especially when spreadsheet-armed university/research lab administrators/managers are present. Have this core mission in mind when adding per se well-meant rules to scientific conference organization that lead to increased organizational complexity, inflexibility, fragility and proceduralization.
Fritz
Acting department chair, DIKU
I like the proposal in the CRA report.
Sam and Vilhelm: Of course, it wouldn’t fix all problems with the current system but it would help to reduce the importance of quantitative measures. At least, it would be an argument for committee members who speak in favor of candidates with less publications. I also don’t see how this proposal would make things worse.
Derek: I don’t understand why it would be hard to get students to write a journal paper if this was one of the most important documents for the job search.
Another thought: Submitting one paper to get a tenure-track position and another paper to get tenure is not far from the traditional model in Germany. These two papers are called dissertation and habilitation. I never understood why dissertations play no role in the hiring process anymore. There are so many ways to get ‘top’ publications. What do you really learn about the writing skills, technical skills, or creativity of a person that published a paper with 5 co-authors? Of course, you can and should collaborate for a dissertation as well. However, it is expected that you take the lead on all aspects and show that you can own a research project for a longer period. If a dissertation is well written then you also don’t have to read the whole thing to find out what’s in there.
It’s not hard to get students to help write a journal paper — I do it all the time. But we’re not talking about an ordinary journal paper. We’re talking about a very-long-form journal article or, better yet, a book laying out a streamlined version of a many-paper research arc that I (the senior researcher) have been pursuing over a number of years. I can’t ask students to devote months and months to helping me with that book rather than focusing on their own PhDs. They need to be focused on producing interesting new papers of their own. So to write such a book, I need to find time myself, which is hard.
Concerning what you learn about the qualities of a person who has published a paper with 5 co-authors, the point is you don’t look at one paper: you look at the spectrum of papers they have published. Is there a pattern or an agenda emerging? If someone has a bunch of papers in top venues, it is highly unlikely they have made minimal contributions to most of them. Otherwise, why would anyone want to work with them? You look at their papers in context: What do the papers en masse reveal about the way the person approaches research? Have they given insightful talks about them that place the papers in perspective? Again, this is all based on a more holistic appreciation of what it means to be a good researcher than “show me your 2 best papers”.
Thanks for the great discussion, everyone! I just want to make a couple of more points:
– I think that the phenomenon of counting papers is most likely to bite during hiring, rather than promotion. If you are hiring a fresh PhD, and you assume that their more recent work reflects their full, mature character (where earlier work might have been driven by the advisor or more senior collaborators), it’s hard to assess the impact of that work. But, if the student has some highly cited work (perhaps earlier in their career), or has a lot of work (e.g., many papers in top conferences, perhaps produced with many collaborators), it is very tempting to point to this as a plus, and the lack of it in other candidates as a minus. But doing so may not lead to the best outcomes. So focusing on the intrinsic qualities of the work itself, perhaps by reading a few papers, while also considering letters of reference and various other forms of evidence, might be a better approach (the CRA thinks so).
– The CRA guidelines, in a sense, are not saying we should only read two papers, and ignore everything else. They are saying you should play close attention to the few papers the candidate has identified as their best (and ask letter writers to do the same), while also doing all of the other things you already do. (E.g., they say quantity should be a “secondary consideration” not a non-consideration.) Following their recommendation can create an incentive for researchers to try to produce some great papers rather than to scattershot-produce many one-off papers that maximize the LPU count. Producing great papers can be done in sequence, e.g., as a long-form version of a conference paper or series of papers, that presents a fuller picture. I am reminded of a comment that Peter O’Hearn made to me in his interview: “I am of the opinion that academic computer science estimates the significance of novelty maybe a little too much; from an industrial perspective we would have more to go on if there were more corroboration and less uncorroborated novelty.” To me this is speaking to one aspect of the question of quality. Conferences allow you to “get the ideas out there” and be productive. But to have impact and really move society forward, we need deep/lasting results. These require a lot of work, with lots of pitfalls dodged and lessons learned. We should try to incentivize people producing this kind of work (perhaps as a synthesis of related conference papers).
Mike:
– I completely agree with your first point. Relying primarily on quantitative metrics for tenure-track hires is a very bad idea.
– Regarding your second point, I am merely saying that “paper” is the wrong unit of “important thing”. I wish the CRA guidelines instead asked candidates to list what they think are their two or three most important ideas or research arcs, which should be explainable at a high level in a few paragraphs (e.g. a 2-page research statement).
– If we want to move away from the obsessive focus on novelty rather than corroboration, we should be encouraging publication of papers that show how to actually scale up existing ideas to more realistic systems. When I speak about (the good kind of) “incremental research”, this is what I am talking about.