Science 7 February 2014:
Vol. 343 no. 6171 pp. 596-598
News Focus

Peering Into Peer Review

Jeffrey Mervis | 18 Comments

Why don't proposals given better scores by the National Institutes of Health lead to more important research outcomes?

Add a new comment

These postings do not necessarily represent the views/opinions of Science.

There are four issues that need to be considered before making conclusions about the fairness of the funding peer-review system.

First, scientific research has its own cycle, which may vary depending on the topic of the study. Even for the same topic, one cannot predict when a promising result is obtained. Science is largely unpredictable. Good ideas may be proven wrong at times.

Second, it is very likely that a proposed research got a low score during the peer-review process merely because it is an “outdated” topic, which makes the reviewers thought that it has a lower impact than those hot ones. However, those not-so-hot research areas have no reason to publish less influential papers in terms of scientific contribution.

Third, the proposals that successfully stood out with high scores could be very impactful themselves. However, the measuring standards - the number, time, and impact of the publications, are not necessary to reflect the true value of that specific research.

Last but not least, writing a good proposal is only the very first step of carrying out good science. A highly ranked proposal needs fulfillment - a continuous efforts made by researchers, post-docs, and students. Besides, with the progress of the project, the PIs need to timely adjust the direction according to the updated situation. All of these factors listed above are not able to be evaluated during the peer-review.

Anyhow, there's nothing to blame the peer-review system. Probably it has some flaws. However, it may be the best we can do so far to evaluate proposals.

Submitted on Wed, 07/30/2014 - 01:30

While not gainsaying the advantages of peer review, it is worth noting that the peer review process can be used in rather nefarious ways. Firstly, it is blind to the context, the resources, the intellectual/monetary support researchers receive, and the circumstances governing the research. Secondly, experts, conservatively, judge incoming research within their respective field’s already established research paradigms favoring “Group Think” and disfavoring the emergence of new ideas and tools which work in tandem to drive science (1,2) resulting in the paucity of paradigm shifts. In this regard, Jessen, Matheson and Lacey seem warranted in submitting that Darwin’s theory would not be published if it was subjected to the group thinking of his peers. (1) Thus, group thinking might squelch novel ideas. This is congruent with John Locke’s view who holds that “Knowledge is the perception of the agreement or disagreement of two ideas.” Peer reviews should allow for the possibility of an emerging paradigm shift, however unlikely it might be, which accords with both common sense and Albert Einstein who once stated “For an idea that does not at first seem insane, there is no hope”. Thirdly, peer review should analyse both results and nonresults. Nevertheless, a major proportion of peer review may be biased against nonresults which might later prove to be equally important in replication studies. Fourthly, peer review can be used to reveal irrelevant knowledge, to self-glorify, to disparage, and to grandstand rather than understand. Eminent scientists can change perspectives. Nonscientists are unable to observe the phenomenon through your lens. They are in sharp contrast in terms of their rhetoric. 1. J. K. Jessen, L. Matheson & F. M. Lacey. Doing your literature review. London: Sage. pp21-22 (2011). 2. F. J. Dyson. Is Science Mostly Driven by Ideas or by Tools. Science. 338, 1426-1427 (2012).

Submitted on Thu, 06/19/2014 - 19:22

I was very disappointed that no attempt was made to judge research quality other than by citation impact. Of course, there are citation cartels where people just cite each other because they do. Or because it seems convenient to cite a well-known author's paper. Citation as we now know is not the same as download or viewing of a paper, and there are many reasons to cite rather than just read. Finally a low-impact paper may spawn one or many high-impact papers.

Submitted on Sat, 04/12/2014 - 14:05

The results presented in thi News Focus story on peer review at NIH are surprising and far from convincing since they are not consistent with similar analyses based on Canadian data. The author of the cited studies, Michael Lauer, states that “Peer review should be able to tell us what research projects will have the highest impacts”. But this has never been the basic function of peer review. Given that only a small minority of the applications gets grants (around 20%) the first task of peer review is to select those “winners”. As a consequence, the first question to ask is: did the winners have (before and after the grants) a larger impact than the losers? Studies based on Canadian data consistently show that the productivity and scientific impact of the grantees are systematically higher than those of the non grantees. The next question is whether the marks given by committee members correlate with scientific impact. In a recent study prepared for the Canadian Institute of Health Research (CIHR), our organization (OST) found that top ranked funded projects have higher productivity and impact than lower ranked but also funded projects. This study also showed that a high degree of consensus among peers increases the likelihood that they have made the right choice. Indeed, “true positives” (that is those who were chosen as “grantees” by reviewers as well as by the Committees) had systematically higher impact (in terms of average relative citations) than false positives and false negatives, while true negatives (rejected by both reviewers and Committees) showed the lowest impact1. This study confirmed an older one we had done in 1996 for the National Sciences and Engineering Research Council of Canada (NSERC) in which those with larger grants were also the ones having higher impact, while the rejected applicants had the lowest impact2. As the recently published report concluded : “The evidence from this analysis therefore provides support to the hypothesis that peer review committees are selecting the 'best research ideas' as measured by resulting outcomes - subsequent publications and their impact”1. 1.See 2. See

Yves Gingras Scientific Director of the Observatoire des Sciences et des technologies (OST) Canada Research Chair in History and Sociology of Science UQAM, C.P. 8888, Suc Centre-Ville, Montréal, Canada, H3 C 3P8

Submitted on Mon, 03/31/2014 - 09:52

Sorry, in the start of the last paragraph, I of course meant:

In finance they face a similar challenge of estimating a future return from future prospects/past performance ..

Submitted on Fri, 03/28/2014 - 17:01

Adding to Chris Waters that the past productivity of the researchers should play a much bigger role. The problem, as I see it, is that the researchers are evaluated mainly on absolute terms (e.g. total citation count), rather than relative terms (total citation count / total funding). This raises a question: How can it be expected that the outcome of the competition is an increase in productivity, when the productivity is not a part of the competition?

In finance they face a similar challenge of estimating a future return from past performance and here the companies are of course mainly evaluated on relative terms (e.g. P/E or earning-per-share ratios). Evaluation on absolute terms, is akin to 'momentum investment', where identity between the best company (in absolute terms) and the best investment (relative terms) is assumed. The problem is, however, the presence of the other investors, because maybe they have already been of the same opinion and driven the price/activity up to a level, where an additional supply of resources would make a relative smaller difference. In this case, a value-based or contrarian strategy would be better; where it is the undervalued/overlooked company/researcher that is favored, because they simply have more time and energy to dedicate to the additional project. Here, the historical productivity in relation to the current activity level has to be a decisive factor, rather than absolute merits. So one has to abandon the implicit assumption that the best (in absolute terms) is identical to the most productive.

Submitted on Fri, 03/28/2014 - 16:51

We also found that peer review scores are a poor predictor of science outcomes in a study based at the National Science Foundation [S. M. Scheiner, L. M. Bouchie. Frontiers Ecol. Environ. 11, 406 (2013)]. However, we drew a different conclusion. These results demonstrate that review panels are less conservative than typically portrayed. Reviewers and panelists get excited about research that may not succeed, but might provide a big payoff if it does. Exactly that willingness to take a chance on risky research would lead to the lack of correlation seen in our analysis and that of Michael Lauer. Peer review does separate the wheat from the chaff, although this is hard to prove unless we are willing to do the experiment of funding some of the projects in the 60th or 80th percentiles. It provides valuable services and should not be abandoned. However, recognizing its limitations can allow us to improve how it is used.

Submitted on Fri, 03/21/2014 - 07:34

Every scientist who is worth his/her salt can recount experiences where an NIH study section trashed a grant on a topic that later turned out to be quite important. It has been pointed out many times that NIH study sections are inherently conservative and tend to reflect the existing consensus in a field. Major breakthroughs, however, come from science that disrupts consensus. This has been understood ever since Thomas Kuhn and The Structure of Scientific Revolutions. The issue is how to identify breakthrough science and differentiate it from misguided error. To give the NIH some credit, in recent years it has implemented a number of granting programs that try to stress innovation (e.g. the Pioneer awards) as well as funding for beginning investigators who may have some new ideas.

Almost the converse situation prevails in publication where the ilk of Nature, CELL and other high profile journals try to focus exclusively on ‘hot’ topics. Over time some of the designated hot areas prove to be not so hot and dwindle away, but nonetheless the ‘hot’ articles will have generated high impact factors. No wonder there is a discrepancy between NIH funding and publication impact!

Although there is no need to obsess about the grant/publication disparity, it is clear that the NIH peer review system could use some renovation (as well as more money to dispense). The titles and functions of study sections still primarily reflect a disease/organ system specific orientation. However, current biomedical science is developing information and insights that cut across traditional boundaries. If the study section system were more reflective of the thrust of current research there would probably be fewer misfires on grant funding.

Submitted on Thu, 02/20/2014 - 11:18

Adding to Christopher Wills' points, I wonder, if the bibliometric indicator used was first shown to be a sufficient indicator of actual research outcome impact: was it validated? Was it also shown to be stable over the study period? This can't just be presumed, I think.

Also, doesn't the quality of the research publication--as a communication of research outcomes--have something to do with impact: important outcomes poorly communicated will tend to have low future impact, but this does not necessarily mean that the research done was poor too, nor that the outcome is not important. It just makes things harder for other researchers and interested users of research outcomes.

If an experiment returns an unexpected outcome, isn't the first thing to do, carefully check the experiment design? Perhaps the outcome obtained is the only outcome the experiment can in fact deliver? Which is what Christopher Wills is suggesting.

Submitted on Thu, 02/13/2014 - 06:37

There are two layers of review, each with a large component of noise, that are involved in this study. The first is the grant peer review process, the second is the peer review process that is involved in publication. Even if each layer has — say — 30% signal and 70% noise, after the research has passed through them both the remaining 9% of signal is unlikely to be detectable.

Submitted on Tue, 02/11/2014 - 17:18

I disagree that citation numbers and time to publish are appropriate measures of how important a research outcome is. "Research importance" is subjective too.

Submitted on Tue, 02/11/2014 - 13:38

We test experiments with rigorous sample size, double-blind approach and subject them to fine-tuned statistics, but grant quality is checked by some white-haired men chosen mostly by number of years spent in the field. Peer review is a relic of 19. and early 20. century, largely subjective, science. No wonder it gives poor results.

Submitted on Mon, 02/10/2014 - 05:19

I continue to be amazed at how well-trained scientists will abandon scientific principles when they have a narrative of their own that they wish to promote. I am even more astounded by how scientists who are members of entitled social groups actually believe that evaluation systems that favor them are fair to others who were not born to the same entitlement. Look! How on earth could a dependency study comparing two such similarly problematic scientific evaluation processes - NIH grant review versus scientific journal review - yield anything but meaningless mush? Both systems contain the same subjective bias, lack of diligent review, prejudice, nepotism, unfair discrimination...Need I go on? Moreover, the analysis has a huge omission bias, as it has no accounting for the impact of research that managed to get done although rejected by NIH study sections.

So, I encourage everyone, who might be doing the usual perfunctory hand-wringing about this report, to move onto a better use of their effort. No peer-review system is ever going to be perfect in practice, but many could be better. Better is a goal everyone can work towards by acting according to the ideal principles for scientific review, including diligence, scientific rigor, objectivity, fairness, open-mindedness, recusal for improper conflicts, and above all no unfair social discrimination.

Submitted on Fri, 02/07/2014 - 16:18

During the time period studied, the majority of funded R01 grants were resubmissions; A1s and A2s.*1 However, NIAID data shows that more than 80 percent of resubmissions get better scores.

It is possible that revised applications could cloud the predictive ability of peer reviewers.*2 This may be due to score improvements in revised applications which are largely based on an applicant's response to reviewer's suggestions, potentially masking the initial assessment of the applicant and the fundamental idea behind the proposal. It may be that initial impressions are more accurate in predicting the potential impact of a project.

In any case, this is pioneering work given the paucity of data on the peer review process.

1. 2.

Submitted on Fri, 02/07/2014 - 16:17

I think a better model would be for the NIH (and other federal agencies) to fund people, not projects. Rather than trying to project forward based on a specific set of proposed experiments, funding agencies could be retrospective and instead ask has the applicant been productive during the last cycle. If the applicant has made an impact on the field (in terms of publications, citations, intellectual property, etc.), then the applicant should be awarded more funds without having to specifically propose what they will do. If not, then the applicant should be awarded less funds. I lay out my argument here:

Submitted on Fri, 02/07/2014 - 14:07

I'm surprised that the author doesn't consider an alternative hypothesis, consistent with the data. The implicit assumption is that there is an actual difference between high and low ranked scores. If that assumption is wrong (i.e., the study sections are being asked to find differences where no differences exist), then one would get equivalent results. Its my impression that so much effort goes into grants nowadays, that very few "bad" grants are being submitted, and the sections are being asked to impose a normal curve on the top 0.1% of the population-which violates the basic assumption of the "normal curve"!

Submitted on Fri, 02/07/2014 - 13:03

I would agree with Scott--one interpretation of the data is that the number of unproductive grants has fallen to a very low level because the US is not funding science to the point that we have run out of important avenues of research. In my own field, oceanography, grant success rates have dropped from the 40% level in the 1970's to levels in the teens now. In my own reviews, I find very few bad proposals being submitted

Submitted on Wed, 02/12/2014 - 06:11