Oracles, Not Just Scientists. The Seemingly Exceptional Batting Average of Pre-Registered Hypotheses

Oracles, Not Just Scientists. The Seemingly Exceptional Batting Average of Pre-Registered Hypotheses
Photo by Viva Luna Studios / Unsplash

The model of scientists as explorers suggests endless defeat followed, only on occasion, by a lightning bolt. The model doesn't readily apply to social scientists. Going by published research, empirical social scientists enjoy a lot of success, finding lots of results that are 'consistent with [their] hypothesis.' It could be that this is merely a case of scientists flattering themselves by writing hypotheses that are consistent with their results. Or scientists (albeit the cargo-cult variety) analyzing the data in a motivated manner. Or scientists being good at predicting the results. (Though the data from replication attempts suggests otherwise.) At any rate, you would expect the batting average of pre-registered hypotheses to be low. But, as I discovered (see below), the success seems to extend to 'pre-registered' hypotheses. It could be that pre-registration is not much of an insurance for rigging the analyses or the interpretation. Or the explanation could be more benign. It could be that scientists are busy proving 'obvious' things (though hypotheses are almost always framed as being novel and non-obvious). (Of course, things that are obvious to them may not be obvious to others.) At any rate, it is peculiar and deserves closer study.

I Know What I Will Find

I started by looking at recently published open-access papers in Psychological Science. I found only one paper where the results were mostly inconsistent with the pre-registered hypotheses. 8 out of 10 papers got a perfect score.

  1. In An Illusion of Time Caused by Repeated Experience, the authors ask, "Overall, do more repetitions result in items being remembered as even farther back in time?" They find a pattern counter to intuition, which suggests that "in the absence of explicit cues, people remember on the basis of memory strength. If a memory is fuzzy, it likely occurred longer ago than a memory that is vivid." They find "a robust illusion of time that stands in stark contrast with this prediction." The pre-registered hypothesis was that "repeated exposure to a stimulus causes it to be remembered as having initially been seen longer ago (despite having in actuality been seen more recently)."
  2. In Participating in a Digital-History Project Mobilizes People for Symbolic Justice and Better Intergroup Relations Today, the authors find "Participants in the digital-history condition were more likely to engage in collective action compared with those in the information-only control condition, consistent with our predictions." (Italics are mine.) (The pre-registration is here.)
  3. In Where Do Children Look When Watching Videos With Same-Language Subtitles?, the pre-registered hypotheses and obtained results were as follows:
    1. "We hypothesize that some degree of reading instruction will be necessary before children look at and show evidence of reading the subtitles. Thus, most of the poor readers, e.g. children from year 1, may ignore the subtitles or spend time looking at them without evidence of linguistic processing." They find: "Critically, children in Year 1 ignored 59% of whole subtitles and skipped 66% of the words in the remaining subtitles."
    2. "In contrast, older children or children with better reading skills may show evidence of reading the subtitles, similarly to adults." They find: "However, our findings also showed a progressive increase across Years 2 and 3 in children’s attention to subtitles. By the end of Year 3, the children in our sample spent substantial time looking at subtitles, and their behavior could not be distinguished from that of older children."
  4. In Doubling-Back Aversion: A Reluctance to Make Progress by Undoing It, the authors find: "support for doubling-back aversion, a reluctance to pursue more efficient means to a goal when they entail undoing progress already made." They had predicted that "participants will be less likely to switch paths to choose the shorter path in the doubling-back condition than in the subtle-switch condition."
  5. In What We Think Others Think and Do About Climate Change: A Multicountry Test of Pluralistic Ignorance and Public-Consensus Messaging, the authors find results that are consistent "with Hypothesis 1a", "Supporting Hypothesis 1b", "Supporting Hypothesis 1c", "Consistent with Hypothesis 1d", "supporting Hypothesis 3", etc. But in Table 4, they summarize that 4 of the hypotheses were supported and five were not.
  6. In Economic Consequences of Numerical Adaptation, the authors find that "people make substantially more mistakes in their nonadapted range." They had predicted "that people in specific countries will be better at making economically relevant decisions in the numerical currency range with which they are familiar, compared to when the involved monetary amounts fall outside of their habitual range."
  7. In Co-Speech Hand Gestures Are Used to Predict Upcoming Meaning, the authors find "gestures improved explicit predictions of upcoming target words" and "gestures reduced alpha and beta power during the pause, indicating anticipation, and reduced N400 amplitudes, demonstrating facilitated semantic processing." They had predicted "iconic hand gestures (compared to meaningless self-adaptors) should result in stronger pre-stimulus alpha and beta desynchronization before hearing the target word and smaller N400 amplitude upon hearing the target word."
  8. In Disagreeing Perspectives Enhance Inner-Crowd Wisdom for Difficult (but Not Easy) Questions, the authors find, "The results support the notion that taking a disagreeing perspective is beneficial for difficult questions, yet harmful for easier questions." (Also: "We subsequently tested the main prediction of interest, namely the interaction between question difficulty and perspective taking. In both experiments, the interaction estimates’ credible intervals did not include zero.") They had predicted "in general, the benefit of averaging increases as a function of question difficulty (with more benefit for increasingly difficult questions), this pattern is expected to be more pronounced when aggregating first guesses with second estimates made from a disagreeing perspective (vs. simply aggregating first guesses with people’s second guesses)." (There are multiple pre-registrations.) Or if you were to refer to the paper: "[W]e expected that, in general, the benefit of averaging would increase as a function of question difficulty with more benefit for increasingly difficult questions, yet this pattern was expected to be more pronounced when aggregating first guesses with second estimates made from a disagreeing perspective (vs. simply aggregating first guesses with second guesses)."
  9. In Reluctance to Downplay: Asymmetric Sensitivity to Differences in the Severity of Moral Transgressions, the authors found "When scaling up from a less severe transgression to a more severe one, people readily express stronger condemnation of the worse transgression. But when scaling down from a more severe transgression to a less severe one, they differentiate less, often condemning the lesser transgression just as strongly as one that is transparently worse... Observers’ moral-character judgments reveal a similar pattern..." The authors had predicted (in the paper) "...a directional asymmetry in people’s willingness to differentiate between bad acts: To avoid expressing insufficient condemnation, people scale down condemnation less than they scale up. In line with this account, we expect the asymmetry to be more pronounced for judgments that implicate moral character to a greater extent and for transgressions that seem especially important to condemn. Moreover, we expect a similar asymmetry to emerge in observers’ judgments of the morality of scaling up versus scaling down." While the authors do not explicitly hypothesize in the pre-registration, the way they write the pre-registered research question gives a strong hint as to their prediction: "whether people's moral judgments of two wrongdoings differ less when they evaluate these wrongdoings in descending order of wrongness (i.e., when the greater wrongdoing comes first) compared to when they evaluate the same two wrongdoings in ascending order of wrongness (i.e., when the lesser wrongdoing comes first)." (More pre-registrations with a similar flavor can be found here.)
  10. In Differences Between Lifelong Singles and Ever-Partnered Individuals in Big Five Personality Traits and Life Satisfaction, the authors find that "lifelong singles were less extraverted, less conscientious, less open to experiences (dependent on singlehood definition), and less satisfied with their lives." The authors had preregistered four hypotheses (I am using the quote from the paper): "Compared to ever-partnered individuals, lifelong singles self-report lower levels of extraversion (Hypothesis 1), higher levels of neuroticism (Hypothesis 2), lower levels of conscientiousness (Hypothesis 3), and lower levels of life satisfaction (Hypothesis 4)." The result on neuroticism didn't make it to the abstract. There were "no significant differences emerged between lifelong singles and ever-partnered respondents in neuroticism (Hypothesis 2), with only minimal differences depending on the three singlehood definitions or the included covariates."

It goes without saying that we could do with more comprehensive data analysis. But I know what I will find.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe