The Modal Social Science Dataset is Under-Analyzed

When estimates can be combined to yield a useful, consistent, asymptotically unbiased estimate, should theory limit which analyses we perform?

Preamble

On statmodeling, Gelman posted a great question: "A study is conducted on two groups. When does it make sense to report two separate estimates, and when does it make sense to just report the pooled estimate?" When Kabir and I faced this question in our own work, we chose to present disaggregated analyses, partly because we felt it was more 'honest' and partly because some of the disaggregations were in line with what was done before. But there is a broader question at stake: does the modal empirical social science paper publish too few or too many statistical summaries?

Set aside the constraints of paper (the physical object and the scientific publication unit) and the need to tailor it toward humans. (For one, think about LLMs and other algorithms as other consumers.) What should be the optimal number of statistical summaries?

What Do We Mean By Optimal?

If our goal is to minimize regret from missing important data‐driven patterns, then every additional summary can only help, so long as it adds nonnegative expected information. This limits us to 'credible' estimates, which include descriptive analyses, heterogeneous treatment effects, etc. In this framing, there is no intrinsic harm in generating more outputs: the only barrier is controlling spurious findings, which we already know how to address (e.g., FDR correction, hierarchical shrinkage). The benefits are likely numerous. More comprehensive descriptive statistics can reveal surprising regularities that challenge theoretical assumptions or suggest new mechanisms. Even concerns about publication bias can be slightly mitigated with more 'mindless' analyses.

Current Status

In empirical social science, theory has long served as both compass and filter. Theory guides what to study, how to operationalize concepts, and the interpretation of results. It also acts as a filter. On many datasets, only a small fraction of analyses, even when limiting ourselves to those yielding 'credible' estimates, are ever conducted.

Rethinking the Pipeline

With cheap compute and automated tooling, we can:

  1. Enumerate "all" (simple?) descriptive and subgroup analyses (along with the kinds of descriptive analyses that say an LLM prompts, see here)
  2. Automatically document each statistic with provenance (“data, model, and code used”),
  3. Apply off‐the‐shelf error‐control methods to keep false discoveries in check,
  4. Expose results in both human‐readable narratives and machine‐readable formats.

This isn’t a free‐for‐all: the pipeline itself must be well‐specified (families of tests, dependency structure, shrinkage priors) so that validity is preserved.

There are options for using this kind of system for automated theory building. If we know our theory and its predictions, then any inconsistent data can lead to theory generation, with plausible help from LLMs.

After all, the ultimate goal isn't to fit our findings into predetermined boxes but to understand the world as it actually is.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe