LLM Agents Can Reproduce Social Science Findings from Paper Descriptions Alone
Researchers tested whether AI agents could replicate social‑science experiments using only a paper’s textual methods and the original data, without seeing the original code or results. Across 48 papers, agents often matched published outcomes, but success varied widely with the model, agent design, and paper clarity. The study highlights both the promise of automated reproducibility and the lingering problem of underspecified methods in scholarly writing.
## What Was Studied
The paper investigates whether large language model (LLM) agents can reproduce empirical results from social‑science studies when given only the paper’s methods description and the raw data, but not the original analysis code, results, or even the full article. The authors built an agentic reproduction pipeline that (1) extracts a structured, step‑by‑step methods description from each target paper, (2) isolates the agent from any knowledge of the original implementation, (3) lets the agent write and execute its own code to generate outputs, and (4) performs a deterministic, cell‑by‑cell comparison between the agent’s output and the published results. When mismatches occur, an error‑attribution step traces the discrepancy back through the pipeline to pinpoint whether it arose from the agent’s reasoning, code generation, or ambiguities in the source text.
## Key Findings
Across 48 papers with human‑verified reproducibility benchmarks, the system showed that LLM agents can often recover the original findings: in many cases the reproduced tables, regression coefficients, or statistical tests matched the published values within tolerance. However, performance was highly uneven. Differences emerged along three axes:
- **Model capability:** More advanced LLMs (e.g., GPT‑4‑turbo, Claude‑3 Opus) achieved higher success rates than smaller or older models.
- **Agent scaffold:** Certain prompting and reasoning architectures (e.g., ReAct‑style with iterative self‑critique) outperformed simpler zero‑shot code‑generation prompts.
- **Paper clarity:** Articles with detailed, unambiguous procedural descriptions yielded better reproduction, whereas vague or underspecified methods led to systematic failures.
Root‑cause analysis indicated that failures split roughly evenly between agent mistakes (misinterpretation of steps, faulty code) and missing or ambiguous details in the original paper that forced the agent to guess.
## Why It Matters
Automated reproducibility could dramatically accelerate scientific verification, reduce the burden on researchers to re‑run complex analyses, and help detect errors or fraud at scale. By showing that LLMs can act as independent reproducers when given only a methods narrative, the work suggests a future where AI agents routinely audit published claims. At the same time, the study underscores a persistent weakness in scholarly communication: many papers omit critical implementation details, making replication difficult even for humans. Improving methods transparency would not only aid human reproducibility but also boost the reliability of AI‑driven verification pipelines.
## Limitations
The evaluation is confined to social‑science papers that provide shareable data and have been manually vetted for reproducibility; results may not generalize to fields with proprietary data, simulations, or highly specialized hardware. The agent system itself depends on the quality of the methods‑extraction step, which could introduce bias if the source text is poorly structured. Finally, the study measures success by numerical equivalence of outputs, which may miss substantive differences in interpretation or robustness checks that are not captured in the reported tables.