What is this claim about AI proposing and testing scientific hypotheses?
The headline refers to scientists’ prediction that advanced AI systems will soon act as semi‑autonomous “AI scientists,” not just helping with coding or writing, but actually suggesting new scientific ideas and running experiments to check them. In a recent Q&A, EPFL professor Robert West and collaborator Ágnes Horvát argue that within roughly five years, AI will plausibly generate the hypotheses that humans then investigate, marking a shift from AI as a helper to AI as an originator of research questions.
These claims sit within a broader movement to build “autonomous agents for scientific discovery,” systems that can read papers, spot gaps, design experiments, control lab equipment, analyze results, and refine their own ideas in loops. Companies and labs, including OpenAI’s “OpenAI for Science” team, now explicitly aim to use frontier models like GPT‑5 to accelerate each step of research, from ideation through simulation to lab work.
The key question is not whether AI will assist science (that is already happening) but how far it will go toward independently choosing which hypotheses to pursue and how much human scientists will remain “in the loop” for judgment, ethics, and interpretation. For everyday readers, the confusion often centers on whether “AI doing science” means replacing human researchers, generating completely new theories, or mainly speeding up routine tasks in specialized domains such as drug discovery or materials design.
Key Takeaways
- Researchers at EPFL and other institutions say it is realistic that AI systems will generate and test scientific hypotheses within about five years, at least in some fields.
- Early systems already combine language models with lab robots and scientific databases to suggest experiments and run them with limited human input, especially in materials science and chemistry.
- Current studies find that AI‑generated hypotheses are often novel but still trail humans when tested in real experiments, showing both rapid progress and clear limitations.
How did we get to the point where AI can talk about “doing science”?
Over the past decade, AI has moved from narrow tools (like pattern‑matching algorithms) to large language models (LLMs) that can read and write technical text, generate code, and synthesize literature across many fields. As these models improved, scientists began using them for things like drafting sections of papers, writing code for analyses, or scanning vast databases such as Semantic Scholar to find underexplored research directions.
In parallel, “self‑driving labs” emerged in materials science and chemistry, where robotic systems physically handle samples and instruments under algorithmic control. For example, MIT’s “CRESt” platform integrates machine learning, scientific literature, and robotic hardware so that the system can plan and run experiments to discover new materials, while monitoring conditions via cameras and vision‑language models. Researchers report that CRESt can make its own observations, suggest hypotheses about issues such as irreproducible results, and propose corrections that improve experiment consistency.
On the conceptual side, AI researchers have started to formalize “levels” of autonomous scientific agents, from basic assistants to fully autonomous systems that can manage an entire research cycle. One 2023–2025 line of work proposes a five‑level framework spanning hypothesis discovery, experimental design, tool use, tool creation, and analysis, with Level 5 describing agents that could autonomously turn abstract goals into validated scientific findings. While nobody claims current systems are at that top level, published surveys and experimental platforms show steady progress through the earlier stages, especially in tightly defined domains.
What does this “AI doing science” idea actually mean in practice?
In practice, the near‑term vision is that AI systems will take on more of the “closed loop” between scientific reading, hypothesis generation, experiment design, execution, and analysis, particularly where tasks are scorable, data‑rich, and automatable. For example, in software engineering research, one AI‑based system described in 2024 automatically rewrites code to improve quality, generates multiple candidate solutions, and uses search to decide which versions deserve further testing, effectively exploring hypotheses about better algorithms on its own
In lab‑based science, platforms like CRESt already show how AI can integrate scientific literature, sensor data, and robotic control to discover new materials or troubleshoot experimental setups with limited human guidance. The system can, for instance, notice millimeter‑scale deviations in sample shapes or pipette positioning, hypothesize that these cause irreproducibility, and recommend adjustments that researchers can choose to adopt, which has improved consistency in reported tests.
Commercial and research groups also report that when given access to specialized tools and enough time, frontier models such as GPT‑5 can reason through complex problems more deeply than in short chats, suggesting concrete experimental directions or analysis pipelines. Benchmarks like OpenAI’s FrontierScience evaluate how well these systems perform expert‑level reasoning across physics, chemistry, biology, and other disciplines, with the goal of tracking progress toward AI that can reliably assist or partially automate scientific workflows. The “within five years” claim usually refers to reaching the point where, in some well‑structured areas, AI can autonomously propose testable hypotheses and run many of the required simulations or experiments, while humans supervise and interpret results.
Who is likely to be affected first, and how would this change their work?
The main groups directly affected in the near term are researchers in data‑rich, automation‑friendly fields such as computational biology, materials science, chemistry, and parts of computer science and engineering. In these areas, a combination of large digital datasets, established simulation tools, and robotic labs makes it technically feasible for AI agents to propose and test large numbers of hypotheses at relatively low cost, shifting human work toward designing goals, interpreting unexpected findings, and checking validity.
A 2025 study in Science looked at AI‑generated hypotheses in natural language processing and found that, when judged blindly, experts rated machine‑generated ideas as highly novel and often comparable in perceived feasibility to human ideas. However, when a new group of specialists ran real experiments on 24 AI‑generated and 19 human‑generated hypotheses, the human ideas performed better overall, suggesting that AI still struggles to reliably estimate what will work in practice. This implies that, at least for now, human researchers remain crucial for filtering and grounding AI‑suggested hypotheses.
Beyond academic labs, pharmaceutical companies, advanced manufacturing firms, and large tech companies are likely to use AI scientists to accelerate R&D pipelines where each experiment is expensive but highly automatable. Over time, universities and funding agencies may need to adapt how they evaluate scientific contributions, as some “ideas” or experimental campaigns originate in AI agents rather than individual human investigators. For students and early‑career researchers, this could change training priorities toward skills like prompt‑engineering for scientific tools, experiment oversight, and critical evaluation of AI‑driven results.
What does this trend not mean, and where are the current limits?
The “AI will do science in five years” line does not mean that machines will fully replace human scientists or independently drive all of science across every field. Current systems still have well‑documented weaknesses: they can exaggerate the importance of their own ideas, misjudge whether a hypothesis is actually testable, and make subtle errors in experiment design or interpretation that humans may catch only after hands‑on review.
The Science study on AI‑generated hypotheses highlights that while AI can produce large volumes of novel ideas, turning those into successful real‑world experiments still demands substantial human expertise and time. Similarly, in self‑driving labs like CRESt, the AI’s suggestions about irreproducibility or setup tweaks require human scientists to decide which changes to implement and how to interpret the outcomes in context of broader theory. Surveys of autonomous scientific agents emphasize that reaching full Level 5 autonomy would require robust tool creation, deep causal understanding, and highly reliable self‑correction mechanisms that remain research challenges rather than solved problems.
Ethically and socially, the trend does not automatically resolve questions about scientific priorities, fairness, or value judgments in research. EPFL’s West and Horvát note that if AI starts choosing which hypotheses to pursue, societies risk losing some control over what questions get asked and whose interests are reflected in research agendas. Human institutions—universities, funders, regulators—still set the norms, constraints, and goals that shape how AI tools are built and used.
What should people watch next as AI moves deeper into scientific research?
Over the next few years, key signals to watch include how often AI‑generated hypotheses lead to peer‑reviewed, experimentally validated discoveries, especially in benchmarked domains like materials, drug discovery, and machine learning research itself. Conferences such as “Agents4Science 2025,” which plans to feature AI tools as both authors and reviewers, will also show how far the research community is willing to integrate AI into core scientific communication and evaluation.
Technical roadmaps for “AI scientists” suggest that progress will hinge on tighter integration between language models, symbolic reasoning, simulation engines, and physical robotics to create reliable closed‑loop discovery systems. Industry initiatives like OpenAI for Science, and similar efforts at major tech and pharma companies, will test whether such systems can consistently improve R&D productivity without sacrificing reliability or safety. Policymakers and professional societies are likely to respond with new guidelines around authorship, accountability, data access, and safety when AI agents design or run experiments, an area that remains in early discussion stages.
How we know this: This explainer draws on interviews with EPFL researchers on AI’s role in hypothesis generation, peer‑reviewed and preprint studies on AI‑generated scientific hypotheses, technical reports and news releases on self‑driving labs such as MIT’s CRESt platform, surveys of autonomous scientific agents, and documentation and benchmarks from AI labs developing tools like GPT‑5 and the FrontierScience evaluation suite.



