Researchers create novel framework for large-scale observational studies
A mother’s health during pregnancy, childbirth and the postpartum period is the foundation of lifelong well-being, directly influencing a child’s development and long-term outcomes, yet most electronic health record (EHR) systems lack a reliable, standardized method to link mothers with their children.
Researchers from Regenstrief Institute, the Indiana University School of Medicine and partner institutions have developed and validated the first of its kind large-scale, probabilistic maternal–child record linkage algorithm using routinely collected EHR data. The retrospective cohort study, conducted as part of the real-world evidence core of the Maternal and Pediatric Precision in Therapeutics (MPRINT) collaborative, demonstrated that machine learning can reliably identify maternal-child relationships across different health systems — a breakthrough for understanding how a mother’s health influences her child’s outcomes over time.
With the evidence collected in this study, researchers can more effectively pursue large-scale observational studies on maternal-child medication effects, congenital conditions and other health outcomes across expansive EHR populations.
“The health of a mother — including medications she takes and illnesses she has — can affect a child immediately, or not until years later. Without reliable linkages, it’s been hard for researchers to follow these relationships over time,” said lead author Colin Rogerson, M.D., MPH, Regenstrief and IU School of Medicine Research Scientist.
By establishing accurate maternal–child linkages at scale, researchers can now examine how prenatal exposures influence childhood development and long-term outcomes, including congenital diseases, neurodevelopmental disorders such as autism and ADHD, chronic conditions like asthma and allergies, and rarer diseases that have historically been difficult to study.
While several prior studies have attempted to link mothers and children using state or national datasets with mixed success, this study is the first to accurately achieve large-scale maternal–child linkage by applying machine learning to universally collected EHR demographic data across an expansive, statewide health information exchange.
“No one before this has been able to do what we’ve done here,” said Dr. Rogerson. “Other researchers have tried this with administrative or state-level data, but it has been hard to generalize their results. Our approach uses standard information that every hospital collects, which means other states and health systems should be able to use the same algorithm and achieve similar results.”
Using demographic features such as name, birthdate, phone number and address, the research team applied an XGBoost machine learning model to more than 82 million records, evaluating 6.2 billion potential maternal-child pairs. The algorithm achieved 92 percent accuracy, 98 percent precision and an F1-score of 92 percent, indicating strong performance identifying true maternal-child connections at scale.
“Linking mothers and children in electronic health records has been a longstanding challenge,” said Regenstrief Vice President for Data and Analytics Shaun Grannis, M.D., M.S. “By leveraging high-quality real-world data and modern machine learning, this work demonstrates how we can responsibly apply AI to answer questions that matter for public health. The ability to generate reliable maternal-child linkages across systems opens the door to discoveries that weren’t possible before.”
“Derivation and validation of an algorithm for maternal–child linkage in electronic health records” is published in the Journal of the American Medical Informatics Association. This work was supported by grant 5P30HD106451-04 from the National Institutes of Health’s National Institute of Child Health and Human Development.
Authors and affiliations, as listed in the publication
Colin M. Rogerson1,2, Christopher W. Bartlett3,4, John Price2, Lang Li4, Eneida A. Mendonca5,6, Shaun Grannis1,2
1Department of Pediatrics, Indiana University School of Medicine, Indianapolis, IN, 46202, United States.
2Regenstrief Institute, Indianapolis, IN, 46202, United States.
3Department of Pediatrics, The Ohio State University, Columbus, OH, 43210, United States.
4Office of Data Sciences, The Steve & Cindy Rasmussen Institute for Genomic Medicine, Abigail Wexner Research Institute, Nationwide Children’s Hospital, Columbus, OH, 43205, United States.
5Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, 45229, United States.
6Department of Pediatrics, University of Cincinnati, Cincinnati, OH, 45229, United States.
Colin Rogerson, M.D., MPH
In addition to his role as a research scientist with the Clem McDonald Center for Biomedical Informatics at Regenstrief Institute, Colin Rogerson, M.D., is an assistant professor of pediatrics at Indiana University School of Medicine and a practicing pediatric intensive care physician.
Shaun Grannis, M.D., M.S.
In addition to his role as vice president for data and analytics and research scientist with the Clem McDonald Center for Biomedical Informatics, at Regenstrief Institute, Shaun Grannis, M.D., M.S., is the Regenstrief Professor of Medical Informatics and a professor of family medicine at Indiana University School of Medicine. He is also an adjunct professor with the Indiana University Richard M. Fairbanks School of Public Health at IU Indianapolis and at the Indiana University Indianapolis Luddy School of Informatics, Computing and Engineering.




