Why We Need to Fix Our Data Before We Fix Health Care with AI

Hand pointing at a computer screen filled with code in a modern office workspace.

Kevin Wiley, Ph.D., M.P.H., is an assistant professor in the Department of Healthcare Leadership and Management at the Medical University of South Carolina College of Health Professions. His work sits at the intersection of health informatics, data quality, and health services research, exploring how clinical data, particularly electronic health records, shape care delivery, analytics, and the responsible use of emerging technologies such as artificial intelligence.

For several years, I have taught in the Master of Science in Health Informatics (MSHI) program at the Medical University of South Carolina (MUSC), and a familiar pattern emerges each semester. Students arrive eager to work with artificial intelligence (AI) and machine learning (ML), aiming to build predictive models, automate quality measurement, and develop clinical decision support tools. From their perspective, the technical work seems the most difficult: selecting algorithms, tuning parameters, and validating performance.

In practice, that is rarely where the real challenge lies.

When students are given real health care datasets such as electronic health records (EHR), administrative claims files, or survey data and asked to build something meaningful, progress often stalls. Not because the modeling techniques are too complex, but because the data themselves are extraordinarily difficult to work with.

Lab results may be missing for large portions of the population without explanation. Diagnosis codes may reflect billing practices rather than clinical reality. Utilization metrics change depending on query methods, and variables that appear straightforward often conceal clinical nuance. With EHR data, phenomena like informed presence bias, where frequent system users are assumed sicker than others, further complicate interpretation. Model performance can vary dramatically across race, insurance status, or geography, illustrating deeper measurement issues.

These challenges are not artifacts of classroom exercises. They reflect the reality of health care data.

Clinical and administrative records were designed for billing, documentation, and workflow—not research or analytics. Missing data are often structural, coding practices vary across institutions, and social and environmental factors affecting health outcomes are inconsistently captured. Even when social needs are documented, they may not inform meaningful interventions.

Students, therefore, learn two lessons at once. The first is technical, acquiring skills in data cleaning, feature engineering, model validation, and ethical considerations. Equally important are the professional norms they absorb along the way. They learn that messy data are normal, statistical adjustments are expected, and strong performance metrics may mask underlying gaps.

AI in health care is no longer confined to academic exercises. These systems are embedded in real-world health care decisions, from identifying high-risk patients to guiding interventions and informing reimbursement. If data quality is treated as just another analytic hurdle, sophisticated systems risk being built on unstable foundations. Poor data can skew performance metrics, weaken risk models for certain populations, and unintentionally reinforce health care disparities while appearing objective.

In MUSC’s MSHI program, classroom exercises illustrate these dynamics. Evaluating model performance across subgroups and comparing structured fields to clinical notes shows how context is lost in discrete variables. These experiences reveal that many analytic limitations stem from deeper measurement challenges.

For AI and ML to fulfill their promise in health care, data quality must be prioritized. In education, this means emphasizing data literacy alongside modeling techniques: assessing patterns of missingness, evaluating representativeness, and documenting limitations. Focusing exclusively on performance metrics such as area under the receiver operator curve (AUROC) or area under the precision-recall curve (AUPRC) without examining data integrity risks reinforces the wrong priorities.

At the organizational level, the same principle applies. Quality measurement systems require clearer and more explicit data standards. As digital measures and automated analytics become more common, the frameworks for data quality must evolve. Reporting should transparently address missingness, inter-institution variation, and measurement uncertainty. Validation must assess whether the underlying data environment supports accurate measurement, not just whether specifications were followed.

Professional societies, accreditation bodies, and regulatory agencies shape these standards, influencing what data are collected and how institutions allocate resources for governance. If AI is to improve rather than undermine health care quality, the supporting infrastructure must be reliable.

My MSHI students are thoughtful and deeply motivated to improve health care through informatics. Yet the individuals who will ultimately have the greatest impact may not be those who develop the most sophisticated models. They will be the ones who pause and ask a more fundamental question: “Can the available data meaningfully answer the question we are trying to address?”

The future of health care quality will not be determined solely by advances in algorithms. It will depend on whether the health care system is willing to invest as seriously in data integrity, transparency, and governance as it does in AI. The classroom reveals this tension each semester. The broader health care system must decide whether it is prepared to confront it.

Currently, we train powerful algorithms on imperfect data. Until we address that foundation, even the most advanced AI will fall short of building the health care system we aspire to.

For professionals who want to lead the next era of health care transformation, the Master of Science in Health Informatics (MSHI) program at the Medical University of South Carolina prepares graduates to navigate the complex intersection of health care data, technology, and decision-making. Through hands-on experience with real clinical datasets, students develop the technical expertise and critical data literacy needed to responsibly design, evaluate, and implement advanced analytics, artificial intelligence, and digital health solutions.

Graduates of MUSC’s MSHI program are prepared to lead in hospitals, health systems, health technology companies, consulting organizations, and public health agencies, ensuring that the data powering tomorrow’s health care innovations are accurate, reliable, and meaningful.

Explore the MUSC MSHI program and learn how you can become a leader in health informatics and data-driven health care.