Deep Patient and Deeper Problems: Barriers to AI-powered Diagnoses in Healthcare
7-minute read
The value of Electronic Health Records (EHRs) has been evident during the COVID-19 pandemic. But so too has the reality that structural biases in healthcare can disproportionately affect marginalized populations. If proper safeguards are not deployed when Artificial Intelligence (AI) is used to predictively diagnose patients, it will further marginalize certain populations from receiving adequate medical care and resources.
AI predictive models are set to play a prominent role in healthcare research and practice. The opportunities offered by AI-powered analytics and, more specifically, Machine Learning (ML) in the medical arena are already being realized by organizations like Google Cloud. A 2020 study between Google Health and physical scientists suggested that AI models could more accurately predict for radiology diagnoses than physicians. AI has been found to be effective in detecting conditions that are typically difficult for physicians to identify, such as rare hereditary and neurodegenerative diseases. However, since IBM effectively “dump[ed]” Watson Health, the AI system that seemed poised to revolutionize cancer diagnosis, and sold it for parts to a private equity firm at the start of 2022, experts have warned against using AI in healthcare without proper safeguards. The relatively disappointing outcomes achieved by Watson Health – a far cry from the promise of diagnostic acceleration it offered – reflect how difficult it is to source representational patient data to train accurate and appropriate AI models.
Despite the recent buzz about Watson Health – or rather, its failures – the use of big data analytics and AI to predict healthcare outcomes has long since been a reality. In 2015, Miotti and his colleagues applied deep learning to train an AI system to recognize patterns across patients’ health and thereby predict for diseases. Like Watson Health, Deep Patient was trained on patients’ Electronic Health Records (EHRs). Specifically, researchers trained Deep Patient on over 700,000 patient records taken from New York’s Mount Sinai Hospital databases.
By training the AI on these EHRs, which included information on individuals’ doctor’s visits, medications and test results, the team was able to get Deep Patient to recognize patterns across records and to predict for diseases. Trained on over 700,000 EHS, Deep Patient has since been highly successful when analyzing records to predict for diseases like severe diabetes, schizophrenia, and various cancers. But despite the benefits that systems like Deep Patient could apport resource-strained healthcare systems, their efficiency comes at a cost. Issues with models’ training data, selection criteria, and historic biases risk reifying biases into decision-making by predictive AIs, causing fatal misdiagnoses or under-diagnoses, and further marginalizing communities that have already strained relationships with healthcare institutions. In the case of Deep Patient, the “black box” nature of its predictive analyses further magnifies the potential impact of biases.
How do systems like Deep Patient work? Deep Patient is a neural network, a form of unsupervised Machine Learning (ML) that produces an output by transferring signals across interconnected layers of simulated neurons. Through such layer-by-layer learning, neural network ML applications can train themselves without significant human oversight. Deep Patient’s ability to quickly process thousands of records to diagnose patients – a feat that would be unachievable by doctors alone – could alleviate pressure from healthcare institutions across the US, which have seen record highs of COVID-19 cases, employee burnout and staffing shortages since the start of the pandemic. The technology’s scalability and its centralized nature could empower providers to use time and money saved via predictive analysis to standardize better healthcare across states.
But there are barriers to Deep Patient’s valiant goals. For one, algorithms used for predictive healthcare analyses risk resulting in decisions that are systematically less accurate when diagnosing for patients with certain protected characteristics like race, gender, ethnicity, or socio-economic status. Neural network AI is not designed to be transparent, lacking what the AI community calls interpretability. Moreover, systems like Deep Patient struggle to justify their diagnoses, meaning the rationales behind their decisions are not explainable. Given Deep Patient’s lack of interpretability and explainability, it’s hard to establish or eliminate the possibility that the system harbors biases based on its design or training data. In practice, this means that even if doctors suspect discriminatory outcomes from predictive AI, they are unable to challenge the algorithm’s diagnoses or identify the mechanisms that produce bias.
But how does predictive health AI end up being biased in the first place? Bias can be introduced through differences between the patient data used to train predictive AI and the patient data it is making predictions about. Hospitals that use predictive AI models could serve populations with different demographic diversity and different healthcare needs to those upon which the model was trained. Imagine, for example, that Deep Patient had been deployed in states with an abundance of certain minority groups that were underrepresented in the Mount Sinai training data. If this were the case, Deep Patient’s disease prediction algorithms could perform worse at recognizing health patterns among these minorities. This disparity could lead to disparate impacts for minority groups, including underdiagnosis. Recent trials of AI-driven healthcare prediction algorithms have suggested they could systematically underdiagnose female, Black, and Hispanic patients, younger patients, and patients of lower socioeconomic status. Underdiagnosing or misdiagnosing patients can have fatal impacts when it comes to high-risk healthcare predictions like cancer.
The risk that AI diagnoses discriminate against marginalized communities is amplified by historic human biases that could exist in the EHRs used to train models. Due to historic barriers to accessing healthcare resources, oppression and discrimination, under-served populations – especially people of color — are less likely to have comprehensive medical records. If models are trained on EHRs that are inaccurate only for people of color, they will be disproportionately worse at diagnosing these individuals. Moreover, certain algorithms have been trained to predict for diseases using variables like health care costs as proxies for health. When this proxy selection occurs, algorithms have been shown to systematically underestimate the needs Black patients compared to white patients, even when those patients had similar markers of health. Such discriminatory outcomes, rooted in existing racial disparities in healthcare access, could heighten the historically motivated lack of trust between minority populations and health care providers.
Regardless of the efficiency and staffing benefits that AI could bring to healthcare institutions, healthcare decisions are too high-risk to be automated using potentially biased models. Due diligence tests and safeguards must be enforced before predictive AI is widely adopted for medical diagnoses. There is increasing recognition that defining fairness in ML applications is difficult and that not all fairness metrics can be simultaneously satisfied. Nonetheless, audits should be conducted to assess whether any AI systems used for diagnosis are biased. In particular, a complete and robust analysis must be made of the accuracy of such predictive models for each group in its training data. Differentiating the model’s accuracy by gender, age, race, and ethnicity would be possible given that EHRs used in training data retain typically retain such demographic details. However, attention must be made to the privacy and other legal implications of doing such analysis. Audits should also be used to assess the composition of EHR training data to garner whether it adequately represents the populations that the AI system will be deployed upon. Where this auditing can be achieved through an external entity, such as the FDA, this could provide additional impartiality.
Performing audits and analyzing performance metrics are necessary steps to achieving accountability for predictive AI systems in the healthcare domain. Human oversight will also be an important factor. Any attempts to integrate AI-based diagnoses into healthcare decision-making should: center patient consent and autonomy; adequately inform doctors of the expected accuracy and performance metrics of the AI; and ensure that doctors always deliver the final judgement on diagnoses produced by the model.