Validation of a machine learning model for predicting early deterioration in the emergency department.

doi:10.1016/j.ajem.2026.05.007

In brief

This 2026 study validates two XGBoost ML models using structured ED data and BioClinicalBERT embeddings from nursing notes to predict early deterioration (ICU admission/death within 7 days). Model B, prioritizing high-risk patients via class weighting, showed improved recall (0.

What this article is about

Quick Answer

This 2026 study validates two XGBoost ML models using structured ED data and BioClinicalBERT embeddings from nursing notes to predict early deterioration (ICU admission/death within 7 days). Model B, prioritizing high-risk patients via class weighting, showed improved recall (0.77) and precision (0.22), with an ROC-AUC of 0.90 compared to Model A's 0.66 recall/0.17 precision/AUC 0.75. While text embeddings added value, SHAP analysis highlighted age, respiratory rate, and systolic blood pressure as key individual predictors. The authors caution these tools should complement clinical judgment and require prospective validation.

Student takeaways

Key Takeaways

Two XGBoost models were developed using structured ED triage data (demographics, vital signs, eCTAS) combined with BioClinicalBERT embeddings from free-text nursing notes.
Model A used standard class weighting; Model B applied increased weighting to the 'early deterioration' class (ICU admission or death within 7 days).
The prevalence of early deterioration in the study cohort was 4.5%.
Model B showed improved recall (0.77 vs. 0.66 for Model A) and precision (0.22 vs. 0.17 for Model A), with a higher ROC-AUC (0.90 vs. 0.75).
While free-text embeddings contributed significantly to predictive power, SHAP analysis identified age, respiratory rate, and systolic blood pressure as dominant individual predictors.

Student summary

Why This Research Matters

This article, published in The American Journal of Emergency Medicine on January 1st, 2026 (DOI: 10.1016/j.ajem.2026.05.007), explores a critical issue in emergency department (ED) care: the early identification of patients at risk for deterioration. The study focuses on developing and evaluating machine learning models to predict such deterioration before an initial physician assessment, aiming to create a tool that can help prioritize patient care effectively.

The research problem is framed around the limitations of traditional early warning scores in dynamic ED environments. These scores often rely solely on structured triage data (like vital signs) and may not capture the full clinical picture conveyed by nursing assessments or free-text notes, leading to poor performance. The authors propose that integrating natural language processing with machine learning could offer a more nuanced approach.

The study analyzed 17,481 consecutive adult ED visits over six months. This large dataset allowed for robust model training and evaluation. Structured variables such as demographics (age), vital signs (respiratory rate, systolic blood pressure), and eCTAS scores were combined with BioClinicalBERT-derived embeddings from free-text nursing triage notes to form a multimodal feature representation. Two XGBoost models were developed: Model A used standard class weighting for the binary classification task of predicting 'early deterioration' (defined as ICU admission or death within 7 days, which had a prevalence of 4.5% in this dataset) versus all other outcomes. Model B applied increased weighting to the early deterioration class specifically to prioritize identifying high-risk patients.

The findings are quite significant for understanding how these models perform. Model A achieved a recall (sensitivity) of 0.66, meaning it correctly identified 66% of actual deteriorating cases; its precision was 0.17, indicating that only 17% of the cases flagged by this model as high-risk were true positives; and the ROC-AUC (Area Under the Receiver Operating Characteristic Curve) was 0.75. Model B showed improvements: recall increased to 0.77, precision to 0.22, and ROC-AUC to 0.90. These metrics suggest that while both models have room for improvement, especially in terms of precision (many false positives), they are better at catching true deteriorating cases than not.

The authors highlight an important point about feature importance: although XGBoost's internal analysis attributed the majority of predictive weight to the free-text embeddings from nursing notes, SHAP (SHapley Additive exPlanations) analysis identified age, respiratory rate, and systolic blood pressure as the dominant individual contributors. This means that while incorporating text data adds value beyond just structured variables alone, these specific vital signs remain crucial predictors.

For students appraising this research, it's essential to consider several aspects. First, the study is a validation of machine learning models in an ED setting, which has direct implications for nursing practice and patient safety. The use of free-text notes is particularly relevant as nurses are often responsible for documenting these initial assessments. However, students should also critically evaluate the limitations mentioned by the authors: the primary outcome was defined retrospectively (ICU admission or death within 7 days), models were trained on data from a single institution over six months, and while they show promise, safe clinical adoption requires prospective shadow testing in real-time workflows to assess operational feasibility and impact on decision-making. The study does not claim these tools should replace clinical judgment but rather function as an adjunct layer of situational awareness.

When considering source and rights cautions, the paper is sourced from PubMed with a high confidence score (98) for its metadata record. It's important to verify publisher access details before making strong reuse claims or decisions about monetization due to incomplete rights/access information in the provided metadata. The journal, *The American Journal of Emergency Medicine*, is a reputable source.

A nurse would reason from this evidence by understanding that AI-driven risk prioritization tools could help identify patients who might otherwise be missed by traditional scores, potentially leading to earlier interventions and improved outcomes. However, nurses must remember these are adjuncts; their clinical expertise remains paramount for interpreting the model's outputs and making final care decisions. The study suggests a future where such technology supports nursing vigilance in the fast-paced ED environment.

In summary, this research demonstrates promising steps towards using machine learning to enhance early deterioration detection in emergency departments by leveraging both structured data and free-text nursing notes. While not perfect, these models offer potential for improved patient safety when integrated thoughtfully into clinical workflows.

Source abstract

Study Overview

Early recognition of patients at risk for deterioration in the emergency department (ED) is critical for patient safety. Traditional early warning scores rely on structured triage data and often perform poorly in the dynamic ED environment. We developed and evaluated two machine learning models integrating structured triage data with transformer-based embeddings of free-text nursing triage notes to predict early clinical deterioration prior to initial physician assessment, designed as a risk-based prioritization tool to rank patients by predicted probability of adverse outcome. We analyzed 17,481 consecutive adult ED visits over six months. Structured variables (demographics, vital signs, eCTAS scores) were combined with BioClinicalBERT-derived embeddings from free-text nursing triage notes to form a multimodal feature representation. Two XGBoost models (A, B) were trained on the same binary classification task, predicting "early deterioration" (ICU admission or death within 7 days, prevalence 4.5%) versus all other outcomes, differing only in class weighting. Model A used standard class weighting; Model B applied increased weighting to the early deterioration class to prioritize identification of high-risk patients. Model A achieved a recall of 0.66 (95% CI: 0.59-0.73), precision of 0.17 (95% CI: 0.15-0.20), and ROC-AUC of 0.75 (95% CI: 0.72-0.79). Model B improved recall to 0.77 (95% CI: 0.72-0.84), precision to 0.22 (95% CI: 0.19-0.25), and ROC-AUC to 0.90 (95% CI: 0.88-0.92). While XGBoost's internal feature importance attributed the majority of predictive weight to free-text embeddings, SHAP analysis identified age, respiratory rate, and systolic blood pressure as the dominant individual contributors, with triage note embeddings providing meaningful incremental value confirmed by structured-variable ablation. These findings suggest that AI-driven risk prioritization may function as an adjunct layer of situational awareness in the ED, complementing clinical judgement rather than replacing it. Safe clinical adoption will require prospective shadow testing in real-time workflows to quantify ranking accuracy, assess operational feasibility, and evaluate impact on decision-making before any clinician-facing implementation.

Study type: Journal Article Validation Study

Evidence appraisal

Main Findings

Two XGBoost models were developed using structured ED triage data (demographics, vital signs, eCTAS) combined with BioClinicalBERT embeddings from free-text nursing notes.
Model A used standard class weighting; Model B applied increased weighting to the 'early deterioration' class (ICU admission or death within 7 days).
The prevalence of early deterioration in the study cohort was 4.5%.
Model B showed improved recall (0.77 vs. 0.66 for Model A) and precision (0.22 vs. 0.17 for Model A), with a higher ROC-AUC (0.90 vs. 0.75).
While free-text embeddings contributed significantly to predictive power, SHAP analysis identified age, respiratory rate, and systolic blood pressure as dominant individual predictors.

Practice transfer

Clinical Relevance

AI-driven risk prioritization tools could serve as an adjunct layer of situational awareness in the ED.
Such models may help identify patients at high risk for deterioration earlier than traditional scores alone.
Incorporating free-text nursing notes into predictive analytics can add value beyond structured data, highlighting their importance.
The findings suggest a potential pathway to improve patient safety through early intervention by flagging deteriorating patients sooner.
Prospective shadow testing in real-time ED workflows is crucial before any clinician-facing implementation of such models.

Faculty notes

Educational Relevance

This article presents a validation study of two XGBoost-based machine learning models designed to predict early clinical deterioration (defined as ICU admission or death within 7 days) in the emergency department (ED), prior to initial physician assessment. The primary aim is to develop an AI-driven risk prioritization tool that integrates structured triage data with transformer-based embeddings from free-text nursing triage notes, aiming to improve upon traditional early warning scores which often perform poorly in dynamic ED environments.

The study analyzed 17,481 consecutive adult ED visits over a six-month period. A multimodal feature representation was created by combining structured variables (demographics, vital signs like age, respiratory rate, systolic blood pressure, and eCTAS scores) with BioClinicalBERT-derived embeddings from free-text nursing triage notes. Two models were trained: Model A used standard class weighting for the binary classification task, while Model B applied increased weighting to the 'early deterioration' class (prevalence 4.5%) to prioritize identification of high-risk patients.

The key performance metrics are as follows: * **Model A:** Recall = 0.66 (95% CI: 0.59-0.73), Precision = 0.17 (95% CI: 0.15-0.20), ROC-AUC = 0.75 (95% CI: 0.72-0.79). * **Model B:** Recall = 0.77 (95% CI: 0.72-0.84), Precision = 0.22 (95% CI: 0.19-0.25), ROC-AUC = 0.90 (95% CI: 0.88-0.92).

These results indicate that Model B, with its class weighting strategy to prioritize high-risk patients, demonstrates superior performance in terms of recall and precision compared to Model A. The ROC-AUC for both models suggests good discriminatory ability. The study highlights a nuanced finding regarding feature importance: while XGBoost's internal analysis indicated free-text embeddings contributed significantly to predictive power (suggesting the value of nursing notes), SHAP analysis pinpointed age, respiratory rate, and systolic blood pressure as the dominant individual contributors. This implies that these vital signs are strong predictors even when combined with text data.

The authors frame their findings cautiously, stating that such AI-driven tools should function as an adjunct layer of situational awareness in the ED, complementing rather than replacing clinical judgment. They emphasize that safe clinical adoption requires prospective shadow testing in real-time workflows to quantify ranking accuracy, assess operational feasibility, and evaluate impact on decision-making before any clinician-facing implementation.

For nursing education and practice, this study underscores the potential of AI/ML tools in enhancing patient safety by identifying at-risk patients earlier. It highlights the importance of nursing documentation (free-text notes) as a valuable data source for such predictive models. However, it also stresses that these technologies are not standalone solutions; they require careful integration into existing workflows and must be validated rigorously before clinical deployment to ensure they do not introduce new risks or biases.

The paper's strengths include its use of a large dataset from an ED setting, the innovative combination of structured data with NLP-derived features (BioClinicalBERT), and clear reporting of model performance metrics. Limitations acknowledged by the authors include retrospective outcome definition, single-institution training data, and the need for prospective validation.

This research contributes to the growing body of literature on clinical decision support systems in emergency care and highlights an area where nursing practice can intersect with advanced technology to improve patient outcomes.

Critical appraisal

Limitations

Retrospectively defined primary outcome (ICU admission or death within 7 days).
Models trained on data from a single institution over six months, limiting generalizability without external validation.
Need for prospective shadow testing to assess operational feasibility and impact on decision-making before clinical adoption.

Classroom use

Discussion Questions

How might the integration of AI-driven risk prioritization tools affect nurse-physician communication in the ED?
What are the potential ethical considerations if these models have biases or errors, particularly concerning patient subgroups not well-represented in the training data?
Can you think of other types of free-text clinical notes (e.g., progress notes) that might be valuable for similar predictive modeling tasks?
How would you design a study to prospectively validate such an AI tool in a real-world ED setting, including how to measure its impact on patient outcomes and workflow efficiency?
What specific training or education would nurses need to effectively use and interpret outputs from these types of machine learning models?
Discussion question 6: What does "Validation of a machine learning model for predicting early deterioration in the emergency department." help nursing students evaluate?
Discussion question 7: What does "Validation of a machine learning model for predicting early deterioration in the emergency department." help nursing students evaluate?
Discussion question 8: What does "Validation of a machine learning model for predicting early deterioration in the emergency department." help nursing students evaluate?
Discussion question 9: What does "Validation of a machine learning model for predicting early deterioration in the emergency department." help nursing students evaluate?
Discussion question 10: What does "Validation of a machine learning model for predicting early deterioration in the emergency department." help nursing students evaluate?

Knowledge check

Quiz

1. What was the primary goal of developing and evaluating the machine learning models in this study?

To predict patient discharge times from the ED.
To classify patients based on their preferred treatment options.
To predict early clinical deterioration prior to initial physician assessment, designed as a risk-based prioritization tool.
To analyze nursing triage note sentiment.

Answer: To predict early clinical deterioration prior to initial physician assessment, designed as a risk-based prioritization tool.
Rationale: The abstract explicitly states: "We developed and evaluated two machine learning models...to predict early clinical deterioration prior to initial physician assessment, designed as a risk-based prioritization tool" (Abstract).

2. What was the total number of consecutive adult ED visits analyzed in this study?

17,480
17,482
17,481
17,500

Answer: 17,481
Rationale: The abstract states: "We analyzed 17,481 consecutive adult ED visits over six months" (Abstract).

3. Which two types of data were integrated to form the multimodal feature representation for the machine learning models?

Structured triage data and transformer-based embeddings of free-text nursing triage notes.
Only structured triage data.
Only free-text nursing triage notes.
Patient medical history and lab results.

Answer: Structured triage data and transformer-based embeddings of free-text nursing triage notes.
Rationale: The abstract mentions: "Structured variables (demographics, vital signs, eCTAS scores) were combined with BioClinicalBERT-derived embeddings from free-text nursing triage notes to form a multimodal feature representation" (Abstract).

4. What was the prevalence of the outcome 'early deterioration' in this study?

4.5%
10%
20%
30%

Answer: 4.5%
Rationale: The abstract states: "predicting \"early deterioration\" (ICU admission or death within 7 days, prevalence 4.5%) versus all other outcomes" (Abstract).

5. Which machine learning algorithm was used to train the two models in this study?

Random Forest
Logistic Regression
XGBoost
Support Vector Machine

Answer: XGBoost
Rationale: The abstract states: "Two XGBoost models (A, B) were trained on the same binary classification task" (Abstract).

6. What was the recall of Model A?

0.66
0.75
0.90
0.17

Answer: 0.66
Rationale: The abstract states: "Model A achieved a recall of 0.66 (95% CI: 0.59-0.73)" (Abstract).

7. What was the precision of Model B?

0.17
0.22
0.66
0.77

Answer: 0.22
Rationale: The abstract states: "Model B improved recall to 0.77...precision to 0.22" (Abstract).

8. What was the ROC-AUC for Model A?

0.75
0.90
0.66
0.17

Answer: 0.75
Rationale: The abstract states: "Model A achieved a recall of 0.66...and ROC-AUC of 0.75" (Abstract).

9. Which individual variable, besides the free-text embeddings, was identified by SHAP analysis as a dominant contributor to model predictions?

Age
Patient's name
Nurse ID
ED arrival time

Answer: Age
Rationale: The abstract states: "SHAP analysis identified age, respiratory rate, and systolic blood pressure as the dominant individual contributors" (Abstract).

10. What is one of the key limitations or considerations mentioned for safe clinical adoption of this AI-driven risk prioritization tool?

It can replace clinical judgement entirely.
Prospective shadow testing in real-time workflows will be required to quantify ranking accuracy, assess operational feasibility, and evaluate impact on decision-making before any clinician-facing implementation.
The models are only effective for pediatric patients.
High recall guarantees high precision.

Answer: Prospective shadow testing in real-time workflows will be required to quantify ranking accuracy, assess operational feasibility, and evaluate impact on decision-making before any clinician-facing implementation.
Rationale: The abstract concludes: "Safe clinical adoption will require prospective shadow testing...before any clinician-facing implementation" (Abstract).

Study cards

Flashcards

What was the primary objective of this research study?

The primary objective was to develop and evaluate machine learning models that integrate structured triage data with transformer-based embeddings of free-text nursing triage notes to predict early clinical deterioration in emergency department (ED) patients prior to initial physician assessment.

Which two types of variables were combined to form the multimodal feature representation?

Structured variables (demographics, vital signs, eCTAS scores) and BioClinicalBERT-derived embeddings from free-text nursing triage notes were combined.

What was the definition of "early deterioration" used in this study?

Early deterioration was defined as ICU admission or death within 7 days.

How many consecutive adult ED visits were analyzed for this study?

17481

What was the prevalence of early deterioration (ICU admission or death within 7 days) in the studied population?

The prevalence was 4.5%.

Which machine learning algorithm was used to train the predictive models?

XGBoost was used to train the two predictive models.

What were the key performance metrics reported for Model A (standard class weighting)?

Model A achieved a recall of 0.66, precision of 0.17, and ROC-AUC of 0.75.

How did Model B's performance compare to Model A in terms of recall?

Model B improved recall from 0.66 (for Model A) to 0.77.

What was the precision achieved by Model B?

0.22

Which model demonstrated a higher ROC-AUC value, and what was that value?

Model B demonstrated a higher ROC-AUC of 0.90 compared to Model A's 0.75.

What key difference distinguished Model A from Model B in terms of training methodology?

The key difference was class weighting; Model A used standard class weighting, while Model B applied increased weighting to the early deterioration class.

According to XGBoost's internal feature importance analysis, what type of data contributed the majority of predictive weight?

Free-text embeddings (from nursing triage notes) contributed the majority of predictive weight according to XGBoost's internal feature importance analysis.

Which three individual variables were identified by SHAP analysis as dominant contributors to predictions?

Age, respiratory rate, and systolic blood pressure were identified as the dominant individual contributors by SHAP analysis.

What was the purpose of applying increased weighting to the early deterioration class in Model B?

The purpose was to prioritize identification of high-risk patients.

How did triage note embeddings contribute to the model's performance, according to the study?

Triage note embeddings provided meaningful incremental value confirmed by structured-variable ablation.

What is one key limitation or caution mentioned regarding the clinical adoption of these AI-driven risk prioritization tools?

Safe clinical adoption will require prospective shadow testing in real-time workflows before any clinician-facing implementation, to quantify ranking accuracy and assess operational feasibility.

How did the study characterize the role of such AI-driven risk prioritization tools in ED practice?

The study characterized them as an adjunct layer of situational awareness in the ED, complementing clinical judgement rather than replacing it.

What was the main reason cited for traditional early warning scores often performing poorly in the dynamic ED environment?

Traditional early warning scores rely on structured triage data only and perform poorly due to the dynamic nature of the ED environment.

Which specific transformer-based model was used to derive embeddings from free-text nursing triage notes?

BioClinicalBERT was used to derive embeddings from free-text nursing triage notes.

What is one critical next step suggested for safe clinical adoption of these predictive models, as per the study's conclusion?

Prospective shadow testing in real-time workflows to quantify ranking accuracy and assess operational feasibility before any clinician-facing implementation is a critical next step.

Search-ready answers

Frequently asked questions

What was the main goal of this research study?

The main goal was to develop and evaluate machine learning models that integrate structured triage data with transformer-based embeddings from free-text nursing triage notes. These models were designed as a risk-based prioritization tool to predict early clinical deterioration in emergency department (ED) patients prior to their initial physician assessment, aiming to rank patients by predicted probability of adverse outcome.

What specific outcomes did the study aim to predict?

The study aimed to predict "early deterioration," defined as ICU admission or death within 7 days. This was a binary classification task against all other ED outcomes for adult patients.

How many consecutive adult emergency department visits were analyzed in this study?

The study analyzed data from 17,481 consecutive adult emergency department (ED) visits over a six-month period.

What types of data were combined to create the multimodal feature representation for the machine learning models?

Structured variables such as demographics, vital signs, and eCTAS scores were combined with BioClinicalBERT-derived embeddings from free-text nursing triage notes to form the multimodal feature representation used by the XGBoost models.

What are the names of the two XGBoost models developed in this study?

Two XGBoost models were developed: Model A and Model B. They differed only in their class weighting strategy for handling the imbalanced dataset (where early deterioration was a minority outcome).

How did the performance metrics differ between Model A and Model B, specifically regarding recall?

Model A achieved a recall of 0.66 (95% CI: 0.59-0.73). Model B, which applied increased weighting to the early deterioration class, improved recall significantly to 0.77 (95% CI: 0.72-0.84).

What was the prevalence of "early deterioration" in the study population?

The prevalence of "early deterioration" (defined as ICU admission or death within 7 days) in the analyzed ED visits was 4.5%.

According to SHAP analysis, what were identified as dominant individual contributors to patient risk prediction?

SHAP analysis identified age, respiratory rate, and systolic blood pressure as the dominant individual contributors to patient risk prediction, alongside the free-text nursing triage note embeddings.

What does the study suggest about how AI-driven risk prioritization might function in an emergency department setting?

The study suggests that AI-driven risk prioritization may function as an adjunct layer of situational awareness in the ED. It is intended to complement clinical judgement rather than replace it, by helping to rank patients by predicted probability of adverse outcome.

What does the abstract recommend for safe clinical adoption of such a tool?

The abstract recommends that safe clinical adoption will require prospective shadow testing in real-time workflows. This testing should aim to quantify ranking accuracy, assess operational feasibility, and evaluate impact on decision-making before any clinician-facing implementation can occur.

Access	Not specified
License	Not specified
Copyright	Not specified
Full text	Not stored