# Machine learning models for identifying predictors of clinical outcomes with first-line immune checkpoint inhibitor therapy in advanced non-small cell lung cancer | Scientific Reports – Nature. com

### Data source and patient selection

This was a retrospective cohort study associated with patients with aNSCLC initiating 1L ICI treatment in the Flatiron Health aNSCLC database. The Flatiron Health EHR-derived database is a longitudinal data source comprising de-identified patient-level structured and unstructured data curated via technology-enabled abstraction 20 , 21 . During the study period (2015–2021), the Flatiron Wellness network consisted of approximately 280 US malignancy clinics (~ 800 sites regarding care). The data are subject to obligations to prevent re-identification and protect patient confidentiality. The institutional review board of WCG IRB, Puyallup, WA, approved the study protocol for data collection from the real-world cohort prior to conduct of the study and waived the need of informed consent.

A cohort of patients was selected who were newly diagnosed with aNSCLC between January 1, 2015, and November 30, 2020, and met the following inclusion plus exclusion criteria: age ≥ 18  years at the time of aNSCLC diagnosis, received an ICI(s)-containing regimen as 1L treatment within 90  days after aNSCLC diagnosis (index date = 1L initiation date), had ≥ 1 PD-L1 test on or before the index date, had no positive test results or receipt of targeted therapies for  ALK/EGFR/ROS1/BRAF/KRAS oncogene alteration(s), and no clinical trial participation.

Two clinical outcomes were evaluated, of which the particular first was OS, defined as time from the index date to death. Patient-level organized data (EHRs, obituaries, and the Social Security Death Index) and unstructured EHR information (abstracted) had been already linked to generate the composite mortality variable that has high sensitivity and specificity when compared to typically the National Death Index (NDI) 22 . The second outcome was PFS, defined as time through index date to the first real-world progression event or death. Real-world progression has been available based on already abstracted information from this medical charts and defined as distinct episodes within the patient journey at which time often the treating physician or clinician concluded that there was spread or worsening of your disease. Flatiron Health uses a clinician-anchored approach supported by radiology reports for assessing real-world development, since this has been reported to be the optimal and most practical method for such assessment 23 .

### Variable pre-processing

Several categories of candidate variables have been considered in our models for predicting clinical outcomes, including demographics, medical history, tumor characteristics, comorbidities, metastatic sites, types of 1L treatment, concomitant medications, and laboratory measurements. The particular assessment time windows, which were determined by clinical experts, varied across variables, in addition to their details are described below and in Table S1 .

Typically the demographic variables that were being considered included age upon the catalog date (i. e., initiation of 1L therapy), sex, payer type, race, and even geographic region. Medical history included year of aNSCLC diagnosis, smoking status, number of different forms of medical visits 90  days prior to list date, as well as the baseline Eastern Cooperative Oncology Group (ECOG) score, which was defined because the most recent value just before or even on the index day or the exact highest involving the values if more than one ECOG score was documented on the same day time. Tumor features included analysis status, histology, and PD-L1 expression level, which was assessed dependent on all valid PD-L1 percentage stain results about or before the index day; the highest PD-L1 percentage staining level was abstracted for each patient. Comorbidities, presence of other primary tumor, and site of metastasis were assessed based on almost all ICD-10 together with ICD-9 diagnoses documented in or prior to the listing date. Furthermore, comorbidities depending on ICD-10 and additionally ICD-9 codes were summarized using the particular Elixhauser comorbidity index score and categorized into 29 different groups excluding 2 groups of metastatic cancer not to mention solid tumor without metastasis 24 . This concomitant medicines used during the 90  days just before or around the index date ended up grouped simply by the third level connected with anatomical therapeutic chemical (ATC3) codes and also the number of different kinds with drugs has been captured. For example, if a patient had been on abacavir (J05AF06), dolutegravir (J05AJ03) as well as lamivudine (J05AF05), then typically the ATC3 variable of J05A for this patient was 3. ATC3 codes were removed if the ATC3 class seemed to be taken by < 10% for patients. Vital signs and laboratory tests were limited to those most frequently measured among the study population and were evaluated in > 50% of patients within 90  days earlier to or perhaps on this index date. Outliers were definitely understood to be lowest and greatest 0. 1% of the distribution for every examined laboratory worth, and as lowest 10% plus highest 0. 1% about the distribution for each assessed vital value using empirical analysis; outliers happen to be then set as missing. Missing values were imputed using the using rules: (1) imputed mode value with regard to categorical in addition to binary factors; (2) imputed mean benefit for most continuous variables; (3) imputed zero for PD-L1 level variable, and then a new binary variable was introduced to indicate missingness; (4) imputed no for metastatic sites and even comorbidities. If multiple ideals were available during often the 90-day window ahead of or maybe on directory date, your frequency in assessments, average value, variation (i. e., standard deviation), and direction and magnitude of changes (i. electronic., slope) was calculated. Categorical integers ended up being used for initial stage at diagnosis along with Stage 0/I = 0, Stage II = 1, Stage IIIA = 2, Stage IIIB/C = 3, and Stage IV = 4. Often the other categorical variables are one-hot encoded and some sort of category coming from the same categorical adjustable was dropped to minimize collinearity. We further excluded a set of features that showed > 85% correlation with the some other features while measured making use of Pearson relationship.

### Models

Survival modeling regarding time-to-event prediction was necessary due to right-censoring, or drop-out of patients from your cohort previous to event occurrence. Success modeling requires a set of info inside the form $$D= \\left(x_i,\delta _i,t_i\right)\ _ i=1 ^ N$$ where N is the total number of individuals in the exact cohort, $$x _ i$$ represents the features, $$\delta _ i$$ represents the particular indicator adjustable with $$\delta _ i =1$$ representing that an event occurred and $$\delta _ i =0$$ indicating right-censoring, and $$t _ i$$ is either the time from censoring as well as time of typically the event intended for patient i 25 . All of us used 5 different approaches to perform time-to-event prediction inside presence from the right-censored files. The predicted median survival time by models was used for this evaluation.

Your CPH model is a standard semi-parametric approach that computes the impact of a group of given features on the risk of an occasion occurring, together with assumes often the features are independent 26 . We used a penalized Cox regression whereas your regularization parameter and the exact method to handle tied celebration times were tuned 27 , 28 .

The accelerated failure period (AFT) model is a parametric model that can be used as an alternative to CPH versions 29 . There are several known distributions that have been utilized for this model including Weibull, log-normal, log-logistic, and exponential. We utilized quantile–quantile (QQ) plots to examine which usually distribution fit our two outcomes, and additionally chose the log-logistic AFT model, which often considers the particular relationship between recovery time and covariates as a linear relationship. The rate of false positives not to mention weights associated with penalization were tuned.

Your survival support vector machine (SSVM) used in typically the study is able to handle right-censored survival records by combining ranking-based and also regression-based loss, and its computational efficiency was improved by the use of kernel functions 30 . Weights regarding penalization, this mixing unbekannte between ranking and regression loss, as well as optimizers had been tuned.

Gradient-boosted decision tree (GBDT) utilized to evaluate whether non-linear relationships identified by increasing model complexity would improve model performance 31 . We all used Cox loss to get GBDT (GBDT-CPH). Learning rate, number of regression trees, maximum depth involving the individual regression estimators, plus the fraction of samples to be used pertaining to fitting often the individual regression estimators have been tuned.

DeepSurv is a good CPH deep neural network and state-of-the-art survival method that can design increasingly complex relationships among patients’ qualities and their risk of failure 18 . Many of us used modern deep learning techniques in order to optimize your training in the network which includes tuning hyper-parameters of learning rate, dropout rate, quantity of hidden layers, and amount of nodes in each hidden layer.

The exact related hyperparameters for the exact above models were fine-tined using randomized searches connected with different variable settings and fivefold cross-validation, and the results were reported according to held-out validation sets from cross-validation, also known as testing set.

### Evaluation

Several metrics were applied to quantify how well the type fit the data. First, we used the particular concordance index (c-index), which in turn measures how well typically the model ranks patients centered on risk score compared to clinical results of interest 32 . Second, we employed 2 metrics derived from Haider et al. 33 called margin loss plus hinge reduction. Specifically, these are metrics that can handle censored data in addition to quantify unit performance in terms with distance for predicted time-to-events and actual time-to-events. Third, since most patients experienced an function within 2  years post-index, hinge and even margin damage scores meant for patients with an event prior to 1 and 2  years post-index were also documented to reveal whether a model could predict time-to-events accurately.

### Explainability

Understanding how models generate their predictions is important from the medical domain. Most of us applied 2 different approaches to identify significant predictors. First, model-based importance scores were generated by different models, such like coefficients via Cox regression or tree-based feature importance scores out of GBDT. However, a limitation of model-based scores is that some models only report whether a feature is important but do not show the directionality with the association, such as whether higher values lead to higher danger. To address this, we made use of SHAP values that are style agnostic 11 . Specifically, all of us used KernelSHAP to assign SHAP ideals for essential variables based on test data.

Share