Abstract
Background: Microsatellite stability (MSS) colorectal cancers (CRCs) have a limited response to immune checkpoint inhibitors (ICIs) compared to microsatellite instability-high (MSI-H) CRCs. Nevertheless, previous studies have shown that some MSS CRCs are sensitive to ICIs, although established criteria for treatment justification are still lacking.
Objective: This study aimed to test the tumor-infiltrating lymphocyte (TIL) features of MSS and develop a novel computational tool for the similarity prediction between MSS and MSI-H status in patients with CRC based on multiple factors.
Methods: We collected and analyzed data from 188 patients with CRC, including MSI status, immune cell distributions, clinical features, and gene mutations, using statistical methods and Cox regression. An ensemble machine learning–based MSI-H score was developed using stacked extreme gradient boosting classifiers to quantify the similarity of patient data to MSI-H data based on immune cell distributions, clinical features, and gene mutations. The model was robust and could address missing input data for immune cell distributions and gene mutations.
Results: The scorer performed well (mean Cohen κ of 0.40, SD 0.05, over 10 random seeds) in identifying MSI-H–like MSS samples with TIL distributions similar to genuine MSI-H CRCs. No significant difference was observed between the TIL features of MSI-H–like MSS CRCs and MSI-H CRCs. The disparity between MSI-H–like MSS CRCs and MSS CRCs potentially lies in the T regulatory cells (P=.09) and macrophage (P=.16) populations within the tumor stromal region.
Conclusions: Some patients with MSS CRC presented similar immune cell distributions with high immunoactivity compared to patients with MSI-H CRC. The MSI-H score serves as a metric to quantify the similarity of MSS CRCs to MSI-H CRCs and presents a promising avenue for more personalized and effective cancer immunotherapy treatment, offering a clinical reference for potential ICI targets in MSS CRCs.
doi:10.2196/66960
Keywords
Introduction
Colorectal cancer (CRC) is one of the leading causes of cancer death worldwide []. Microsatellite status divides CRCs into two subtypes: (1) deficient mismatch repair or microsatellite instability-high (MSI-H) tumors and (2) proficient mismatch repair or microsatellite stability (MSS) and microsatellite instability-low tumors []. These 2 subtypes are distinct in terms of clinicopathological factors, gene mutations, and the immune microenvironment [,]. One of the pivotal treatment modalities in the field of CRC is immunotherapy, especially the administration of immune checkpoint inhibitors (ICIs), including antiprogrammed cell death-1 (PD-1) and antiprogrammed cell death ligand 1 (PD-L1) antibodies []. A reliable predictor of immunotherapy response and immunoactivity is MSI-H status; notably, the Food and Drug Administration and the European Medicines Agency granted approval for the use of ICIs to treat MSI-H CRC in 2017 and 2021, respectively [-].
While ICIs offer an alternative to surgery and chemotherapy for MSI-H CRCs, the use of ICIs to treat MSS CRCs still lacks justification, with no guidelines for identifying high immunoactivity MSS CRCs; thus, a large population of patients with MSS CRC lacks effective treatment options []. However, MSS status is not an absolute marker for excluding immunotherapy. Part of MSS CRCs showed response to ICI therapies []. A meta-analysis provides evidence for the application of ICI therapies in nonmetastatic MSS CRCs and highlights its safety and the potential for organ preservation with this approach []. In addition, Motta et al [] demonstrated that some MSS CRCs (up to 20%) harbor a similar profile, including immunological, genetic, pathological, and clinical characteristics, to MSI-H tumors. Therefore, identifying MSS CRCs with similar profiles to MSI-H CRCs could be a reasonable approach, and a strategy for achieving this is urgently needed. Tumor-infiltrating lymphocytes (TILs), a polymorphic group consisting primarily of effector T lymphocytes, regulatory T lymphocytes, natural killer cells, dendritic cells, and macrophages, are a critical feature of CRC immunology []. TILs are useful in immunotherapy and immunoactivity prediction []. Notably, the intratumoral spatial heterogeneity of TILs is an important factor for precisely stratifying prognostic immune subgroups of MSI-H CRC [].
In this study, we developed a novel MSI-H score based on ensemble machine learning to quantify the degree of similarity of immunoactivity between patients with MSS CRC and patients with MSI-H CRC. A subgroup of patients with MSS CRC with high MSI-H scores was defined as patients with MSI-H–like MSS CRC, exhibiting MSI-H–like features in immune cell distributions, gene mutations, pathological reports, and clinical characteristics. This work paves the way for more personalized, accurate, and effective cancer immunotherapy treatments, delivering a clinical reference for identifying potential ICI targets and advancing patient care.
Methods
Recruitment
Data from 188 patients with stage II CRC and tissue samples were collected from the institutional database of the Fudan University Shanghai Cancer Center between 2013 and 2019. The American Joint Committee on Cancer staging system was used to determine each patient’s stage []. Tested by next-generation sequencing, 24 patients were classified as MSI-H. None of the patients had radiation therapy, chemotherapy, or immunotherapy before tumor resection. Clinical and pathological data were obtained from patient records and postoperative pathology reports.
Multiplex Immunohistochemistry Staining
Sections (4 mm thick) were cut from formalin-fixed, paraffin-embedded CRC tissue and control tonsil tissue for multiplex immunohistochemistry (mIHC). The slides were dewaxed in xylene, rehydrated, and rinsed in graded ethanol solutions and tap water. Antibody diluent/block (72424205; PerkinElmer) was applied to block endogenous peroxidase. The slides were boiled in a Tris-EDTA buffer (pH: 9; 643901; Klinipath) and underwent microwave treatment (MWT) for antigen retrieval. Information on the primary antibodies and the corresponding fluorophores is provided in Table S1 in , including 2 panels (). One antigen required 1 round of labeling, including primary antibody incubation, secondary antibody incubation, and tyramide signal amplification (TSA) visualization, followed by labeling of the subsequent antibody. After incubation with the primary antibody for 1 hour at room temperature, the slides were incubated with Opal Polymer HRP Ms+Rb (2414515; PerkinElmer) at 37 ℃ for 10 minutes. TSA visualization was performed with the Opal 7-Color IHC Kit (NEL797B001KT; PerkinElmer) containing the fluorophores 4,6-diamidino-2-phenylindole (DAPI; Thermo Scientific) and the TSA Coumarin system (NEL703001KT; PerkinElmer). MWT was performed to remove the antibody-TSA complex with the Tris-EDTA buffer (pH: 9). TSA single-stained slides were finished with MWT, counterstained with DAPI for 5 minutes, and enclosed in Antifade mounting medium (I0052; NobleRyder).

Image Acquisition and Analysis
Multiplexed and single-color control slides were scanned at an absolute magnification of 200× by the PerkinElmer Vectra automated multispectral microscope. Representative fields from the single-color slides were imaged, and a spectral library for unmixing was generated by inForm image analysis software (version 2.1; PerkinElmer). Index cases were stained using the multiplex method and then imaged. Channels were unmixed using the spectral library. All settings were saved within an algorithm to allow for batch analysis of multiple original multispectral images of the same tissue [].
Quantification of Immune Cell Densities and Classification
The nuclear morphological features were based on DAPI staining. The numbers of immune cells in each image were scored as percent cellularity (number of positive cells/number of nucleated cells). Five representative fields at 200× magnification of tissue area were selected. The densities of immune cells were segmented independently by 2 pathologists. Immune variables were classified based on the patterns of fluorochrome intensity.
Patient Follow-Up
Patients were monitored every 3‐6 months for 3 years, then every 6‐12 months up to 5 years. Follow-ups included rectal exams, carcinoembryonic antigen (CEA) tests, annual radiological studies, and colonoscopies as needed.
Test of MSI and CRC-Relevant Mutations
The ColonCore panel (Burning Rock) is designed for simultaneous detection of MSI status and mutations in 37 CRC-related genes (Table S2 in ). The MSI detection method was a read-count-distribution–based approach, using the coverage ratio of a specific set of repeat lengths as the main characteristic of each microsatellite locus. The MSI status of a sample was determined by the percentage of unstable loci in the given sample [].
Statistical Tests and Survival Analysis
Statistical analysis was performed and visualized by R (version 3.4.3; R Foundation for Statistical Computing), SPSS software (version 25.0; IBM Corp), and GraphPad Prism 7 software (GraphPad Software Inc). All group-wise comparisons were conducted by the 2-sided unpaired Mann-Whitney U test, followed by the Bonferroni procedure. The Cox proportional hazards regression model was used to assess the hazard ratios, 95% CIs, and P values for univariate and multivariate analysis. Variables with P<.10 after adjusting for common clinicopathological parameters were included in the multivariate analysis. Survival times were compared using the log-rank test. A P value of <.05 was considered statistically significant, and all P values corresponded to 2-sided statistical tests.
Feature Engineering
The process from feature engineering to model evaluation is depicted in . Categorical features were one-hot encoded as dummy variables. The mutation landscape was also one-hot encoded based on gene classes, with 2 classification stringencies, using the Database for Annotation, Visualization, and Integrated Discovery (DAVID) gene functional classification tool []. To further engineer the mutation landscape, we calculated the joint posterior mutation probability with the following equation:
(1)
where is the probability that a patient has MSI-high status given their mutation landscape and is based on previous probabilities and the frequencies of mutated genes in MSI-H and MSS populations, is a mutated gene in a patient’s sample, is a set of all detected mutated genes in the sample, is the probability of MSI-H in a CRC sample (0.83), is the probability of MSS in a CRC sample (0.17), is the frequency of a mutated gene in a CRC MSI-H population, and is the frequency of a mutated gene in a CRC MSS population. and were based on the findings of Serebriiskii et al [] and 2 datasets [,] on cBioPortal [,]. Our Bayesian-based metric can explicitly incorporate previous biological knowledge, including MSI-H/MSS prevalence in populations with CRC and microsatellite status–specific mutation frequencies, and probabilistic reasoning into the modeling process. Leveraging these priors potentially enhances the model’s ability to distinguish MSI-H from MSS cases. This metric was included in the dataset along with the other features and used for model training.
Though no missing data were presented in the dataset, our model can handle missing input from users because we trained several models with varied complexities, as elaborated in the following section.
Model Training and Deployment
Sample microsatellite status was one-hot labeled (MSI-H as 1 and MSS as 0). Multiple extreme gradient boosting (XGBoost) models [] were trained to identify MSI-H–like tumors with different combinations of features, that is, patient metainformation, mutational landscape–derived features, and mIHC results, such as PD-L1, CD163, and CD8 mIHC staining results. The predicted likeliness of MSI-H by the models was defined as MSI-H score. Specifically, 44 models, with different combinations of features shown in Table S3 in , were trained. To address potential bias caused by class imbalance, the scale_pos_weight parameter was set to the class ratio during model training.
The models were then deployed as a public web interface. As users’ data privacy is prioritized, users’ input is never stored on our server. Each model ensemble consists of 10 submodels, each trained with a distinct random seed. The final prediction for any user input is the average of the middle 6 submodel outputs, excluding the extremes. One of the XGBoost tree models in pseudocode is shown in .
Visualization and Clustering of TILs
To understand the feature importance in an unbiased and holistic way, a massive exploratory XGBoost model with all features was trained. These features include the features patient metainformation, mutational landscape–derived features, and all mIHC results. The massive model can capture the full spectrum of the MSI-H score variation and avoid the potential bias or noise introduced by the feature selection process. The feature importance was then computed using both the XGBoost built-in function and the Shapley additive explanations package [].
Following classification by this model (threshold=0.3 defining MSI-H, chosen so that predicted MSI-H proportion approximates the epidemiologically documented prevalence of MSI-H CRC), we visualized all mIHC features of all samples by grouped box plots. For each cell type, we compared 4 groups (all MSS vs MSI-H, other MSS vs MSI-H, other MSS vs MSI-H–like MSS, and MSI-H–like MSS vs MSI-H) by 2-sided unpaired Mann-Whitney U test and Benjamini-Hochberg adjustment []. To cluster cells based on 4 comparisons, we projected each cell type into a 4D latent space using a formula measuring similarity between cell percentages of former and latter populations:
(2)
where rformer and rlatter represent the percentages of a specific cell type (eg, CD8+ cells) measured in individual samples belonging to the “former” and “latter” groups, respectively. padj is the adjusted P value of a comparison. We then performed hierarchical clustering based on Euclidean distance in latent space with complete linkage [].
Feature and Model Evaluation
Model generalizability was assessed by training models with identical hyperparameters through stratified 5-fold cross-validation. Cohen κ coefficients were computed on each hold-out fold, and the mean Cohen κ was computed based on the 5 κ’s, with greater κ’s indicating better model performance. This training and validation process was repeated 10 times with different random stratified splits and model initializations.
Ethical Considerations
Ethics approval was obtained from the Ethics Committee of Fudan University Shanghai Cancer Center, and informed consent was obtained from all participants (1808190‐12). Neither the patients nor the public were involved in this study (ie, only database tissue samples and data from patient records and postoperative pathology reports were used). All patient data collected for this study were deidentified prior to analysis.
Results
Higher TIL Infiltration in the Stromal Region Than in the Tumor Region
TILs were analyzed using the mIHC method. Significant differences were found between the stromal region and the tumor region (). The stromal region showed a higher prevalence of CD3+ T cells (P<.001), CD8+ T cells (P<.001), memory T cells (P<.001), CD8+ memory T cells (P<.001), CD3+ PD-1+ T cells (P<.001), CD4+ T cells (P=.048), regulatory T cells (Tregs; P<.001), macrophages (P=.001), M1 macrophages (P=.003), M2 macrophages (P=.001), and PD-L1+ cells (P=.007) than the tumor region. However, no significant difference was observed for CD8+ PD-1+ T cells and PD-L1+ macrophages between the stromal region and tumor region.

Prognostic Impact of Clinical Characteristics, MSI Status, and Immune Cell Infiltration
Variables () commonly collected in clinics and related to prognosis or with a P value <.10 in univariate analysis (Tables S4 and S5 in ) were analyzed using the Cox proportional hazards regression model (Table S6 in ). Microsatellite status was not linked to overall survival or disease-free survival in multivariate analysis. Significant overall survival predictors included age, CEA, M1 macrophage (CD68+ CD163–) infiltration in stromal region, and PD-1+ T cell (CD3+ PD-1+) infiltration in tumor region. Significant disease-free survival predictors included age, CEA, tumor differentiation, and CD8+ T cell (CD8+) infiltration in tumor region (Table S6 in ).
| Characteristic | Patients, n | MSS tumor (n=164), n (%) | MSI-H tumor (n=24), n (%) | P value | |
| Sex | .82 | ||||
| Male | 106 | 93 (57) | 13 (54) | ||
| Female | 82 | 71 (43) | 11 (46) | ||
| Age (y) | .71 | ||||
| <65 | 111 | 96 (59) | 15 (63) | ||
| ≥65 | 77 | 68 (41) | 9 (38) | ||
| Mucinous | .006 | ||||
| No | 152 | 138 (84) | 14 (58) | ||
| Yes | 36 | 26 (16) | 10 (42) | ||
| Differentiation | .002 | ||||
| Poor | 40 | 29 (18) | 11 (50) | ||
| Moderate to well | 140 | 129 (82) | 11 (50) | ||
| T stage | .59 | ||||
| T3 | 88 | 78 (48) | 10 (42) | ||
| T4 | 100 | 86 (52) | 14 (58) | ||
| Tumor site | <.001 | ||||
| Right | 52 | 36 (22) | 16 (67) | ||
| Left | 52 | 49 (30) | 3 (13) | ||
| Rectum | 83 | 78 (48) | 5 (21) | ||
| Lymphovascular invasion | .43 | ||||
| No | 149 | 128 (78) | 21 (88) | ||
| Yes | 39 | 36 (22) | 3 (13) | ||
| Perineural invasion | .86 | ||||
| No | 136 | 119 (73) | 17 (71) | ||
| Yes | 52 | 45 (27) | 7 (29) | ||
| CEA (ng/ml) | .94 | ||||
| <5 | 124 | 108 (66) | 16 (67) | ||
| ≥5 | 64 | 56 (34) | 8 (33) | ||
| Chemotherapy | .45 | ||||
| No | 76 | 68 (41) | 8 (33) | ||
| Yes | 112 | 96 (59) | 16 (67) | ||
| Radiotherapy | >.99 | ||||
| No | 163 | 142 (92) | 21 (91) | ||
| Yes | 15 | 13 (8) | 2 (9) | ||
aCEA: carcinoembryonic antigen.
TIL Distribution in MSI-H CRCs, All MSS CRCs, MSI-H–Like MSS CRCs, and Other MSS CRCs
The mIHC experiment was performed to examine the TILs in all CRCs (). MSI-H CRCs exhibited significantly higher infiltration of TILs compared to MSS CRCs in both the tumor region and stromal region. Specifically, MSI-H CRCs had a more abundant presence of PD-L1+ M2 macrophages (P=.001), CD163+ cells (P=.001), PD-L1+ macrophages (P=.01), M2 macrophages (P=.001), and macrophages (P=.03) in the stromal region, as well as M2 macrophages in the tumor region (P=.02).

MSI-H–like MSS CRCs exhibited TIL infiltration patterns akin to MSI-H CRCs but distinct from other MSS CRCs. No significant difference was observed between MSI-H–like MSS and genuine MSI-H CRCs (the fourth column in ). Compared to other MSS CRCs, MSI-H–like MSS CRCs showed higher infiltration of CD163+ cells in the stromal region (P=.001; the box plot and the third column in ) and potentially increased levels of PD-L1+ M2 macrophages (P=.13), FOXP3+ cells (P=.09), Tregs (P=.09), PD-L1+ macrophages (P=.16), M2 macrophages (P=.09), and macrophages (P=.16) in the stromal region, as well as M2 macrophages (P=.13) in the tumor region. The distinct infiltration patterns of TILs indicate that heightened presence of macrophages and Tregs are key factors in distinguishing MSI-H–like MSS CRCs from MSS CRCs.
Macrophages were also found to be significantly more abundant in genuine MSI-H CRCs than in other MSS CRCs (the second column in ). Specifically, in the stromal region, PD-L1+ M2 macrophages (P<.001), CD163+ cells (P<.001), PD-L1+ macrophages (P=.004), M2 macrophages (P<.001), M1 macrophages (P=.045), CD4+ T cells (P=.045), and macrophages (P=.01) were significantly more abundant in MSI-H CRCs than in other MSS CRCs. In the tumor region, M2 macrophages (P=.007), M1 macrophages (P=.045), and macrophages (P=.045) were found to be significantly increased in MSI-H CRCs compared with other MSS CRCs.
The TIL distribution shows that the model performed well. The scorer, which was trained and validated on only 3 types of lymphocytes, classified MSI-H–like MSS CRC samples with similar TIL distributions as MSI-H CRC samples (the fourth column in ) rather than MSS CRC samples (the third column in ), despite most features of TILs (15 other lymphocytes) being unknown to the model. In addition, as anticipated, the heat map in (second and third columns) revealed that other MSS CRC samples exhibited a slightly closer TIL distribution to MSI-H–like MSS CRC samples than to MSI-H CRC samples.
The feature importance in a large predictive model is described in . TIL features from the whole or stromal region were more predictive than the tumor region alone for the MSI-H status. Top features for MSI-H score predictor included macrophage subtypes, mutational landscape, and immune cell distributions.
MSI-H Score Predictor Generalization Ability Affected by Feature Number and Type
Increasing the number of features generally enhanced , indicating better generalization performance (). However, models incorporating PD-L1 mIHC staining tended to exhibit lower compared to those without, likely due to noise in PD-L1 measurements, as evidenced by the high SD of for models 2 to 4. This noise effect was mitigated by increasing model complexity; for example, model 44 had a greater than model 41, despite including PD-L1. Feature importance analysis () revealed that while PD-L1 remained relevant, its importance diminished as models became more complex, suggesting that sophisticated models learned to filter out noise and extract useful information from PD-L1. The variable Spearman correlation matrix heat map is shown in . In addition, an MSI-H scorer web interface is freely accessible [].


Discussion
Principal Findings
Immunotherapy has been successful for treating MSI-H CRCs but is not as effective in MSS CRCs, which comprise the majority of CRCs. Thus, we developed a machine learning–based MSI-H predictor to generate a robust and reliable score that can capture the complexity and heterogeneity of CRC and better target patients with MSS CRC who may benefit from immunotherapy. Our study also provides insights into the immune landscape of CRC and the role of immune cell distributions, clinical features, and gene mutations in influencing MSI status. This CRC prognostic study mostly agrees with our previous research [] and with findings from other authors []. For example, according to our results, TIL infiltration, primarily by macrophages or CD163+ cells, was significantly higher in MSI-H CRCs than in MSS CRCs (), consistent with previous studies [].
We observed a higher abundance of TIL subsets in the stromal region than in the tumor region, indicating a more active immune response in the stromal region (). Comparing the MSI predictive performance of models 42, 43, and 44 () also highlights the importance of stromal TILs. Regional disparities underscore the importance of analyzing the complete tumor region for comprehensive insights. Our scorer successfully identified MSI-H–like MSS samples with TIL distributions similar to genuine MSI-H CRCs (). In addition, the balance of proinflammatory and anti-inflammatory scale is an important feature for immunological characters. Macrophages can be classified into 2 main subtypes: M1 macrophages with proinflammatory and antitumor functions and M2 macrophages with anti-inflammatory and protumor functions. The ratio of M1/M2 macrophages may influence immunotherapy outcomes, reflecting the balance between proinflammatory and anti-inflammatory signals in the tumor microenvironment []. Tregs are frequently known to be immunosuppressive and can predict both the host immune response and chemotherapeutic response []. Both macrophages and Tregs are important in the regulation of immunoactivity. As is shown in our results, the distribution of macrophages and Tregs appears to be important in differentiating MSI-H–like MSS CRCs from other MSS CRCs based on TIL infiltration patterns (). By comparing variations within model sets and between specific model set pairs ()—including model 5/6/7 versus 13/16/19, 27/28/29 versus 35/38/41, 2/3/4 versus 12/15/18, 24/25/26 versus 34/27/40, 11/14/17 versus 20/21/22, and 33/36/39 versus 42/43/44—we observed that CD163 mIHC result increased the predictive value for whole-tumor MSI scores but reduced it for tumor region scores. To better understand the differences in M2 macrophages and Tregs between MSS CRCs and MSI-H CRCs, further research on their function in CRCs is necessary.
Our analysis revealed that PD-L1+ M2 macrophages in the total region, mutational landscape, CD163+ cells in the stromal region, PD-L1+ M2 macrophages in the stromal region, and tumor site were the most important features for predicting MSI-H status (), aligning with other research results [,]. Macrophages can express PD-L1 and interact with PD-1+ T cells, which may affect the response to immunotherapy []. However, PD-L1+ macrophages potentially indicate M1-like polarization profiles []. The stroma is important because it can influence the extracellular matrix formation, angiogenesis, immune response, and therapeutic resistance of tumors []. The importance of the mutational landscape for prediction is widely known []. We did not find an obvious immunological explanation for why tumor site would impact the similarity between MSI-H and MSS. Further study is needed to clarify the underlying mechanisms. Moreover, we observed that feature number and type influenced the generalization ability of the MSI-H score prediction models ( and ). This suggests that the omission of diverse variables requires specific computational models, and our machine learning scorer is adept at incorporating all such considerations, thereby highlighting our advantage.
On the basis of our results, we proposed a hypothesis regarding the changes that occur in MSI-H–like MSS CRCs compared to other MSS CRCs. MSI-H–like MSS CRCs foster an immunosuppressive microenvironment with M2 macrophages, Tregs, and PD-L1 that inhibits T cell responses []. However, there are enough T cells present that can be reactivated upon PD-1/PD-L1 blockade, leading to the sensitivity of MSI-H–like MSS CRCs to ICIs. The abundance of macrophages suggests that there may be some M1-like populations that, when disinhibited, promote antitumor immunity. Detailed differences in immune cell populations and their functions in MSI-H–like MSS CRCs and other MSS CRCs should be further investigated to understand the mechanisms underlying the differential response to immunotherapy. Furthermore, future clinical trials could be conducted to evaluate ICI treatment between patients with MSI-H–like MSS CRC and other patients with MSS CRC with low MSI-H scores.
Limitations
Limitations of our study include the lack of internal or external validation of the MSI-H score in patients with MSS CRC receiving immunotherapy and the absence of investigation into the underlying molecular mechanisms. Further research and clinical trials are needed to validate our MSI-H score and elucidate the associated mechanisms.
Conclusions
In conclusion, our study revealed significant variations in TIL distribution across tumor regions and MSI status. Integrating clinical, TIL, and mutational data, we developed a robust MSI-H scorer that captures CRC’s complexity and heterogeneity. Macrophages, gene mutations, and tumor site emerged as key predictors. MSI-H–like MSS CRCs exhibited TIL infiltration patterns with high immunoactivity similar to MSI-H CRCs, distinctly different from other MSS CRCs. Our privacy-protected MSI-H score predictor is freely available on the web, enabling clinical and research applications.
Acknowledgments
This study was supported by grants from the National Natural Science Foundation of China (82473500 to JP and 82372974 to YL) and the Natural Science Foundation of Shandong Province (ZR2023QC282 to LJ). The sponsors had no role in the study design; data collection, analysis, or interpretation; writing the report; or the decision to submit the paper for publication.
Data Availability
The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.
Authors' Contributions
JP and DH contributed to the study conception and design. HY, YL, and LJ developed the methodology. LJ, HY, WS, SM, and FW performed the analysis and interpretation. YL, HY, and LJ drafted and revised the manuscript. JP and DH supervised the study. All authors read and approved the final manuscript.
Conflicts of Interest
FW is employed by Weifang Ten Nanometer Biotechnology Co, Ltd. All other authors declare no other conflicts of interest.
Detailed information on the experimental protocols, genetic panels, model specifications, and statistical analyses performed in this study, as provided by 6 supplementary tables.
DOC File, 249 KBThe flowchart of feature engineering, model training, deployment, and feature and model evaluation.
PNG File, 203 KBAn example of one of the extreme gradient boosting tree models in pseudocode.
DOC File, 92 KBFeature importance in a large predictive model.
PDF File, 894 KBHeat map of Spearman correlation coefficients between all pairs of variables (features and targets).
PDF File, 534 KBReferences
- Siegel RL, Wagle NS, Cercek A, Smith RA, Jemal A. Colorectal cancer statistics, 2023. CA Cancer J Clin. 2023;73(3):233-254. [CrossRef] [Medline]
- Li K, Luo H, Huang L, Luo H, Zhu X. Microsatellite instability: a review of what the oncologist should know. Cancer Cell Int. 2020;20(1):16. [CrossRef] [Medline]
- Ganesh K, Stadler ZK, Cercek A, et al. Immunotherapy in colorectal cancer: rationale, challenges and potential. Nat Rev Gastroenterol Hepatol. Jun 2019;16(6):361-375. [CrossRef] [Medline]
- Wu X, Gu Z, Chen Y, et al. Application of PD-1 blockade in cancer immunotherapy. Comput Struct Biotechnol J. 2019;17:661-674. [CrossRef] [Medline]
- Golshani G, Zhang Y. Advances in immunotherapy for colorectal cancer: a review. Therap Adv Gastroenterol. 2020;13:1756284820917527. [CrossRef] [Medline]
- Casak SJ, Marcus L, Fashoyin-Aje L, et al. FDA approval summary: pembrolizumab for the first-line treatment of patients with MSI-H/dMMR advanced unresectable or metastatic colorectal carcinoma. Clin Cancer Res. Sep 1, 2021;27(17):4680-4684. [CrossRef] [Medline]
- Trullas A, Delgado J, Genazzani A, et al. The EMA assessment of pembrolizumab as monotherapy for the first-line treatment of adult patients with metastatic microsatellite instability-high or mismatch repair deficient colorectal cancer. ESMO Open. Jun 2021;6(3):100145. [CrossRef] [Medline]
- Wu H, Deng M, Xue D, et al. PD-1/PD-L1 inhibitors for early and middle stage microsatellite high-instability and stable colorectal cancer: a review. Int J Colorectal Dis. May 29, 2024;39(1):83. [CrossRef] [Medline]
- Guven DC, Kavgaci G, Erul E, et al. The efficacy of immune checkpoint inhibitors in microsatellite stable colorectal cancer: a systematic review. Oncologist. May 3, 2024;29(5):e580-e600. [CrossRef] [Medline]
- Zhang H, Huang J, Xu H, et al. Neoadjuvant immunotherapy for DNA mismatch repair proficient/microsatellite stable non-metastatic rectal cancer: a systematic review and meta-analysis. Front Immunol. 2025;16:1523455. [CrossRef]
- Motta R, Cabezas-Camarero S, Torres-Mattos C, et al. Immunotherapy in microsatellite instability metastatic colorectal cancer: current status and future perspectives. J Clin Transl Res. Aug 26, 2021;7(4):511-522. [Medline]
- Mantovani A, Allavena P, Sica A, Balkwill F. Cancer-related inflammation. Nature New Biol. Jul 24, 2008;454(7203):436-444. [CrossRef] [Medline]
- Brummel K, Eerkens AL, de Bruyn M, Nijman HW. Tumour-infiltrating lymphocytes: from prognosis to treatment selection. Br J Cancer. Feb 2023;128(3):451-458. [CrossRef] [Medline]
- Jung M, Lee JA, Yoo SY, Bae JM, Kang GH, Kim JH. Intratumoral spatial heterogeneity of tumor-infiltrating lymphocytes is a significant factor for precisely stratifying prognostic immune subgroups of microsatellite instability-high colorectal carcinomas. Mod Pathol. Dec 2022;35(12):2011-2022. [CrossRef] [Medline]
- Weiser MR. AJCC 8th edition: colorectal cancer. Ann Surg Oncol. Jun 2018;25(6):1454-1455. [CrossRef] [Medline]
- Gorris MAJ, Halilovic A, Rabold K, et al. Eight-color multiplex immunohistochemistry for simultaneous detection of multiple immune checkpoint molecules within the tumor microenvironment. J Immunol. Jan 1, 2018;200(1):347-354. [CrossRef] [Medline]
- Zhu L, Huang Y, Fang X, et al. A novel and reliable method to detect microsatellite instability in colorectal cancer by next-generation sequencing. J Mol Diagn. Mar 2018;20(2):225-231. [CrossRef] [Medline]
- Huang DW, Sherman BT, Tan Q, et al. The DAVID gene functional classification tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007;8(9):R183. [CrossRef] [Medline]
- Serebriiskii IG, Connelly C, Frampton G, et al. Comprehensive characterization of RAS mutations in colon and rectal cancers in old and young patients. Nat Commun. Aug 19, 2019;10(1):3722. [CrossRef] [Medline]
- Mondaca S, Walch H, Nandakumar S, Chatila WK, Schultz N, Yaeger R. Specific mutations in APC, but not alterations in DNA damage response, associate with outcomes of patients with metastatic colorectal cancer. Gastroenterology. Nov 2020;159(5):1975-1978. [CrossRef] [Medline]
- Chatila WK, Kim JK, Walch H, et al. Genomic and transcriptomic determinants of response to neoadjuvant therapy in rectal cancer. Nat Med. Aug 2022;28(8):1646-1655. [CrossRef] [Medline]
- Cerami E, Gao J, Dogrusoz U, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. May 2012;2(5):401-404. [CrossRef] [Medline]
- Gao J, Aksoy BA, Dogrusoz U, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal. Apr 2, 2013;6(269):pl1. [CrossRef] [Medline]
- Chen T, Guestrin C. Xgboost: a scalable tree boosting system. Presented at: Proceedings of the 22nd ACM SIGKKD International Conference on Knowledge Discovery and Data Mining; Aug 13, 2016 to Aug 17, 2026:785-794; San Francisco, CA.
- Lundberg SM, Lee SI. A unified approach to interpreting model predictions. arXiv. Preprint posted online on May 22, 2017. [CrossRef]
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser C Appl Stat. Jan 1, 1995;57(1):289-300. [CrossRef]
- McQuitty LL. Hierarchical linkage analysis for the isolation of types. Educ Psychol Meas. Apr 1960;20(1):55-67. [CrossRef]
- MSI-H score predictor. URL: https://www.jiangbioinfo.com/msi-score/ [Accessed 2025-09-30]
- Li Y, Liang L, Dai W, et al. Prognostic impact of programed cell death-1 (PD-1) and PD-ligand 1 (PD-L1) expression in cancer cells and tumor infiltrating lymphocytes in colorectal cancer. Mol Cancer. Aug 24, 2016;15(1):55. [CrossRef] [Medline]
- Idos GE, Kwok J, Bonthala N, Kysh L, Gruber SB, Qu C. The prognostic implications of tumor infiltrating lymphocytes in colorectal cancer: a systematic review and meta-analysis. Sci Rep. Feb 25, 2020;10(1):3360. [CrossRef] [Medline]
- Millen R, Hendry S, Narasimhan V, et al. CD8+ tumor-infiltrating lymphocytes within the primary tumor of patients with synchronous de novo metastatic colorectal carcinoma do not track with survival. Clin Transl Immunology. 2020;9(7):e1155. [CrossRef] [Medline]
- Edin S, Wikberg ML, Dahlin AM, et al. The distribution of macrophages with a M1 or M2 phenotype in relation to prognosis and the molecular characteristics of colorectal cancer. PLoS One. 2012;7(10):e47045. [CrossRef] [Medline]
- Oshi M, Sarkar J, Wu R, et al. Intratumoral density of regulatory T cells is a predictor of host immune response and chemotherapy response in colorectal cancer. Am J Cancer Res. 2022;12(2):490-503. [Medline]
- Wang H, Tian T, Zhang J. Tumor-associated macrophages (TAMs) in colorectal cancer (CRC): from mechanism to therapy and prognosis. IJMS. Aug 6, 2021;22(16):8470. [CrossRef]
- Elomaa H, Ahtiainen M, Väyrynen SA, et al. Spatially resolved multimarker evaluation of CD274 (PD-L1)/PDCD1 (PD-1) immune checkpoint expression and macrophage polarisation in colorectal cancer. Br J Cancer. Jun 2023;128(11):2104-2115. [CrossRef] [Medline]
- Li E, Hu Y, Han W, et al. The mutational landscape of MSI-H and MSS colorectal cancer. J Clin Oncol. May 26, 2019;37(15_suppl):e15122. [CrossRef]
- Zhang Y, Rajput A, Jin N, Wang J. Mechanisms of immunosuppression in colorectal cancer. Cancers (Basel). Dec 20, 2020;12(12):3850. [CrossRef] [Medline]
Abbreviations
| CEA: carcinoembryonic antigen |
| CRC: colorectal cancer |
| DAPI: 4,6-diamidino-2-phenylindole |
| DAVID: Database for Annotation, Visualization, and Integrated Discovery |
| ICI: immune checkpoint inhibitor |
| mIHC: multiplex immunohistochemistry |
| MSI-H: microsatellite instability-high |
| MSS: microsatellite stability |
| MWT: microwave treatment |
| PD-1: antiprogrammed cell death-1 |
| PD-L1: antiprogrammed cell death ligand 1 |
| TIL: tumor-infiltrating lymphocyte |
| Treg: regulatory T cell |
| TSA: tyramide signal amplification |
| XGBoost: extreme gradient boosting |
Edited by Javad Sarvestan; submitted 02.Oct.2024; peer-reviewed by Jiaying Lai, Weijie Ma; final revised version received 23.Aug.2025; accepted 26.Aug.2025; published 16.Oct.2025.
Copyright© Hongkai Yan, Li Jiang, Yaqi Li, Fengchong Wang, Shaobo Mo, Weiqi Sheng, Dan Huang, Junjie Peng. Originally published in JMIR Formative Research (https://formative.jmir.org), 16.Oct.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.

