Second primary malignancy (SPM) is common in breast cancer (BC). The present study aimed to profile the characteristics of BC with SPM and to identify patients at high risk of SPM. Clinical and outcome data of BC cases were retrieved from the SEER database. Principal component analysis and a random forest model were utilized to create a model for predicting the occurrence of SPMs. Of the 286,047 BC cases analyzed, 9.32% developed SPMs. Approximately 70% of BCs that developed SPMs were ductal carcinoma and 71% of BCs that developed SPMs were human epidermal growth factor receptor 2 (HER2)-/hormone receptor (HR)+. The overall survival (OS) of the SPM cohort was significantly worse (hazard ratio: 1.49; 95% CI: 1.44-1.53; log-rank P<0.001). After adjusting for metastasis status, SPM was still a poor prognostic factor (hazard ratio: 1.71; 95% CI: 1.70-1.82; log-rank P<0.001). Of note, 50.5% of the SPMs occurred in the breast and the OS of the breast SPM group was significantly better than that of the other single-organ SPM group (hazard ratio: 0.46; 95% CI: 0.45-0.49; log-rank P<0.001) and the multiple-organ SPM group (hazard ratio: 0.44; 95% CI: 0.39-0.50; log-rank P<0.001). A random forest model created from clinical features predicted SPM with a positive predictive value of 32.3% and negative predictive value of 90.7% in the testing set. Thus, SPM occurs in nearly 1/10 of BC survivors and its existence and occurrence site significantly influence OS. SPM may be partly predicted from clinical features. In addition, it was indicated that postmenopausal elderly patients with a HER2-/HR+ molecular subtype should be more watchful and undergo screenings for SPMs.
Breast cancer (BC) is the most prevalent cancer type among females worldwide (
An SPM is a second, unrelated cancer in a person who has previously experienced another cancer at any time. The exact incidence of SPMs is uncertain, though studies have provided certain insight. One study evaluated over 2 million people who developed the 10 most common types of cancer from 1992 to 2008, and >10% of them developed an SPM (
An SPM may occur in the same tissue or organ as the first cancer or in another region of the body (
The present study aimed to comprehensively profile the characteristics of patients with BC harboring an SPM and to further identify patients at high risk of developing SPMs using a large, population-based cohort. First, the demographic, clinical and histological differences between patients with BC with only one primary malignancy (OOPM) and with SPMs were retrieved. The influence of SPM on prognosis was then investigated. Finally, the intrinsic factors associated with the development of SPM were evaluated and a machine learning model was established to identify BC survivors who were at high risk of developing SPMs.
Data were extracted from the Surveillance, Epidemiology and End Results (SEER) research database. The SEER program is the most authoritative and premier source of cancer statistics in the US, collecting demographics, tumor characteristics and survival data. The SEER data of the version November 2019 (
Patients with BC diagnosed after 2010 were included because molecular subtypes were available from then on. Only cases with complete data, without missing values on important covariates (age, ethnicity, tumor site, grade, size) were eligible. Cases that were reported from a death certificate or autopsy were excluded and a 2-month latency exclusion was set to further distinguish SPMs from simultaneous cancers. The identified patients with BC were then categorized into two cohorts: The OOPM cohort and the SPM cohort. The study design and workflow are presented in
According to the SEER rules for classifying multiple primary cancers, the definition was dependent on the cancer site of origin, date of diagnosis, histology, tumor behavior (i.e.,
An unsupervised machine learning method called factor analysis of mixed data (FAMD), which is generally used to analyze datasets containing both quantitative and qualitative variables, was used to transform data. In brief, FAMD may be regarded as a mix between principal component analysis (PCA) and multiple correspondence analysis. It acts as PCA for quantitative variables and as multiple correspondence analysis for qualitative variables. This was achieved by an R package FactoMineR (
The total dataset was randomly split 75/25% into the training set and testing set, stratified by the existence of SPMs. A popular supervised machine learning method called random forest was applied to the training set to predict the likelihood of developing SPMs. The performance of the random forest classifier was evaluated in the testing set with 50 repetitions to reduce the influence of randomization. The outcome was visualized by the receiver operating characteristic (ROC) curve. Feature importance was calculated by their contribution to the prediction ability of the model. The above steps were implemented in python using the sci-kit-learn package.
To compare distributions between variables, the χ2 test was generally applied for discrete variables, Student's t-test for continuous variables satisfying a normal distribution and the Mann-Whitney U-test for continuous variables otherwise. For survival analysis, both the non-parametric Kaplan-Meier model and the semi-parametric Cox proportional hazard model were used to evaluate the influence of variables on overall survival (OS); when both methods produced a significant P-value, the result was regarded to be significant. P<0.05 was considered to indicate statistical significance.
A total of 286,047 patients with BC were identified from the SEER database. Of them, 26,657 (9.32%) developed SPMs within a maximum follow-up of ~7 years. The characteristics of the patients with OOPM and SPM are compared in
The cancer types of the SPMs are profiled in
The histological type distribution in the SPM cohort was compared with that in the OOPM cohort (
Molecular status had been determined by a combination of immunohistochemistry, fluorescence
Since it is at times difficult to distinguish metastasis and SPM, the SPM frequency was compared between stage M0 and stage M1, stratified by histological and molecular subtype. Although the SPM frequency was slightly higher in stage M0 than in stage M1 across numerous histological types, no significant difference in SPM frequency was detected in any histological type (
The OS of the SPM cohort was significantly worse than that in the OOPM cohort (hazard ratio: 1.49; 95% CI: 1.44-1.53; log-rank P<0.001), indicating the role of SPMs in accelerating patient death (
Since distant metastasis is a key factor influencing OS, this was validated in the present dataset (
To detect whether the SPM and OOPM cohorts may be distinguished by certain features, unsupervised transformation was performed using FAMD, which was an extension of PCA. The general purpose of PCA is to find transformed features that may cluster the patients into two or more clusters and the transformed features are a combination of the original variables. In the present study, there were 23 original variables. After transformation, the top 5 features were extracted. Only slightly >10% of the variance of the data was able to be explained by the top five transformed features (
The patient population was randomly split into a training set (75%) and a testing set (25%), each stratified by the presence of SPMs. Parameters including maximum depth and class weight were learned from the training set and the parameters that generated the highest positive predictive value (PPV) in the out-of-bag mode were adopted to create the random forest model. The model generated an overall area under the curve (AUC) of 0.95 in the training set (
A representative ROC curve with its AUC in the testing set is illustrated in
BC has the highest incidence among all cancers in the world in females, since its incidence has surpassed that of lung cancer (
Carcinogenesis is a multistep, long-term process. As life expectancy increases, the likelihood of being diagnosed with cancer also increases. Therefore, the risk of developing SPMs gradually increases with age (
The present study suggested that approximately half of SPMs in patients with BC occurred in the breast, while the rest appeared to occur randomly in other organs. This finding suggests that the primary malignancy of BC may change the mammary gland microenvironment and contribute to the occurrence of SPMs (
Although a small number of studies have looked into whether menopausal women are more likely to develop SPMs, the present study found that patients with SPMs were mostly HER2-/HR+ menopausal patients with a median age of 63 years, consistent with the finding of Xiao
The OS of the SPM cohort was significantly lower than that of the OOPM cohort, indicating that the occurrence of SPM had a certain role in accelerating disease progression and deterioration. Compared with the patients with SPMs in non-breast organs, the patients with SPMs in the breast had significantly better OS. Compared with patients with SPMs in non-breast organs and patients with multiple SPMs, the patients with SPMs in the breast had 54 and 56% lower risks of death, respectively. In addition, OS was not significantly different between patients with SPMs in non-breast organs and patients with multiple SPMs, which indicates that the organs bearing SPMs had a significantly greater impact on prognosis than other factors, such as the number of SPMs.
To predict the occurrence of SPM at the time when the primary BC was diagnosed, a supervised machine learning model was created based on clinical characteristics and features of the primary tumor, such as age at diagnosis, marital status and tumor location. The model of the present study had a PPV of 32% and NPV of 91%. This performance is not very good, but this was the best result that was achieved when using the above features after comparing various models. Compared to the unsupervised machine learning model, which was not able to clearly distinguish SPMs from OOPMs, the model of the present study achieved an acceptable PPV and a high NPV. Of all the features used to create the model, age at diagnosis and tumor size were the two most important features predicting SPM, which is reasonable and consistent with previous reports (
The present study has several limitations. First, it is retrospective and the data originated from different centers; therefore, it has limitations inherent to such studies such as heterogeneity regarding data recording and patient management etc. Furthermore, the differential diagnosis between SPMs and metastatic lesions is still difficult, so diagnostic confusion between the two is inevitable. Finally, there is a lack of information regarding the treatment given after surgery or diagnosis, which is an important prognostic variable. However, considering the large population base, the present study made valuable contributions.
In conclusion, the present study describes the clinical, histological and molecular characteristics of patients with BC with SPMs based on the SEER dataset. The results suggested that the OS of the SPM cohort was significantly worse than that of the OOPM cohort, and the OS of the patients with SPMs in the breast was significantly better than that of the patients with SPMs in other organs. Furthermore, the negative effect of SPM on OS was independent of the metastasis status. A supervised machine learning model was created that had a 32% PPV and 91% NPV using certain clinical characteristics and characteristics of the primary malignancy. In addition, postmenopausal elderly patients with a HER2-/HR+ molecular subtype should be more watchful for SPMs. The present results suggest that SPMs in the breast should be considered a prognostic factor; the association between BC and SPMs should not be ignored only because of metastasis. In addition, adequate diagnosis and long-term regular follow-up are of great significance to patients with malignancies. Therefore, attention should be paid to SPM monitoring to avoid misdiagnoses or missed diagnoses and to achieve early detection, early diagnosis and early treatment in these patients.
Not applicable.
All data used in this study are available from the SEER research database.
Conception and design: QL and HL. Collection and collation of data: QL and FZ. Data analysis and interpretation: FZ. Manuscript writing: All authors. All authors have read and approved the final manuscript. Data authentication is not applicable.
Not applicable.
Not applicable.
The authors declare that they have no competing interests.
Study design and workflow of the present study. OS, overall survival; SPM, second primary malignancy; SEER, Surveillance, Epidemiology and End Results; FAMD, factor analysis of mixed data.
Characteristics of patients with SPM. (A) Distribution of cancer types among SPMs. (B) Distribution of cancer types in the OOPM and SPM cohorts. The enrichment of SPMs in lobular/ductal carcinoma is evident. (C) Distribution of molecular subtypes in the OOPM and SPM cohorts. The enrichment of SPMs among the HER2-/HR+ molecular subtype is evident. (D) Incidence rates of SPMs in each histological type in the nonmetastatic cohort and the metastatic cohort. There was no significant difference in the incidence rate of SPMs between the metastatic and nonmetastatic cohorts. (E) Incidence rates of SPMs in each molecular subtype in the nonmetastatic cohort and the metastatic cohort. The incidence rate of SPMs in the HER2-/HR+ molecular subtype in the nonmetastatic cohort was significantly higher than that in the metastatic cohort. *P<0.05; ***P<0.001; ns, no significance. SPM, second primary malignancy; OOPM, only one primary malignancy; HR, hormone receptor; HER2, human epidermal growth factor receptor 2; Ca, carcinoma; Nos, not otherwise specified.
Survival analysis of SPMs against OS. (A) Impact of SPM on OS. The OS of the SPM cohort was significantly worse than that of the OOPM cohort. (B) Effect of SPM location on OS. The OS of patients with SPMs in the breast was significantly better than that of patients SPMs in other organs. (C) Effect of metastasis status on OS. The OS of patients in stage M1 was significantly inferior to that of patients in stage M0. SPM, second primary malignancy; OOPM, only one primary malignancy; OS, overall survival.
Performance of the model in predicting the occurrence of SPMs. (A) ROC curve illustrating the diagnostic accuracy of the model in the training set. (B) AUC, PPV and NPV of the model in testing set after 50 repeats. The mean AUC, NPV and PPV in the testing set were 0.57, 0.91 and 0.32, respectively. (C) A representative ROC curve in the testing set. (D) The top 10 features that contributed to the performance of the model in the testing set. SPM, second primary malignancy; ROC, receiver operating characteristic; AUC, area under the curve; NPV, negative predictive value; PPV, positive predictive value.
Characteristics of patients with BC with OOPM and SPM.
Parameter | OOPM (n=259,390) | SPM (n=26,657) |
---|---|---|
Sex | ||
Female | 257,517 (99.28) | 26,432 (99.16) |
Male | 1,873 (0.72) | 225 (0.84) |
Age at diagnosis, years [median (range)] | 60.0 (2.0-117.0) | 63.0 (21.0-103.0) |
Marital status | ||
Married | 142,757 (55.04) | 14,109 (52.93) |
Single | 39,001 (15.04) | 3,937 (14.77) |
Widowed | 32,868 (12.67) | 3,964 (14.87) |
Divorced | 27,488 (10.60) | 2,910 (10.92) |
Separated | 2,822 (1.09) | 257 (0.96) |
Unmarried or domestic partner | 753 (0.29) | 93 (0.35) |
Laterality | ||
Left | 131,723 (50.78) | 13,197 (49.51) |
Right | 127,288 (49.07) | 13,444 (50.43) |
Left or right |
49 (0.02) | 2 (0.01) |
Bilateral | 42 (0.02) | 0 (0.00) |
Ethnicity | ||
White | 202,991 (78.26) | 21,556 (80.86) |
Black | 29,453 (11.35) | 2,765 (10.37) |
Asian or Pacific Islander | 23,403 (9.02) | 2,063 (7.74) |
American Indian/Alaska Native | 1,570 (0.61) | 159 (0.60) |
Grade | ||
I: Well differentiated | 55,797 (21.51) | 6,373 (23.91) |
II: Moderately differentiated | 108,397 (41.79) | 11,788 (44.22) |
III: Poorly differentiated | 82,637 (31.86) | 7,227 (27.11) |
IV: Undifferentiated | 900 (0.35) | 74 (0.28) |
BC subtype | ||
HER2+/HR+ | 26,825 (10.34) | 2,199 (8.25) |
HER2+/HR- | 11,292 (4.35) | 876 (3.29) |
HER2-/HR+ | 177,409 (68.39) | 19,487 (73.10) |
Triple negative | 28,180 (10.86) | 2,402 (9.01) |
T stage | ||
T1 | 146,231 (56.37) | 14,865 (55.76) |
T2 | 76,558 (29.51) | 7,865 (29.50) |
T3 | 14,686 (5.66) | 1,721 (6.46) |
T4 | 6,170 (2.38) | 697 (2.61) |
N stage | ||
N0 | 170,761 (65.83) | 17,547 (65.83) |
N1 | 60,316 (23.25) | 6,092 (22.85) |
N2 | 13,472 (5.19) | 1,408 (5.28) |
N3 | 10,488 (4.04) | 1,177 (4.42) |
M stage | ||
M0 | 245,406 (94.61) | 25,316 (94.97) |
M1 | 11,590 (4.47) | 1,089 (4.09) |
Stage | ||
I | 120,561 (46.48) | 12,247 (45.94) |
II | 94,164 (36.30) | 9,664 (36.25) |
III | 29,340 (11.31) | 3,268 (12.26) |
IV | 11,590 (4.47) | 1,089 (4.09) |
aOnly one side involved, right or left but unspecified. Values are expressed as n (%) unless otherwise specified. SPM, second primary malignancy; OOPM, only one primary malignancy; BC, breast cancer; HR, hormone receptor; HER2, human epidermal growth factor receptor 2.
Univariate and multivariate Cox regression analysis of SPM group and metastasis status for overall survival.
Univariate | Multivariate | |||
---|---|---|---|---|
Factor | Hazard ratio (95% CI) | P-value | Hazard ratio (95% CI) | P-value |
SPM (vs. OOPM) | 1.49 (1.44-1.53) | <0.001 | 1.71 (1.70-1.82) | <0.001 |
Metastasis status (M1 vs. M0) | 11.37 (11.09-11.67) | <0.001 | 12.64 (12.38-13.05) | <0.001 |
Interaction term |
0.40 (0.36-0.43) | <0.001 |
aInteraction between SPM group and metastasis status. SPM, second primary malignancy; OOPM, only one primary malignancy.