<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en" article-type="research-article">
<?release-delay 0|0?>
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">MI</journal-id>
<journal-title-group>
<journal-title>Medicine International</journal-title>
</journal-title-group>
<issn pub-type="ppub">2754-3242</issn>
<issn pub-type="epub">2754-1304</issn>
<publisher>
<publisher-name>D.A. Spandidos</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">MI-4-6-00192</article-id>
<article-id pub-id-type="doi">10.3892/mi.2024.192</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Articles</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Multi‑label classification of biomedical data</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Diakou</surname><given-names>Io</given-names></name>
<xref rid="af1-MI-4-6-00192" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Iliopoulos</surname><given-names>Eddie</given-names></name>
<xref rid="af1-MI-4-6-00192" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Papakonstantinou</surname><given-names>Eleni</given-names></name>
<xref rid="af1-MI-4-6-00192" ref-type="aff">1</xref>
<xref rid="af2-MI-4-6-00192" ref-type="aff">2</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Dragoumani</surname><given-names>Konstantina</given-names></name>
<xref rid="af1-MI-4-6-00192" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Yapijakis</surname><given-names>Christos</given-names></name>
<xref rid="af2-MI-4-6-00192" ref-type="aff">2</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Iliopoulos</surname><given-names>Costas</given-names></name>
<xref rid="af3-MI-4-6-00192" ref-type="aff">3</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Spandidos</surname><given-names>Demetrios A.</given-names></name>
<xref rid="af4-MI-4-6-00192" ref-type="aff">4</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Chrousos</surname><given-names>George P.</given-names></name>
<xref rid="af2-MI-4-6-00192" ref-type="aff">2</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Eliopoulos</surname><given-names>Elias</given-names></name>
<xref rid="af1-MI-4-6-00192" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Vlachakis</surname><given-names>Dimitrios</given-names></name>
<xref rid="af1-MI-4-6-00192" ref-type="aff">1</xref>
<xref rid="af2-MI-4-6-00192" ref-type="aff">2</xref>
<xref rid="af3-MI-4-6-00192" ref-type="aff">3</xref>
<xref rid="c1-MI-4-6-00192" ref-type="corresp"/>
</contrib>
</contrib-group>
<aff id="af1-MI-4-6-00192"><label>1</label>Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, 11855 Athens, Greece</aff>
<aff id="af2-MI-4-6-00192"><label>2</label>University Research Institute of Maternal and Child Health and Precision Medicine, National and Kapodistrian University of Athens, ‘Aghia Sophia’ Children's Hospital, 11527 Athens, Greece</aff>
<aff id="af3-MI-4-6-00192"><label>3</label>School of Informatics, Faculty of Natural and Mathematical Sciences, King's College London, London WC2R 2LS, UK</aff>
<aff id="af4-MI-4-6-00192"><label>4</label>Laboratory of Clinical Virology, School of Medicine, University of Crete, 71003 Heraklion, Greece</aff>
<author-notes>
<corresp id="c1-MI-4-6-00192"><italic>Correspondence to:</italic> Professor Dimitrios Vlachakis, Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, 75 Iera Odos, 11855 Athens, Greece <email>dimitris@aua.gr drsadia@uitm.edu.my </email></corresp>
</author-notes>
<pub-date pub-type="collection">
<season>Nov-Dec</season>
<year>2024</year></pub-date>
<pub-date pub-type="epub">
<day>09</day>
<month>09</month>
<year>2024</year></pub-date>
<volume>4</volume>
<issue>6</issue>
<elocation-id>68</elocation-id>
<history>
<date date-type="received">
<day>15</day>
<month>03</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>30</day>
<month>08</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright: © 2024 Diakou et al.</copyright-statement>
<copyright-year>2024</copyright-year>
<license license-type="open-access">
<license-p>This is an open access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.</license-p></license>
</permissions>
<abstract>
<p>Biomedical datasets constitute a rich source of information, containing multivariate data collected during medical practice. In spite of inherent challenges, such as missing or imbalanced data, these types of datasets are increasingly utilized as a basis for the construction of predictive machine-learning models. The prediction of disease outcomes and complications could inform the process of decision-making in the hospital setting and ensure the best possible patient management according to the patient's features. Multi-label classification algorithms, which are trained to assign a set of labels to input samples, can efficiently tackle outcome prediction tasks. Myocardial infarction (MI) represents a widespread health risk, accounting for a significant portion of heart disease-related mortality. Moreover, the danger of potential complications occurring in patients with MI during their period of hospitalization underlines the need for systems to efficiently assess the risks of patients with MI. In order to demonstrate the critical role of applying machine-learning methods in medical challenges, in the present study, a set of multi-label classifiers was evaluated on a public dataset of MI-related complications to predict the outcomes of hospitalized patients with MI, based on a set of input patient features. Such methods can be scaled through the use of larger datasets of patient records, along with fine-tuning for specific patient sub-groups or patient populations in specific regions, to increase the performance of these approaches. Overall, a prediction system based on classifiers trained on patient records may assist healthcare professionals in providing personalized care and efficient monitoring of high-risk patient subgroups.</p>
</abstract>
<kwd-group>
<kwd>myocardial infarction</kwd>
<kwd>multi-label classification</kwd>
<kwd>biomedical datasets</kwd>
<kwd>label graph</kwd>
<kwd>precision medicine</kwd>
<kwd>complication prediction</kwd>
</kwd-group>
<funding-group>
<funding-statement><bold>Funding:</bold> The authors would like to acknowledge funding from the following: i) AdjustEBOVGP-Dx (RIA2018EF-2081): Biochemical Adjustments of native EBOV Glycoprotein in Patient Sample to Unmask target Epitopes for Rapid Diagnostic Testing. A European and Developing Countries Clinical Trials Partnership (EDCTP2) under the Horizon 2020 ‘Research and Innovation Actions’ DESCA; and ii) ‘MilkSafe: A novel pipeline to enrich formula milk using omics technologies’, a research co-financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH-CREATE-INNOVATE (project code: T2EDK-02222).</funding-statement>
</funding-group>
</article-meta>
</front>
<body>
<sec sec-type="intro">
<title>Introduction</title>
<p>Machine learning is a subset of artificial intelligence, aimed at developing ‘intelligent’ algorithms that harness data to execute tasks with optimal performance (<xref rid="b1-MI-4-6-00192" ref-type="bibr">1</xref>). Machine learning algorithms can be broadly split into four categories: Supervised, semi-supervised, unsupervised and reinforcement-learning (<xref rid="b2-MI-4-6-00192" ref-type="bibr">2</xref>). In supervised learning, the algorithm is given a dataset known as ‘training data’, where each training sample corresponds to one or several inputs and the desired output (<xref rid="b3-MI-4-6-00192" ref-type="bibr">3</xref>). Through an iterative process, the algorithm determines a function which can correctly predict the desired output from a set of new, previously unknown inputs. Supervised machine-learning tasks include classification, regression and forecasting (<xref rid="b4-MI-4-6-00192" ref-type="bibr">4</xref>). Unsupervised learning, conversely, is carried out on unlabeled datasets with the aim of extracting patterns and information without external supervision (<xref rid="b5-MI-4-6-00192" ref-type="bibr">5</xref>). Semi-supervised learning, as indicated by the name, falls somewhere between supervised and unsupervised techniques, as only a portion of the training data is labeled (<xref rid="b6-MI-4-6-00192" ref-type="bibr">6</xref>). Lastly, during reinforcement learning tasks, an intelligent agent takes actions in a set environment (<xref rid="b7-MI-4-6-00192" ref-type="bibr">7</xref>). The actions return a reward, while also influencing the environment and the state of the agent. The goal of the agent is to ‘learn’ the policy which maximizes the reward function, or more generally, maximizes the reinforcement signal that is generated by the rewards (<xref rid="b8-MI-4-6-00192" ref-type="bibr">8</xref>).</p>
<p>Machine learning approaches, as a whole, have become increasingly relevant in the current era of ‘big data’. Data technologies are rapidly evolving, with data storage sizes entering petabytes, cloud services enabling high data transfer speed and computational systems shifting towards high performance cluster computing (<xref rid="b9-MI-4-6-00192 b10-MI-4-6-00192 b11-MI-4-6-00192" ref-type="bibr">9-11</xref>). This has allowed the implementation of machine learning in a range of diverse fields, including healthcare. A trove of biomedical data is generated daily throughout the process of medical practice and patient care. Examples of such data include imaging results (ultrasounds, magnetic resonance imaging, computed tomography scans), laboratory test results (cell cultures, biological material analyses and sequencing), patient medical history, drug effects and interactions and patient health outcomes (<xref rid="b12-MI-4-6-00192" ref-type="bibr">12</xref>). These can be regarded as attractive targets for the implementation of machine learning algorithms, wherein the desired objective is tailored to the respective challenge.</p>
<p>The prediction of disease complications constitutes a key aspect of patient treatment and management (<xref rid="b13-MI-4-6-00192" ref-type="bibr">13</xref>). The study published in 2022 by Ghosheh <italic>et al</italic> (<xref rid="b14-MI-4-6-00192" ref-type="bibr">14</xref>) investigated the use of a predictive system to predict the risk of developing complications in patients diagnosed with coronavirus disease 2019 (COVID-19), trained on data of &gt;3,000 patients in the United Arab Emirates. The prediction of disease complications, overall, can be interpreted as a predictive classification problem, thus opening the door to the application of supervised machine learning algorithms. The current status of the patient can be analyzed by diagnostic classification models, while potential outcomes can be predicted by prognostic classification models (<xref rid="b15-MI-4-6-00192" ref-type="bibr">15</xref>). Predictive classification algorithms have been implemented in the context of various diseases within the past years. During the peak of the recent severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, classification models were designed to tackle various aspects of the disease, such as the detection of viral infection through X-ray imaging and CT scans or the prediction of outcomes of patients with COVID-19 using their recorded characteristics as input (<xref rid="b16-MI-4-6-00192 b17-MI-4-6-00192 b18-MI-4-6-00192" ref-type="bibr">16-18</xref>). This is even more important under the scope of personalized medicine, as the individual patient profile, which includes features, such as comorbidities, age, ethnic background etc., may affect disease progression and clinical manifestations (<xref rid="b19-MI-4-6-00192" ref-type="bibr">19</xref>). In a number of cases, several conditions, or labels, may be assigned to a single patient; for example, an individual may be suffering from COVID-19, while also exhibiting cardiovascular issues and high cholesterol. In such a case, the challenge of building predictive models can be regarded as a multi-label classification problem.</p>
<p>This type of classification task falls under the supervised machine learning ‘umbrella’ and constitutes a modeling problem where the class, also known as target or label, is predicted for a data point, otherwise known as input sample (<xref rid="b20-MI-4-6-00192" ref-type="bibr">20</xref>). In single-label classification, from a collection of discrete <italic>L</italic> (<italic>L</italic>&gt;1) labels, a single label <italic>‘l’</italic> is assigned to each input sample (<xref rid="b21-MI-4-6-00192" ref-type="bibr">21</xref>). If the number of labels <italic>L</italic> is 2, the task is considered a binary classification problem, whereas if the number of labels <italic>L</italic>&gt;2, it is considered a multiclass classification problem (<xref rid="b22-MI-4-6-00192" ref-type="bibr">22</xref>). In the case where each input sample is associated with a set of non-exclusive class labels, the task is called multi-label classification (<xref rid="b23-MI-4-6-00192" ref-type="bibr">23</xref>).</p>
<p>One of the common challenges in classification and even more so in the case of multi-label, biomedical datasets, is the presence of imbalanced data (<xref rid="b24-MI-4-6-00192" ref-type="bibr">24</xref>). In the multi-label setting, imbalance can be traced to three distinct levels, imbalance within labels, imbalance between labels and imbalance within label sets (<xref rid="b25-MI-4-6-00192" ref-type="bibr">25</xref>). In inter-label imbalance, the label column contains a disproportionate ratio of negative vs. positive samples, effectively obscuring the signal of the particular label (<xref rid="b26-MI-4-6-00192" ref-type="bibr">26</xref>). In intra-label imbalance, there is difference in frequency between labels, with frequency most often termed as the number of positive instances (<xref rid="b27-MI-4-6-00192" ref-type="bibr">27</xref>). Labels with an abundance of positive samples are considered in majority, while the rest are minority labels (<xref rid="b28-MI-4-6-00192" ref-type="bibr">28</xref>). In multi-label classification, there is a possibility for a sample to simultaneously contain a label in minority and another label that is in majority. Labels in majority are termed head labels and labels in minority tail label (<xref rid="b29-MI-4-6-00192" ref-type="bibr">29</xref>). The last source of multi-label imbalance is the existence of label sets, where some sets may be more frequent in the dataset than others (<xref rid="b30-MI-4-6-00192 b31-MI-4-6-00192 b32-MI-4-6-00192" ref-type="bibr">30-32</xref>).</p>
<p>Myocardial infarction (MI), or more generally known as a ‘heart attack’, is caused by a decrease or total interruption in blood flow to part of the myocardium (<xref rid="b33-MI-4-6-00192" ref-type="bibr">33</xref>). As per the World Health Organization, more than four out of five cardiovascular disease-related deaths are due to heart attacks and strokes, while a third of these deaths concern individuals aged &lt;70 years (<xref rid="b34-MI-4-6-00192" ref-type="bibr">34</xref>). Thrombosis has been established as the most prominent driver of acute MI, itself stemming from events of atherosclerosis and inflammation (<xref rid="b35-MI-4-6-00192" ref-type="bibr">35</xref>). Atherosclerotic lesions emerge as thickening of the coronary inner walls of the artery or as fatty streaks, ultimately leading to thrombus formation (<xref rid="b36-MI-4-6-00192" ref-type="bibr">36</xref>). During their hospitalization, patients with acute MI often face severe complications, such as shock, stroke, atrioventricular block, respiratory failure and cardiopulmonary arrest (<xref rid="b37-MI-4-6-00192 b38-MI-4-6-00192 b39-MI-4-6-00192" ref-type="bibr">37-39</xref>). Shock and posterior cerebral artery, in particular, constitute prevalent causes of mortality in patients with acute MI (<xref rid="b40-MI-4-6-00192" ref-type="bibr">40</xref>). Establishing a robust plan of action is inextricably linked to the successful management of MI-related complications in the hospital setting. Therefore, the application of a classification model to predict potential MI complications may prove useful in informing the decision-making of medical professionals.</p>
<p>In the present study, to demonstrate the potential of leveraging biomedical patient data in the context of predictive classification, a multi-label classification model was trained on a recently released, public dataset of MI patient data.</p>
</sec>
<sec sec-type="Data|methods">
<title>Data and methods</title>
<sec>
<title/>
<sec>
<title>Data collection and preprocessing</title>
<p>Machine-learning tasks require, first and foremost, a solid data-related foundation, for the implementation and validation of the framework and its constituent algorithms. To explore the multi-label classification task, a public dataset regarding the outcomes of patients with MI was retrieved (<xref rid="b41-MI-4-6-00192" ref-type="bibr">41</xref>). The original dataset consists of 114 descriptive metrics collected for a total of 1,700 patients with MI, along with information regarding patient outcomes and complications, split into 12 candidate categories. Descriptive analytics of the dataset and information regarding the convention of column names can be found in <xref rid="SD1-MI-4-6-00192" ref-type="supplementary-material">Table SI</xref>. The recorded patient metrics are of mixed nature, falling into one of the following types: Binary, real and ordinal. The binary data stems from two attribute-handling approaches during the dataset's original creation. First, categorical variables were dummy coded into 0 and 1; for example, the column ‘SEX’ where values ‘female’ and ‘male’ were arbitrarily encoded to 0 and 1, respectively. Secondly, binary encoding was used to indicate presence or absence of attributes, such as in the ‘symptomatic hypertension’ (SIM_GIPERT) column, where presence or absence was encoded to 1 and 0, respectively. The real, or numerical data, within the dataset mainly concern patient measurements taken during assessment by the medical professionals and throughout hospitalization, such as ‘systolic blood pressure according to Emergency Cardiology Team’ (S_AD_KBRIG). Lastly, the ordinal feature data contained values with intrinsic order; for example, the column ‘presence of essential hypertension’ (GB), which took values 0, 1, 2 or 3, corresponding to absence, stage 1, stage 2 and stage 3 essential hypertension. An overview of the steps of the pipeline is presented in <xref rid="f1-MI-4-6-00192" ref-type="fig">Fig. 1</xref>.</p>
<p>As per the dataset's curators, there are four candidate time settings for the construction of the prediction challenge: the time of admission to hospital, 24, 48 and 72 h following hospital admission. The selection of time setting determines the columns which can be used as input data during model fitting. The end of the first day (24 h post-hospital admission), was selected as the time setting for the present implementation of the classification algorithm, leading to the exclusion of six input columns, plus the patient ID column which serves no predictive purpose.</p>
<p>Missing data constitutes a common issue in biomedical datasets, as the process of data collection is error-prone, particularly in a hospital setting where such processes are rarely automated and are most often handled by the doctors and nurses themselves. Discarding every single sample/patient record that is missing a portion of the input features could harm the potential of the classifier, as it would markedly lessen the amount of information available to train the classifier on. A common strategy to remedy this problem is data imputation, where the missing data are imputed by various methods. In the present pipeline, input columns which contained missing values above a threshold of 85% of the total number of rows were removed, and a multivariate imputer was used to estimate the missing values in the rest of the dataset. Lastly, a baseline for the type of target data was set. As aforementioned, the output (target) data span columns 113-124 of the original dataset. All columns but one contained data in binary form, denoting presence or absence of a complication/outcome. To maintain the binary type uniform across the output data, the singular column ‘lethal outcome’ (LET_IS), which contained non-ordinal, numerical data, was one-hot encoded. To one-hot encode a variable with <italic>n</italic> possible, non-ordinal values, <italic>n</italic> separate columns are generated and the presence or absence of the value in a sample is denoted by 1 and 0, respectively. In the case at hand, numbers 0-7 had been assigned to the outcomes ‘alive’, ‘cardiogenic shock’, ‘pulmonary edema’, ‘myocardial rupture’, ‘progress of congestive heart failure’, ‘thromboembolism’, ‘asystole’, ‘ventricular fibrillation’, and were subsequently split into eight separate binary data columns. The final dataset can be found in <xref rid="SD2-MI-4-6-00192" ref-type="supplementary-material">Table SII</xref>. Additionally, a detailed description of the original MI database, descriptive statistics and a list of the column abbreviations can be found via the following weblink: 10.25392/leicester.data.12045261.v3.</p>
</sec>
<sec>
<title>Label relations exploration</title>
<p>In classification tasks, it is useful to explore the label space and the associations between the labels reported within the dataset. Graphs are increasingly used in the study of complex systems, such as protein or traffic networks, enabling the implementation of embedding algorithms (<xref rid="b42-MI-4-6-00192" ref-type="bibr">42</xref>). In the multi-label setting, graphs represent multiple levels of information, as graph edges could reflect a range of relationships between labels, from simpler to more complex ones (<xref rid="b43-MI-4-6-00192" ref-type="bibr">43</xref>). Furthermore, the study of clustering and interactions between labels could elucidate more obscure factors underlying the network's structure (<xref rid="b44-MI-4-6-00192" ref-type="bibr">44</xref>). For example, a network of comorbidities represented as a multi-label graph could provide insight into subtle interplays between pathological conditions.</p>
<p>To explore the graph space, the Label Cooccurrence Graph Builder class was imported from the skmultilearn.cluster module as the graph builder base (<xref rid="b45-MI-4-6-00192" ref-type="bibr">45</xref>). The NetworkXLabelGraphClusterer class from the same module was used to study the community and clustering trends across the label instances by use of the Louvain method (<xref rid="b46-MI-4-6-00192" ref-type="bibr">46</xref>), and NetworkX was used to visualize the graph (<xref rid="b47-MI-4-6-00192" ref-type="bibr">47</xref>).</p>
</sec>
<sec>
<title>Data-driven model selection and hyperparameter tuning</title>
<p>The selection of a classification algorithm is entirely dependent on the nature of the multioutput problem concerned, and there is no established guideline for choosing a model. Two factors to be considered during the selection process are performance and efficiency. Efficiency is inherently tied to model aspects, such as its scalability, the type of label combinations within the dataset, and so on. A classification algorithm that exhibits high performance may suffer from low efficiency; for example, choosing a model that trains a single classifier per label would be unsuitable or too slow for a task with a large label space. Performance may be viewed as the model's generalization capability and there exist several evaluation metrics that can be employed. Precision measures the model's reliability in classifying a sample as positive, accuracy measures how well the model performs across all the classes of the dataset (<xref rid="b48-MI-4-6-00192" ref-type="bibr">48</xref>) and recall represents the ratio of how many of the actual labels were correctly predicted (<xref rid="b49-MI-4-6-00192" ref-type="bibr">49</xref>). While the aforementioned metrics hold up well in cases of multi-class classification, in multi-label classification, where the model's predictions can range from fully or partially correct to fully incorrect, the adjustment of evaluation measures is required to reflect these subtleties (<xref rid="b50-MI-4-6-00192" ref-type="bibr">50</xref>). Hamming loss (HL) measures the hamming distance between the true and the predicted label, penalizing the incorrectly predicted labels in a predicted label set individually; therefore, the metric is capable of reflecting the notion of partially correct model predictions (<xref rid="b23-MI-4-6-00192" ref-type="bibr">23</xref>). The metric ranges from 0 to 1 and the lower the HL, the better performance is exhibited by the multi-label classification model.</p>
<p>To carry out data-driven model selection, a set of algorithms were trained on the dataset and their performance was evaluated by the HL metric. Parameters which control the model's architecture are termed ‘hyperparameters’ and the process of exploring possible choices towards an optimal model architecture is termed ‘hyperparameter tuning’ (<xref rid="b51-MI-4-6-00192" ref-type="bibr">51</xref>). To carry out the step, aa cross-validated grid search was employed. The method, which is available through the GridSearchCV class of the scikit-learn module, entails an exhaustive search over a parameter grid, in order to yield optimal model parameters (<xref rid="b52-MI-4-6-00192" ref-type="bibr">52</xref>). Inputting an integer for the ‘cv’ parameter of the class enables a stratified k-fold split, where the dataset is divided into k partitions and for each split, a search through a user-set range of the hyperparameter spaces is executed, fitting and scoring each combination to elect the hyperparameters which lead to the best model performance (<xref rid="b53-MI-4-6-00192" ref-type="bibr">53</xref>). The pool of candidate classification algorithms contained the following: Classifier Chains (CC) with Random Forest as the base classifier, Classifier Chains with XGBoost as the base classifier, Binary Relevance k-Nearest Neighbors (BRkNN) Classifier, Random Forest (RF) Classifier, Multi-label k-Nearest Neighbors Classifier (MLkNN) and OneVsRest with XGboost, all of which are available through the scikit-learn and XGBoost libraries (<xref rid="b54-MI-4-6-00192" ref-type="bibr">54</xref>,<xref rid="b55-MI-4-6-00192" ref-type="bibr">55</xref>).</p>
<p>Extreme Gradient Boosting (XGBoost) is an advanced implementation of the Gradient Boosting decision tree algorithm (<xref rid="b55-MI-4-6-00192" ref-type="bibr">55</xref>). Gradient Boosting builds the first learner, a decision tree, on the training dataset to perform the prediction of the target samples, then calculates the loss, which is the difference between the true value (or true label) and the predicted value that has been generated from the first learner (<xref rid="b56-MI-4-6-00192" ref-type="bibr">56</xref>). The residual of the loss function is calculated using the Gradient Descent Method and is used as the target variable for the next iteration, where an improved learner is built (<xref rid="b57-MI-4-6-00192" ref-type="bibr">57</xref>). In brief, numerous models are trained sequentially, and the algorithm aims to boost them into a strong learner which best predicts the target. XGBoost implements parallel processing, increasing the algorithm's speed ten-fold, compared to standard Gradient Boosting (<xref rid="b55-MI-4-6-00192" ref-type="bibr">55</xref>). Furthermore, the algorithm is flexible, allowing the user to select custom optimization objectives and fine-tune booster and task parameters. XGBoost does not naively support multi-output classification, hence, to implement extreme gradient boosting in the multi-label classification problem, the XGBoost model was wrapped inside the MultiOutputClassifier class from the scikit-learn module (<xref rid="b54-MI-4-6-00192" ref-type="bibr">54</xref>).</p>
</sec>
<sec>
<title>Training and K-fold cross validation</title>
<p>In traditional machine-learning model development, the model is trained on a partition of the data, called the training set, and a set of the data unseen to the model during training is used as the test set, to evaluate the performance of the algorithm based on new data (<xref rid="b58-MI-4-6-00192" ref-type="bibr">58</xref>,<xref rid="b59-MI-4-6-00192" ref-type="bibr">59</xref>). K-fold cross validation entails dividing the dataset into k non-overlapping groups of rows, then training the machine-learning model on all the groups save for a hold-out fold, which is then used as the test set (<xref rid="b60-MI-4-6-00192" ref-type="bibr">60</xref>). The process is repeated across all folds, until each fold has been used as the hold-out test set, and model performance is averaged across the folds (<xref rid="b61-MI-4-6-00192" ref-type="bibr">61</xref>). To carry out k-fold cross validation and account for the imbalanced multi-label dataset, high-order iterative stratification was implemented via the scikit-learn ‘iterative_stratification’ module (<xref rid="b62-MI-4-6-00192" ref-type="bibr">62</xref>,<xref rid="b63-MI-4-6-00192" ref-type="bibr">63</xref>). In brief, dataset splits are created while maintaining balanced representation of labels within each fold as much as possible. To evaluate model performance on the imbalanced dataset, HL was selected as the evaluation metric. Building, training and testing the classifiers was carried out in a Jupyter environment, using a 4-core CPU and 2-core GPU-accelerated system.</p>
</sec>
</sec>
</sec>
<sec sec-type="Results">
<title>Results</title>
<p>At the end of the data preprocessing steps, the dataset contained 1,700 rows corresponding to 1,700 patient samples, 100 columns of patient measurements/features serving as the input (X) data and 18 columns of patient outcomes, serving as the target label (y) data. As visible in the histogram plot illustrated in <xref rid="f2-MI-4-6-00192" ref-type="fig">Fig. 2A</xref>, there is notable imbalance across the instance of labels. The number of labels per sample is also varied, with most samples assigned one or two labels and very few samples exhibiting more than three labels (<xref rid="f2-MI-4-6-00192" ref-type="fig">Fig. 2B</xref>). This can be traced back to the challenge of data collection that was touched upon in the introduction segment; the limited number of patients affects the number of labels (outcomes) that happened to be present among them and were thus recorded.</p>
<p>The output label portion of the dataset was used as input to generate information regarding label interactions and relationships. In the case at hand, the potential outcomes and complications of hospitalized patients with MI are regarded as labels. The results of the exploration of the label space can be summarized in the circular graph of <xref rid="f3-MI-4-6-00192" ref-type="fig">Fig. 3</xref>. Each label, i.e., each complication, is represented as a node in the graph, with an edge existing when there is co-occurrence between the labels, weighted by the frequency of co-occurrence. By using the Louvain algorithm, a popular community detection method, three clusters are reported, denoted by purple, yellow and light blue colour.</p>
<p>The <italic>atrial fibrillation (FIBR_PREDS)</italic> complication (non-lethal) exhibits strong relations with <italic>progress of congestive heart failure</italic> and <italic>asystole</italic>, both tagged within the dataset as lethal complications. Notable relations are also reported between <italic>atrial fibrillation</italic>-<italic>relapse of the myocardial infarction</italic>, and <italic>atrial fibrillation</italic>-<italic>pulmonary edema (OTEK_LANC)</italic>. It is also noteworthy that, as regards lethal outcomes, <italic>cardiogenic shock</italic>, <italic>asystole</italic> and <italic>ventricular fibrillation</italic> have been clustered together, as have <italic>pulmonary edema</italic>, <italic>progress of congestive heart failure</italic> and <italic>thromboembolism</italic>. The lethal complication of <italic>myocardial rupture</italic>, on the other hand, has been clustered with- and exhibits relations with-the <italic>myocardial rupture (RAZRIV)</italic> complication label, potentially indicating that patients with myocardial rupture were assigned to both labels up to the lethal outcome.</p>
<p>The candidate algorithms were evaluated according to their performance on the dataset and the results are summarized in <xref rid="f4-MI-4-6-00192" ref-type="fig">Fig. 4</xref>. The highest HL was exhibited by the Multi-label k-Nearest Neighbor (MLkNN) classifier, whereas the lowest HL (0.053) was exhibited by the OnevsRest classifier with XGBoost. This intuitive classifier strategy, also known as the one-vs.-all, trains a binary classifier independently for each label, and can be applied to both multi-class and multi-label problems.</p>
</sec>
<sec sec-type="Discussion">
<title>Discussion</title>
<p>Predictive classification algorithms have been implemented in the context of various diseases over the past years. During the peak of the recent SARS-CoV-2 pandemic, classification models were designed to tackle various aspects of the disease, such as the detection of viral infection through X-ray imaging and CT scans or the outcome prediction of COVID-19 patients using their recorded characteristics as input (<xref rid="b16-MI-4-6-00192 b17-MI-4-6-00192 b18-MI-4-6-00192" ref-type="bibr">16-18</xref>). This is even more critical under the scope of personalized medicine, as the individual patient profile, which includes features, such as comorbidities, age, ethnic background etc., may affect disease progression and clinical manifestations (<xref rid="b19-MI-4-6-00192" ref-type="bibr">19</xref>). In a number of cases, several conditions, or labels, may be assigned to a single patient; for example, an individual may be suffering from COVID-19, while also exhibiting cardiovascular issues and high cholesterol levels. In such a case, the challenge of building predictive models can be regarded as a multi-label classification problem, as explored herein.</p>
<p>The development and testing of classifiers is a complex task, particularly in the case of multi-label classification. The comparative evaluation of various algorithms on the myocardial infarction dataset elected the OneVsRest heuristic as the best-performing one. Furthermore, the use of XGBoost as the base classifier enabled the fine-tuning of the model and accelerated the learning process. XGBoost was identified as the best performing algorithm in a study published in 2022 for a similar predictive task. That study aimed to develop and validate a machine learning-based model to predict regional lymph node metastasis in osteosarcoma using data from 1,201 patients, identifying T and M stage, surgery and chemotherapy as significant risk factors and XGBoost as the best performing predictive algorithm for that task (<xref rid="b64-MI-4-6-00192" ref-type="bibr">64</xref>).</p>
<p>The use of disease datasets is also employed by other frameworks; for example, Tang <italic>et al</italic> (<xref rid="b65-MI-4-6-00192" ref-type="bibr">65</xref>) described a Gaussian randomizer-based system for early fundus screening with privacy preserving and domain adaptation, employing a multi-disease dataset.</p>
<p>It would be of interest, as an extension of the proposed framework, to evaluate a set of different base classifiers in the context of the OneVsRest strategy and observe the effect that their substitution may have on the classification performance. The present study focused on a subset of the available algorithms and strategies; therefore, there exist other potential machine-learning components and techniques to be evaluated on this task. In terms of the dataset itself, graph exploration highlighted shared label instances that potentially contain information relevant to myocardial infarction pathophysiology. The label imbalance that marks the dataset constitutes an interesting point in terms of handling a multi-label classification problem. Methods to address imbalanced datasets in multi-label classification have been reviewed elsewhere and include, but are not limited to, random oversampling and undersampling, heuristic oversampling, cost-sensitive learning, and ensemble approaches (<xref rid="b26-MI-4-6-00192" ref-type="bibr">26</xref>,<xref rid="b31-MI-4-6-00192" ref-type="bibr">31</xref>). Oversampling is the process of increasing the rate of minority class instances within an imbalanced dataset to compensate for the occurrence of common classes (<xref rid="b66-MI-4-6-00192" ref-type="bibr">66</xref>). Modern and widely-used techniques, such as the Synthetic Minority Over-sampling Technique (SMOTE) create synthetic data points by using the feature space of the minority class and k-nearest neighbors; however, applying the k-nearest neighbor approach on binary input data such as the dataset at hand would serve no purpose (<xref rid="b67-MI-4-6-00192" ref-type="bibr">67</xref>). Furthermore, the existence of binary input data excludes the use of SMOTENC, the SMOTE extension for numerical and categorical features (<xref rid="b67-MI-4-6-00192" ref-type="bibr">67</xref>). Therefore, if an added oversampling step were to be applied, a custom oversampling function would need to be created to increase tail label samples based on the calculated oversampling ratio of the labels. Lastly, the concept of errors constitutes an important facet of developing accurate and reliable biomedical classification models. A classifier is subject to two main types of errors, false positives, also known as type I errors, and false negatives, also known as type II errors (<xref rid="b59-MI-4-6-00192" ref-type="bibr">59</xref>). In the case of false positives, the classifier predicts a label which is not present in the test set, while in the case of false negatives, a label that should have been predicted is missing. Similarly, true positives are results where the classifier has correctly predicted a label presence and in true negatives, the classifier has correctly predicted the absence of label, i.e., the absence of the positive instance. In the case of disease complication predictions, we are greatly invested in limiting the rate of false negatives, where the classifier fails to predict a label (a complication) which in truth exists. One could argue that in the hospital setting, it would be less damaging to monitor a patient in anticipation of a complication that turns out to be a false positive, than failing to catch a complication that may be lethal. Therefore, the selection of performance metrics and the penalties enforced on the errors of the model are greatly affected by the nature of the disease which we wish to interpret as a classification task.</p>
<p>In conclusion, MI constitutes a highly frequent phenomenon in the subset of the population suffering from cardiovascular problems. The development of accurate and scalable systems to support the decision-making process of the medical professionals in the hospital can alleviate the weight of patient management and may potentially increase the odds of survival for myocardial infarction patients. The use of predictive systems for disease-related challenges has been garnering attention the past years with the increase in computational power and novelty of algorithms.</p>
<p>The data-driven approach presented herein and the obtained results underline the potential of machine learning applications in risk predictions, in particular for the challenge associated with MI. As demonstrated through the evaluation, high-performance algorithms, such as the Extreme Gradient Boosting algorithm can be employed as base classifiers in the context of machine-learning model development, while disciplines such as graph theory can shed light into the elaborate networks underlying myocardial infarction progression. Public dataset repositories can provide the large-scale quantities of biomedical and patient data that are required to build efficient and reliable predictive classification models. This data-driven approach can be further scaled and enhanced; there exists promise in the use of ensemble models, made up of different classifiers with different aptitude towards predicting specific labels. Overall, the classifier-based pipeline holds the potential to support the decision-making process of healthcare professionals and aid a proactive approach to patient care.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material id="SD1-MI-4-6-00192" content-type="local-data">
<caption>
<title>Descriptive analytics of the dataset and information regarding the convention of column names.</title>
</caption>
<media mimetype="application" mime-subtype="xls" xlink:href="Supplementary_Data1.xlsx"/>
</supplementary-material>
<supplementary-material id="SD2-MI-4-6-00192" content-type="local-data">
<caption>
<title>Details of the final dataset.</title>
</caption>
<media mimetype="application" mime-subtype="xls" xlink:href="Supplementary_Data2.xlsx"/>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<title>Acknowledgements</title>
<p>Not applicable.</p>
</ack>
<sec sec-type="data-availability">
<title>Availability of data and materials</title>
<p>The code samples and raw data analyzed during the present study can be found at: <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/IoDiakou/MLC-on-biomedical-data.git">https://github.com/IoDiakou/MLC-on-biomedical-data.git</ext-link> and <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://darkdna.gr">http://darkdna.gr</ext-link>.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>All authors (ID, EI, EP, KD, CY, CI, DAS, GPC, EE and DV) contributed to the conceptualization, design, writing, drafting, revising, editing and reviewing of the manuscript. All authors confirm the authenticity of all the raw data. All authors have read and approved the final manuscript.</p>
</sec>
<sec>
<title>Ethics approval and consent to participate</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Patient consent for publication</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="b1-MI-4-6-00192"><label>1</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Anderson</surname><given-names>JR</given-names></name></person-group><comment>Machine learning: an artificial intelligence approach. Elsevier Science, 1983.</comment></element-citation></ref>
<ref id="b2-MI-4-6-00192"><label>2</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Russell</surname><given-names>S</given-names></name><name><surname>Norvig</surname><given-names>P</given-names></name></person-group><comment>Artificial intelligence: A modern approach. 3rd edition. Prentice-Hall, Upper Saddle River, 2010.</comment></element-citation></ref>
<ref id="b3-MI-4-6-00192"><label>3</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Somani</surname><given-names>P</given-names></name><name><surname>Kaur</surname><given-names>G</given-names></name></person-group><article-title>A review on supervised learning algorithms</article-title><source>Int J Adv Sci Technol</source><volume>29</volume><fpage>2551</fpage><lpage>2559</lpage><year>2020</year><pub-id pub-id-type="pmid">32146356</pub-id><pub-id pub-id-type="doi">10.1016/j.neunet.2020.02.011</pub-id></element-citation></ref>
<ref id="b4-MI-4-6-00192"><label>4</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Singh</surname><given-names>P</given-names></name></person-group><comment>Supervised machine learning. In: Learn PySpark: Build Python-based Machine Learning and Deep Learning Models. Singh P (ed). Apress, Berkeley, CA, pp117-159, 2019.</comment></element-citation></ref>
<ref id="b5-MI-4-6-00192"><label>5</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gentleman</surname><given-names>R</given-names></name><name><surname>Carey</surname><given-names>VJ</given-names></name></person-group><comment>Unsupervised machine learning. In: Bioconductor Case Studies. Hahne F, Huber W, Gentleman R and Falcon S (eds). Springer New York, New York, NY, pp137-157, 2008.</comment></element-citation></ref>
<ref id="b6-MI-4-6-00192"><label>6</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hady</surname><given-names>MFA</given-names></name><name><surname>Schwenker</surname><given-names>F</given-names></name></person-group><comment>Semi-supervised Learning. In: Handbook on Neural Information Processing. Bianchini M, Maggini M and Jain LC (eds). Intelligent Systems Reference Library. Vol. 49. Springer, Berlin, Heidelberg, pp215-239, 2013.</comment></element-citation></ref>
<ref id="b7-MI-4-6-00192"><label>7</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sutton</surname><given-names>RS</given-names></name><name><surname>Barto</surname><given-names>AG</given-names></name></person-group><comment>Reinforcement learning: An introduction. MIT Press, 2018.</comment></element-citation></ref>
<ref id="b8-MI-4-6-00192"><label>8</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname><given-names>D</given-names></name><name><surname>Seo</surname><given-names>H</given-names></name><name><surname>Jung</surname><given-names>MW</given-names></name></person-group><article-title>Neural basis of reinforcement learning and decision making</article-title><source>Annu Rev Neurosci</source><volume>35</volume><fpage>287</fpage><lpage>308</lpage><year>2012</year><pub-id pub-id-type="pmid">22462543</pub-id><pub-id pub-id-type="doi">10.1146/annurev-neuro-062111-150512</pub-id></element-citation></ref>
<ref id="b9-MI-4-6-00192"><label>9</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Czarnul</surname><given-names>P</given-names></name><name><surname>Proficz</surname><given-names>J</given-names></name><name><surname>Krzywaniak</surname><given-names>A</given-names></name></person-group><article-title>Energy-aware high-performance computing: Survey of state-of-the-art tools, techniques, and environments</article-title><source>Sci Program</source><volume>2019</volume><issue>8348791</issue><year>2019</year></element-citation></ref>
<ref id="b10-MI-4-6-00192"><label>10</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Mascetti</surname><given-names>L</given-names></name><name><surname>Arsuaga Rios</surname><given-names>M</given-names></name><name><surname>Bocchi</surname><given-names>E</given-names></name><name><surname>Vicente</surname><given-names>JC</given-names></name><name><surname>Cheong</surname><given-names>BCK</given-names></name><name><surname>Castro</surname><given-names>D</given-names></name><name><surname>Collet</surname><given-names>J</given-names></name><name><surname>Contescu</surname><given-names>C</given-names></name><name><surname>Labrador</surname><given-names>HG</given-names></name><name><surname>Iven</surname><given-names>J</given-names></name><etal/></person-group><article-title>CERN disk storage services: Report from last data taking, evolution and future outlook towards Exabyte-scale storage</article-title><source>EPJ Web Conf</source><volume>245</volume><issue>04038</issue><year>2020</year></element-citation></ref>
<ref id="b11-MI-4-6-00192"><label>11</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Amin</surname><given-names>R</given-names></name><name><surname>Vadlamudi</surname><given-names>S</given-names></name><name><surname>Rahaman</surname><given-names>MM</given-names></name></person-group><article-title>Opportunities and challenges of data migration in cloud</article-title><source>Eng Int</source><volume>9</volume><fpage>41</fpage><lpage>50</lpage><year>2021</year></element-citation></ref>
<ref id="b12-MI-4-6-00192"><label>12</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Dash</surname><given-names>S</given-names></name><name><surname>Shakyawar</surname><given-names>SK</given-names></name><name><surname>Sharma</surname><given-names>M</given-names></name><name><surname>Kaushik</surname><given-names>S</given-names></name></person-group><article-title>Big data in healthcare: Management, analysis and future prospects</article-title><source>J Big Data</source><volume>6</volume><issue>54</issue><year>2019</year></element-citation></ref>
<ref id="b13-MI-4-6-00192"><label>13</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wachter</surname><given-names>RM</given-names></name></person-group><comment>Chapter 11. Other complications of healthcare. In: Understanding Patient Safety, 2e. The McGraw-Hill Companies, New York, NY, 2012.</comment></element-citation></ref>
<ref id="b14-MI-4-6-00192"><label>14</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ghosheh</surname><given-names>GO</given-names></name><name><surname>Alamad</surname><given-names>B</given-names></name><name><surname>Yang</surname><given-names>KW</given-names></name><name><surname>Syed</surname><given-names>F</given-names></name><name><surname>Hayat</surname><given-names>N</given-names></name><name><surname>Iqbal</surname><given-names>I</given-names></name><name><surname>Al Kindi</surname><given-names>F</given-names></name><name><surname>Al Junaibi</surname><given-names>S</given-names></name><name><surname>Al Safi</surname><given-names>M</given-names></name><name><surname>Ali</surname><given-names>R</given-names></name><etal/></person-group><article-title>Clinical prediction system of complications among patients with COVID-19: A development and validation retrospective multicentre study during first wave of the pandemic</article-title><source>Intell Based Med</source><volume>6</volume><issue>100065</issue><year>2022</year><pub-id pub-id-type="pmid">35721825</pub-id><pub-id pub-id-type="doi">10.1016/j.ibmed.2022.100065</pub-id></element-citation></ref>
<ref id="b15-MI-4-6-00192"><label>15</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>van Smeden</surname><given-names>M</given-names></name><name><surname>Reitsma</surname><given-names>JB</given-names></name><name><surname>Riley</surname><given-names>RD</given-names></name><name><surname>Collins</surname><given-names>GS</given-names></name><name><surname>Moons</surname><given-names>KG</given-names></name></person-group><article-title>Clinical prediction models: Diagnosis versus prognosis</article-title><source>J Clin Epidemiol</source><volume>132</volume><fpage>142</fpage><lpage>145</lpage><year>2021</year><pub-id pub-id-type="pmid">33775387</pub-id><pub-id pub-id-type="doi">10.1016/j.jclinepi.2021.01.009</pub-id></element-citation></ref>
<ref id="b16-MI-4-6-00192"><label>16</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>de Souza</surname><given-names>FSH</given-names></name><name><surname>Hojo-Souza</surname><given-names>NS</given-names></name><name><surname>dos Santos</surname><given-names>EB</given-names></name><name><surname>da Silva</surname><given-names>CM</given-names></name><name><surname>Guidoni</surname><given-names>DL</given-names></name></person-group><comment>Predicting the disease outcome in COVID-19 positive patients through machine learning: A retrospective cohort study with Brazilian data. medRxiv: 2020.2006.2026.20140764, 2020.</comment></element-citation></ref>
<ref id="b17-MI-4-6-00192"><label>17</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ezzoddin</surname><given-names>M</given-names></name><name><surname>Nasiri</surname><given-names>H</given-names></name><name><surname>Dorrigiv</surname><given-names>M</given-names></name></person-group><comment>Diagnosis of COVID-19 cases from chest X-ray images using deep neural network and LightGBM. IEEE, 2022.</comment></element-citation></ref>
<ref id="b18-MI-4-6-00192"><label>18</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pathak</surname><given-names>Y</given-names></name><name><surname>Shukla</surname><given-names>PK</given-names></name><name><surname>Tiwari</surname><given-names>A</given-names></name><name><surname>Stalin</surname><given-names>S</given-names></name><name><surname>Singh</surname><given-names>S</given-names></name></person-group><article-title>Deep transfer learning-based classification model for COVID-19 disease</article-title><source>IRBM</source><volume>43</volume><fpage>87</fpage><lpage>92</lpage><year>2022</year><pub-id pub-id-type="pmid">32837678</pub-id><pub-id pub-id-type="doi">10.1016/j.irbm.2020.05.003</pub-id></element-citation></ref>
<ref id="b19-MI-4-6-00192"><label>19</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yuan</surname><given-names>B</given-names></name></person-group><article-title>Towards a clinical efficacy evaluation system adapted for personalized medicine</article-title><source>Pharmgenomics Pers Med</source><volume>14</volume><fpage>487</fpage><lpage>496</lpage><year>2021</year><pub-id pub-id-type="pmid">33953600</pub-id><pub-id pub-id-type="doi">10.2147/PGPM.S304420</pub-id></element-citation></ref>
<ref id="b20-MI-4-6-00192"><label>20</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kotsiantis</surname><given-names>SB</given-names></name><name><surname>Zaharakis</surname><given-names>ID</given-names></name><name><surname>Pintelas</surname><given-names>PE</given-names></name></person-group><article-title>Machine learning: A review of classification and combining techniques</article-title><source>Artif Intell Rev</source><volume>26</volume><fpage>159</fpage><lpage>190</lpage><year>2006</year></element-citation></ref>
<ref id="b21-MI-4-6-00192"><label>21</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wei</surname><given-names>Y</given-names></name><name><surname>Xia</surname><given-names>W</given-names></name><name><surname>Huang</surname><given-names>J</given-names></name><name><surname>Ni</surname><given-names>B</given-names></name><name><surname>dong</surname><given-names>J</given-names></name><name><surname>Zhao</surname><given-names>Y</given-names></name><name><surname>Yan</surname><given-names>S</given-names></name></person-group><comment>CNN: Single-label to multi-label. ArXiv: abs/1406.5726, 2014.</comment></element-citation></ref>
<ref id="b22-MI-4-6-00192"><label>22</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Soofi</surname><given-names>AA</given-names></name><name><surname>Awan</surname><given-names>A</given-names></name></person-group><article-title>Classification techniques in machine learning: Applications and issues</article-title><source>J Basic Appl Sci</source><volume>13</volume><fpage>459</fpage><lpage>465</lpage><year>2017</year></element-citation></ref>
<ref id="b23-MI-4-6-00192"><label>23</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tsoumakas</surname><given-names>G</given-names></name><name><surname>Katakis</surname><given-names>I</given-names></name></person-group><article-title>Multi-label classification: An overview</article-title><source>Int J Data Warehous Min</source><volume>3</volume><fpage>1</fpage><lpage>13</lpage><year>2009</year></element-citation></ref>
<ref id="b24-MI-4-6-00192"><label>24</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Herrera</surname><given-names>F</given-names></name><name><surname>Charte</surname><given-names>F</given-names></name><name><surname>Rivera</surname><given-names>AJ</given-names></name><name><surname>del Jesus</surname><given-names>MJ</given-names></name></person-group><comment>Multilabel classification. In: Multilabel Classification: Problem Analysis, Metrics and Techniques. Herrera F, Charte F, Rivera AJ and del Jesus MJ (eds). Springer International Publishing, Cham, pp17-31, 2016.</comment></element-citation></ref>
<ref id="b25-MI-4-6-00192"><label>25</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sun</surname><given-names>Y</given-names></name><name><surname>Wong</surname><given-names>AKC</given-names></name><name><surname>Kamel</surname><given-names>MS</given-names></name></person-group><article-title>Classification of imbalanced data: A review</article-title><source>Int J Pattern Recognit Artif Intell</source><volume>23</volume><fpage>687</fpage><lpage>719</lpage><year>2009</year></element-citation></ref>
<ref id="b26-MI-4-6-00192"><label>26</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tarekegn</surname><given-names>AN</given-names></name><name><surname>Giacobini</surname><given-names>M</given-names></name><name><surname>Michalak</surname><given-names>K</given-names></name></person-group><article-title>A review of methods for imbalanced multi-label classification</article-title><source>Pattern Recognit</source><volume>118</volume><issue>107965</issue><year>2021</year></element-citation></ref>
<ref id="b27-MI-4-6-00192"><label>27</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Charte</surname><given-names>F</given-names></name><name><surname>Rivera</surname><given-names>AJ</given-names></name><name><surname>del Jesus</surname><given-names>MJ</given-names></name><name><surname>Herrera</surname><given-names>F</given-names></name></person-group><article-title>Dealing with difficult minority labels in imbalanced mutilabel data sets</article-title><source>Neurocomputing</source><volume>326-327</volume><fpage>39</fpage><lpage>53</lpage><year>2019</year></element-citation></ref>
<ref id="b28-MI-4-6-00192"><label>28</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Charte</surname><given-names>F</given-names></name><name><surname>Rivera</surname><given-names>A</given-names></name><name><surname>del Jesus</surname><given-names>MJ</given-names></name><name><surname>Herrera</surname><given-names>F</given-names></name></person-group><comment>A first approach to deal with imbalance in multi-label datasets. In: Pan JS, Polycarpou MM, Woźniak M, de Carvalho ACPLF, Quintián H and Corchado E (eds). Hybrid Artificial Intelligent Systems. HAIS 2013. Lecture Notes in Computer Science. Vol. 8073. Springer, Berlin, Heidelberg, pp150-160, 2013.</comment></element-citation></ref>
<ref id="b29-MI-4-6-00192"><label>29</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname><given-names>Y</given-names></name><name><surname>Giledereli</surname><given-names>B</given-names></name><name><surname>Köksal</surname><given-names>A</given-names></name><name><surname>Ozgur</surname><given-names>A</given-names></name><name><surname>Ozkirimli</surname><given-names>E</given-names></name></person-group><comment>Balancing methods for multi-label text classification with long-tailed class distribution. arXiv: 2109.04712, 2021.</comment></element-citation></ref>
<ref id="b30-MI-4-6-00192"><label>30</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Giraldo Forero</surname><given-names>AF</given-names></name><name><surname>Jaramillo-Garzón</surname><given-names>J</given-names></name><name><surname>Ruiz-Muñoz</surname><given-names>J</given-names></name><name><surname>Castellanos-Dominguez</surname><given-names>G</given-names></name></person-group><comment>Managing Imbalanced Data Sets in Multi-label Problems: A Case Study with the SMOTE Algorithm. In: Ruiz-Shulcloper J, Sanniti di Baja G (eds). Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2013. Lecture Notes in Computer Science. Vol. 8258. Springer, Berlin, Heidelberg, pp334-342, 2013.</comment></element-citation></ref>
<ref id="b31-MI-4-6-00192"><label>31</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tahir</surname><given-names>MA</given-names></name><name><surname>Kittler</surname><given-names>J</given-names></name><name><surname>Bouridane</surname><given-names>A</given-names></name></person-group><article-title>Multilabel classification using heterogeneous ensemble of multi-label classifiers</article-title><source>Pattern Recognit Lett</source><volume>33</volume><fpage>513</fpage><lpage>523</lpage><year>2012</year></element-citation></ref>
<ref id="b32-MI-4-6-00192"><label>32</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Cao</surname><given-names>P</given-names></name><name><surname>Liu</surname><given-names>X</given-names></name><name><surname>Zhao</surname><given-names>D</given-names></name><name><surname>Zaiane</surname><given-names>O</given-names></name></person-group><comment>Cost sensitive ranking support vector machine for multi-label data learning. In: Abraham A, Haqiq A, Alimi A, Mezzour G, Rokbani N and Muda A (eds). Proceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016). HIS 2016. Advances in Intelligent Systems and Computing. Vol. 552. Springer, Cham, pp244-255, 2017.</comment></element-citation></ref>
<ref id="b33-MI-4-6-00192"><label>33</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Saleh</surname><given-names>M</given-names></name><name><surname>Ambrose</surname><given-names>JA</given-names></name></person-group><article-title>Understanding myocardial infarction</article-title><source>F1000Res</source><volume>7</volume><issue>1378</issue><year>2018</year><pub-id pub-id-type="pmid">30228871</pub-id><pub-id pub-id-type="doi">10.12688/f1000research.15096.1</pub-id></element-citation></ref>
<ref id="b34-MI-4-6-00192"><label>34</label><element-citation publication-type="journal"><comment>World Health Organization: Cardiovascular diseases, 2022.</comment></element-citation></ref>
<ref id="b35-MI-4-6-00192"><label>35</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Badimon</surname><given-names>L</given-names></name><name><surname>Vilahur</surname><given-names>G</given-names></name></person-group><article-title>Thrombosis formation on atherosclerotic lesions and plaque rupture</article-title><source>J Intern Med</source><volume>276</volume><fpage>618</fpage><lpage>632</lpage><year>2014</year><pub-id pub-id-type="pmid">25156650</pub-id><pub-id pub-id-type="doi">10.1111/joim.12296</pub-id></element-citation></ref>
<ref id="b36-MI-4-6-00192"><label>36</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Asada</surname><given-names>Y</given-names></name><name><surname>Yamashita</surname><given-names>A</given-names></name><name><surname>Sato</surname><given-names>Y</given-names></name><name><surname>Hatakeyama</surname><given-names>K</given-names></name></person-group><article-title>Thrombus formation and propagation in the onset of cardiovascular events</article-title><source>J Atheroscler Thromb</source><volume>25</volume><fpage>653</fpage><lpage>664</lpage><year>2018</year><pub-id pub-id-type="pmid">29887539</pub-id><pub-id pub-id-type="doi">10.5551/jat.RV17022</pub-id></element-citation></ref>
<ref id="b37-MI-4-6-00192"><label>37</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Shavadia</surname><given-names>JS</given-names></name><name><surname>Chen</surname><given-names>AY</given-names></name><name><surname>Fanaroff</surname><given-names>AC</given-names></name><name><surname>de Lemos</surname><given-names>JA</given-names></name><name><surname>Kontos</surname><given-names>MC</given-names></name><name><surname>Wang</surname><given-names>TY</given-names></name></person-group><article-title>Intensive care utilization in stable patients with ST-segment elevation myocardial infarction treated with rapid reperfusion</article-title><source>JACC Cardiovasc Interv</source><volume>12</volume><fpage>709</fpage><lpage>717</lpage><year>2019</year><pub-id pub-id-type="pmid">31000008</pub-id><pub-id pub-id-type="doi">10.1016/j.jcin.2019.01.230</pub-id></element-citation></ref>
<ref id="b38-MI-4-6-00192"><label>38</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Abrignani</surname><given-names>MG</given-names></name><name><surname>Dominguez</surname><given-names>LJ</given-names></name><name><surname>Biondo</surname><given-names>G</given-names></name><name><surname>Di Girolamo</surname><given-names>A</given-names></name><name><surname>Novo</surname><given-names>G</given-names></name><name><surname>Barbagallo</surname><given-names>M</given-names></name><name><surname>Braschi</surname><given-names>A</given-names></name><name><surname>Braschi</surname><given-names>G</given-names></name><name><surname>Novo</surname><given-names>S</given-names></name></person-group><article-title>In-hospital complications of acute myocardial infarction in hypertensive subjects</article-title><source>Am J Hypertens</source><volume>18</volume><fpage>165</fpage><lpage>170</lpage><year>2005</year><pub-id pub-id-type="pmid">15752942</pub-id><pub-id pub-id-type="doi">10.1016/j.amjhyper.2004.09.018</pub-id></element-citation></ref>
<ref id="b39-MI-4-6-00192"><label>39</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Malla</surname><given-names>RR</given-names></name><name><surname>Sayami</surname><given-names>A</given-names></name></person-group><article-title>In hospital complications and mortality of patients of inferior wall myocardial infarction with right ventricular infarction</article-title><source>JNMA J Nepal Med Assoc</source><volume>46</volume><fpage>99</fpage><lpage>102</lpage><year>2007</year><pub-id pub-id-type="pmid">18274563</pub-id></element-citation></ref>
<ref id="b40-MI-4-6-00192"><label>40</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Babaev</surname><given-names>A</given-names></name><name><surname>Frederick</surname><given-names>PD</given-names></name><name><surname>Pasta</surname><given-names>DJ</given-names></name><name><surname>Every</surname><given-names>N</given-names></name><name><surname>Sichrovsky</surname><given-names>T</given-names></name><name><surname>Hochman</surname><given-names>JS</given-names></name></person-group><comment>NRMI Investigators</comment><article-title>Trends in management and outcomes of patients with acute myocardial infarction complicated by cardiogenic shock</article-title><source>JAMA</source><volume>294</volume><fpage>448</fpage><lpage>454</lpage><year>2005</year><pub-id pub-id-type="pmid">16046651</pub-id><pub-id pub-id-type="doi">10.1001/jama.294.4.448</pub-id></element-citation></ref>
<ref id="b41-MI-4-6-00192"><label>41</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Golovenkin</surname><given-names>SE</given-names></name><name><surname>Gorban</surname><given-names>A</given-names></name><name><surname>Mirkes</surname><given-names>E</given-names></name><name><surname>Shulman</surname><given-names>VA</given-names></name><name><surname>Rossiev</surname><given-names>DA</given-names></name><name><surname>Shesternya</surname><given-names>PA</given-names></name><name><surname>Nikulina</surname><given-names>SY</given-names></name><name><surname>Orlova</surname><given-names>YV</given-names></name><name><surname>Dorrer</surname><given-names>MG</given-names></name></person-group><comment>Myocardial infarction complications Database. Journal, 2020.</comment></element-citation></ref>
<ref id="b42-MI-4-6-00192"><label>42</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname><given-names>J</given-names></name><name><surname>Leskovec</surname><given-names>J</given-names></name></person-group><article-title>Defining and evaluating network communities based on ground-truth</article-title><source>Knowl Inf Syst</source><volume>42</volume><fpage>181</fpage><lpage>213</lpage><year>2015</year></element-citation></ref>
<ref id="b43-MI-4-6-00192"><label>43</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname><given-names>SJ</given-names></name><name><surname>Zhou</surname><given-names>ZH</given-names></name></person-group><article-title>Multi-label learning by exploiting label correlations locally</article-title><source>Proc AAAI Conf Artif Intell</source><volume>26</volume><fpage>949</fpage><lpage>955</lpage><year>2021</year></element-citation></ref>
<ref id="b44-MI-4-6-00192"><label>44</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chakravarty</surname><given-names>A</given-names></name><name><surname>Sarkar</surname><given-names>T</given-names></name><name><surname>Ghosh</surname><given-names>N</given-names></name><name><surname>Sethuraman</surname><given-names>R</given-names></name><name><surname>Sheet</surname><given-names>D</given-names></name></person-group><article-title>Learning decision ensemble using a graph neural network for comorbidity aware chest radiograph screening</article-title><source>Annu Int Conf IEEE Eng Med Biol Soc</source><volume>2020</volume><fpage>1234</fpage><lpage>1237</lpage><year>2020</year><pub-id pub-id-type="pmid">33018210</pub-id><pub-id pub-id-type="doi">10.1109/EMBC44109.2020.9176693</pub-id></element-citation></ref>
<ref id="b45-MI-4-6-00192"><label>45</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Szymański</surname><given-names>P</given-names></name><name><surname>Kajdanowicz</surname><given-names>T</given-names></name><name><surname>Kersting</surname><given-names>K</given-names></name></person-group><article-title>How is a data-driven approach better than random choice in label space division for multi-label classification?</article-title><source>Entropy</source><volume>18</volume><issue>282</issue><year>2016</year></element-citation></ref>
<ref id="b46-MI-4-6-00192"><label>46</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Blondel</surname><given-names>VD</given-names></name><name><surname>Guillaume</surname><given-names>JL</given-names></name><name><surname>Lambiotte</surname><given-names>R</given-names></name><name><surname>Lefebvre</surname><given-names>E</given-names></name></person-group><article-title>Fast unfolding of communities in large networks</article-title><source>J Stat Mech</source><volume>2008</volume><issue>P10008</issue><year>2008</year></element-citation></ref>
<ref id="b47-MI-4-6-00192"><label>47</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hagberg</surname><given-names>A</given-names></name><name><surname>Swart</surname><given-names>PJ</given-names></name><name><surname>Chult</surname><given-names>DA</given-names></name></person-group><comment>Exploring network structure, dynamics, and function using NetworkX, 2008.</comment></element-citation></ref>
<ref id="b48-MI-4-6-00192"><label>48</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Goutte</surname><given-names>C</given-names></name><name><surname>Gaussier</surname><given-names>E</given-names></name></person-group><comment>A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Losada DE, Fernández-Luna JM (eds). Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science. Vol. 3408. Springer, Berlin, Heidelberg, pp345-359, 2005.</comment></element-citation></ref>
<ref id="b49-MI-4-6-00192"><label>49</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Qin</surname><given-names>T</given-names></name></person-group><comment>Machine learning basics. In: Dual Learning. Qin T (ed). Springer Singapore, Singapore, pp11-23, 2020.</comment></element-citation></ref>
<ref id="b50-MI-4-6-00192"><label>50</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sorower</surname><given-names>MS</given-names></name></person-group><comment>A literature survey on algorithms for multi-label learning. Oregon State University, Corvallis, 2010.</comment></element-citation></ref>
<ref id="b51-MI-4-6-00192"><label>51</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname><given-names>J</given-names></name><name><surname>Chen</surname><given-names>XY</given-names></name><name><surname>Zhang</surname><given-names>H</given-names></name><name><surname>Xiong</surname><given-names>LD</given-names></name><name><surname>Lei</surname><given-names>H</given-names></name><name><surname>Deng</surname><given-names>SH</given-names></name></person-group><article-title>Hyperparameter optimization for machine learning models based on bayesian optimizationb</article-title><source>J Electron Sci Technol</source><volume>17</volume><fpage>26</fpage><lpage>40</lpage><year>2019</year></element-citation></ref>
<ref id="b52-MI-4-6-00192"><label>52</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Liashchynskyi</surname><given-names>P</given-names></name><name><surname>Liashchynskyi</surname><given-names>P</given-names></name></person-group><comment>Grid search, random search, genetic algorithm: A big comparison for NAS. arXiv: 1912.06059, 2019.</comment></element-citation></ref>
<ref id="b53-MI-4-6-00192"><label>53</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Feurer</surname><given-names>M</given-names></name><name><surname>Hutter</surname><given-names>F</given-names></name></person-group><comment>Hyperparameter optimization. In: Automated Machine Learning: Methods, Systems, Challenges. Hutter F, Kotthoff L and Vanschoren J (eds). Springer International Publishing, Cham, pp3-33, 2019.</comment></element-citation></ref>
<ref id="b54-MI-4-6-00192"><label>54</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pedregosa</surname><given-names>F</given-names></name><name><surname>Varoquaux</surname><given-names>G</given-names></name><name><surname>Gramfort</surname><given-names>A</given-names></name><name><surname>Michel</surname><given-names>V</given-names></name><name><surname>Thirion</surname><given-names>B</given-names></name></person-group><article-title>Scikit-learn: Machine learning in python</article-title><source>J Mach Learn Res</source><volume>12</volume><fpage>2825</fpage><lpage>2830</lpage><year>2011</year></element-citation></ref>
<ref id="b55-MI-4-6-00192"><label>55</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>T</given-names></name><name><surname>Guestrin</surname><given-names>C</given-names></name></person-group><comment>XGBoost: A scalable tree boosting system. KDD ‘16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp785-794, 2016.</comment></element-citation></ref>
<ref id="b56-MI-4-6-00192"><label>56</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Mason</surname><given-names>L</given-names></name><name><surname>Baxter</surname><given-names>J</given-names></name><name><surname>Bartlett</surname><given-names>P</given-names></name><name><surname>Frean</surname><given-names>M</given-names></name></person-group><article-title>Boosting algorithms as gradient descent</article-title><source>Adv Neural Inf Process Syst</source><volume>12</volume><year>1999</year></element-citation></ref>
<ref id="b57-MI-4-6-00192"><label>57</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Boehmke</surname><given-names>B</given-names></name><name><surname>Greenwell</surname><given-names>B</given-names></name></person-group><comment>Hands-on Machine Learning with R. Chapman and Hall/CRC, New York, NY, pp221-246, 2019.</comment></element-citation></ref>
<ref id="b58-MI-4-6-00192"><label>58</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Medar</surname><given-names>R</given-names></name><name><surname>Rajpurohit</surname><given-names>VS</given-names></name><name><surname>Rashmi</surname><given-names>B</given-names></name></person-group><comment>Impact of training and testing data splits on accuracy of time series forecasting in machine learning. In: 2017 International Conference on Computing, Communication, Control and Automation (ICCUBEA). IEEE, pp1-6. 2017.</comment></element-citation></ref>
<ref id="b59-MI-4-6-00192"><label>59</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sarker</surname><given-names>IH</given-names></name></person-group><article-title>Machine learning: Algorithms, real-world applications and research directions</article-title><source>SN Comput Sci</source><volume>2</volume><issue>160</issue><year>2021</year><pub-id pub-id-type="pmid">33778771</pub-id><pub-id pub-id-type="doi">10.1007/s42979-021-00592-x</pub-id></element-citation></ref>
<ref id="b60-MI-4-6-00192"><label>60</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Nti</surname><given-names>I</given-names></name><name><surname>Nyarko-Boateng</surname><given-names>O</given-names></name><name><surname>Aning</surname><given-names>J</given-names></name></person-group><article-title>Performance of machine learning algorithms with different K values in K-fold cross-validation</article-title><source>Int J Inf Technol and Comp Sci</source><volume>6</volume><fpage>61</fpage><lpage>71</lpage><year>2021</year></element-citation></ref>
<ref id="b61-MI-4-6-00192"><label>61</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Refaeilzadeh</surname><given-names>P</given-names></name><name><surname>Tang</surname><given-names>L</given-names></name><name><surname>Liu</surname><given-names>H</given-names></name></person-group><comment>Cross-validation. In: Encyclopedia of Database Systems. Liu L and ÖZsu MT (eds). Springer US, Boston, MA, pp532-538, 2009.</comment></element-citation></ref>
<ref id="b62-MI-4-6-00192"><label>62</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sechidis</surname><given-names>K</given-names></name><name><surname>Tsoumakas</surname><given-names>G</given-names></name><name><surname>Vlahavas</surname><given-names>I</given-names></name></person-group><comment>On the Stratification of Multi-label. Data. In: Gunopulos D, Hofmann T, Malerba D and Vazirgiannis M (eds). Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science. Vol. 6913. Springer, Berlin, Heidelberg, pp145-458, 2011.</comment></element-citation></ref>
<ref id="b63-MI-4-6-00192"><label>63</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Szymański</surname><given-names>P</given-names></name><name><surname>Kajdanowicz</surname><given-names>T</given-names></name></person-group><article-title>A network perspective on stratification of multi-label data</article-title><source>Proc Mach Learn Res</source><volume>74</volume><fpage>22</fpage><lpage>35</lpage><year>2017</year></element-citation></ref>
<ref id="b64-MI-4-6-00192"><label>64</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname><given-names>W</given-names></name><name><surname>Liu</surname><given-names>Y</given-names></name><name><surname>Liu</surname><given-names>W</given-names></name><name><surname>Tang</surname><given-names>ZR</given-names></name><name><surname>Dong</surname><given-names>S</given-names></name><name><surname>Li</surname><given-names>W</given-names></name><name><surname>Zhang</surname><given-names>K</given-names></name><name><surname>Xu</surname><given-names>C</given-names></name><name><surname>Hu</surname><given-names>Z</given-names></name><name><surname>Wang</surname><given-names>H</given-names></name><etal/></person-group><article-title>Machine learning-based prediction of lymph node metastasis among osteosarcoma patients</article-title><source>Front Oncol</source><volume>12</volume><issue>797103</issue><year>2022</year><pub-id pub-id-type="pmid">35515104</pub-id><pub-id pub-id-type="doi">10.3389/fonc.2022.797103</pub-id></element-citation></ref>
<ref id="b65-MI-4-6-00192"><label>65</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname><given-names>Z</given-names></name><name><surname>Wong</surname><given-names>HS</given-names></name><name><surname>Yu</surname><given-names>Z</given-names></name></person-group><article-title>Privacy-preserving federated learning with domain adaptation for multi-disease ocular disease recognition</article-title><source>IEEE J Biomed Health Inform</source><volume>28</volume><fpage>3219</fpage><lpage>3227</lpage><year>2024</year><pub-id pub-id-type="pmid">37590112</pub-id><pub-id pub-id-type="doi">10.1109/JBHI.2023.3305685</pub-id></element-citation></ref>
<ref id="b66-MI-4-6-00192"><label>66</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chawla</surname><given-names>NV</given-names></name></person-group><comment>Data mining for imbalanced datasets: An overview. In: Maimon O, Rokach L (eds). Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA, pp853-867, 2005.</comment></element-citation></ref>
<ref id="b67-MI-4-6-00192"><label>67</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chawla</surname><given-names>NV</given-names></name><name><surname>Bowyer</surname><given-names>KW</given-names></name><name><surname>Hall</surname><given-names>LO</given-names></name><name><surname>Kegelmeyer</surname><given-names>WP</given-names></name></person-group><article-title>SMOTE: Synthetic minority over-sampling technique</article-title><source>J Artif Intell Res</source><volume>16</volume><fpage>321</fpage><lpage>357</lpage><year>2002</year></element-citation></ref>
</ref-list>
</back>
<floats-group>
<fig id="f1-MI-4-6-00192" position="float">
<label>Figure 1</label>
<caption><p>Overview of the steps of the pipeline used herein.</p></caption>
<graphic xlink:href="mi-04-06-00192-g00.tif"/>
</fig>
<fig id="f2-MI-4-6-00192" position="float">
<label>Figure 2</label>
<caption><p>(A) Bar histogram depicting the distribution of label instances across the dataset. (B) Bar histogram representing the distribution of total label numbers assigned to each sample (a patient). Note that the number of labels exhibited by a patient will normally be smaller than the total of possible outcomes, as one patient will not exhibit all possible outcomes.</p></caption>
<graphic xlink:href="mi-04-06-00192-g01.tif"/>
</fig>
<fig id="f3-MI-4-6-00192" position="float">
<label>Figure 3</label>
<caption><p>Label graph of the labels represented as nodes, with the thickness of edges corresponding to co-occurrence rate. By using the Louvain algorithm, three clusters are reported, denoted by purple, yellow and light blue colour. Graph nodes sharing a colour belong to the same cluster. The nodes with the same colour have been assigned to the same cluster.</p></caption>
<graphic xlink:href="mi-04-06-00192-g02.tif"/>
</fig>
<fig id="f4-MI-4-6-00192" position="float">
<label>Figure 4</label>
<caption><p>Chart reflecting the candidate classification algorithms and their respective performance as measured through the hamming loss metric. CC, Classifier Chains; BRkNN, Binary Relevance k-Nearest Neighbor; RF, Random Forest; MLkNN, Multi-label k-Nearest Neighbor.</p></caption>
<graphic xlink:href="mi-04-06-00192-g03.tif"/>
</fig>
</floats-group>
</article>
