Class Based Variable Importance for Medical Decision Making
Journal Title: Biomedical Journal of Scientific & Technical Research (BJSTR) - Year 2017, Vol 1, Issue 5
Abstract
In this paper we explore variable importance within tree-based modeling, discussing its strengths and weaknesses with regard to medical inference and action ability. While variable importance is useful in understanding how strongly a variable influences a tree, it does not convey how variables relate to different classes of the target variable. Given that in the medical setting, both prediction and inference are important for successful machine learning, a new measure capturing variable importance with regards to classes is essential. A measure calculated from the paths of training instances through the tree is defined, and initial performance on benchmark datasets is explored. Tree based methods are common for use with medical datasets, the goal being to create a predictive model of one variable based on several input variables. The basic algorithm consists of a single tree, whereby the input starts at the root node and follows a path down the tree, choosing a path based on a splitting decision at each interior node [1]. The prediction is made by whatever leaf node the path ends in, either the majority or average of the node, depending on whether the problem is classification or regression respectively. Several implementations exist, such as ID3 [1,2], C4.5 [1,3] and CART (Classification and Regression Trees) [2], with CART being the implementation in Python’s scikit-learn machine learning library used in this analysis. More sophisticated algorithms build on the simple tree by making an ensemble of thousands trees, pooling the predictions together for a single final prediction. Prominent among these are Random Forests [3], Extra Trees [4-9], and Gradient Boosted Trees [6]. Tree based modeling in itself is popular given that it is easy to use, can easily support multi class prediction, and is better equipped to deal with small n and large p problems, where the number of observations are much smaller than the number of variables. The small n, large p issue is especially relevant in certain medical domains, such as genetic data [5], where hundreds or thousands of measurements can be taken on a handful of patients in a single study. Traditional modeling in this instance, while possible, will likely find a multiplicity of models with comparable error estimates [4]. One major drawback for tree based learning is the lack of interpretability in model behavior. Machine learning can be used for two purposes: prediction and inference. Trees are excellent for prediction; for inference, however, they fall short. Building a single tree, we can examine the set of branching rules to gather insight, but typically a single tree is a poor predictor. Prediction can be improved by aggregating over hundreds of trees, but by doing so, the ability to infer disappears. Regression models, while more rigid in predictive power given that only a single model is made, are straightforward for inference, and thus are easy to convey to decision makers. The co-efficient from a model can be explained as the strength of the effect for the given variable on the target variable: a positive coefficient represents a positive effect, and a negative coefficient represents a negative effect. When trying to determine a course of treatment designed to change an outcome, such as for treating a patient given a poor prognosis from a model, inference can be argued to be just as important for the medical practitioner. In this context, a model should not only be able to detect a disease, but it should also provide insight as to why it detected the disease in order to treat it. This issue of inference has been overlooked in the quest to find more accurate prediction. The main measure used, variable importance, provides some insight into how variables affect the overall model, but it does not provide insight as to how variables interact with the target. Some work using variable importance moves in this direction, such as for understanding the effects of correlated input variables [10-15], adjusting with imbalanced class sizes [10], measuring variable interactions [11], and as a variable selection mechanism [1] [8], but they still do not fully answer the question of how the features affect a given outcome. In classification problems, this question is essential for improving the usability of trees in the medical setting. What we desire is a new measure that conveys how the variable is important with regard to the target variable. In this paper, we raise this question for consideration and offer an initial approach for bridging the gap between prediction and inference. The paper is structured as follows: First, we outline the general approach for building a decision tree. Next, we explore the standard ways of interpreting a tree, both for a single tree and for an ensemble model. We then define a new measure, Class Variable Importance, to capture the strength of the effect of a variable with regard to different classes. Next, we explore the calculation of this new measure on several benchmark datasets. The final section concludes and proposes further areas for research.
Authors and Affiliations
Danielle Baghernejad
Atherosclerosis (Also Coronary Artery Disease), Rheumatoid Vasculitis, Morbus Alzheimer, Stroke Etc
According to American research 3 million various sorts of molecular endogenous factors circulate in the blood of a healthy person, including around 100 thousand selective antibodies (esta...
Abdominal Cystic Lymphangioma in Adults
Cystic lymphangioma is a benign vascular tumor originating from the lymphatic pathways, mainly seen in children. This pathology has various locations dominated by the cranio-facial, cervical and axillary regions. Abdomin...
Estimation of Fetal Brain Volume from MRI of Human Fetus
Monitoring of fetal brain growth during the pregnancy period is essential to have a healthy fetus. Fetal brain volume is a biomarker to identify any deformity in the fetal brain. In this paper we propose a method to esti...
Benefits of intravascular laser Irradiation of Blood on Motor and Sensory Recovery Viewing from Brain Function Images: Portrait of a Case with Chronic SjÖgren's Syndrome, Transverse Myelitis, and GuillainBarré Syndrome
Transverse myelitis is an autoimmune disease and often causes paralysis of the lower limbs. The Guillain-Barré syndrome is characterized as sensorimotor disturbance and progressive limbs weakness from lower limbs and pro...
Impact of β-lactam + Aminoglycoside Combination Regimen As Empirical Therapy For the Treatment of Bacteraemia Due to Gram-Negative Bacilli in Neutropenic Haematological Patients In An Era Of Antimicrobial Resistance (AMINOLACTAM Study)
Background: Current guidelines for the management of patients with febrile neutropenia don't recommend the use of empirical combination antibiotic therapy. The addition of an aminoglycoside to the recommended broad-spect...