Decision Trees
Ashley Williams Dr. BowersASPIRE Program ResearchFinal Paper This summer our project was focused on data mining. Data mining tries to find useful patterns and information within large databases. Our goal was to predict the Bank Marketing for a term deposit by using a special tool of data mining known as classification. This paper will outline the steps and information needed to perform the project and include: an introduction of trees, our methodology, the database and problem at hand, results, and the conclusion. In short, the project proved that the average yearly balance in euros was the most important factor in predicting the marketing for a term deposit. To begin with, it is quite important to understand a basic tree. A tree is a connected undirected graph with no simple circuits. More specifically, a rooted tree is a tree in which one vertex has been designated as the root and every edge is directed away from the root. [pic 1]There is basic terminology which is used in order to describe different aspects within a tree. The most widely used terms are the root node, parent node, left child and right child, and leaf node, when referring to a binary tree, a tree that only splits into two nodes each time it branches off. The root node is the top most nodes in the tree. This node is also a parent node but any node that has a child or descendants is also a parent node. Since this tree only splits into two, the children are known as the left child and right child. A leaf node is a node that does not have child nodes. In Data Mining, decisions are made based upon leaf nodes. This basic knowledge of a tree is very useful for explaining the decision tree method of classification.
As stated above, we choose to use the tool of classification in order to carry out our data mining project. Within classification however, there are many different methods that can be used. These techniques are clustering, genetic algorithms, neural networks, and decision trees. We considered decision trees as the most appropriate method for the specific project at hand. The decision tree is a supervised learning tool for use in classification. This means that the researcher knows the target within the data that they are modeling. To contrast, a tool considered to be unsupervised learning is clustering where the researcher does not have a specific target. In Mathematics, trees are studied in Discrete Mathematics, particularly Graph Theory. A Decision tree is a collection of decision nodes connected by branches that extend downward from the root node until terminating in leaf nodes. The leaf node classifies the data, therefore, allowing the researcher to state a decision rule. Decision trees for classification has some of its origins in the Computer Science area of machine learning and then later on statisticians began working to increase the data mining field as well. For this project we used XL Miner, a data mining add-in tool for Excel. The classification algorithm in XL Miner was built by using the theory of Classification and Regression Trees (CART) that was developed by Breiman, Friedman, Olshen, and Stone in 1984. An important attribute of CART is that only binary splits of data occur which is featured in XL Miner as well.