In the last post, we went over the basics of supervised machine learning. Next, we'll go over some of the algorithms that you can use for classification, including: Logistic Regression, KNN, SVM, Decision Trees, and Random Forest.
First of all, logistic regression isn't named super well. When you think of regression, you would probably assume that it would involving doing some type of regression, right? Wrong. Logistic Regression is a classification algorithm, like all of the algorithms below: it predicts what class or category a new data point is using past data points. The main thing to remember for logistic regression is that it has to do with probability. When we talk about probability, we're talking about an estimate in the range of 0 and 1, where 1 is a success and 0 is a failure. Logistic regression is used to predict the odds of something being a success, and when I say odds I mean the probability of something being a success divided by something being a failure. So say your confidence interval is 95 %, if the new data point calculates to 96%, you would count it as a success and vice versa if it were below 95%.
You use logistic regression when your Y, aka dependent variable, is binary (or dichotomous if you want to get all fancy), meaning the outcome is one of two options. For example, yes or no to the success of a medical treatment, voting for or against a new law to pass, you get the picture.
Let's go back and use the mango and apple example to explain how KNN works. Imagine a scatter plot with a a bunch of data points, the points are either categorized as mango or apple. Now imagine our new mystery fruit appears at a random point in the scatter plot, how would we know if it's a mango or an apple? Well, we could use KNN, of course! KNN stands for K-Nearest Neighbor. It works by looking at the k nearest points to an unknown label and then seeing which of those points occurs the most in order to classify the unknown label. Back to our fruit example, let's say K = 3, then we would look at the three closest data points to our mystery fruit. Let's say that 2 of those data points are mangos and 1 is an apple, since mangos appear more frequently within those 3 data points, we'd classify the mystery fruit as a mango! Logically, you might wonder, "That seems nice and all, but how do you know what k to choose?". Luckily, the Python module, sci-kit learn, helps a lot with this. You can simply make a for loop that changes the number of k and keeps track of the accuracy of each, then you can choose the k that performs best (sidenote: it's good to keep the k as an odd number so there aren't any "ties" when you're looking at the closest data points").
Full disclaimer: stole all of these images from here as it's wonderful and simple and great.
Let's continue using our beloved mango and apple data set. Let's pretend it looks something like this: