In the last post, we went over the basics of supervised machine learning. Next, we'll go over some of the algorithms that you can use for classification, including: Logistic Regression, KNN, SVM, Decision Trees, and Random Forest.
First of all, logistic regression isn't named super well. When you think of regression, you would probably assume that it would involving doing some type of regression, right? Wrong. Logistic Regression is a classification algorithm, like all of the algorithms below: it predicts what class or category a new data point is using past data points. The main thing to remember for logistic regression is that it has to do with probability. When we talk about probability, we're talking about an estimate in the range of 0 and 1, where 1 is a success and 0 is a failure. Logistic regression is used to predict the odds of something being a success, and when I say odds I mean the probability of something being a success divided by something being a failure. So say your confidence interval is 95 %, if the new data point calculates to 96%, you would count it as a success and vice versa if it were below 95%.
You use logistic regression when your Y, aka dependent variable, is binary (or dichotomous if you want to get all fancy), meaning the outcome is one of two options. For example, yes or no to the success of a medical treatment, voting for or against a new law to pass, you get the picture.
Let's go back and use the mango and apple example to explain how KNN works. Imagine a scatter plot with a a bunch of data points, the points are either categorized as mango or apple. Now imagine our new mystery fruit appears at a random point in the scatter plot, how would we know if it's a mango or an apple? Well, we could use KNN, of course! KNN stands for K-Nearest Neighbor. It works by looking at the k nearest points to an unknown label and then seeing which of those points occurs the most in order to classify the unknown label. Back to our fruit example, let's say K = 3, then we would look at the three closest data points to our mystery fruit. Let's say that 2 of those data points are mangos and 1 is an apple, since mangos appear more frequently within those 3 data points, we'd classify the mystery fruit as a mango! Logically, you might wonder, "That seems nice and all, but how do you know what k to choose?". Luckily, the Python module, sci-kit learn, helps a lot with this. You can simply make a for loop that changes the number of k and keeps track of the accuracy of each, then you can choose the k that performs best (sidenote: it's good to keep the k as an odd number so there aren't any "ties" when you're looking at the closest data points").
Full disclaimer: stole all of these images from here as it's wonderful and simple and great.
Let's continue using our beloved mango and apple data set. Let's pretend it looks something like this:
Where the blue dots are mangos and the red dots are apples. We want to classify these into different groups. We could start by just trying to fit a line between the two groups like so:
But, as you can see, there are a number of different lines we could draw here that could technically make this work. Here's how SVM (support vector machines) comes to the rescue:
SVM works by putting the line where the gap on each side is the largest, like above. However, what happens when our points aren't nicely separated like so?
I'm glad you asked, because SVM can handle that as well. There's this pretty neat trick that SVM uses called Kernels. It's able to transform your data so that it's linearly separable, aka by a line like above. This is possible because nonlinear data becomes linear in higher dimensions. Don't freak out. It looks like this:
That green object looking thing is called a hyperplane. Now your data is nicely classified and separated. Voila, ya'll:
This is much longer than I anticipated so i'm just going to make a part 3.