The last two weeks we've spent a lot of time going over a few different classification algorithms in machine learning, specifically with in supervised learning. So far we've covered: Logistic Regression, Decision Trees, KNN, SVM, and Random Forest. I know all of these may sound really intimidating if you've never studied statistics/data science before, but I'm going to try and do my best to explain all of them as best I can.
I figure it might be helpful to start from the beginning and give a brief intro into machine learning. So let's imagine you have two groups of fruits: mangos and apples. (Aren't mangos just the best?) Imagine one day you go the grocery store and you see a fruit that you've never seen before, how would you know if it's a mango or apple? (Imagine for the sake of this example that you know it has to be one of these two) To start, you'd probably notice a few things about this new mystery fruit, for example: color, size, texture, and shape. After noticing these things, you'd then be able to say that this mystery fruit is an apple! Easy, right?
This basic example is the idea behind supervised machine learning - we're training a computer to make these distinctions. It's called supervised because we are giving the computer a big set of data that is already labeled correctly. For example, a 1000 measurements of mangos and oranges with the various features (color, size, texture, and shape) so that when our new mystery fruit appears, the computer is able to predict it's an orange based off of the dataset it has already seen before. Where as unsupervised would be a 1000 mystery fruits and telling the computer to classify the fruits into the different fruits automatically, see the difference? In the next post I'll go over the some of the different classification algorithms that the computer uses to classify new items in order to make a prediction.