Machine Learning Algorithms: Which One to Choose for Your Problem
- by 7wData
When I was beginning my way in data science, I often faced the problem of choosing the most appropriate algorithm for my specific problem. If you’re like me, when you open some article about Machine Learning algorithms, you see dozens of detailed descriptions. The paradox is that they don’t ease the choice.
In this article, I will try to explain basic concepts and give some intuition of using different kinds of Machine Learning algorithms in different tasks. At the end of the article, you’ll find the structured overview of the main features of described algorithms.
First of all, you should distinguish 4 types of Machine Learning tasks:
Supervised learning is the task of inferring a function from labeled training data. By fitting to the labeled training set, we want to find the most optimal model parameters to predict unknown labels on other objects (test set). If the label is a real number, we call the task regression. If the label is from the limited number of values, where these values are unordered, then it’s classification.
In unsupervised learning we have less information about objects, in particular, the train set is unlabeled. What is our goal now? It’s possible to observe some similarities between groups of objects and include them in appropriate clusters. Some objects can differ hugely from all clusters, in this way we assume these objects to be anomalies.
Semi-supervised learning tasks include both problems we described earlier: they use labeled and unlabeled data. That is a great opportunity for those who can’t afford labeling their data. The method allows us to significantly improve accuracy, because we can use unlabeled data in the train set with a small amount of labeled data.
Reinforcement learning is not like any of our previous tasks because we don’t have labeled or unlabeled datasets here. RL is an area of machine learning concerned with how software agents ought to take actions in some environment to maximize some notion of cumulative reward.
Imagine, you’re a robot in some strange place, you can perform the activities and get rewards from the environment for them. After each action your behavior is getting more complex and clever, so you are training to behave the most effective way on each step. In biology, this is called adaptation to natural environment.
Now that we have some intuition about types of machine learning tasks, let’s explore the most popular algorithms with their applications in real life.
These are probably the simplest algorithms in machine learning. You have features x1,…xn of objects (matrix A) and labels (vector b). Your goal is to find the most optimal weights w1,…wn and bias for these features according to some loss function, for example, MSE or MAE for a regression problem. In the case of MSE there is a mathematical equation from the least squares method:
In practice, it’s easier to optimize it with gradient descent, that is much more computationally efficient. Despite the simplicity of this algorithm, it works pretty well when you have thousands of features, for example, bag of words or n-gramms in text analysis. More complex algorithms suffer from overfitting many features and not huge datasets, while linear regression provides decent quality.
To prevent overfitting we often use regularization techniques like lasso and ridge. The idea is to add the sum of modules of weights and the sum of squares of weights, respectively, to our loss function. Read the great tutorial on these algorithms at the end of the article.
Don’t confuse these classification algorithms with regression methods for using “regression” in its title. Logistic regression performs binary classification, so the label outputs are binary. Let’s define P(y=1|x) as the conditional probability that the output y is 1 under the condition that there is given the input feature vector x. The coefficients w are the weights that the model wants to learn.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More