A Few Useful Things to Know about Machine Learning by Pedro Domingos
Article summarizes 12 key lessons that machine learning researchers and practitioners have learned.
I have list down those 12 key lessons with my own understanding below it.
1. LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
Key point to note here is when we have to identify which algorithms to use for an application, dividing the algorithms into above three components will help us in the process. Most of the learning algorithms found are either different combination of above three components. E.g. of Representation (Neural Network) + Evaluation (Accuracy) + Optimization (Gradient Descent)
2. Its GENERALIZATION THAT COUNTS
Here if the algorithms are doing well in training sets however in test set it is not giving high accuracy then we can say the algorithm is not generalized enough. It is critical that algorithm perform well in test data which is not exposed to algorithm while training. It is basically the true performance indicator of algorithm.
3. DATA ALONE IS NOT ENOUGH
Point here is leverage the knowledge and applies them into the application as sample data will not be enough to create most of the classes found unseen population. It is always good to apply domain knowledge along with data in machine learnings, as possible.
4. OVERFITTING HAS MANY FACES.
Like we know high variance may give high accuracy in the training data, however, due to over fitting, it learns the random effect in the training data and carries that information to incorrectly predict the test set. Hence, it is recommended to have Low variance and low bias to predict more accurately in test data. As often the case, Bias and Variance are two side of the balancing point, idea here is to choose between them that is optimal for the problem in hand.
5. INTUITION FAILS IN HIGH DIMENSIONS.
As we go into higher dimensions, i.e. adding more features, the net effect is the value (contribution to the prediction) of each features diminishes, given the fixed number of examples. Additionally, training space (available example) over the input space (entire population set for the selected features), reduces considerably giving rise to high uncertainty in the prediction.
6. THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEMS
Example here is theoretically found number of examples required for good generalization may not hold true always. One example is adding features increases the requirement for example exponentially, however, theoretical guarantees may increase the requirement of example logarithmically only.
7. FEATURE ENGINEERING IS THE KEY.
It is often notice the success of machine learning projects highly dependent on feature selection and integration in the model. Selected features often have a complex relationship between each other this increases the complexity of identifying and keeping the most appropriate features. Due to this feature engineering is mostly domain specific.
8. MORE DATA BEATS CLEVERER ALGORITHM.
Machine learning is all about data doing the heavy lifting, so given a choice on how to improve the results between refining the algorithm vs getting more data row wise (more example) or column wise (more features) this one have impact of curse of dimensionality, it is recommended to go for more data, as currently, we have more data available then before. However, with more data, complexity increases the learning time and time is the major constrained now. Hence it is also important to have cleverer algorithm, which could capture the information in less time.
9. LEARN MANY MODELS NOT JUST ONE.
It is found that combination of learners give better results than using one learner. This is also known as ensemble of learners/classifiers. There methods already in use such as Bagging, Boosting and Stacking. More details on them as follows:
Bagging: resample from the training and create the classifier and take the average of it, it reduces variance and slightly increases bias.
Boosting: Here training examples are given weights and classifier are built to focus on the one it has predicted wrong last time by giving more weights to it.
Stacking: Here output of the learner becomes the input for higher level learner that tells how to combine different classifier.
Sometime, many get confused with Bayesian model averaging (BMA) with above ensemble method, where BMA is nothing but taking average of the predicted output weighted by how well the classifiers explains the training example and how much we believe in its a priori.
10. SIMPLICITY DOES NOT IMPLY ACCURACY
Normal assumptions are there, if two classifiers gives same training error then one which is simpler will give good prediction, however, in reality it may not be true.
Reason for choosing simplicity is a virtue in its own right not because of hypothetical connection with accuracy.
11. REPRESENTABLE DOES NOT IMPLY LEARNABLE
In Machine learning, it is critical how much the algorithm learns from the data. Sometime, a favorite representation is used often for solving most problems with assumptions it will capture all the functions for different sets of data. However, the fact is some representation is better identifying inherent functions compare to other set of representation, so it is vital to focus on ability to learns over the ease of using a particular representation.
12. CORRELATION DOES NOT IMPLY CAUSATION
To know the effect of an action rather than just the correlation between two variables could be attributed to importance of causation over correlation. However, it is always helpful to know the correlation and using the same other important relation can be derived.
CONCLUSION:
Like any discipline, folk wisdom, I relate it to “common sense” is crucial for success of Machine learning as well.
******************************Thank You*********************************