When I am looking at a dataset, the first thing I ask myself is “What is the question I am trying to get an answer to?” The answer may very well be in the data, but the question is of utmost importance.
The next question I ask myself is, “Is this data ready for me to start asking it the question”. Most likely it is not, because *real* data is dirty data, and you need to clean it up before you can ask it questions.
After I have clean data, the next questions is “What is the class of question I am asking this data”.
-If I have a set of example features, then I can model according to those features in a supervised learning algorithm.
“Do I have combination-type features?”
-If so, a Naive Bayesian classifier most likely wont work, because the combinations of features and classifications will confuse my classifier.
“Maybe I can use a decision tree?”
-Maybe, but if I need to create a model that requires incremental training, a decision tree wont work well. I will need to retrain and rebuild the tree for every new feature/combination
“What about a Neural Network?”
- Maybe, but I will need to run a LOT of experiments to get the parameters right.
“What about a Support Vector Machine?”
- Maybe, but if we are dealing with a high number of dimensions I might not understand how the SVM is even *doing* the classification. That may not be an issue, but it may be an issue if I need to mentally walk through it.
So on and so on.
The point being, as you get more familiar with the algorithms and applications, you get a better sense of which ones to use either alone or in combination.
1. Whats the question?
2. Is the data ready?
3. Experiment Experiment Experiment. Test Test Test.