What steps do you take when working on a machine learning problem?

When I am looking at a dataset, the first thing I ask myself is “What is the question I am trying to get an answer to?”  The answer may very well be in the data, but the question is of utmost importance.

The next question I ask myself is, “Is this data ready for me to start asking it the question”.  Most likely it is not, because *real* data is dirty data, and you need to clean it up before you can ask it questions.

After I have clean data, the next questions is “What is the class of question I am asking this data”.

-If I have a set of example features, then I can model according to those features in a supervised learning algorithm.

“Do I have combination-type features?”

-If so, a Naive Bayesian classifier most likely wont work, because the combinations of features and classifications will confuse my classifier.

“Maybe I can use a decision tree?”

-Maybe, but if I need to create a model that requires incremental training, a decision tree wont work well.  I will need to retrain and rebuild the tree for every new feature/combination

“What about a Neural Network?”
- Maybe, but I will need to run a LOT of experiments to get the parameters right.

“What about a Support Vector Machine?”
- Maybe, but if we are dealing with a high number of dimensions I might not understand how the SVM is even *doing* the classification.  That may not be an issue, but it may be an issue if I need to mentally walk through it.

So on and so on.

The point being, as you get more familiar with the algorithms and applications, you get a better sense of which ones to use either alone or in combination.

1. Whats the question?
2. Is the data ready?
3. Experiment Experiment Experiment.  Test Test Test.

Comments are closed.