What methods/code are available for studying a corpus consisting of e-mails?

Getting names and dates from the email headers is relatively straightforward, because that is structured information.  Getting geographical locations from the email content itself is not.  Here is how I would approach it.

First, some background.  What we are talking about here is ‘Information Extraction‘, and ‘Entity Detection‘.  Since we know we are looking for geographical locations, we can attempt to convert the unstructured data in the body of an email into structured data.  Basically we parse into a database.

On to the show:  The basic architecture of such a system could look like this:
Raw Text->:Sentence Segmentation:->Sentences->:Tokenization:->Tokenized Sentences->:Part of Speech Tagging:->POS-Tagged Sentences->:Entity Detection:->chunked sentences->:Relation Detection:->Relations (list of tuples)

NLTK would be my tool of choice.  To do the first three steps, I would use the default sentence segmenter, word tokenizer, and pos tagger.

def steps_one_to_three(document):
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]

This will give you POS tagged sentences.
The next step would be to segment and lable the entities that might have interesting relationships with each other, such as a proper name (Houston, Texas).  Then we can search for specific patterns between pairs of entities that occur near each other in the text, and use that to build our tuples to document the relationship between the entitites.

Enter Chunking!  Chunking is a basic technique for entity detection.  It segments and labels multi-token sequences, such as “the smart engineer” into something called a noun phrase.  Thank heavens, because geographical information just to happens to be a noun phrase (Houston, Texas).  Sometimes a human can look at a noun phrase and say “hey, that is a single noun phrase but my chunker broke it up into 2 noun phrases”.  Thats ok.  The algorithm tries to make our noun phrases small enough such that they dont contain any other ‘nested’ noun phrase, so it errs on the side of caution.  Make sense?  Here is an example:

“The national association for the advancement of colored peoples” is clearly a single noun phrase, but it contains nested noun phrases.  However, it will be captured as a series of noun phrase chunks.

Enough about chunking.  What’s next?  In order to use our chunker effectively, we need to define a chunk grammar.  This will be the ruleset that will dictate how sentences should be chunked.  This is pretty straight forward in the form of a regex:

grammar = “NP: (<DT>?<JJ><NN>)” <-”Tirrell, what the fu’schnick is that?”.  Lemme break it down.  The rule to our grammar says, “Find my the Noun Phrases (NP) that has an optional determiner(DT) followed by any number of adjectives (JJ) and then a noun (NN)”.  So this grammar chunker would find something like “The little wayward boy”, and the grammar is defined as a “Tag Pattern”

Are you still with me?  Good.  Now I can circle back to the original question of finding a geography in unstructured data.  Earlier, I talked about Named Entities, which is what we want to extract.  The goal of a “named entity recognition” system is to identify all textual mentions of the named entities, and this amounts to identifying the boundaries of the named entity, then identifying its type.

Examples:
ORGANIZATION:  Quora, Inc
PERSON:  TIRRELL PAYTON
LOCATION, HOUSTON, TX

The cool thing about named entity recognition is that it can do other things too, like help answer questions.  You did a google search of “Who was the first black man killed in the revolutionary war?”.

Google returns a wikipedia entry that looks something like this,”Crispus Attucks (c. 1723 – March 5, 1770) was a dockworker of Wampanoag and African descent. He was the first person shot to death by British redcoats during the Boston Massacre, inBoston, Massachusetts.[2] He has been called the first martyr of the revolution.[3]“.  If we were to run this passage through our Entity Recognition System, it could return something like “Crispus Attucks was the first black man killed in the revolutionary war”.  Cool huh?

Ok, I’m getting off track again, lets get back to geographies.  In the case of locations, we could use a gazetteer (geographical dictionary), but dictionary lookups can be pretty dumb because it does blind pattern matches.  “Reading is fundamental”<-thinks this passage is about Reading, UK.  Well, this is about reading… as the gerund form of the verb “to read”, not the PLACE “Reading, UK”.  AHA.  Now you see why all the razzle dazzle on POS tagging will be important!

NLTK is cool because it has a classifier that can already recognized named entities<-Damn Tirrell, why didn’t you just say that before.  Because I had to build suspense and an appreciation of the magnitude and messiness of the problem space!  NLTK tags named entities according to category labels.  As a side note, the new version of NLTK includes an interface to the Stanford Named Entity Recognizer:  http://nltk.github.com/api/nltk….

Stanford people are pretty smart, so this may work better than the default.  BUT for the sake of this discussion, lets assume the vanilla case that we are using the default nltk.ne_chunk, and if you use ne_chunk, you can get a location out of your unstructured data.

So now, you know how to get locations out of unstructured data such as email.  If you wanted to take it a step further, you could extract relationships from the named location entities to answer questions such as “Where was this person when this email was sent?” or “Where is this company located?”

Going into that is beyond the scope of your question, and I write too much anyway.  I hope this was valuable.

What steps do you take when working on a machine learning problem?

When I am looking at a dataset, the first thing I ask myself is “What is the question I am trying to get an answer to?”  The answer may very well be in the data, but the question is of utmost importance.

The next question I ask myself is, “Is this data ready for me to start asking it the question”.  Most likely it is not, because *real* data is dirty data, and you need to clean it up before you can ask it questions.

After I have clean data, the next questions is “What is the class of question I am asking this data”.

-If I have a set of example features, then I can model according to those features in a supervised learning algorithm.

“Do I have combination-type features?”

-If so, a Naive Bayesian classifier most likely wont work, because the combinations of features and classifications will confuse my classifier.

“Maybe I can use a decision tree?”

-Maybe, but if I need to create a model that requires incremental training, a decision tree wont work well.  I will need to retrain and rebuild the tree for every new feature/combination

“What about a Neural Network?”
- Maybe, but I will need to run a LOT of experiments to get the parameters right.

“What about a Support Vector Machine?”
- Maybe, but if we are dealing with a high number of dimensions I might not understand how the SVM is even *doing* the classification.  That may not be an issue, but it may be an issue if I need to mentally walk through it.

So on and so on.

The point being, as you get more familiar with the algorithms and applications, you get a better sense of which ones to use either alone or in combination.

1. Whats the question?
2. Is the data ready?
3. Experiment Experiment Experiment.  Test Test Test.

How do you learn with negative examples?

This question was asked on Quora:

Assume you build a classifier to classify “objects” of class A.
1 a sample set in which some instances are classified as class A, others are not known to belong to class A.
you build a classifier for this task.
2 a training set in which some instances are classified as objects of class A, some are unknown (majority of all objects I have), some are classified as objects which are known NOT to belong to class A
How 2 is different from 1? How additional knowledge of NOT can be used in 2. Are different methods of classification must be used for 1 and 2

Lets say we have 3 types of entities, A, !A, and U
A are entities known to be of type A
!A are entities known to NOT be of type A
U are entities that are unknown.  Could be A, could be !A, could be some other entity class (B), but for now they are U”

How is !A different from U?
Logically, !A has a more defined feature than U.  That feature is that we know that it is NOT an instance of type A.  Based on the definition above, we don’t know anything about U, other than it has no defined features for us to classify against.  Could be A, could be !A, could be some other entity class (B), but for now they are U

Let me move away from symbolism for a moment and talk about this in terms of a metaphor.  We are classifying animals.
A is an antelope.
!A could be anything BUT an antelope, such as a tiger or a dog.
U is unknown.  Could be an antelope, could NOT be an antelope.  We don’t know.  For now we call it U.  As the algorithm gets better at classification, this could turn out to be a different species of antelope(A), or a different animal altogether(!A)

How additional knowledge of NOT can be used in 2 (the training set in which some classes are classified as objects of class A, some are unknown, and some are classified as objects which are known NOT to belong to class A?

The additional knowledge of NOT can be used to set up a different class, so instead of A and !A, you can now have A, !A, and U.

Are different methods of classification must be used for 1 and 2?
Like everything else, it depends:
- Are you looking to classify binary A vs !A?  If so, and you know U = !A, then the U entities and the !A entities can be given the same binary classification(0) and the entities that are known to be A can be given(1)

- Are you looking for a richer classification set with more numerous classifications?  If so, U can be given its own classification, with A, and !A given their own classifications.  Potentially U can be decomposed into other classifications of its own.  Given the definition in the question, I would call U unknown… which doesnt mean anything in terms of features.  It may turn out to be A, it may turn out to be !A, and it may turn out to be something totally different.

How do you learn with negative examples?
Make U = !A and train the model with that assumption

Is it possible to use a machine learning algorithm to determine whether or not a voice comes from person 1 or person 2?

Someone asked this question on Quora, so I tried to answer it:  http://www.quora.com/Is-it-possible-to-use-a-machine-learning-algorithm-to-determine-whether-or-not-a-voice-comes-from-person-1-or-person-2

Yes.

When you talk about voice audio and machine learning, the first step would be feature extraction.

What is a feature? A feature, in machine learning/statistics speak, is a “relevant parameter” for building a model. Feature importance is in the eye of the beholder and really points back to ‘What do you want to know about this data?’. If we look at the example of a car, a relevant feature for me might be leather interior, and I don’t care about the gas mileage. On the other hand, a relevant feature for you might be ’2 doors’.

What is a relevant feature for audio? An example of a relevant feature for audio is the Mel Frequency Cepstrum Coefficients (MFCCs). WTF is an MFCC? The MFCCs are basically frequencies of sound (like a wav file) that have been warped to more closely represent how humans hear sound. Doing this can get you a better representation of sound…like in audio compression.

What do features have to do with it? Features are the inputs to your machine learning algorithm. Ok so get on with it, how do I figure out who is talking? We would take the MFCCs of 2 different samples and plug them into our algorithm and compare the results.

Here is a hypothetical algorithm (assuming python)

Step 1: Use something like SciPy.io.wavfile to read a wav file. After that, you will get the wav data in a more usable form.

Step 2: Take multiple windows (snippets) of the audio data (10-30ms each) using something like scipy.signal.hanning

Step 3: Run a Fast Fourier Transform on each window of audio data.
The output of the FFT represents sequence of values into components of different frequencies

Step 4: Map the powers of the spectrum onto the mel scale, using triangular overlapping windows

Step 5: Take the discrete cosine transform of the list of mel log powers.

Voila, the MFCCs are the amplitudes of the resulting spectrum! Bam whats next?! Whats next is now we have to plug the MFCCs into our Machine Learning algorithm. Um… ok. Which one should I use? Like everything in life, it depends.

Way 1: Supervised Learning

Assuming you know how many speakers you have, you can use a K Means Clustering algorithm to cluster the audio of similar speakers. Ok, what is a K means cluster? K means clustering is a way to divide up N observations into K clusters in which each observation belongs to the cluster with the nearest mean.

So in this case, our MFCCs are our observations (features) and K represents the number of speakers we intend to identify.

Way 2: Unsupervised Learning

If I didnt know how many speakers I needed to identify, I would plug my inputs into a Random Forest algorithm. What is a Random Forest algorithm? Random forest is an “ensemble classifier” that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. Remember, the mode is the value that occurs most frequently in a data set or probability distribution. So in this case, our random forest algorithm could ouptut some clusters, with some type of dissimilarity index between them that HOPEFULLY correlate to the different speakers we are trying to identify.

-Tirrell
-Special Thanks to Wikipedia for the entry on MFCC, K means clustering, and Random Forests

Your (version 1.0) baby is ugly

Your baby is ugly.  Yes, conventional wisdom says that all babies are beautiful, but in reality, when babies first come out, they are blue-gray ugly blood and fluid covered screaming aliens with misshapen heads.  Software is the same way.  Your Version 1.0 is an ugly blue gray bloody misshapen mess, and much like the newborn, will require lots of care and feeding to live and be happy and healthy.

I had a conversation with my cofounder about this today.  He was appalled that I would say that v1.0 would suck.  Its the truth.  It wont always suck, but products, like babies, require feedback from the environment in order to grow and develop.

Happy New Year

Happy New Year.

Right now, I am in San Francisco.  The thing that gets me about this city is the proximity in which the financial , creative , and technical classes operate.  Its a petri dish with all kinds of ideas combining, splitting, and recombining into even newer ideas.  On one hand, it is easy to see how the echo chamber develops.  On the other hand its very easy to see how innovation springs from a place like this.

VCs are like Record Labels

Having had a previous life in independent music, and now having a current life in tech startups, there are a lot of similarities.

VCs are like Record Labels.  They can sign you on, give you money, and pour gas on your fire.  In exchange, they own (part of) you.  Luckily there is a lot more transparency in the tech industry than there is in the music industry, so there are lots of ways to avoid getting a bad deal (standard seed docs, thefunded, angellist, etc).

Not everyone will be interested in signing on with a record label, just like not everyone is interested in getting a record deal.  But for the uninitiated, the typical route is to try to get funded.

Another similarity is that young inexperienced musicians follow the hype cycle and assume that if they send a demo tape, or rap in front of an exec, they can get ‘discovered’ and get a recording contract.  Just like young inexperienced entrepreneurs think that if they have a great idea and talk to a VC about it they can get funded.  In reality, both record labels and VCs like to bet on horses that are already winning in the minor leagues.  That is to say that unless you are already a known entity, you wont get funded with an idea.  However if you have managed to put that idea into action, creating a product and getting users to use it (and even better, to pay for it), then you can get their attention.

Record labels and VCs dont invest in ideas and demos, they invest in people who know how to build fires.  You get the fuel, you get the spark, you get a small flame going, the flame appears to be getting bigger, they want to come along and throw gas on it.

On the other hand, if you just have an idea on how to build a fire, you shouldnt expect much.

The Role of the “Business Guy”

What does the business guy do before the big launch? Budget business man

I have seen this question come up from time to time in tech startup circles.  Typically, tech startups are very engineering focused.  Many of them don’t have a dedicated ‘business guy’, and the culture of tech startups (at least in the Valley) is engineering first.  In other words, grow engineers into business guys, not the other way around.  Nonetheless, many startups start with a business guy on board and there are sometimes questions about what this person should be doing if they don’t know how to code.  

Typically the answer is ‘Learn how to code.’  There are lots of resources around to help non technical people learn how to code, I will cover some ways that non-technical people can add value outside of the code.

1.  Customer Discovery

2.  Customer Validation

3.  Market Research

4.  Creative (Or Creative Management)

5.  Competitive Advantage Assertion

Fail Fast to Succeed

Many companies, especially larger corporations, wait until a product is completely perfect (in their eyes) before taking it to market.  This, however, doesn’t always net the best results.  Think of the many products that get recalled, along with their own healthy dose of negative PR.

Red Fail

A better alternative is to fail fast. Launch a product, even if it’s half-baked, and see what happens. 

In the book Little Bets by Peter Sims, the author quotes Pixar director Andrew Stanton as saying: “My strategy has always been: be wrong as fast as we can.”

There’s some truth to this. It’s easier to make little tweaks to discovered issues than it is to try to be perfect to begin with. Small tweaks are better than larger ones that take years to discover.

And so, I say, don’t fear failure. We only have to look at the world’s billionaires to prove the point that we all fail before we fly. Your product is no different. You may think it’s the bee’s knees, but someone somewhere will find fault with it. Better to find it out today than next year.

Photo:  Flickr user griffithchris. Creative Commons 2.0.

Understanding Minimum Viable Product’s Role

When it comes to product development, there are different approaches. For larger corporations who have time and money to waste spend, they tend to spend a lot of time on the planning and development phases. But what about for startups who have neither to waste? The months it can take to launch a product may be more than they’ve got before a competitor swoops in and makes all that work null.

Clay animals

For this group of businesses, the minimum viable product is the better route to go. With this method, you put together what’s pretty much a prototype, a bare bones version of your product, and see what the reaction is in the marketplace. Let your early users shape where the product goes from there.

Early adopters are thought to be more honest and open in their feedback if they understand this is an MVP. They know the product can and probably will have more features, and they can help shape what it becomes. It’s like handing half-formed clay over to a group of artists, who then shape it into their own version of what they think it should be. You compile all their ideas into what the final product becomes.

Continue reading