Getting names and dates from the email headers is relatively straightforward, because that is structured information. Getting geographical locations from the email content itself is not. Here is how I would approach it.
First, some background. What we are talking about here is ‘Information Extraction‘, and ’Entity Detection‘. Since we know we are looking for geographical locations, we can attempt to convert the unstructured data in the body of an email into structured data. Basically we parse into a database.
On to the show: The basic architecture of such a system could look like this:
Raw Text->:Sentence Segmentation:->Sentences-
NLTK would be my tool of choice. To do the first three steps, I would use the default sentence segmenter, word tokenizer, and pos tagger.
def steps_one_to_three(docume
sentences = nltk.sent_tokenize(docume
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
This will give you POS tagged sentences.
The next step would be to segment and lable the entities that might have interesting relationships with each other, such as a proper name (Houston, Texas). Then we can search for specific patterns between pairs of entities that occur near each other in the text, and use that to build our tuples to document the relationship between the entitites.
Enter Chunking! Chunking is a basic technique for entity detection. It segments and labels multi-token sequences, such as “the smart engineer” into something called a noun phrase. Thank heavens, because geographical information just to happens to be a noun phrase (Houston, Texas). Sometimes a human can look at a noun phrase and say “hey, that is a single noun phrase but my chunker broke it up into 2 noun phrases”. Thats ok. The algorithm tries to make our noun phrases small enough such that they dont contain any other ‘nested’ noun phrase, so it errs on the side of caution. Make sense? Here is an example:
“The national association for the advancement of colored peoples” is clearly a single noun phrase, but it contains nested noun phrases. However, it will be captured as a series of noun phrase chunks.
Enough about chunking. What’s next? In order to use our chunker effectively, we need to define a chunk grammar. This will be the ruleset that will dictate how sentences should be chunked. This is pretty straight forward in the form of a regex:
grammar = “NP: (<DT>?<JJ><NN>)” <-”Tirrell, what the fu’schnick is that?”. Lemme break it down. The rule to our grammar says, “Find my the Noun Phrases (NP) that has an optional determiner(DT) followed by any number of adjectives (JJ) and then a noun (NN)”. So this grammar chunker would find something like “The little wayward boy”, and the grammar is defined as a “Tag Pattern”
Are you still with me? Good. Now I can circle back to the original question of finding a geography in unstructured data. Earlier, I talked about Named Entities, which is what we want to extract. The goal of a “named entity recognition” system is to identify all textual mentions of the named entities, and this amounts to identifying the boundaries of the named entity, then identifying its type.
Examples:
ORGANIZATION: Quora, Inc
PERSON: TIRRELL PAYTON
LOCATION, HOUSTON, TX
The cool thing about named entity recognition is that it can do other things too, like help answer questions. You did a google search of “Who was the first black man killed in the revolutionary war?”.
Google returns a wikipedia entry that looks something like this,”Crispus Attucks (c. 1723 – March 5, 1770) was a dockworker of Wampanoag and African
Ok, I’m getting off track again, lets get back to geographies. In the case of locations, we could use a gazetteer (geographical dictionary), but dictionary lookups can be pretty dumb because it does blind pattern matches. ”Reading is fundamental”<-thinks this passage is about Reading, UK. Well, this is about reading… as the gerund form of the verb “to read”, not the PLACE “Reading, UK”. AHA. Now you see why all the razzle dazzle on POS tagging will be important!
NLTK is cool because it has a classifier that can already recognized named entities<-Damn Tirrell, why didn’t you just say that before. Because I had to build suspense and an appreciation of the magnitude and messiness of the problem space! NLTK tags named entities according to category labels. As a side note, the new version of NLTK includes an interface to the Stanford Named Entity Recognizer: http://nltk.github.com/a
Stanford people are pretty smart, so this may work better than the default. BUT for the sake of this discussion, lets assume the vanilla case that we are using the default nltk.ne_chunk, and if you use ne_chunk, you can get a location out of your unstructured data.
So now, you know how to get locations out of unstructured data such as email. If you wanted to take it a step further, you could extract relationships from the named location entities to answer questions such as “Where was this person when this email was sent?” or “Where is this company located?”
Going into that is beyond the scope of your question, and I write too much anyway. I hope this was valuable.