Peddie School EXP 2013: NLP week 2, 3: evaluating results and more coding

My name is Jiehan Zheng and I work at CCLS at Columbia University on a natural language processing project on extracting social networks from text with my mentor Apoorv, his colleague Anup and Dr. Rambow. Now I am into my third week here and I am going to recap what I did in my second week and first half of this week. In the first week, I worked on visualizing the generated social network.

I worked on postprocessing and evaluating the results from the NER (named entity recognition) system first. A named entity recognizer takes raw text as input and outputs locations of grouped entity mentions (spans of character offsets counting from the very beginning of text, and by "grouped" I mean entity mentions of the same entity are grouped together in a XML structure under a node) and types of entities (organizations, people, etc). Our team did not write the NER system ourselves because NER is not Apoorv's focus--his thesis is on social network extraction. Anyways, so we have to know how well NER is performing and try to "improve" its result without digging into the NER itself.

There were two problems. First, the NER sometimes mistakenly splits entities that are meant to be the same into multiple entities. This messed up the generated social network because then you are having more than two vertices for the same person, thus distracting the viewer. For instance, for Alice in Alice in Wonderland, entity #1 (first entity that NER gave us) had 67 entity mentions of "alice", 3 of "poor alice" among many other entity mentions like "she", "her", etc. Entity #81, meanwhile, had 38 mentions of "alice" and 3 mentions of "alice hastily", etc. We need to merge these. Clearly, to humans we know that 1 and 81 both refer to the main character Alice in the novel, yet how can we have computers to make similar decisions?

Our solution was to find all the different entity mentions from the output and create feature vectors with them. For the sake of simplicity, let's say that in a NER output, if we ignore all the words like "she" and "he", only "alice", "poor alice", "a little girl", "queen", "alice hastily", "her sister", "king" were mentioned. We will create a feature vector of (# of occurrences of "alice", # of occurrences of "poor alice", ..., # of "king") for each entity in the output. Maybe E1 = {"alice"x67, "poor alice"x3}, then E1 will be given a feature vector of (67, 3, 0, 0, 0, 0, 0). Similarly, E81 will have (38, 0, 0, 0, 3, 0, 0). Then if we think of them being vectors in 7-dimensional space (about which I have no idea) and calculate their cosine similarity (just learned this last week from Apoorv), they will have a surprisingly high similarity (> 0.99). The implementation of entity merging was to generate a mapping from IDs of one or more entities to the ID of one entity (actually this part of the code is in the screenshot). Say, 1, 13 and 81 are all actually Alice, then we will have a map that maps 13 to 1 and 81 to 1. Then when we present the result to users, I check if the entity is in this duplication mapping.

Running the code to merge entities and guess names

The second problem was that the NER gives us no information about a person's real name or best name. I wrote some code to address this problem. For instance, for entity #6 we have (after removing words like "she" and "he"):

{"a white rabbit"=1, "the rabbit"=16, "the white rabbit"=11, "the white rabbit, who was peeping"=1, "the white rabbit, who said,"=1}

. Clearly this is talking about "the rabbit". In this case, "the rabbit" is the most frequently used entity mention, so my program would pick "the rabbit" as the best name. The reason why we remove the common pronouns is that otherwise we would see a lot of "she" and "he" being picked as the best name, which wouldn't make sense because no one wants to see a social network graph with vertices called "she" and "he" all over the place and interacting with each other. When the entity mentions counts suggest a tie, we choose the first entity mention because most of the time the name is clear when a character is formally introduced.

After this, I wrote a simple script in Python to crawl a website to obtain test text data for later use.

Then I wrote a program to evaluate NER by comparing NER output against the gold standard by paid human annotators. I used a simple spans exact match and it didn't work very well. For instance, if there is a span 10000-10002 corresponding to "cat", and there is another span 9996-10002 corresponding to "the cat", my current program would give a score of zero--yet this "cat" and "the cat" mistake is not a serious one, and shouldn't be punished so badly. Because I was interrupted to do some other programs, I didn't get to implement a more flexible span matching method yet, but I will. After this, we also map this into a multidimensional space and calculate the similarity between each output entity from NER and each entity from the gold to see how similar they are.

This Monday I started another small project using Java, HTML and JavaScript to help Apoorv, Anup and Dr. Rambow analyze experiment results for a paper that is due this Friday (I know, it's so close to the deadline now...)! Basically the program makes machine learning examples output from Java, displays them in a webpage, and dynamically inserts columns from experiments provided in JSON format that maps example IDs to scores. It also colors results that agree with the gold green, and red otherwise. The user can type commands in the web console to filter rows (the one in the screenshot means that I want to see only the examples that model 1 got right but on which models 2, 3, 4, 5 all failed. It makes comparing machine learning models much easier.

I expected to do the sentiment analysis starting from week 2, but obviously that didn't happen...but working on postprocessing and making all kinds of utilities is fun, too. Hopefully I will start the coolest part of the research soon!

By the way, we still go to work on July 4th!

Peddie School EXP 2013

Wednesday, July 3, 2013

NLP week 2, 3: evaluating results and more coding

No comments:

Post a Comment