Peddie School EXP 2013: Jiehan Zheng

Showing posts with label Jiehan Zheng. Show all posts

Saturday, August 17, 2013

NLP week 7: finishing up pipeline wrapper and web interface

Hi again, my name is Jiehan Zheng. I worked on NLP and some machine learning at Columbia University.

I skipped writing about week 6 because we were working on something secret! We will publish our work on that during the upcoming fall term if things go smoothly. So I am writing about my work during my 7th week. I was too busy working on the project so I didn't have time to post updates to this blog...

Since week 7 is the last week I physically work at CCLS at Columbia University this summer, we chose to finish up things that require face-to-face collaboration first, so we don't have to wait on each other to finish our tasks. My work on the web interface and pipeline wrapper would be the thing that we have to finish together before I go--so the last week I mainly worked on pipeline wrapper and web interface.

Apoorv's work is on the pipeline that takes in sentences, gold dependency parse trees, semantic parse trees, and entity annotations. It spits out a file in graph modeling language containing the interactions between entities. In order to make the pipeline work for any unprocessed text and returns a social network, it has to be wrapped around by some wrapper code--I named that part of the code as "pipeline wrapper," and I feel like that's a smart name, isn't it?

So the pipeline wrapper has to take in raw text, split it into sentences and call various parsers and process the result from the parsers into a format that the pipeline expects. There was code on that but it no longer worked, and when it was working, it was poorly written and inefficient. I rewrote the wrapper in a more organized way. For instance, the old wrapper had to call NYU Jet's main method twice to get named entities and split sentences separately--I read Jet's source code and managed to call Jet once and get both information, making it faster. I also prevented Jet from performing useless operations that takes time, like relation extraction.

Then the pipeline gets dependency parses from Stanford parser. My refactoring effort also enables us to run multiple tasks in parallel. For instance, we are going to run CMU's SEMAFOR semantic parser as well in the future, and running SEMAFOR takes a long time. Had we added SEMAFOR to the old wrapper, it has to wait until Stanford parser finishes its job. With the new structure, SEMAFOR and Stanford parser runs in different processes, can take advantage of multiple CPU cores and run at the same time, cutting the running time by at least 50%. SEMAFOR integration is a bit harder than other parsers, so I decided to work on that after I go back to China.

After we have all the parses and other files, the wrapper calls the pipeline with the files, and waits for pipeline to finish processing the files. Once it gets the interactions in text, the wrapper calls the postprocessor that I made during week 2 which merges duplicate entities, finds out the best name for each entity, analyzes interactions and finally organizes these information and outputs a social network file.

The web interface is just some pure programming effort and is nowhere as interesting as working on the pipeline wrapper and other machine learning aspects. My work on the pipeline wrapper, postprocessor and web interface had been included in a demo paper that is going to be presented in IJCNLP 2013 this October in Japan, and I've been made an co-author on that paper--I am very excited for that!

Apoorv and I have made that arrangement with Mr. Corica that I will be continuing our work on that "secret project" as an independent project at Peddie during my fall term. This is indeed a very precious opportunity for me to learn more machine learning--from implementing tools, extracting features, run experiments and tune SVM parameters and our features, to finally evaluating the result.

As for the rest of my summer, I did figure out a way to integrate SEMAFOR so I will spend some time to make enhancements to the web interface and pipeline wrapper by adding in SEMAFOR integration. I will describe more in my next blog post!

Saturday, July 20, 2013

NLP week 4, 5: a crazy (but lazy) workaround and some machine learning

Hi! My name is Jiehan Zheng and I work at CCLS at Columbia University on natural language processing and machine learning. I've been doing some training data collection, evaluation and postprocessing work in previous weeks, and finally in week 4 and 5 I get to do some real machine learning! It's been busy but interesting two weeks. I did too many things and I am not sure if I can recall all of them...but let me try--

After building the model comparison framework in HTML and JavaScript, Apoorv asked for a new feature--calculating p-values from Χ^2 from McNemar's test for models to indicate how differently any two models perform on the dataset. I had no clue how to do this, even after looking at Wikipedia and several papers, so I asked Apoorv how he used to calculate the p-value before having my framework. He sent me a MATLAB function file that he used to use at IBM.

I tried the MATLAB on Columbia server and verified that this function file works. Then another software, Octave, immediately came up to my mind. Octave is open-source and although it doesn't advertise as so, it is known to be an open-source "implementation" of MATLAB's features, so that everyone can use it freely. So I ran the function in Octave as well and it works too.

I then looked into the source code of that MATLAB function (although I've never used MATLAB before...), and found out that although most of the calculation steps are fairly simple and straightforward, it calls a function called chi2cdf() from MATLAB. By looking at MATLAB documentation I found that the chi2cdf() function, as expected, contained a definite integral in it. So I basically ran into the same problem of not being able to calculate definite integral in JavaScript.

Then, I don't know why, a crazy (but lazy) idea came to my mind... If it is hard to calculate definite integral in JavaScript, then why bother doing it in JavaScript?!?! I can simply install Octave on my server and set up an API on the server to pass the X^2 to Octave and ask Octave to calculate p-value for us! I quickly installed Octave and wrote up a short Node.js program that listens for requests, and spawns a new Octave process whenever it receives a request, pass the X^2 value to Octave's chi2cdf() function and collects its output then returns it to the browser. So whenever Apoorv enters the command to calculate p-value in the browser, my code is going to send a request to my server and wait for the server's response and display that answer. And very luckily I proved this idea actually works and was eventually able to implement this in a few hours.

Well, that's all I can say for now...which is the work I did in the afternoon on July 4. For the rest of the time I wrote code to extract features and making training examples (without any annotation from human!) in preparation for a machine learning task (sequence labeling), and got some training data from several websites and stored them to a MongoDB database. I also posted an answer on Stack Overflow for the first time while looking for solutions to a PyMongo error and after reading some MongoDB's documentation! Unfortunately I can't share more details on the machine learning task for now but I will in the future!

Oh, and on July 4th Apoorv asked me if I'd like to see fireworks at Dr. Rambow's apartment building. I went and it was very amazing! I also changed my plane tickets so that I can extend my internship by a few days. So now officially I will work at CCLS with Apoorv for almost 7 weeks!

Thanks for reading! See you next week!

Wednesday, July 3, 2013

NLP week 2, 3: evaluating results and more coding

My name is Jiehan Zheng and I work at CCLS at Columbia University on a natural language processing project on extracting social networks from text with my mentor Apoorv, his colleague Anup and Dr. Rambow. Now I am into my third week here and I am going to recap what I did in my second week and first half of this week. In the first week, I worked on visualizing the generated social network.

I worked on postprocessing and evaluating the results from the NER (named entity recognition) system first. A named entity recognizer takes raw text as input and outputs locations of grouped entity mentions (spans of character offsets counting from the very beginning of text, and by "grouped" I mean entity mentions of the same entity are grouped together in a XML structure under a node) and types of entities (organizations, people, etc). Our team did not write the NER system ourselves because NER is not Apoorv's focus--his thesis is on social network extraction. Anyways, so we have to know how well NER is performing and try to "improve" its result without digging into the NER itself.

There were two problems. First, the NER sometimes mistakenly splits entities that are meant to be the same into multiple entities. This messed up the generated social network because then you are having more than two vertices for the same person, thus distracting the viewer. For instance, for Alice in Alice in Wonderland, entity #1 (first entity that NER gave us) had 67 entity mentions of "alice", 3 of "poor alice" among many other entity mentions like "she", "her", etc. Entity #81, meanwhile, had 38 mentions of "alice" and 3 mentions of "alice hastily", etc. We need to merge these. Clearly, to humans we know that 1 and 81 both refer to the main character Alice in the novel, yet how can we have computers to make similar decisions?

Our solution was to find all the different entity mentions from the output and create feature vectors with them. For the sake of simplicity, let's say that in a NER output, if we ignore all the words like "she" and "he", only "alice", "poor alice", "a little girl", "queen", "alice hastily", "her sister", "king" were mentioned. We will create a feature vector of (# of occurrences of "alice", # of occurrences of "poor alice", ..., # of "king") for each entity in the output. Maybe E1 = {"alice"x67, "poor alice"x3}, then E1 will be given a feature vector of (67, 3, 0, 0, 0, 0, 0). Similarly, E81 will have (38, 0, 0, 0, 3, 0, 0). Then if we think of them being vectors in 7-dimensional space (about which I have no idea) and calculate their cosine similarity (just learned this last week from Apoorv), they will have a surprisingly high similarity (> 0.99). The implementation of entity merging was to generate a mapping from IDs of one or more entities to the ID of one entity (actually this part of the code is in the screenshot). Say, 1, 13 and 81 are all actually Alice, then we will have a map that maps 13 to 1 and 81 to 1. Then when we present the result to users, I check if the entity is in this duplication mapping.

Running the code to merge entities and guess names

The second problem was that the NER gives us no information about a person's real name or best name. I wrote some code to address this problem. For instance, for entity #6 we have (after removing words like "she" and "he"):

{"a white rabbit"=1, "the rabbit"=16, "the white rabbit"=11, "the white rabbit, who was peeping"=1, "the white rabbit, who said,"=1}

. Clearly this is talking about "the rabbit". In this case, "the rabbit" is the most frequently used entity mention, so my program would pick "the rabbit" as the best name. The reason why we remove the common pronouns is that otherwise we would see a lot of "she" and "he" being picked as the best name, which wouldn't make sense because no one wants to see a social network graph with vertices called "she" and "he" all over the place and interacting with each other. When the entity mentions counts suggest a tie, we choose the first entity mention because most of the time the name is clear when a character is formally introduced.

After this, I wrote a simple script in Python to crawl a website to obtain test text data for later use.

Then I wrote a program to evaluate NER by comparing NER output against the gold standard by paid human annotators. I used a simple spans exact match and it didn't work very well. For instance, if there is a span 10000-10002 corresponding to "cat", and there is another span 9996-10002 corresponding to "the cat", my current program would give a score of zero--yet this "cat" and "the cat" mistake is not a serious one, and shouldn't be punished so badly. Because I was interrupted to do some other programs, I didn't get to implement a more flexible span matching method yet, but I will. After this, we also map this into a multidimensional space and calculate the similarity between each output entity from NER and each entity from the gold to see how similar they are.

This Monday I started another small project using Java, HTML and JavaScript to help Apoorv, Anup and Dr. Rambow analyze experiment results for a paper that is due this Friday (I know, it's so close to the deadline now...)! Basically the program makes machine learning examples output from Java, displays them in a webpage, and dynamically inserts columns from experiments provided in JSON format that maps example IDs to scores. It also colors results that agree with the gold green, and red otherwise. The user can type commands in the web console to filter rows (the one in the screenshot means that I want to see only the examples that model 1 got right but on which models 2, 3, 4, 5 all failed. It makes comparing machine learning models much easier.

I expected to do the sentiment analysis starting from week 2, but obviously that didn't happen...but working on postprocessing and making all kinds of utilities is fun, too. Hopefully I will start the coolest part of the research soon!

By the way, we still go to work on July 4th!

Thursday, June 20, 2013

NLP week 1: visualizing social networks

This summer I work at Center for Computational Learning Systems at Columbia University. It's great to work at a computer "lab," because we basically go to work whenever we want, and leave whenever we want. Why? Because computer science people are typically pretty motivated and self-disciplined, and they probably work even more at home (at night) than at work during the day.

The focus of my mentor, Apoorv, is to have computers extract a social network from text--that's right, it means computers will have to "read and understand" English text!

My job this first week is to simply build a web interface to the system that accepts arbitrary text input from anyone on the Internet, passes it on to the program that Apoorv and his team have built, collects the output of their program, parses it and visualizes it in user's browser. Sounds too abstract? See the following example.

In the screenshot above, the circles (vertices) are "occurrences"--each time a name appears in the text, it is an occurrence. The arrows (arcs) between vertices denote observations (as in "I see you in the restaurant," where "I" am aware of your existence, but "you" are not aware of "my" existence). Another type of connection is called interaction, where both parties are aware of each other. If you are interested in this notation, read Apoorv's paper.

In the next step, I will post-process the result from Apoorv's system, and merge occurrences--as you can see, Charlie appeared three times in the generated graph above, and in the next version it will be merged into one.

In another word, I am just building a demo system to allow more users to try our system. You may have noticed that the arrows are completely in wrong directions, and some of the connections should have arrows in both directions. Yes--it's a known bug, and we are still trying to figure out why.

Let me show you the stack. Bottom-up, there is Apoorv's Java program, and then a TCP socket server that I wrote in Java that listens locally for requests, parses the results in .net format and generates JSON results. A Node.js program that I wrote with the Express framework is on top of my socket server, and it simply serves this webpage above, passes English sentences down to the Java program and transfers JSON results back to the browser. Visualization happens mostly in your browser, in which I used D3.js library to help me to calculate the locations for the circles and lines according to physics laws, and SVG to actually represent them.

Bonus: a graph for you to play around with (a modern browser that supports SVG is required). You can drag anyone to move them around!

Hope you liked it!

The real research hasn't started yet. Hopefully I will actually get to the natural language processing part as early as next Monday, after I finish working on this web interface. So, see you next week!