Hi again, my name is Jiehan Zheng. I worked on NLP and some machine learning at Columbia University.
I skipped writing about week 6 because we were working on something secret! We will publish our work on that during the upcoming fall term if things go smoothly. So I am writing about my work during my 7th week. I was too busy working on the project so I didn't have time to post updates to this blog...
Since week 7 is the last week I physically work at CCLS at Columbia University this summer, we chose to finish up things that require face-to-face collaboration first, so we don't have to wait on each other to finish our tasks. My work on the web interface and pipeline wrapper would be the thing that we have to finish together before I go--so the last week I mainly worked on pipeline wrapper and web interface.
Apoorv's work is on the pipeline that takes in sentences, gold dependency parse trees, semantic parse trees, and entity annotations. It spits out a file in graph modeling language containing the interactions between entities. In order to make the pipeline work for any unprocessed text and returns a social network, it has to be wrapped around by some wrapper code--I named that part of the code as "pipeline wrapper," and I feel like that's a smart name, isn't it?
So the pipeline wrapper has to take in raw text, split it into sentences and call various parsers and process the result from the parsers into a format that the pipeline expects. There was code on that but it no longer worked, and when it was working, it was poorly written and inefficient. I rewrote the wrapper in a more organized way. For instance, the old wrapper had to call NYU Jet's main method twice to get named entities and split sentences separately--I read Jet's source code and managed to call Jet once and get both information, making it faster. I also prevented Jet from performing useless operations that takes time, like relation extraction.
Then the pipeline gets dependency parses from Stanford parser. My refactoring effort also enables us to run multiple tasks in parallel. For instance, we are going to run CMU's SEMAFOR semantic parser as well in the future, and running SEMAFOR takes a long time. Had we added SEMAFOR to the old wrapper, it has to wait until Stanford parser finishes its job. With the new structure, SEMAFOR and Stanford parser runs in different processes, can take advantage of multiple CPU cores and run at the same time, cutting the running time by at least 50%. SEMAFOR integration is a bit harder than other parsers, so I decided to work on that after I go back to China.
After we have all the parses and other files, the wrapper calls the pipeline with the files, and waits for pipeline to finish processing the files. Once it gets the interactions in text, the wrapper calls the postprocessor that I made during week 2 which merges duplicate entities, finds out the best name for each entity, analyzes interactions and finally organizes these information and outputs a social network file.
The web interface is just some pure programming effort and is nowhere as interesting as working on the pipeline wrapper and other machine learning aspects. My work on the pipeline wrapper, postprocessor and web interface had been included in a demo paper that is going to be presented in IJCNLP 2013 this October in Japan, and I've been made an co-author on that paper--I am very excited for that!
Apoorv and I have made that arrangement with Mr. Corica that I will be continuing our work on that "secret project" as an independent project at Peddie during my fall term. This is indeed a very precious opportunity for me to learn more machine learning--from implementing tools, extracting features, run experiments and tune SVM parameters and our features, to finally evaluating the result.
As for the rest of my summer, I did figure out a way to integrate SEMAFOR so I will spend some time to make enhancements to the web interface and pipeline wrapper by adding in SEMAFOR integration. I will describe more in my next blog post!
No comments:
Post a Comment