Our 21 students are working in labs from NC (Duke) to MA (Harvard and MIT), and on topics from computer languages to tissue formation. Join us here to read weekly updates from their time in the lab!

Visit the EXP page on Peddie website: peddie.org/EXP.

Saturday, July 20, 2013

NLP week 4, 5: a crazy (but lazy) workaround and some machine learning

Hi!  My name is Jiehan Zheng and I work at CCLS at Columbia University on natural language processing and machine learning.  I've been doing some training data collection, evaluation and postprocessing work in previous weeks, and finally in week 4 and 5 I get to do some real machine learning!  It's been busy but interesting two weeks.  I did too many things and I am not sure if I can recall all of them...but let me try--

After building the model comparison framework in HTML and JavaScript, Apoorv asked for a new feature--calculating p-values from Χ^2 from McNemar's test for models to indicate how differently any two models perform on the dataset.  I had no clue how to do this, even after looking at Wikipedia and several papers, so I asked Apoorv how he used to calculate the p-value before having my framework.  He sent me a MATLAB function file that he used to use at IBM.

I tried the MATLAB on Columbia server and verified that this function file works.  Then another software, Octave, immediately came up to my mind.  Octave is open-source and although it doesn't advertise as so, it is known to be an open-source "implementation" of MATLAB's features, so that everyone can use it freely.  So I ran the function in Octave as well and it works too.

I then looked into the source code of that MATLAB function (although I've never used MATLAB before...), and found out that although most of the calculation steps are fairly simple and straightforward, it calls a function called chi2cdf() from MATLAB.  By looking at MATLAB documentation I found that the chi2cdf() function, as expected, contained a definite integral in it.  So I basically ran into the same problem of not being able to calculate definite integral in JavaScript.

Then, I don't know why, a crazy (but lazy) idea came to my mind...  If it is hard to calculate definite integral in JavaScript, then why bother doing it in JavaScript?!?!  I can simply install Octave on my server and set up an API on the server to pass the X^2 to Octave and ask Octave to calculate p-value for us!  I quickly installed Octave and wrote up a short Node.js program that listens for requests, and spawns a new Octave process whenever it receives a request, pass the X^2 value to Octave's chi2cdf() function and collects its output then returns it to the browser.  So whenever Apoorv enters the command to calculate p-value in the browser, my code is going to send a request to my server and wait for the server's response and display that answer.  And very luckily I proved this idea actually works and was eventually able to implement this in a few hours.

Well, that's all I can say for now...which is the work I did in the afternoon on July 4.  For the rest of the time I wrote code to extract features and making training examples (without any annotation from human!) in preparation for a machine learning task (sequence labeling), and got some training data from several websites and stored them to a MongoDB database.  I also posted an answer on Stack Overflow for the first time while looking for solutions to a PyMongo error and after reading some MongoDB's documentation!  Unfortunately I can't share more details on the machine learning task for now but I will in the future!

Oh, and on July 4th Apoorv asked me if I'd like to see fireworks at Dr. Rambow's apartment building.  I went and it was very amazing!  I also changed my plane tickets so that I can extend my internship by a few days.  So now officially I will work at CCLS with Apoorv for almost 7 weeks!

Thanks for reading!  See you next week!

No comments:

Post a Comment