update README. (also fix sample.ini for out-of-the-box SR parsing)

brendano · brendano · commit a9df4919b4a2 · 2014-07-29T17:32:16.000-04:00
diff --git a/README.md b/README.md
@@ -1,65 +1,60 @@
-**EXPERIMENTAL DO NOT USE WITHOUT LOTS OF TESTING**
+This is a Python wrapper for the [Stanford CoreNLP][1] library, allowing
+sentence splitting, POS/NER, and parse annotations.  (Coreference is not
+supported currently.)  It runs the Java software as a subprocess and
+communicates over sockets.  This library handles timeouts and some process
+management (restarting the server if it seems to have crashed).
 
-Java files copied from [github.com/brendano/myutil](github.com/brendano/myutil).
+Alternatives you may want to consider:
 
-License GPL version 2 or later
+  * https://bitbucket.org/torotoki/corenlp-python
+  * https://github.com/dasmith/stanford-corenlp-python
+  * http://py4j.sourceforge.net/
 
-see also
-* https://bitbucket.org/torotoki/corenlp-python
-* https://github.com/dasmith/stanford-corenlp-python
+This wrapper's defaults assume CoreNLP 3.4.  It uses whatever the CoreNLP
+default settings are, but they can be overriden with a configuration file.  The
+included `sample.ini` configuration file, for example, runs with the
+[shift-reduce][2] parser (and requires the appropriate model file to be
+downloaded into `corenlp_libdir`.)
 
-This wrapper assumes the use of CoreNLP 3.4 and the new
-[shift-reduce](http://nlp.stanford.edu/software/srparser.shtml). The `sample.ini` config file is
-setup to use these options.
+[1]: http://nlp.stanford.edu/software/corenlp.shtml
+[2]: http://nlp.stanford.edu/software/srparser.shtml
 
-##Install
+## Install
 
 You can install the program using something like:
 
 ```
 git clone https://github.com/brendano/stanford-corepywrapper
 pip install stanford-corepywrapper
 ```
+## Usage
 
-##Usage
-
-The return values are JSON-safe data structures (in fact, the python<->java
-communication is a JSON protocol).
-
-See javasrc/corenlp/Parse.java for the allowable pipeline types.
-
-TODO List:
-* Downgrade most of the messages to DEBUG not INFO once we think this code works ok
-
-Note some of these messages are stderr from the CoreNLP subprocess. Everything
-starting with `INFO:` is from the Python logging system and is from the parent
-process.
+The basic arguments to open a server are (1) the pipeline type (see
+`javasrc/corenlp/Parse.java` for the list of possible ones), and (2) the
+directory that contains the CoreNLP jar files.  Here we assume it's been
+unzipped in the current directory.
 
 ```
 >>> import sockwrap
->>> p=sockwrap.SockWrap("pos")
-
-INFO:StanfordSocketWrap:Starting pipe subprocess, and waiting for signal it's ready, with command:  exec java -Xmx4g -cp /Users/brendano/sw/nlp/stanford-pywrapper/lib/piperunner.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/guava-13.0.1.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/jackson-all-1.9.11.jar:/users/brendano/sw/nlp/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1.jar:/users/brendano/sw/nlp/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1-models.jar     corenlp.PipeCommandRunner --server 12340 pos
+>>> p=sockwrap.SockWrap("pos", corenlp_libdir="stanford-corenlp-full-2014-06-16")
 
-INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
-INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
+INFO:StanfordSocketWrap:Starting pipe subprocess, and waiting for signal it's ready, with command:  exec java -Xmx4g -cp /Users/brendano/sw/nlp/stanford-pywrapper/lib/piperunner.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/guava-13.0.1.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/jackson-all-1.9.11.jar:stanford-corenlp-full-2014-06-16/stanford-corenlp-3.4.jar:stanford-corenlp-full-2014-06-16/stanford-corenlp-3.4-models.jar:stanford-corenlp-full-2014-06-16/stanford-srparser-2014-07-01-models.jar     corenlp.PipeCommandRunner --server 12340  --mode pos
+[Server] Using mode type: pos
 Adding annotator tokenize
 Adding annotator ssplit
 Adding annotator pos
-Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... 
-
-INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
-INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
-INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
-INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
-INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
-done [1.5 sec].
+Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
+done [1.6 sec].
 Adding annotator lemma
 [Server] Started socket server on port 12340
-INFO:StanfordSocketWrap:Socket timeout happened, returning None: <class 'socket.timeout'> timed out
 INFO:StanfordSocketWrap:Successful ping. The server has started.
 INFO:StanfordSocketWrap:Subprocess is ready.
+```
+
+The return values are JSON-safe data structures (in fact, the python<->java
+communication is a JSON protocol).
 
+```
 >>> p.parse_doc("hello world. how are you?")
 {u'sentences': 
     [
@@ -75,7 +70,11 @@ INFO:StanfordSocketWrap:Subprocess is ready.
         }
     ]
 }
+```
+
+Here is how to specify a configuration file:
 
+```
 >>> p=sockwrap.SockWrap("justparse", configfile='sample.ini')
 >>> p.parse_doc("hello world. how are you?")
 {u'sentences':
@@ -93,3 +92,50 @@ INFO:StanfordSocketWrap:Subprocess is ready.
     ]
 }
 ```
+
+## Notes
+
+* A pipeline type is a notion only in our server, not in CoreNLP itself.
+  (TODO: we should get rid of these eventually?)  Our server uses the pipeline
+  type in order to know what annotators to set, and what annotations to return.
+  (It appears that the annotators can be overriden with a configuration file.)
+
+* The configuration files are Java properties files, which I think are the .ini
+  format, but am not sure.  It used to be that CoreNLP came with some sample
+  versions of these, but I can't find any at the moment.  (TOCONSIDER: maybe we
+  should abolish this in the interface and use a Python dict instead?)
+
+* Some of the output messages are stderr from the CoreNLP subprocess.
+  Everything starting with `INFO:` or `WARNING:` is from the Python logging
+  system, in the parent process.  Messages starting with `[Server]` are from the
+  Java subprocess, in our server code (but not from Stanford CoreNLP).
+
+* To use a different CoreNLP version, make sure the `corenlp_libdir` and
+  `corenlp_jars` parameters are correct.  If a future CoreNLP breaks binary
+  (Java API) compatibility, you'll have to edit the Java server code and
+  re-compile `piperunner.jar` via `./build.sh`.
+
+* If you want to run multiple instances on the same machine, make sure each
+  SockWrap instance has a unique port number.  (TOCONSIDER: use a different
+  mechanism that doesn't require port numbers.)
+
+* An important to-do is to test this code's robustness in a variety of
+  situations.  Bugs will probably occur when processing larger and larger
+  datasets, and I don't know the right policies to have for timeouts, when to
+  give up and restart after a timeout, and whether to re-try analyzing a
+  document or give up and move on (because state dependence and "killer
+  documents" screw all this up in different ways).  Thanks to John Beieler for
+  testing on the PETRARCH news analysis pipeline.
+
+## Testing
+
+There are some pytest-style tests, though they're incomplete. Run:
+
+    py.test -v sockwrap.py
+
+## License etc.
+
+Copyright Brendan O'Connor (http://brenocon.com).  
+License GPL version 2 or later.
+
+Some Java files were copied from [github.com/brendano/myutil](github.com/brendano/myutil).
diff --git a/sample.ini b/sample.ini
@@ -1,3 +1,2 @@
 annotators = tokenize,ssplit,pos,lemma,parse
-tokenize.whitespace = true
 parse.model = edu/stanford/nlp/models/srparser/englishSR.ser.gz

Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,2 @@`
`1`	`1`	`annotators = tokenize,ssplit,pos,lemma,parse`
`2`		`-tokenize.whitespace = true`
`3`	`2`	`parse.model = edu/stanford/nlp/models/srparser/englishSR.ser.gz`