Skip to content

Commit a9df491

Browse files
committed
update README. (also fix sample.ini for out-of-the-box SR parsing)
1 parent 731b0a9 commit a9df491

File tree

2 files changed

+83
-38
lines changed

2 files changed

+83
-38
lines changed

README.md

+83-37
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,60 @@
1-
**EXPERIMENTAL DO NOT USE WITHOUT LOTS OF TESTING**
1+
This is a Python wrapper for the [Stanford CoreNLP][1] library, allowing
2+
sentence splitting, POS/NER, and parse annotations. (Coreference is not
3+
supported currently.) It runs the Java software as a subprocess and
4+
communicates over sockets. This library handles timeouts and some process
5+
management (restarting the server if it seems to have crashed).
26

3-
Java files copied from [github.com/brendano/myutil](github.com/brendano/myutil).
7+
Alternatives you may want to consider:
48

5-
License GPL version 2 or later
9+
* https://bitbucket.org/torotoki/corenlp-python
10+
* https://github.com/dasmith/stanford-corenlp-python
11+
* http://py4j.sourceforge.net/
612

7-
see also
8-
* https://bitbucket.org/torotoki/corenlp-python
9-
* https://github.com/dasmith/stanford-corenlp-python
13+
This wrapper's defaults assume CoreNLP 3.4. It uses whatever the CoreNLP
14+
default settings are, but they can be overriden with a configuration file. The
15+
included `sample.ini` configuration file, for example, runs with the
16+
[shift-reduce][2] parser (and requires the appropriate model file to be
17+
downloaded into `corenlp_libdir`.)
1018

11-
This wrapper assumes the use of CoreNLP 3.4 and the new
12-
[shift-reduce](http://nlp.stanford.edu/software/srparser.shtml). The `sample.ini` config file is
13-
setup to use these options.
19+
[1]: http://nlp.stanford.edu/software/corenlp.shtml
20+
[2]: http://nlp.stanford.edu/software/srparser.shtml
1421

15-
##Install
22+
## Install
1623

1724
You can install the program using something like:
1825

1926
```
2027
git clone https://github.com/brendano/stanford-corepywrapper
2128
pip install stanford-corepywrapper
2229
```
30+
## Usage
2331

24-
##Usage
25-
26-
The return values are JSON-safe data structures (in fact, the python<->java
27-
communication is a JSON protocol).
28-
29-
See javasrc/corenlp/Parse.java for the allowable pipeline types.
30-
31-
TODO List:
32-
* Downgrade most of the messages to DEBUG not INFO once we think this code works ok
33-
34-
Note some of these messages are stderr from the CoreNLP subprocess. Everything
35-
starting with `INFO:` is from the Python logging system and is from the parent
36-
process.
32+
The basic arguments to open a server are (1) the pipeline type (see
33+
`javasrc/corenlp/Parse.java` for the list of possible ones), and (2) the
34+
directory that contains the CoreNLP jar files. Here we assume it's been
35+
unzipped in the current directory.
3736

3837
```
3938
>>> import sockwrap
40-
>>> p=sockwrap.SockWrap("pos")
41-
42-
INFO:StanfordSocketWrap:Starting pipe subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -cp /Users/brendano/sw/nlp/stanford-pywrapper/lib/piperunner.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/guava-13.0.1.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/jackson-all-1.9.11.jar:/users/brendano/sw/nlp/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1.jar:/users/brendano/sw/nlp/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1-models.jar corenlp.PipeCommandRunner --server 12340 pos
39+
>>> p=sockwrap.SockWrap("pos", corenlp_libdir="stanford-corenlp-full-2014-06-16")
4340
44-
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
45-
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
41+
INFO:StanfordSocketWrap:Starting pipe subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -cp /Users/brendano/sw/nlp/stanford-pywrapper/lib/piperunner.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/guava-13.0.1.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/jackson-all-1.9.11.jar:stanford-corenlp-full-2014-06-16/stanford-corenlp-3.4.jar:stanford-corenlp-full-2014-06-16/stanford-corenlp-3.4-models.jar:stanford-corenlp-full-2014-06-16/stanford-srparser-2014-07-01-models.jar corenlp.PipeCommandRunner --server 12340 --mode pos
42+
[Server] Using mode type: pos
4643
Adding annotator tokenize
4744
Adding annotator ssplit
4845
Adding annotator pos
49-
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ...
50-
51-
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
52-
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
53-
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
54-
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
55-
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
56-
done [1.5 sec].
46+
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
47+
done [1.6 sec].
5748
Adding annotator lemma
5849
[Server] Started socket server on port 12340
59-
INFO:StanfordSocketWrap:Socket timeout happened, returning None: <class 'socket.timeout'> timed out
6050
INFO:StanfordSocketWrap:Successful ping. The server has started.
6151
INFO:StanfordSocketWrap:Subprocess is ready.
52+
```
53+
54+
The return values are JSON-safe data structures (in fact, the python<->java
55+
communication is a JSON protocol).
6256

57+
```
6358
>>> p.parse_doc("hello world. how are you?")
6459
{u'sentences':
6560
[
@@ -75,7 +70,11 @@ INFO:StanfordSocketWrap:Subprocess is ready.
7570
}
7671
]
7772
}
73+
```
74+
75+
Here is how to specify a configuration file:
7876

77+
```
7978
>>> p=sockwrap.SockWrap("justparse", configfile='sample.ini')
8079
>>> p.parse_doc("hello world. how are you?")
8180
{u'sentences':
@@ -93,3 +92,50 @@ INFO:StanfordSocketWrap:Subprocess is ready.
9392
]
9493
}
9594
```
95+
96+
## Notes
97+
98+
* A pipeline type is a notion only in our server, not in CoreNLP itself.
99+
(TODO: we should get rid of these eventually?) Our server uses the pipeline
100+
type in order to know what annotators to set, and what annotations to return.
101+
(It appears that the annotators can be overriden with a configuration file.)
102+
103+
* The configuration files are Java properties files, which I think are the .ini
104+
format, but am not sure. It used to be that CoreNLP came with some sample
105+
versions of these, but I can't find any at the moment. (TOCONSIDER: maybe we
106+
should abolish this in the interface and use a Python dict instead?)
107+
108+
* Some of the output messages are stderr from the CoreNLP subprocess.
109+
Everything starting with `INFO:` or `WARNING:` is from the Python logging
110+
system, in the parent process. Messages starting with `[Server]` are from the
111+
Java subprocess, in our server code (but not from Stanford CoreNLP).
112+
113+
* To use a different CoreNLP version, make sure the `corenlp_libdir` and
114+
`corenlp_jars` parameters are correct. If a future CoreNLP breaks binary
115+
(Java API) compatibility, you'll have to edit the Java server code and
116+
re-compile `piperunner.jar` via `./build.sh`.
117+
118+
* If you want to run multiple instances on the same machine, make sure each
119+
SockWrap instance has a unique port number. (TOCONSIDER: use a different
120+
mechanism that doesn't require port numbers.)
121+
122+
* An important to-do is to test this code's robustness in a variety of
123+
situations. Bugs will probably occur when processing larger and larger
124+
datasets, and I don't know the right policies to have for timeouts, when to
125+
give up and restart after a timeout, and whether to re-try analyzing a
126+
document or give up and move on (because state dependence and "killer
127+
documents" screw all this up in different ways). Thanks to John Beieler for
128+
testing on the PETRARCH news analysis pipeline.
129+
130+
## Testing
131+
132+
There are some pytest-style tests, though they're incomplete. Run:
133+
134+
py.test -v sockwrap.py
135+
136+
## License etc.
137+
138+
Copyright Brendan O'Connor (http://brenocon.com).
139+
License GPL version 2 or later.
140+
141+
Some Java files were copied from [github.com/brendano/myutil](github.com/brendano/myutil).

sample.ini

-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
annotators = tokenize,ssplit,pos,lemma,parse
2-
tokenize.whitespace = true
32
parse.model = edu/stanford/nlp/models/srparser/englishSR.ser.gz

0 commit comments

Comments
 (0)