1
- ** EXPERIMENTAL DO NOT USE WITHOUT LOTS OF TESTING**
1
+ This is a Python wrapper for the [ Stanford CoreNLP] [ 1 ] library, allowing
2
+ sentence splitting, POS/NER, and parse annotations. (Coreference is not
3
+ supported currently.) It runs the Java software as a subprocess and
4
+ communicates over sockets. This library handles timeouts and some process
5
+ management (restarting the server if it seems to have crashed).
2
6
3
- Java files copied from [ github.com/brendano/myutil ] ( github.com/brendano/myutil ) .
7
+ Alternatives you may want to consider:
4
8
5
- License GPL version 2 or later
9
+ * https://bitbucket.org/torotoki/corenlp-python
10
+ * https://github.com/dasmith/stanford-corenlp-python
11
+ * http://py4j.sourceforge.net/
6
12
7
- see also
8
- * https://bitbucket.org/torotoki/corenlp-python
9
- * https://github.com/dasmith/stanford-corenlp-python
13
+ This wrapper's defaults assume CoreNLP 3.4. It uses whatever the CoreNLP
14
+ default settings are, but they can be overriden with a configuration file. The
15
+ included ` sample.ini ` configuration file, for example, runs with the
16
+ [ shift-reduce] [ 2 ] parser (and requires the appropriate model file to be
17
+ downloaded into ` corenlp_libdir ` .)
10
18
11
- This wrapper assumes the use of CoreNLP 3.4 and the new
12
- [ shift-reduce] ( http://nlp.stanford.edu/software/srparser.shtml ) . The ` sample.ini ` config file is
13
- setup to use these options.
19
+ [ 1 ] : http://nlp.stanford.edu/software/corenlp.shtml
20
+ [ 2 ] : http://nlp.stanford.edu/software/srparser.shtml
14
21
15
- ##Install
22
+ ## Install
16
23
17
24
You can install the program using something like:
18
25
19
26
```
20
27
git clone https://github.com/brendano/stanford-corepywrapper
21
28
pip install stanford-corepywrapper
22
29
```
30
+ ## Usage
23
31
24
- ##Usage
25
-
26
- The return values are JSON-safe data structures (in fact, the python<->java
27
- communication is a JSON protocol).
28
-
29
- See javasrc/corenlp/Parse.java for the allowable pipeline types.
30
-
31
- TODO List:
32
- * Downgrade most of the messages to DEBUG not INFO once we think this code works ok
33
-
34
- Note some of these messages are stderr from the CoreNLP subprocess. Everything
35
- starting with ` INFO: ` is from the Python logging system and is from the parent
36
- process.
32
+ The basic arguments to open a server are (1) the pipeline type (see
33
+ ` javasrc/corenlp/Parse.java ` for the list of possible ones), and (2) the
34
+ directory that contains the CoreNLP jar files. Here we assume it's been
35
+ unzipped in the current directory.
37
36
38
37
```
39
38
>>> import sockwrap
40
- >>> p=sockwrap.SockWrap("pos")
41
-
42
- INFO:StanfordSocketWrap:Starting pipe subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -cp /Users/brendano/sw/nlp/stanford-pywrapper/lib/piperunner.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/guava-13.0.1.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/jackson-all-1.9.11.jar:/users/brendano/sw/nlp/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1.jar:/users/brendano/sw/nlp/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1-models.jar corenlp.PipeCommandRunner --server 12340 pos
39
+ >>> p=sockwrap.SockWrap("pos", corenlp_libdir="stanford-corenlp-full-2014-06-16")
43
40
44
- INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
45
- INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
41
+ INFO:StanfordSocketWrap:Starting pipe subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -cp /Users/brendano/sw/nlp/stanford-pywrapper/lib/piperunner.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/guava-13.0.1.jar:/Users/brendano/sw/nlp/stanford-pywrapper/lib/jackson-all-1.9.11.jar:stanford-corenlp-full-2014-06-16/stanford-corenlp-3.4.jar:stanford-corenlp-full-2014-06-16/stanford-corenlp-3.4-models.jar:stanford-corenlp-full-2014-06-16/stanford-srparser-2014-07-01-models.jar corenlp.PipeCommandRunner --server 12340 --mode pos
42
+ [Server] Using mode type: pos
46
43
Adding annotator tokenize
47
44
Adding annotator ssplit
48
45
Adding annotator pos
49
- Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ...
50
-
51
- INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
52
- INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
53
- INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
54
- INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
55
- INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
56
- done [1.5 sec].
46
+ Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 61] Connection refused
47
+ done [1.6 sec].
57
48
Adding annotator lemma
58
49
[Server] Started socket server on port 12340
59
- INFO:StanfordSocketWrap:Socket timeout happened, returning None: <class 'socket.timeout'> timed out
60
50
INFO:StanfordSocketWrap:Successful ping. The server has started.
61
51
INFO:StanfordSocketWrap:Subprocess is ready.
52
+ ```
53
+
54
+ The return values are JSON-safe data structures (in fact, the python<->java
55
+ communication is a JSON protocol).
62
56
57
+ ```
63
58
>>> p.parse_doc("hello world. how are you?")
64
59
{u'sentences':
65
60
[
@@ -75,7 +70,11 @@ INFO:StanfordSocketWrap:Subprocess is ready.
75
70
}
76
71
]
77
72
}
73
+ ```
74
+
75
+ Here is how to specify a configuration file:
78
76
77
+ ```
79
78
>>> p=sockwrap.SockWrap("justparse", configfile='sample.ini')
80
79
>>> p.parse_doc("hello world. how are you?")
81
80
{u'sentences':
@@ -93,3 +92,50 @@ INFO:StanfordSocketWrap:Subprocess is ready.
93
92
]
94
93
}
95
94
```
95
+
96
+ ## Notes
97
+
98
+ * A pipeline type is a notion only in our server, not in CoreNLP itself.
99
+ (TODO: we should get rid of these eventually?) Our server uses the pipeline
100
+ type in order to know what annotators to set, and what annotations to return.
101
+ (It appears that the annotators can be overriden with a configuration file.)
102
+
103
+ * The configuration files are Java properties files, which I think are the .ini
104
+ format, but am not sure. It used to be that CoreNLP came with some sample
105
+ versions of these, but I can't find any at the moment. (TOCONSIDER: maybe we
106
+ should abolish this in the interface and use a Python dict instead?)
107
+
108
+ * Some of the output messages are stderr from the CoreNLP subprocess.
109
+ Everything starting with ` INFO: ` or ` WARNING: ` is from the Python logging
110
+ system, in the parent process. Messages starting with ` [Server] ` are from the
111
+ Java subprocess, in our server code (but not from Stanford CoreNLP).
112
+
113
+ * To use a different CoreNLP version, make sure the ` corenlp_libdir ` and
114
+ ` corenlp_jars ` parameters are correct. If a future CoreNLP breaks binary
115
+ (Java API) compatibility, you'll have to edit the Java server code and
116
+ re-compile ` piperunner.jar ` via ` ./build.sh ` .
117
+
118
+ * If you want to run multiple instances on the same machine, make sure each
119
+ SockWrap instance has a unique port number. (TOCONSIDER: use a different
120
+ mechanism that doesn't require port numbers.)
121
+
122
+ * An important to-do is to test this code's robustness in a variety of
123
+ situations. Bugs will probably occur when processing larger and larger
124
+ datasets, and I don't know the right policies to have for timeouts, when to
125
+ give up and restart after a timeout, and whether to re-try analyzing a
126
+ document or give up and move on (because state dependence and "killer
127
+ documents" screw all this up in different ways). Thanks to John Beieler for
128
+ testing on the PETRARCH news analysis pipeline.
129
+
130
+ ## Testing
131
+
132
+ There are some pytest-style tests, though they're incomplete. Run:
133
+
134
+ py.test -v sockwrap.py
135
+
136
+ ## License etc.
137
+
138
+ Copyright Brendan O'Connor (http://brenocon.com ).
139
+ License GPL version 2 or later.
140
+
141
+ Some Java files were copied from [ github.com/brendano/myutil] ( github.com/brendano/myutil ) .
0 commit comments