|
1 |
| -# sockpuppet |
| 1 | +# SockPuppet |
| 2 | +#### Having fun with WebSockets, Python, Golang and nytimes.com <br> |
| 3 | +<img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> <img src ="http://upload.wikimedia.org/wikipedia/commons/a/a7/Sock-puppet.jpg" height="50px"> |
| 4 | + |
| 5 | + |
| 6 | +<br> |
| 7 | +### What's this all about? |
| 8 | +Did you ever wonder how **nytimes.com** pushes breaking news articles to the front page while you have it open in your browser? Well, I used my browser's developer tools to look at what's going one and it turns out, they don't periodically reload JSON data but use websockets to push new events directly to your browser ([see here](https://developer.mozilla.org/en-US/docs/WebSockets) for more information about websockets).<br> |
| 9 | +It's a system called `nyt-fabrik`, here are a few talks and presentations where they give some insight into the architecture: [search google for "nytimes fabrik websockets"](https://www.google.com/search?q=nytimes+fabrik+websockets). |
| 10 | + |
| 11 | +There is example code, see [here for the Python code](blob/master/sockpuppet.py) and [here for the Golang example](blob/master/sockpuppet.go). |
| 12 | + |
| 13 | +<br> |
| 14 | +### Cool, so how does it work? |
| 15 | + |
| 16 | +When you go to **nytimes.com**, your browser will establish a websocket connection with the NYT fabrik server and, after a little login dance, will start listening for news events. |
| 17 | +Your browser opens a websocket TCP connection to e.g. `ws://blablabla.fabrik.nytimes.com./123/abcde123/websocket` and the server sends a one-character frame `o` which is a request to provide some sort of login identification.<br> |
| 18 | +The client (your browser) responds with `["{\"action\":\"login\",\"client_app\":\"hermes.push\",\"cookies\":{\"nyt-s\":\"SOME_COOKIE_VALUE_HERE\"}}"]` and next thing you know you, you either receive a `h` every 20-30 seconds which is some sort of keep-alive or a frame that starts with `a` and has all sorts of data encoded as JSON. |
| 19 | + |
| 20 | +If we receive a message starting with `a`, we can strip the first character and JSON decode the rest. |
| 21 | + |
| 22 | +```json |
| 23 | +{ |
| 24 | + "body": "{\"status\":\"updated\",\"version\":1,\"links\":[{\"url\":\"http://www.nytimes.com/2015/05/26/us/cleveland-police.html\",\"count\":0,\"content_id\":\"100000003702598\",\"content_type\":\"article\",\"offset\":0}],\"title\":\"Cleveland Is Said to Settle Justice Department Lawsuit Over Policing\",\"start_time\":1432581057,\"display_duration\":null,\"label\":\"Breaking News\",\"last_modified\":1432581057,\"display_type_id\":1,\"end_time\":1432581057,\"id\":34931339,\"sub_type\":\"BreakingNews\"}", |
| 25 | + "timestamp": "2015-05-21T11:21:11.123456Z", |
| 26 | + "hash_key": "34131339", |
| 27 | + "uuid": "1234", |
| 28 | + ... |
| 29 | + "account": "nyt1", |
| 30 | + "type": "feeds_item" |
| 31 | +} |
| 32 | +``` |
| 33 | + |
| 34 | +If the decoded message has field "body", we can decode it. In case of a breaking news item it looks something like this: |
| 35 | + |
| 36 | +```json |
| 37 | +{"status": "updated", "sub_type": "BreakingNews", |
| 38 | +"links": [{"url": "http://www.nytimes.com/2015/05/26/us/cleveland-police.html", "count": 0, "content_id": "100000003702598", "content_type": "article", "offset": 0}], |
| 39 | +"title": "Cleveland Is Said to Settle Justice Department Lawsuit Over Policing", |
| 40 | +"start_time": 1432581057, "display_duration": null, "label": "Breaking News", |
| 41 | +"version": 1, "display_type_id": 1, "end_time": 1432581057, |
| 42 | +"last_modified": 1432581057, "id": 34131339} |
| 43 | +``` |
| 44 | +<br> |
| 45 | +### Neat but how do I access the feed programmatically? |
| 46 | + |
| 47 | +Good question, let's see, we need about 3-4 things to get this to work, easy. For the Python example, I'll be using the [Tornado websocket framework](http://tornado.readthedocs.org/en/latest/websocket.html) and for the Golang example I'll be using the [Golang.org websocket package](https://godoc.org/golang.org/x/net/websocket). |
| 48 | + |
| 49 | +#### Connect to the websocket |
| 50 | + |
| 51 | +In Python, this is easy: |
| 52 | + |
| 53 | +```python |
| 54 | +url = "ws://blablabla.fabrik.nytimes.com./123/abcdef123/websocket" |
| 55 | +try: |
| 56 | + w = yield tornado.websocket.websocket_connect(url, connect_timeout=5) |
| 57 | + logging.info("Connected to %s", url) |
| 58 | +except Exception as ex: |
| 59 | + logging.error("couldn't connect, err: %s", ex) |
| 60 | +``` |
| 61 | + |
| 62 | +In Golang, it looks about the same: |
| 63 | + |
| 64 | +```go |
| 65 | +addr := "ws://blablabla.fabrik.nytimes.com./123/abcdef123/websocket" |
| 66 | +ws, err := websocket.Dial(addr, "", "http://www.nytimes.com/") |
| 67 | +if err != nil { |
| 68 | + log.Fatal(err) |
| 69 | +} |
| 70 | +log.Printf("Connected to %s", addr) |
| 71 | +``` |
| 72 | +That was easy, wasn't it? |
| 73 | + |
| 74 | +#### Listen for incoming messages |
| 75 | +Good, we now are connected and have a websocket object/struct we can work with, let's listen for incoming messages.<br> |
| 76 | + |
| 77 | +Python: |
| 78 | + |
| 79 | +```python |
| 80 | +while True: |
| 81 | + payload = yield w.read_message() |
| 82 | + if payload is None: |
| 83 | + logging.error("uh oh, we got disconnected") |
| 84 | + return |
| 85 | +``` |
| 86 | +and in Golang: |
| 87 | + |
| 88 | +```go |
| 89 | +var msgBuf = make([]byte, 4096) |
| 90 | +for { |
| 91 | + bufLen, err := ws.Read(msgBuf) |
| 92 | + if err != nil { |
| 93 | + log.Printf("read err: %s", err) |
| 94 | + return |
| 95 | + } |
| 96 | +``` |
| 97 | +One caveat here, the Golang version can't handle messages longer than 4k (it'll chunk them into 4k pieces) but for our purposes that's not an issue. |
| 98 | +
|
| 99 | +#### Send the login message |
| 100 | +
|
| 101 | +If we receive `o` we need to send the login message. We need a cookie value so let's make one up: |
| 102 | +
|
| 103 | +```python |
| 104 | +if payload[0] == "o": |
| 105 | + cookie = ''.join(random.choice(string.ascii_letters + string.digits) for _ in range(32)) |
| 106 | + msg = json.dumps(['{"action":"login", "client_app":"hermes.push", "cookies":{"nyt-s":"%s"}}' % cookie]) |
| 107 | + w.write_message(msg.encode('utf8')) |
| 108 | + logging.info("sent cookie: %s", cookie) |
| 109 | +``` |
| 110 | +
|
| 111 | +In Golang this is a bit more verbose: |
| 112 | +
|
| 113 | +```go |
| 114 | +if msgBuf[0] == 'o' { |
| 115 | + // reply to the login request |
| 116 | + cookie := randCookie() |
| 117 | + msg := fmt.Sprintf(`["{\"action\":\"login\", \"client_app\":\"hermes.push\", \"cookies\":{\"nyt-s\":\"%s\"}}"]`, cookie) |
| 118 | + _, err := ws.Write([]byte(msg)) |
| 119 | + if err != nil { |
| 120 | + log.Fatal(err) |
| 121 | + } |
| 122 | + log.Printf("Sent cookie: %s\n", cookie) |
| 123 | +} |
| 124 | +``` |
| 125 | +and `randCookie()` lookslike this: |
| 126 | +
|
| 127 | +```go |
| 128 | +func randCookie() string { |
| 129 | + letters := []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890") |
| 130 | + b := make([]rune, 30) |
| 131 | + for i := range b { |
| 132 | + b[i] = letters[rand.Intn(len(letters))] |
| 133 | + } |
| 134 | + return string(b) |
| 135 | +} |
| 136 | +``` |
| 137 | +
|
| 138 | +#### Patiently wait; and (mostly) ignore the `h` messages |
| 139 | +Nothing much to do here, whenever we get a `h` message we can simply write `ping` to the console. |
| 140 | +
|
| 141 | +```python |
| 142 | +elif payload[0] == 'h': |
| 143 | + logging.info('ping') |
| 144 | +``` |
| 145 | +and |
| 146 | +
|
| 147 | +```go |
| 148 | +if payload[0] == "o" { |
| 149 | + log.Println("ping") |
| 150 | +} |
| 151 | +``` |
| 152 | + |
| 153 | +
|
| 154 | +#### Decode the news alert message when we receive one |
| 155 | +
|
| 156 | +Messages from the server that start with `a` contain JSON encoded data that we can decode. |
| 157 | +Python first: |
| 158 | +
|
| 159 | +```go |
| 160 | +elif payload[0] == 'a': |
| 161 | + frame = json.loads(payload[1:]) |
| 162 | + if 'body' in frame: |
| 163 | + body = json.loads(frame['body']) |
| 164 | +``` |
| 165 | +Now you can for check `if body['sub_type'] == "BreakingNews"` or whatever else you plan on doing with this. |
| 166 | +
|
| 167 | +In Golang everything is a bit more verbose but roughly works the same (inlined and shortened for brevity). |
| 168 | +
|
| 169 | +```python |
| 170 | +if payload[0] == "o" { |
| 171 | + |
| 172 | + frame := []struct { |
| 173 | + UUID string `json:"uuid"` |
| 174 | + Product string `json:"product"` |
| 175 | + Project string `json:"project"` |
| 176 | + ... |
| 177 | + Body string `json:"body,omitempty"` |
| 178 | + }{} |
| 179 | + |
| 180 | + // [1:] as we want to skip the leading character `a` |
| 181 | + err = json.Unmarshal(payload[1:], &frame) |
| 182 | + if err != nil { |
| 183 | + return |
| 184 | + } |
| 185 | + if len(frame.Body) > 1 { |
| 186 | + // here we should try to JSON unmarshal frame.Body |
| 187 | + } |
| 188 | +} |
| 189 | + |
| 190 | +``` |
| 191 | +`frame.Body` can now be unmarshaled in the same way as `payload[1:]` earlier. |
| 192 | +The resulting struct for it looks something like this: |
| 193 | +
|
| 194 | +```go |
| 195 | +type MessageBody struct { |
| 196 | + ID int `json:"id"` |
| 197 | + Title string `json:"title"` |
| 198 | + Status string `json:"status"` |
| 199 | + Version int `json:"version"` |
| 200 | + SubType string `json:"sub_type"` |
| 201 | + Label string `json:"label"` |
| 202 | + StartTime int `json:"start_time"` |
| 203 | + EndTime int `json:"end_time"` |
| 204 | + LastModified int `json:"last_modified"` |
| 205 | + Links []struct { |
| 206 | + URL string `json:"url"` |
| 207 | + ContentID string `json:"content_id"` |
| 208 | + } `json:"links"` |
| 209 | +} |
| 210 | + |
| 211 | +``` |
| 212 | +
|
| 213 | +<br> |
| 214 | +### Sweet but what do I do with this? |
| 215 | +
|
| 216 | +Totally up to you. Send yourself an email or txt msg using Twilio or Plivo every time something happens. For example, I wrote a little app using the Plivo API to send breaking news txts, you can subscribe by texting `news` to <a href="tel:+17185771913">+1-718-577-1913</a> if you want to give it a try (but no guarantees for how long I'll keep the service up).<br> |
| 217 | +
|
| 218 | +
|
| 219 | +### Cool, how do I run the examples? |
| 220 | +
|
| 221 | +Python |
| 222 | +
|
| 223 | +``` |
| 224 | +python sockpuppet.py --ws_addr="ws://<<ADDRESS HERE>>" |
| 225 | +``` |
| 226 | +
|
| 227 | +Go |
| 228 | +
|
| 229 | +``` |
| 230 | +go run sockpuppet.go --ws_addr="ws://<<ADDRESS HERE>>" |
| 231 | +``` |
| 232 | +
|
| 233 | +You can find a valid websocket host by using the Developer Console of your favorite browser and visit [nytimes.com](nytimes.com) and look for websocket connections in the network tab. |
| 234 | +
|
| 235 | +
|
| 236 | +
|
| 237 | +
|
| 238 | +
|
0 commit comments