|
2 | 2 | title: Reverse Engineering the Vive Facial Tracker
|
3 | 3 | ---
|
4 | 4 |
|
5 |
| -# The cool intro |
| 5 | +# Reverse Engineering the Vive Facial Tracker |
6 | 6 |
|
7 | 7 | My name is Hiatus, and I am one of the developers on the Project Babble Team.
|
8 | 8 |
|
9 |
| -For the past 2 years (at the time of writing this), me and a team have been working on the eponymous Project Babble, a VR lower-face expression tracking solution for VRChat, Resonite and ChilloutVR namely. |
| 9 | +For the past 2 years at the time of writing this, me and a team have been working on the eponymous Project Babble, a cross-platform and hardware-agnostic VR lower-face expression tracking solution for [VRChat](/docs/software/integrations/vrc), [ChilloutVR](/docs/software/integrations/chilloutVR), and [Resonite](/docs/software/integrations/resonite). |
10 | 10 |
|
11 |
| -This is the story of how the Vive Facial Tracker, another (abandoned) VR face tracking accessory, was reverse engineered to work with the Babble App. |
| 11 | +This blog post was inspired by a member of our community [DragonLord](https://github.com/LordOfDragons). His work is responsible for this feature, and this blog post largely paraphrases his findings. Again, put where credit is due and check out his Github. You can check out his findings as well as his [repo](https://github.com/LordOfDragons/vivefacialtracker) here. |
| 12 | + |
| 13 | +This is the story of how the Vive Facial Tracker, another VR face tracking accessory was reverse engineered to work with the Babble App. |
12 | 14 |
|
13 | 15 | Buckle in.
|
14 | 16 |
|
15 | 17 | # The Vive Facial Tracker
|
16 | 18 |
|
17 |
| -The Vive Facial Tracker is a VR accessory released in March 24th 2021. It is worn underneath a VR headset, and captures camera images an AI, SRanipal ("Super Reality Animation Pal") converts into expressions other programs can understand*. |
18 |
| - |
19 |
| -Sidenote here, it's really hard to describe the *impact* this had on the entirety of Social VR, at least in my experience. |
20 |
| - |
21 |
| -Unfortunately, the VFT has been discontinued. [You can't even see it on their own store anymore.](https://www.vive.com/us/accessory/). Even worse, it's being scalped on eBay in excess of $1,000. Remember, this accesory cost ~$150 when it came out!! |
22 |
| - |
23 |
| -# The rising action: A curious conversation |
24 |
| - |
25 |
| -I was in voice chat with some people from the Babble Discord, when someone said they knew someone else in the Project Babble discord had gotten their VFT working with a fork of the Babble App. On Linux. *What* |
| 19 | +Some context. The Vive Facial Tracker (VFT) was a VR accessory released on March 24th, 2021. Worn underneath a VR headset, it captures camera images of a user's lower face, and using an in-house AI ([SRanipal](https://docs.vrcft.io/docs/hardware/VIVE/sranipal)) converts expressions into other programs can understand. |
26 | 20 |
|
27 |
| -Curious, I asked for more details so they linked me a conversation that had happened earlier that week. I reached out to the person who made the post to understand what they did, and what followed was a pleasent conversation. |
28 |
| - |
29 |
| -# The Babble App |
30 |
| - |
31 |
| -Before we go on, we need to briefly cover how the Babble App works. In short, it runs an ONNX model that accept a (256x256) grayscale image, fed in from one of two video sources: |
32 |
| - |
33 |
| -1) Via OpenCV. Think webcameras, ip-cameras, etc. This handles 80% of all things cameras. |
34 |
| -2) Via Serial. Presently, our Babble Boards send image data via a wired USB connection a computer for processing. |
35 |
| -*If the Babble Board is running in wireless mode, it just spins up an IP Camera. Plain and simple stuff. |
| 21 | +The VFT currently has integrations for VRChat (via [VRCFaceTracking](https://github.com/benaclejames/VRCFaceTracking)), Resonite and ChilloutVR (natively). Here is a video of it in use: |
36 | 22 |
|
37 | 23 | :::note
|
38 |
| -I do want to create an article about how are training/trained our ONNX models too! |
| 24 | +Sidenote here, it's really hard to describe the *impact* VR face tracking has had on the entirety of Social VR, at least in my experience. It's a completely different degree of immersion, you have to see it in person. |
39 | 25 | :::
|
40 | 26 |
|
41 |
| -Got all that? |
42 |
| - |
| 27 | +Unfortunately, the VFT has been discontinued. [You can't even see it on Vive's own store anymore](https://www.vive.com/us/accessory/). Even worse, it's being scalped on eBay in excess of $1,000. Remember, this accessory cost ~$150 when it came out!! |
43 | 28 |
|
44 | 29 | # Understanding the VFT's hardware
|
45 | 30 |
|
46 |
| -The VFT consists of 2 OVXXX infrared cameras and and IR Led. See here https://archive.is/NFlaO. This produces a combined 800x400 image, at 400x400 pixels per camera, encoded in YUV2. |
| 31 | +Before we can begin to extract camera images from the VFT, we need to understand its underlying hardware. |
47 | 32 |
|
48 |
| -:::note |
49 |
| -In SRanipal, these would be used to compute stereo disparity IE provide how close objects are to the camera. This is useful for expressions in which parts of the face are closer to the camera, such as `JawForward` and `TongueOut`. |
50 |
| -::: |
| 33 | +## Camera |
51 | 34 |
|
52 |
| -Babble doesn't care about 3D, so we have 2 options: |
53 |
| -1) Just pick either the left or right frame. |
54 |
| -2) Do something fancier that requires more work. |
| 35 | +The VFT consists of two [OV6211 image sensors](https://www.ovt.com/products/ov6211/) and an image signal processor [OV00580-B21G-1C](https://www.ovt.com/products/ov580/) from OmniVision. The cameras record at 400px\*400px at 60Hz, their images are then shrunk to 200px\*400px then put side by side, for a combined final image 400px\*400px. |
55 | 36 |
|
56 |
| -Guess which one we did? |
| 37 | +:::note |
| 38 | +In SRanipal, these separate images are used to compute [disparity](https://en.wikipedia.org/wiki/Binocular_disparity), IE provide how close an object is to the camera. This is useful for expressions in which parts of the face are closer to the camera, such as `Jaw_Forward`, `Tongue_LongStep1` and `Tongue_LongStep2`. |
| 39 | +::: |
57 | 40 |
|
58 |
| -With that in mind, we needed to |
59 |
| -1) Open the camera(s) (done!) |
60 |
| -2) Turn on the LEDs and other components |
61 |
| -3) Process/Convert the VFT's camera into something Babble can understand |
| 41 | +An IR light source is used to illuminate the face of the user. The cameras do not record color information per se but rather the luminance of the IR light. This has a direct influence on the image format. |
| 42 | + |
| 43 | + |
| 44 | +Moreover, this is the output of Video4Linux's `v4l2-ctl --list-devices`: |
| 45 | + |
| 46 | +``` |
| 47 | +... |
| 48 | +HTC Multimedia Camera: HTC Mult (usb-0000:0f:00.0-4.1.2.1): |
| 49 | + /dev/video2 |
| 50 | + /dev/video3 |
| 51 | + /dev/media1 |
| 52 | +... |
| 53 | +``` |
| 54 | + |
| 55 | +As well as `v4l2-ctl -d /dev/video2 --list-formats-ext`: |
| 56 | +``` |
| 57 | +... |
| 58 | +ioctl: VIDIOC_ENUM_FMT |
| 59 | + Type: Video Capture |
| 60 | +
|
| 61 | + [0]: 'YUYV' (YUYV 4:2:2) |
| 62 | + Size: Discrete 400x400 |
| 63 | + Interval: Discrete 0.017s (60.000 fps) |
| 64 | +... |
| 65 | +``` |
| 66 | + |
| 67 | +From above, we can see the VFT provides YUV422 images. |
| 68 | + |
| 69 | +Funnily enough though, the VFT *does not actually output a proper YUV image*. Instead, the VFT stores the grayscale image in all 3 image channels. This breaks trying to convert YUV into RGB, the resulting image can be seen below: |
| 70 | + |
| 71 | +To workaround this, we can extract the "Y" channel use it as a grayscale image. This is pretty fast, as we only need to decode the "4" part of the YUV422 image and ignore the "22" part. |
| 72 | + |
| 73 | +## Lights |
| 74 | + |
| 75 | +The VFT does not provide any controls for brightness or exposure. Instead, an exposure and gain parameter can be set directly using device registers. |
| 76 | + |
| 77 | +Once again, we can probe the VFT using `lsusb`: |
| 78 | + |
| 79 | +``` |
| 80 | +Bus 004 Device 076: ID 0bb4:0321 HTC (High Tech Computer Corp.) HTC Multimedia Camera |
| 81 | +Device Descriptor: |
| 82 | + bLength 18 |
| 83 | + bDescriptorType 1 |
| 84 | + bcdUSB 3.00 |
| 85 | + bDeviceClass 239 Miscellaneous Device |
| 86 | + bDeviceSubClass 2 [unknown] |
| 87 | + bDeviceProtocol 1 Interface Association |
| 88 | + bMaxPacketSize0 9 |
| 89 | + idVendor 0x0bb4 HTC (High Tech Computer Corp.) |
| 90 | + idProduct 0x0321 HTC Multimedia Camera |
| 91 | + bcdDevice 1.00 |
| 92 | + iManufacturer 1 HTC Multimedia Camera |
| 93 | + iProduct 2 HTC Multimedia Camera |
| 94 | + iSerial 0 |
| 95 | + bNumConfigurations 1 |
| 96 | + Configuration Descriptor: |
| 97 | + bLength 9 |
| 98 | + bDescriptorType 2 |
| 99 | + wTotalLength 0x00de |
| 100 | + bNumInterfaces 2 |
| 101 | + bConfigurationValue 1 |
| 102 | + iConfiguration 0 |
| 103 | + bmAttributes 0x80 |
| 104 | + (Bus Powered) |
| 105 | + MaxPower 512mA |
| 106 | + Interface Association: |
| 107 | + bLength 8 |
| 108 | + bDescriptorType 11 |
| 109 | + bFirstInterface 0 |
| 110 | + bInterfaceCount 2 |
| 111 | + bFunctionClass 14 Video |
| 112 | + bFunctionSubClass 3 Video Interface Collection |
| 113 | + bFunctionProtocol 0 |
| 114 | + iFunction 2 HTC Multimedia Camera |
| 115 | + Interface Descriptor: |
| 116 | + bLength 9 |
| 117 | + bDescriptorType 4 |
| 118 | + bInterfaceNumber 0 |
| 119 | + bAlternateSetting 0 |
| 120 | + bNumEndpoints 1 |
| 121 | + bInterfaceClass 14 Video |
| 122 | + bInterfaceSubClass 1 Video Control |
| 123 | + bInterfaceProtocol 0 |
| 124 | + iInterface 2 HTC Multimedia Camera |
| 125 | + VideoControl Interface Descriptor: |
| 126 | + bLength 13 |
| 127 | + bDescriptorType 36 |
| 128 | + bDescriptorSubtype 1 (HEADER) |
| 129 | + bcdUVC 1.10 |
| 130 | + wTotalLength 0x004f |
| 131 | + dwClockFrequency 150.000000MHz |
| 132 | + bInCollection 1 |
| 133 | + baInterfaceNr( 0) 1 |
| 134 | + VideoControl Interface Descriptor: |
| 135 | + bLength 18 |
| 136 | + bDescriptorType 36 |
| 137 | + bDescriptorSubtype 2 (INPUT_TERMINAL) |
| 138 | + bTerminalID 1 |
| 139 | + wTerminalType 0x0201 Camera Sensor |
| 140 | + bAssocTerminal 0 |
| 141 | + iTerminal 0 |
| 142 | + wObjectiveFocalLengthMin 0 |
| 143 | + wObjectiveFocalLengthMax 0 |
| 144 | + wOcularFocalLength 0 |
| 145 | + bControlSize 3 |
| 146 | + bmControls 0x00000000 |
| 147 | + VideoControl Interface Descriptor: |
| 148 | + bLength 9 |
| 149 | + bDescriptorType 36 |
| 150 | + bDescriptorSubtype 3 (OUTPUT_TERMINAL) |
| 151 | + bTerminalID 2 |
| 152 | + wTerminalType 0x0101 USB Streaming |
| 153 | + bAssocTerminal 0 |
| 154 | + bSourceID 4 |
| 155 | + iTerminal 0 |
| 156 | + VideoControl Interface Descriptor: |
| 157 | + bLength 13 |
| 158 | + bDescriptorType 36 |
| 159 | + bDescriptorSubtype 5 (PROCESSING_UNIT) |
| 160 | + bUnitID 3 |
| 161 | + bSourceID 1 |
| 162 | + wMaxMultiplier 0 |
| 163 | + bControlSize 3 |
| 164 | + bmControls 0x00000000 |
| 165 | + iProcessing 2 HTC Multimedia Camera |
| 166 | + bmVideoStandards 0x00 |
| 167 | + VideoControl Interface Descriptor: |
| 168 | + bLength 26 |
| 169 | + bDescriptorType 36 |
| 170 | + bDescriptorSubtype 6 (EXTENSION_UNIT) |
| 171 | + bUnitID 4 |
| 172 | + guidExtensionCode {2ccb0bda-6331-4fdb-850e-79054dbd5671} |
| 173 | + bNumControls 2 |
| 174 | + bNrInPins 1 |
| 175 | + baSourceID( 0) 3 |
| 176 | + bControlSize 1 |
| 177 | + bmControls( 0) 0x03 |
| 178 | + iExtension 2 HTC Multimedia Camera |
| 179 | + Endpoint Descriptor: |
| 180 | + bLength 7 |
| 181 | + bDescriptorType 5 |
| 182 | + bEndpointAddress 0x86 EP 6 IN |
| 183 | + bmAttributes 3 |
| 184 | + Transfer Type Interrupt |
| 185 | + Synch Type None |
| 186 | + Usage Type Data |
| 187 | + wMaxPacketSize 0x0040 1x 64 bytes |
| 188 | + bInterval 9 |
| 189 | + bMaxBurst 0 |
| 190 | + Interface Descriptor: |
| 191 | + bLength 9 |
| 192 | + bDescriptorType 4 |
| 193 | + bInterfaceNumber 1 |
| 194 | + bAlternateSetting 0 |
| 195 | + bNumEndpoints 1 |
| 196 | + bInterfaceClass 14 Video |
| 197 | + bInterfaceSubClass 2 Video Streaming |
| 198 | + bInterfaceProtocol 0 |
| 199 | + iInterface 0 |
| 200 | + VideoStreaming Interface Descriptor: |
| 201 | + bLength 14 |
| 202 | + bDescriptorType 36 |
| 203 | + bDescriptorSubtype 1 (INPUT_HEADER) |
| 204 | + bNumFormats 1 |
| 205 | + wTotalLength 0x004d |
| 206 | + bEndpointAddress 0x81 EP 1 IN |
| 207 | + bmInfo 0 |
| 208 | + bTerminalLink 2 |
| 209 | + bStillCaptureMethod 0 |
| 210 | + bTriggerSupport 0 |
| 211 | + bTriggerUsage 0 |
| 212 | + bControlSize 1 |
| 213 | + bmaControls( 0) 0 |
| 214 | + VideoStreaming Interface Descriptor: |
| 215 | + bLength 27 |
| 216 | + bDescriptorType 36 |
| 217 | + bDescriptorSubtype 4 (FORMAT_UNCOMPRESSED) |
| 218 | + bFormatIndex 1 |
| 219 | + bNumFrameDescriptors 1 |
| 220 | + guidFormat {32595559-0000-0010-8000-00aa00389b71} |
| 221 | + bBitsPerPixel 16 |
| 222 | + bDefaultFrameIndex 1 |
| 223 | + bAspectRatioX 0 |
| 224 | + bAspectRatioY 0 |
| 225 | + bmInterlaceFlags 0x00 |
| 226 | + Interlaced stream or variable: No |
| 227 | + Fields per frame: 2 fields |
| 228 | + Field 1 first: No |
| 229 | + Field pattern: Field 1 only |
| 230 | + bCopyProtect 0 |
| 231 | + VideoStreaming Interface Descriptor: |
| 232 | + bLength 30 |
| 233 | + bDescriptorType 36 |
| 234 | + bDescriptorSubtype 5 (FRAME_UNCOMPRESSED) |
| 235 | + bFrameIndex 1 |
| 236 | + bmCapabilities 0x00 |
| 237 | + Still image unsupported |
| 238 | + wWidth 400 |
| 239 | + wHeight 400 |
| 240 | + dwMinBitRate 153600000 |
| 241 | + dwMaxBitRate 153600000 |
| 242 | + dwMaxVideoFrameBufferSize 320000 |
| 243 | + dwDefaultFrameInterval 166666 |
| 244 | + bFrameIntervalType 1 |
| 245 | + dwFrameInterval( 0) 166666 |
| 246 | + VideoStreaming Interface Descriptor: |
| 247 | + bLength 6 |
| 248 | + bDescriptorType 36 |
| 249 | + bDescriptorSubtype 13 (COLORFORMAT) |
| 250 | + bColorPrimaries 1 (BT.709,sRGB) |
| 251 | + bTransferCharacteristics 1 (BT.709) |
| 252 | + bMatrixCoefficients 4 (SMPTE 170M (BT.601)) |
| 253 | + Endpoint Descriptor: |
| 254 | + bLength 7 |
| 255 | + bDescriptorType 5 |
| 256 | + bEndpointAddress 0x81 EP 1 IN |
| 257 | + bmAttributes 2 |
| 258 | + Transfer Type Bulk |
| 259 | + Synch Type None |
| 260 | + Usage Type Data |
| 261 | + wMaxPacketSize 0x0400 1x 1024 bytes |
| 262 | + bInterval 0 |
| 263 | + bMaxBurst 15 |
| 264 | +Binary Object Store Descriptor: |
| 265 | + bLength 5 |
| 266 | + bDescriptorType 15 |
| 267 | + wTotalLength 0x0016 |
| 268 | + bNumDeviceCaps 2 |
| 269 | + USB 2.0 Extension Device Capability: |
| 270 | + bLength 7 |
| 271 | + bDescriptorType 16 |
| 272 | + bDevCapabilityType 2 |
| 273 | + bmAttributes 0x00000006 |
| 274 | + BESL Link Power Management (LPM) Supported |
| 275 | + SuperSpeed USB Device Capability: |
| 276 | + bLength 10 |
| 277 | + bDescriptorType 16 |
| 278 | + bDevCapabilityType 3 |
| 279 | + bmAttributes 0x00 |
| 280 | + wSpeedsSupported 0x000c |
| 281 | + Device can operate at High Speed (480Mbps) |
| 282 | + Device can operate at SuperSpeed (5Gbps) |
| 283 | + bFunctionalitySupport 2 |
| 284 | + Lowest fully-functional device speed is High Speed (480Mbps) |
| 285 | + bU1DevExitLat 10 micro seconds |
| 286 | + bU2DevExitLat 32 micro seconds |
| 287 | +Device Status: 0x0000 |
| 288 | + (Bus Powered) |
| 289 | +``` |
| 290 | + |
| 291 | +Interestingly, the "VideoControl Interface Descriptor" guidExtensionCode's value `{2ccb0bda-6331-4fdb-850e-79054dbd5671}` matches the log output of a "ZED2i" camera online. This means the (open-source!) code of stereolabs's ZED cameras and the VIVE Facial Tracker share a lot in common, at least USB-wise: |
| 292 | + |
| 293 | +- https://github.com/stereolabs/zed-open-capture/blob/5cf66ff777175776451b9b59ecc6231d730fa202/src/videocapture.cpp |
| 294 | + |
| 295 | +### SB Protocol / Extension Unit |
| 296 | + |
| 297 | +The VIVE Facial Tracker is a video type USB device and behaves like one, with an exception. The data stream is not activated using the regular means but has to be activated using the "Extension Unit". Basically the VIVE Facial Tracker is controlled by sending commands to this "Extension Unit". |
| 298 | + |
| 299 | +In general you have to use SET_CUR commands to set camera parameters and to enable the camera stream. The device uses a fixed size scratch buffer of 384 bytes for all sending and receiving. Only the relevant command bytes are actually consumed while the rest is disregarded. |
| 300 | + |
| 301 | +Camera parameters are set using the `0xab` request id. Analyzing the protocol there are 11 registers touched by the original SRanipal software. The ZED2i lists in particular 6 parameters to control exposure and gain: |
| 302 | +- ADDR_EXP_H |
| 303 | +- ADDR_EXP_M |
| 304 | +- ADDR_EXP_L |
| 305 | +- ADDR_GAIN_H |
| 306 | +- ADDR_GAIN_M |
| 307 | +- ADDR_GAIN_L |
| 308 | + |
| 309 | +Using some testing they most probably map like this to the VIVE Facial Tracker: |
| 310 | + |
| 311 | +| Register | Value | Description | |
| 312 | +|----------|---------|-------------| |
| 313 | +| `x00` | `x40` | | |
| 314 | +| `x08` | `x01` | | |
| 315 | +| `x70` | `x00` | | |
| 316 | +| `x02` | `xff` | exposure high | |
| 317 | +| `x03` | `xff` | exposure med | |
| 318 | +| `x04` | `xff` | exposure low | |
| 319 | +| `x0e` | `xff` | | |
| 320 | +| `x05` | `xb2` | gain high | |
| 321 | +| `x06` | `xb2` | gain med | |
| 322 | +| `x07` | `xb2` | gain low | |
| 323 | +| `x0f` | `x03` | | |
| 324 | + |
| 325 | +The values on the left side are the registers and the value on the right side is the value set by SRanipal. Testing different values produced worse results so the values used by SRanipal seem to be the best choice. What the other parameters dso is unknown. |
| 326 | + |
| 327 | +The `x14` request is the one enabling and disabling the data stream. Hence first the camera parameters have to be set then the stream has to be enabled. |
| 328 | + |
| 329 | +Once the data stream is enabled the camera streams data in the YUV422 format using regular USB video device streaming. |
| 330 | + |
| 331 | +### Windows |
| 332 | + |
| 333 | +One small caveat, Windows has no simple access to USB devices as Linux has. Thankfully, instead of using `v4l2`, we can use [`pygrabber`](https://github.com/andreaschiavinato/python_grabber) when needs be. |
| 334 | + |
| 335 | +## Action |
| 336 | + |
| 337 | +Now that we have a camera image from the VFT, we just need to pass it to the Babble App. That's all it takes! Here's a video of the VFT in use with the Babble App, if you'd like to mess around with it yourself, feel free to check out the branch for it here. |
62 | 338 |
|
63 | 339 | # Conclusions, Reflections
|
64 | 340 |
|
65 |
| -At the end of all of this, I couldn't help but wonder |
| 341 | +I want to give a shoutout to DragonLord for providing the code the VFT as well as making it available for the Babble App. I would also like to thank my teammates Summer and Rames, as well as Aero for QA'ing this here too. |
66 | 342 |
|
67 |
| -Becuase it's fun! |
68 |
| - |
69 |
| -Also, if you're interested in a Babble Tracker we're looking to restock sometime later this March, maybe April if things go slowly. |
| 343 | +If you're interested in a Babble Tracker we're looking to restock sometime later this March, maybe April if things go slowly. We'll make an announcement when we re-open sales, you can follow us on Twitter or join or Discord to stay up to date on all things Babble! |
70 | 344 |
|
71 | 345 | Until next time,
|
72 | 346 |
|
73 |
| -- Hiatus |
74 |
| -The Project Babble Team |
75 |
| - |
| 347 | +\- Hiatus |
76 | 348 |
|
77 |
| -### Credits |
| 349 | +The Project Babble Team |
0 commit comments