Skip to content

Commit 261f845

Browse files
committed
(tweak) More copy
1 parent cea3edc commit 261f845

File tree

1 file changed

+316
-44
lines changed

1 file changed

+316
-44
lines changed

blog/reverse-engineering-the-vive-facial-tracker.mdx

+316-44
Original file line numberDiff line numberDiff line change
@@ -2,76 +2,348 @@
22
title: Reverse Engineering the Vive Facial Tracker
33
---
44

5-
# The cool intro
5+
# Reverse Engineering the Vive Facial Tracker
66

77
My name is Hiatus, and I am one of the developers on the Project Babble Team.
88

9-
For the past 2 years (at the time of writing this), me and a team have been working on the eponymous Project Babble, a VR lower-face expression tracking solution for VRChat, Resonite and ChilloutVR namely.
9+
For the past 2 years at the time of writing this, me and a team have been working on the eponymous Project Babble, a cross-platform and hardware-agnostic VR lower-face expression tracking solution for [VRChat](/docs/software/integrations/vrc), [ChilloutVR](/docs/software/integrations/chilloutVR), and [Resonite](/docs/software/integrations/resonite).
1010

11-
This is the story of how the Vive Facial Tracker, another (abandoned) VR face tracking accessory, was reverse engineered to work with the Babble App.
11+
This blog post was inspired by a member of our community [DragonLord](https://github.com/LordOfDragons). His work is responsible for this feature, and this blog post largely paraphrases his findings. Again, put where credit is due and check out his Github. You can check out his findings as well as his [repo](https://github.com/LordOfDragons/vivefacialtracker) here.
12+
13+
This is the story of how the Vive Facial Tracker, another VR face tracking accessory was reverse engineered to work with the Babble App.
1214

1315
Buckle in.
1416

1517
# The Vive Facial Tracker
1618

17-
The Vive Facial Tracker is a VR accessory released in March 24th 2021. It is worn underneath a VR headset, and captures camera images an AI, SRanipal ("Super Reality Animation Pal") converts into expressions other programs can understand*.
18-
19-
Sidenote here, it's really hard to describe the *impact* this had on the entirety of Social VR, at least in my experience.
20-
21-
Unfortunately, the VFT has been discontinued. [You can't even see it on their own store anymore.](https://www.vive.com/us/accessory/). Even worse, it's being scalped on eBay in excess of $1,000. Remember, this accesory cost ~$150 when it came out!!
22-
23-
# The rising action: A curious conversation
24-
25-
I was in voice chat with some people from the Babble Discord, when someone said they knew someone else in the Project Babble discord had gotten their VFT working with a fork of the Babble App. On Linux. *What*
19+
Some context. The Vive Facial Tracker (VFT) was a VR accessory released on March 24th, 2021. Worn underneath a VR headset, it captures camera images of a user's lower face, and using an in-house AI ([SRanipal](https://docs.vrcft.io/docs/hardware/VIVE/sranipal)) converts expressions into other programs can understand.
2620

27-
Curious, I asked for more details so they linked me a conversation that had happened earlier that week. I reached out to the person who made the post to understand what they did, and what followed was a pleasent conversation.
28-
29-
# The Babble App
30-
31-
Before we go on, we need to briefly cover how the Babble App works. In short, it runs an ONNX model that accept a (256x256) grayscale image, fed in from one of two video sources:
32-
33-
1) Via OpenCV. Think webcameras, ip-cameras, etc. This handles 80% of all things cameras.
34-
2) Via Serial. Presently, our Babble Boards send image data via a wired USB connection a computer for processing.
35-
*If the Babble Board is running in wireless mode, it just spins up an IP Camera. Plain and simple stuff.
21+
The VFT currently has integrations for VRChat (via [VRCFaceTracking](https://github.com/benaclejames/VRCFaceTracking)), Resonite and ChilloutVR (natively). Here is a video of it in use:
3622

3723
:::note
38-
I do want to create an article about how are training/trained our ONNX models too!
24+
Sidenote here, it's really hard to describe the *impact* VR face tracking has had on the entirety of Social VR, at least in my experience. It's a completely different degree of immersion, you have to see it in person.
3925
:::
4026

41-
Got all that?
42-
27+
Unfortunately, the VFT has been discontinued. [You can't even see it on Vive's own store anymore](https://www.vive.com/us/accessory/). Even worse, it's being scalped on eBay in excess of $1,000. Remember, this accessory cost ~$150 when it came out!!
4328

4429
# Understanding the VFT's hardware
4530

46-
The VFT consists of 2 OVXXX infrared cameras and and IR Led. See here https://archive.is/NFlaO. This produces a combined 800x400 image, at 400x400 pixels per camera, encoded in YUV2.
31+
Before we can begin to extract camera images from the VFT, we need to understand its underlying hardware.
4732

48-
:::note
49-
In SRanipal, these would be used to compute stereo disparity IE provide how close objects are to the camera. This is useful for expressions in which parts of the face are closer to the camera, such as `JawForward` and `TongueOut`.
50-
:::
33+
## Camera
5134

52-
Babble doesn't care about 3D, so we have 2 options:
53-
1) Just pick either the left or right frame.
54-
2) Do something fancier that requires more work.
35+
The VFT consists of two [OV6211 image sensors](https://www.ovt.com/products/ov6211/) and an image signal processor [OV00580-B21G-1C](https://www.ovt.com/products/ov580/) from OmniVision. The cameras record at 400px\*400px at 60Hz, their images are then shrunk to 200px\*400px then put side by side, for a combined final image 400px\*400px.
5536

56-
Guess which one we did?
37+
:::note
38+
In SRanipal, these separate images are used to compute [disparity](https://en.wikipedia.org/wiki/Binocular_disparity), IE provide how close an object is to the camera. This is useful for expressions in which parts of the face are closer to the camera, such as `Jaw_Forward`, `Tongue_LongStep1` and `Tongue_LongStep2`.
39+
:::
5740

58-
With that in mind, we needed to
59-
1) Open the camera(s) (done!)
60-
2) Turn on the LEDs and other components
61-
3) Process/Convert the VFT's camera into something Babble can understand
41+
An IR light source is used to illuminate the face of the user. The cameras do not record color information per se but rather the luminance of the IR light. This has a direct influence on the image format.
42+
43+
44+
Moreover, this is the output of Video4Linux's `v4l2-ctl --list-devices`:
45+
46+
```
47+
...
48+
HTC Multimedia Camera: HTC Mult (usb-0000:0f:00.0-4.1.2.1):
49+
/dev/video2
50+
/dev/video3
51+
/dev/media1
52+
...
53+
```
54+
55+
As well as `v4l2-ctl -d /dev/video2 --list-formats-ext`:
56+
```
57+
...
58+
ioctl: VIDIOC_ENUM_FMT
59+
Type: Video Capture
60+
61+
[0]: 'YUYV' (YUYV 4:2:2)
62+
Size: Discrete 400x400
63+
Interval: Discrete 0.017s (60.000 fps)
64+
...
65+
```
66+
67+
From above, we can see the VFT provides YUV422 images.
68+
69+
Funnily enough though, the VFT *does not actually output a proper YUV image*. Instead, the VFT stores the grayscale image in all 3 image channels. This breaks trying to convert YUV into RGB, the resulting image can be seen below:
70+
71+
To workaround this, we can extract the "Y" channel use it as a grayscale image. This is pretty fast, as we only need to decode the "4" part of the YUV422 image and ignore the "22" part.
72+
73+
## Lights
74+
75+
The VFT does not provide any controls for brightness or exposure. Instead, an exposure and gain parameter can be set directly using device registers.
76+
77+
Once again, we can probe the VFT using `lsusb`:
78+
79+
```
80+
Bus 004 Device 076: ID 0bb4:0321 HTC (High Tech Computer Corp.) HTC Multimedia Camera
81+
Device Descriptor:
82+
bLength 18
83+
bDescriptorType 1
84+
bcdUSB 3.00
85+
bDeviceClass 239 Miscellaneous Device
86+
bDeviceSubClass 2 [unknown]
87+
bDeviceProtocol 1 Interface Association
88+
bMaxPacketSize0 9
89+
idVendor 0x0bb4 HTC (High Tech Computer Corp.)
90+
idProduct 0x0321 HTC Multimedia Camera
91+
bcdDevice 1.00
92+
iManufacturer 1 HTC Multimedia Camera
93+
iProduct 2 HTC Multimedia Camera
94+
iSerial 0
95+
bNumConfigurations 1
96+
Configuration Descriptor:
97+
bLength 9
98+
bDescriptorType 2
99+
wTotalLength 0x00de
100+
bNumInterfaces 2
101+
bConfigurationValue 1
102+
iConfiguration 0
103+
bmAttributes 0x80
104+
(Bus Powered)
105+
MaxPower 512mA
106+
Interface Association:
107+
bLength 8
108+
bDescriptorType 11
109+
bFirstInterface 0
110+
bInterfaceCount 2
111+
bFunctionClass 14 Video
112+
bFunctionSubClass 3 Video Interface Collection
113+
bFunctionProtocol 0
114+
iFunction 2 HTC Multimedia Camera
115+
Interface Descriptor:
116+
bLength 9
117+
bDescriptorType 4
118+
bInterfaceNumber 0
119+
bAlternateSetting 0
120+
bNumEndpoints 1
121+
bInterfaceClass 14 Video
122+
bInterfaceSubClass 1 Video Control
123+
bInterfaceProtocol 0
124+
iInterface 2 HTC Multimedia Camera
125+
VideoControl Interface Descriptor:
126+
bLength 13
127+
bDescriptorType 36
128+
bDescriptorSubtype 1 (HEADER)
129+
bcdUVC 1.10
130+
wTotalLength 0x004f
131+
dwClockFrequency 150.000000MHz
132+
bInCollection 1
133+
baInterfaceNr( 0) 1
134+
VideoControl Interface Descriptor:
135+
bLength 18
136+
bDescriptorType 36
137+
bDescriptorSubtype 2 (INPUT_TERMINAL)
138+
bTerminalID 1
139+
wTerminalType 0x0201 Camera Sensor
140+
bAssocTerminal 0
141+
iTerminal 0
142+
wObjectiveFocalLengthMin 0
143+
wObjectiveFocalLengthMax 0
144+
wOcularFocalLength 0
145+
bControlSize 3
146+
bmControls 0x00000000
147+
VideoControl Interface Descriptor:
148+
bLength 9
149+
bDescriptorType 36
150+
bDescriptorSubtype 3 (OUTPUT_TERMINAL)
151+
bTerminalID 2
152+
wTerminalType 0x0101 USB Streaming
153+
bAssocTerminal 0
154+
bSourceID 4
155+
iTerminal 0
156+
VideoControl Interface Descriptor:
157+
bLength 13
158+
bDescriptorType 36
159+
bDescriptorSubtype 5 (PROCESSING_UNIT)
160+
bUnitID 3
161+
bSourceID 1
162+
wMaxMultiplier 0
163+
bControlSize 3
164+
bmControls 0x00000000
165+
iProcessing 2 HTC Multimedia Camera
166+
bmVideoStandards 0x00
167+
VideoControl Interface Descriptor:
168+
bLength 26
169+
bDescriptorType 36
170+
bDescriptorSubtype 6 (EXTENSION_UNIT)
171+
bUnitID 4
172+
guidExtensionCode {2ccb0bda-6331-4fdb-850e-79054dbd5671}
173+
bNumControls 2
174+
bNrInPins 1
175+
baSourceID( 0) 3
176+
bControlSize 1
177+
bmControls( 0) 0x03
178+
iExtension 2 HTC Multimedia Camera
179+
Endpoint Descriptor:
180+
bLength 7
181+
bDescriptorType 5
182+
bEndpointAddress 0x86 EP 6 IN
183+
bmAttributes 3
184+
Transfer Type Interrupt
185+
Synch Type None
186+
Usage Type Data
187+
wMaxPacketSize 0x0040 1x 64 bytes
188+
bInterval 9
189+
bMaxBurst 0
190+
Interface Descriptor:
191+
bLength 9
192+
bDescriptorType 4
193+
bInterfaceNumber 1
194+
bAlternateSetting 0
195+
bNumEndpoints 1
196+
bInterfaceClass 14 Video
197+
bInterfaceSubClass 2 Video Streaming
198+
bInterfaceProtocol 0
199+
iInterface 0
200+
VideoStreaming Interface Descriptor:
201+
bLength 14
202+
bDescriptorType 36
203+
bDescriptorSubtype 1 (INPUT_HEADER)
204+
bNumFormats 1
205+
wTotalLength 0x004d
206+
bEndpointAddress 0x81 EP 1 IN
207+
bmInfo 0
208+
bTerminalLink 2
209+
bStillCaptureMethod 0
210+
bTriggerSupport 0
211+
bTriggerUsage 0
212+
bControlSize 1
213+
bmaControls( 0) 0
214+
VideoStreaming Interface Descriptor:
215+
bLength 27
216+
bDescriptorType 36
217+
bDescriptorSubtype 4 (FORMAT_UNCOMPRESSED)
218+
bFormatIndex 1
219+
bNumFrameDescriptors 1
220+
guidFormat {32595559-0000-0010-8000-00aa00389b71}
221+
bBitsPerPixel 16
222+
bDefaultFrameIndex 1
223+
bAspectRatioX 0
224+
bAspectRatioY 0
225+
bmInterlaceFlags 0x00
226+
Interlaced stream or variable: No
227+
Fields per frame: 2 fields
228+
Field 1 first: No
229+
Field pattern: Field 1 only
230+
bCopyProtect 0
231+
VideoStreaming Interface Descriptor:
232+
bLength 30
233+
bDescriptorType 36
234+
bDescriptorSubtype 5 (FRAME_UNCOMPRESSED)
235+
bFrameIndex 1
236+
bmCapabilities 0x00
237+
Still image unsupported
238+
wWidth 400
239+
wHeight 400
240+
dwMinBitRate 153600000
241+
dwMaxBitRate 153600000
242+
dwMaxVideoFrameBufferSize 320000
243+
dwDefaultFrameInterval 166666
244+
bFrameIntervalType 1
245+
dwFrameInterval( 0) 166666
246+
VideoStreaming Interface Descriptor:
247+
bLength 6
248+
bDescriptorType 36
249+
bDescriptorSubtype 13 (COLORFORMAT)
250+
bColorPrimaries 1 (BT.709,sRGB)
251+
bTransferCharacteristics 1 (BT.709)
252+
bMatrixCoefficients 4 (SMPTE 170M (BT.601))
253+
Endpoint Descriptor:
254+
bLength 7
255+
bDescriptorType 5
256+
bEndpointAddress 0x81 EP 1 IN
257+
bmAttributes 2
258+
Transfer Type Bulk
259+
Synch Type None
260+
Usage Type Data
261+
wMaxPacketSize 0x0400 1x 1024 bytes
262+
bInterval 0
263+
bMaxBurst 15
264+
Binary Object Store Descriptor:
265+
bLength 5
266+
bDescriptorType 15
267+
wTotalLength 0x0016
268+
bNumDeviceCaps 2
269+
USB 2.0 Extension Device Capability:
270+
bLength 7
271+
bDescriptorType 16
272+
bDevCapabilityType 2
273+
bmAttributes 0x00000006
274+
BESL Link Power Management (LPM) Supported
275+
SuperSpeed USB Device Capability:
276+
bLength 10
277+
bDescriptorType 16
278+
bDevCapabilityType 3
279+
bmAttributes 0x00
280+
wSpeedsSupported 0x000c
281+
Device can operate at High Speed (480Mbps)
282+
Device can operate at SuperSpeed (5Gbps)
283+
bFunctionalitySupport 2
284+
Lowest fully-functional device speed is High Speed (480Mbps)
285+
bU1DevExitLat 10 micro seconds
286+
bU2DevExitLat 32 micro seconds
287+
Device Status: 0x0000
288+
(Bus Powered)
289+
```
290+
291+
Interestingly, the "VideoControl Interface Descriptor" guidExtensionCode's value `{2ccb0bda-6331-4fdb-850e-79054dbd5671}` matches the log output of a "ZED2i" camera online. This means the (open-source!) code of stereolabs's ZED cameras and the VIVE Facial Tracker share a lot in common, at least USB-wise:
292+
293+
- https://github.com/stereolabs/zed-open-capture/blob/5cf66ff777175776451b9b59ecc6231d730fa202/src/videocapture.cpp
294+
295+
### SB Protocol / Extension Unit
296+
297+
The VIVE Facial Tracker is a video type USB device and behaves like one, with an exception. The data stream is not activated using the regular means but has to be activated using the "Extension Unit". Basically the VIVE Facial Tracker is controlled by sending commands to this "Extension Unit".
298+
299+
In general you have to use SET_CUR commands to set camera parameters and to enable the camera stream. The device uses a fixed size scratch buffer of 384 bytes for all sending and receiving. Only the relevant command bytes are actually consumed while the rest is disregarded.
300+
301+
Camera parameters are set using the `0xab` request id. Analyzing the protocol there are 11 registers touched by the original SRanipal software. The ZED2i lists in particular 6 parameters to control exposure and gain:
302+
- ADDR_EXP_H
303+
- ADDR_EXP_M
304+
- ADDR_EXP_L
305+
- ADDR_GAIN_H
306+
- ADDR_GAIN_M
307+
- ADDR_GAIN_L
308+
309+
Using some testing they most probably map like this to the VIVE Facial Tracker:
310+
311+
| Register | Value | Description |
312+
|----------|---------|-------------|
313+
| `x00` | `x40` | |
314+
| `x08` | `x01` | |
315+
| `x70` | `x00` | |
316+
| `x02` | `xff` | exposure high |
317+
| `x03` | `xff` | exposure med |
318+
| `x04` | `xff` | exposure low |
319+
| `x0e` | `xff` | |
320+
| `x05` | `xb2` | gain high |
321+
| `x06` | `xb2` | gain med |
322+
| `x07` | `xb2` | gain low |
323+
| `x0f` | `x03` | |
324+
325+
The values on the left side are the registers and the value on the right side is the value set by SRanipal. Testing different values produced worse results so the values used by SRanipal seem to be the best choice. What the other parameters dso is unknown.
326+
327+
The `x14` request is the one enabling and disabling the data stream. Hence first the camera parameters have to be set then the stream has to be enabled.
328+
329+
Once the data stream is enabled the camera streams data in the YUV422 format using regular USB video device streaming.
330+
331+
### Windows
332+
333+
One small caveat, Windows has no simple access to USB devices as Linux has. Thankfully, instead of using `v4l2`, we can use [`pygrabber`](https://github.com/andreaschiavinato/python_grabber) when needs be.
334+
335+
## Action
336+
337+
Now that we have a camera image from the VFT, we just need to pass it to the Babble App. That's all it takes! Here's a video of the VFT in use with the Babble App, if you'd like to mess around with it yourself, feel free to check out the branch for it here.
62338

63339
# Conclusions, Reflections
64340

65-
At the end of all of this, I couldn't help but wonder
341+
I want to give a shoutout to DragonLord for providing the code the VFT as well as making it available for the Babble App. I would also like to thank my teammates Summer and Rames, as well as Aero for QA'ing this here too.
66342

67-
Becuase it's fun!
68-
69-
Also, if you're interested in a Babble Tracker we're looking to restock sometime later this March, maybe April if things go slowly.
343+
If you're interested in a Babble Tracker we're looking to restock sometime later this March, maybe April if things go slowly. We'll make an announcement when we re-open sales, you can follow us on Twitter or join or Discord to stay up to date on all things Babble!
70344

71345
Until next time,
72346

73-
- Hiatus
74-
The Project Babble Team
75-
347+
\- Hiatus
76348

77-
### Credits
349+
The Project Babble Team

0 commit comments

Comments
 (0)