Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't successfully connect BLE devices on Linux #12

Open
MrShinyAndNew opened this issue Oct 17, 2024 · 49 comments
Open

Can't successfully connect BLE devices on Linux #12

MrShinyAndNew opened this issue Oct 17, 2024 · 49 comments

Comments

@MrShinyAndNew
Copy link

I'm trying to connect a Lego Powered Up remote using node-poweredup. I've overridden its dependency to use @stoprocent/noble, but the behaviour is very poor and inconsistent. I think there are still unresolved issues, but I'm not sure how to debug this any further.

The sample program on the noble screen, or the basic sample node-poweredup program both show the same symptoms - they fail to successfully connect to the BLE device 19 times out of 20. Most of the time the program does nothing; sometimes it gives an error "already connected".

I have tried using strace to inspect the system calls and one thing I noticed is that the connect() system call sometimes takes 40s and failes with "Connection refused" - of course my BLE device has given up attempting to connect long before 40s is up.

Example:
connect(24, {sa_family=AF_BLUETOOTH, l2_psm=htobs(0 /* L2CAP_PSM_??? */), l2_bdaddr=07:d0:99:f1:34:a4, l2_cid=htobs(L2CAP_CID_ATT), l2_bdaddr_type=BDADDR_LE_PUBLIC}, 14) = -1 ECONNREFUSED (Connection refused)

@stoprocent
Copy link
Owner

@MrShinyAndNew It's very little details to work with. I can try something but I only have duplo ble train that is anything close to this. One thing I can tell you is that I took over noble because I'm using it in production environments at work and it has been super stable for us. We are using it all over the place. What is the BLE hci you are using? Do you have any other BLE dongle you can use and see if its the same? Im making thousands request everyday in CI/CD and I do get connection issues but I would say it's like 1 per 10k

@stoprocent
Copy link
Owner

stoprocent commented Oct 18, 2024

So I did a little digging and I think it could be related to BLE you have either built in or dongle. I have 2 dongles one is BLE 5.3 and the other one is 4.1. With the 5.3 one I do have some inconsistency but 4.1 is like this:

IMG_0033.MOV

By this I mean its connecting every single time. No issues

@MrShinyAndNew
Copy link
Author

I'm using a Dell XPS notebook from 2017. I'm not sure how to determine what kind of bluetooth it has. lspci lists a Qualcomm Atheros QCA6174 network controller and lsusb lists a Qualcomm Atheros wireless device but doesn't indicate if it's the Bluetooth controller or not.

I didn't mean to imply that the noble library doesn't work at all :) but before the kernel changes, everything worked perfectly on this notebook. I don't mind diving in to debug or otherwise dig in to this problem but I don't know where to look - I'm not really an expert in bluetooth. I can try with a different notebook, but it will take some time to get back to you.

@MrShinyAndNew
Copy link
Author

MrShinyAndNew commented Oct 21, 2024

I've tested on my other notebook with the same results. lspci shows an Intel Dual Band Wireless-AC 3168NGW and lsusb shows Intel Corp. Wireless-AC 3168 Bluetooth. Both of these devices worked before and they're both fairly old so there's nothing new and exotic in them.

I realize this bug report doesn't have much actionable detail - if there's any test I can run I'd love to help get to the bottom of this.

@stoprocent
Copy link
Owner

If you can export DEBUG=* and run connection attempt and paste the log here

@MrShinyAndNew
Copy link
Author

Okay, here are 4 tests I ran, listed here in the order I ran them. The first two use the node-poweredup library with an override (I've verified it's pulling in the expected noble version). The second two just use plain noble. The address for the remote is "a4:34:f1:99:d0:07".

This one, the Lego remote appeared to connect (solid indicator light) right away, but the test program didn't emit the "discovered" event.
connected_but_not_discovered.txt

This one emitted the "discovered" event but then never connected
discovered_but_not_connected.txt

This one instantly failed with "already connected"
noble_already_connected.txt

This one didn't connect.
noble_not_connected.txt

@MrShinyAndNew
Copy link
Author

I tried on my other notebook computer, a Lenovo with an intel chipset. The behaviour here is different. I get the poweredOn event but nothing appears to arrive when I active the BLE device.
output.txt

@stoprocent
Copy link
Owner

Hi,

I’ve gone through your logs and have some ideas. Apologies for the delayed response—I’ve been busy with other tasks at work and haven’t been focusing on native HCI as my primary target. I’m currently using HCI over UART converters instead.

With Christmas coming up, I’ll finally have some time off and plan to look into the issues posted here. There’s also one more issue on Windows that I promised to address.

Stay tuned!

@stoprocent
Copy link
Owner

Quick update as I’m currently looking into this issue. Could you try the following?

const noble = require('@stoprocent/noble/with-custom-binding')({ extended: true });

This week, I tested a few different BLE dongles I bought on eBay, and I noticed that BLE 5.3 dongles don’t work too well unless the extended option is explicitly set to true. I’m planning to dive deeper into the HCI commands to determine the dongle profile automatically, so this adjustment might eventually be integrated into noble.

Please test this setup when you get a chance and let me know how it goes.

Thanks!

@stoprocent
Copy link
Owner

Great news! There's a command to read the supported features, which allows me to identify the capabilities of the dongle. If you could test it and let me know whether it helps, that would be fantastic. Your feedback would be invaluable!

@MrShinyAndNew
Copy link
Author

Hi, I tried using the latest release (1.15.2) with and without extended bindings and there's no improvement that I can see. I don't know what you mean with identifying the capabilities of the dongle.
Just a note that the main test machine I'm using here is a Dell from 2017, so it's definitely not Bluetooth 5.x. It used to work before the Linux kernel changes :(

noble_1.15.2_extended_params.txt
poweredup_with_1.15.2.txt

I've attached two logs showing different tests. The noble test uses extended bindings, the pu uses whatever node-poweredup uses. From what I can see, the noble test case doesn't even notice the PU Remote attempt to connect. The poweredup test case does notice it, but can't connect.

Both test cases are from the dell notebook.

@stoprocent
Copy link
Owner

I found another issue that I think could be related. Can you add one line to the code after power On event:

noble.on('stateChange', async (state) => {
    console.log('stateChange ', state);
  if (state === 'poweredOn') {
    console.log('Powered on');
    noble.reset(); // THIS LINE
    await noble.startScanningAsync(['180f'], false);
  }
});

@stoprocent
Copy link
Owner

stoprocent commented Dec 19, 2024

Looking at your log I'm not sure that you are using my fork and not @abandonware one. On you log when you CTRL+C there is a reset which is removed from my code. LEts try this ...

  1. Download zip attached here noble.zip
  2. cd into downloaded unzipped dir
  3. npm install inside
  4. export DEBUG=hci
  5. node examples/peripheral-explorer-async.js 11:22:33:44:55:66
  6. Send me a LOG here again

@MrShinyAndNew
Copy link
Author

Hi,
I've checked and the files downloaded are indeed the stoprocent noble files (the package.json installs stoprocent/noble under abandonware/noble but it's the stoprocent files, and it pulls in stoprocent/bluetooth-hci). But I tried npm installing the zip file and it fails with a git checkout error:

mark@:~/code/test3/noble$ npm install npm error code 1 npm error The git reference could not be found npm error command git --no-replace-objects checkout fix-docker npm error error: pathspec 'fix-docker' did not match any file(s) known to git npm error A complete log of this run can be found in: /home/mark/.npm/_logs/2024-12-19T20_06_42_684Z-debug-0.log mark@:~/code/test3/noble$

@stoprocent
Copy link
Owner

Yes sorry another you have to change one thing in package.json change:
"prebuildify-cross": "thegecko/prebuildify-cross#fix-docker",
into
"prebuildify-cross": "^5.1.1",
and try my sequence from above

@MrShinyAndNew
Copy link
Author

@stoprocent
Copy link
Owner

as a last thing can you change:
const noble = require('../with-custom-binding')({ extended: true });
to
const noble = require('../with-custom-binding')({ extended: false });
in this file:
examples/peripheral-explorer-async.js

@MrShinyAndNew
Copy link
Author

MrShinyAndNew commented Dec 19, 2024

noble_peripheral_scan_test_extended-false.txt
(Edited to add: I'll be unavailable shortly but will check in again later if there's something else to try)

@stoprocent
Copy link
Owner

Ok last one did work. It picked up all scans. Lets do final test and it will be good I think to release:
Try this: noble.zip
And call the peripheral explorer async with your lego device mac address. Ive seen in other logs that for example this mac address poped up: eb:34:f6:0c:7e:58

@stoprocent
Copy link
Owner

Last log you sent me shows scanning working correctly

@MrShinyAndNew
Copy link
Author

It scans correctly but still doesn't seem to connect.
noble_scan_detected.txt

@stoprocent
Copy link
Owner

OK one last chance :) Because I think I know the issue now :)
Try this script:

const noble = require('../with-custom-binding')({ extended: true });

const peripheralIdOrAddress = process.argv[2].toLowerCase();

noble.on('stateChange', async (state) => {
  if (state === 'poweredOn') {
    await noble.connectAsync("a4:34:f1:99:d0:07".replace(/:/g, ''))
  }
});

@MrShinyAndNew
Copy link
Author

This script (if I understood the assignment) didn't do much - it seems to try to connect instantly even before the remote is ready. But anyway I tried starting the remote first and it doesn't seem to help.
connectasync_test.txt

@stoprocent
Copy link
Owner

I’ve identified the cause of the issue. While I don’t have a fix yet, it’s related to the type of address these devices are using. There are two types of addresses: public and random. I tested one of my dongles by advertising the same public address type as your LEGO device, and I encountered the same connection issue.

I’ll need to dive deeper into the connect HCI command to see if there’s something I might have missed. Rest assured, I’m confident this will get fixed.

@stoprocent
Copy link
Owner

Can you try the latest release and run the following example:

node examples/peripheral-explorer-async.js 1 a4:34:f1:99:d0:07 

and

node examples/peripheral-explorer-async.js 0 a4:34:f1:99:d0:07 

Please note that this script was modified and the first argument 0|1 indicates a direct connection without scan.
I've tried Linux and Mac and Windows with all combinations with the devices with the same mac address as you and it worked. If it doesn't then I'm hopeless :)

@MrShinyAndNew
Copy link
Author

So, there is some improvement!
2024_12_27_with_noble_1.17_asyc_0.txt
2024_12_27_with_noble_1.17_asyc_1.txt

Attached are the tests you requested. The transcript shows that it's pulling in the latest stoprocent noble. The connection can be established, however there are two issues:

  1. The async 0 test has that connection timeout problem where the connection is attempted but it fails. You can see in the log that it takes 44s on the "create le conn" line - the connection is not created. However if I re-attempt the connection after that, it can succeed. I was able to connect one remote using node-poweredup by attempting the connection twice, however:
  2. I wasn't able to connect more than one device, and when I disconnected the remote, nothing would connect again afterwards. I'm afraid my poweredup app log is very verbose, so please let me know if you'd like me to capture it.

@stoprocent
Copy link
Owner

It seems like you might be misinterpreting the logs. In the second log you posted, the connection was successful and completed quickly. Similarly, in the first log, the connection was also successful, but it took significantly longer.

I’ve observed this behavior myself with certain BLE adapters—occasionally, the connection process slows down significantly. However, after unplugging and reconnecting the USB BLE adapter, the process returns to normal speed.

From the logs, it’s clear that this issue isn’t related to @stoprocent/noble. The HCI command is sent, but the system idles for over 40 seconds before receiving a response from the HCI. Initially, I thought this might be due to something introduced during the refactor of the C++ HCI driver, but I observed the same behavior with the abandonware HCI driver.

I’m not sure what causes this issue, but as mentioned, unplugging and reconnecting the BLE adapter seems to resolve it.

@stoprocent
Copy link
Owner

After further investigation, I found (as explained in this article: [Bluetooth Address Privacy in BLE](https://novelbits.io/bluetooth-address-privacy-ble/)) that when establishing a direct connection, you need to specify the address type as either public or random. This is because the address type is part of the advertising information. A peripheral device may advertise an address that appears public but is actually marked as random. In such cases, the HCI dongle on Linux won’t connect to the device unless the address and the address type match the filter settings.

To address this, I’ve updated the connect function to accept an additional parameter for addressType. I’ve also modified the example script, peripheral-explorer-async.js, to accept three parameters. The third parameter specifies the address type (public or random) and defaults to random.

That said, based on the logs you provided, the attempt to scan and then connect took about 4 seconds, and everything appears to have worked correctly. The connection was successful, and reading the GATT attributes completed without issues.

@stoprocent
Copy link
Owner

I've tried on couple of BLE USB dongles and now it seems to work perfectly ...
Screenshot 2024-12-29 at 14 08 16

@MrShinyAndNew
Copy link
Author

Yes, the timeout I'm referring to is in the asyc 0 log file, on this line:
hci onSocketData: 040f0400010d20 +44s
You can see that the socket data response is > 40s after the connection is attempted. The peripheral (a Lego PoweredUp remote) only tries to connect for 10s. In that file, the connection is not established until after the timout, when I re-start the peripheral.

@MrShinyAndNew
Copy link
Author

2024_12_28_with_noble_1.17.3_asyc_0.txt
Here is the file of that same test as before, with the latest code, note the hang-up while attempting the connection the first time. Using public or random as the argument didn't help.

Peripheral with ID a434f199d007 found
  Local Name        = Green Grille  
  Manufacturer Data = 970300420a276100
  Service Data      = []
  Service UUIDs     = 000016231212efde1623785feabcd123

Services and characteristics:
  hci create le conn - writing: 010d201960003000000007d099f134a4000600120000002a0004000600 +6ms
  hci onSocketData: 040f0400010d20 +42s
  hci   event type = 4 +0ms
  hci   sub event type = 15 +0ms
  hci           status = 0 +1ms
  hci           cmd = 8205 +0ms
--------------------------------------------------pressing button on remote again----------------------
  hci onSocketData: 043e1301000100000007d099f134a4120000002a0000 +22s```

@MrShinyAndNew
Copy link
Author

@MrShinyAndNew
Copy link
Author

Okay I did another test and I noticed something odd: If I am testing a single PoweredUp device, the behaviour is thus:

  1. First attempt to connect fails - the node program hangs for 40s and the peripheral times out after 10s
  2. Second attempt to connect after the program resumes succeeds
  3. Disconnecting and reconnecting works fine after this
  4. Connecting a second PoweredUp device fails, same behaviour as before.
  5. After the 40s timeout, the second device can connect properly.
  6. If the first device was disconnected at the time, it will also hang for 40s the first time you try to reconnect after the 2nd device.
  7. If I connect one device, fail for 40s, connect successfully, then connect a second device, fail for 40s, then connect successfully, it appears that the two devices will connect to the same program successfully.

Log attached
2024_12_28_putest_noble_1.17.3.txt

@Apollon77
Copy link

@stoprocent The 40s thing is also mainly happening in the cases I get ... where then this auto-reconnect comes in and such (at least sounds like it ... )

@stoprocent
Copy link
Owner

@MrShinyAndNew I'm a bit lost tbh. I found a PowerUp Train Remote in my son's Lego and tried it and it worked like a charm. There is one issue in noble that I will release in a moment but other than that it works and connects every single time:

Here is the Video ... https://marek-serafin.s3.amazonaws.com/test.mov

@stoprocent
Copy link
Owner

Sorry, I’ve discovered another issue, specifically with older dongles, which also affects the PoweredUp npm package. The problem is that the stopScan command isn’t sent when establishing a connection, and I believe this might be what you’re encountering. While recording a video on a different dongle, I noticed some odd behavior. After adding a stopScan during the connection process, the issue was resolved. Please test it and let me know.

@stoprocent
Copy link
Owner

Ok, I have no more ideas on how to test it further. I have forked PoweredUp npm and replaced all abandonware with the latest @stoprocent/noble and I have tested it on Linux HCI, Linux UART HCI, Mac Native, Mac UART HCI, Window Native, Windows UART HCI and it works exactly the same on all platforms. The main thing I found is that PoweredUp lib is restarting the scan every second and not stopping it even after the connect function is called and on older BLE dongles which is causing issues. I have now added stop scan logic to the connection functions and proper queueing, eliminating the problem.

@MrShinyAndNew
Copy link
Author

With the latest noble I'm still getting a 40s hangup when making the connection the first time.
2024_12_30_with_noble_1.18.2_async_0_random.txt

@MrShinyAndNew
Copy link
Author

I tested the newest noble on my lenovo laptop (it's much newer than the Dell I normally use) - it doesn't seem to connect at all. Logs attached.

2024-12-30-peripheral-scanner-asyc-0-random.txt
2024-12-30-peripheral-scanner-asyc-1-random.txt

@stoprocent
Copy link
Owner

There are couple of things here:.

I tested the newest noble on my lenovo laptop (it's much newer than the Dell I normally use) - it doesn't seem to connect at all. Logs attached.

2024-12-30-peripheral-scanner-asyc-0-random.txt

  • In this one there are scans that are picked up but none of the are your device if you look in the log there are these devices found but not the one you looking for. This isnt noble
1a:82:00:37:21:be
66:82:fb:ac:09:18
2a:bf:86:7e:2b:93
07:60:19:86:db:77
5d:42:07:78:77:54
6f:40:99:b2:a9:42
4c:5b:b3:d1:81:85
18:70:c6:4e:40:46
fe:d7:5c:fe:26:3f
64:aa:fa:5b:c8:79
90:84:2b:00:aa:cc
7c:e2:69:7c:1a:09
71:71:71:94:4a:b1
e9:80:40:60:97:23
6b:47:04:c3:b2:a9

2024-12-30-peripheral-scanner-asyc-1-random.txt

  • In this log device is found but you set the address to random and the address is public so this is what is expected. You set the wrong Address Type. In the previous one it wouldn't matter as the Address Type is picked by the scanner, but it just didn't find the device.

This one is a bit weird and I think you need to rerun it and post some more logs:

In your log it says:

  noble scanStop +16s <<<<<<<<<<<<<<<<<< That scan stop took 16 second ?
Peripheral with ID a434f199d007 found
  Local Name        = Green Grille  
  Manufacturer Data = 970300420a276100
  Service Data      = []
  Service UUIDs     = 000016231212efde1623785feabcd123

Services and characteristics:
  hci create le conn - writing: 010d201960003000000007d099f134a4000600120000002a0004000600 +10ms
  hci onSocketData: 040f0400010d20 +42s
  hci   event type = 4 +0ms
  hci   sub event type = 15 +0ms
  hci           status = 0 +1ms
  hci           cmd = 8205 +0ms

2024_12_30_with_noble_1.18.2_async_0_random.txt

@MrShinyAndNew
Copy link
Author

The scanStop time appears to be however long it's been since the start of the program.

I re-did some tests and found that with the arguments 1 and public it appears to connect to the remote on the first try; with 0 it does not.

How do I make node-poweredup use these arguments?

2024_12_31_dell_async-0-public.txt
2024_12_31_dell_async-1-public.txt

@stoprocent
Copy link
Owner

In both cases/logs you have connected successfully. You don't need these params in PoweredUp just replace @abandonware/noble to @stoprocent/noble in PoweredUp and then try it will work. Logs are looking good just give it a go.

The scanStop time appears to be however long it's been since the start of the program.

I re-did some tests and found that with the arguments 1 and public it appears to connect to the remote on the first try; with 0 it does not.

How do I make node-poweredup use these arguments?

2024_12_31_dell_async-0-public.txt 2024_12_31_dell_async-1-public.txt

@MrShinyAndNew
Copy link
Author

MrShinyAndNew commented Dec 31, 2024 via email

@stoprocent
Copy link
Owner

I think you are not using @stoprocent/noble in the internals maybe this swap you doing isn't working because when I did run the same command my hci logs are different ...

  hci onSocketData: 043e270201030176964b5555381b02011a17ff4c0009081356c0a801881b58160800f7887bc0c0c56ed6 +7ms
  hci 	event type = 4 +0ms
  hci 	sub event type = 62 +0ms
  hci 		LE meta event type = 2 +0ms
  hci 		LE meta event data length = 39 +0ms
  hci 		LE meta event num reports = 1 +0ms
  hci 		LE meta event data = 030176964b5555381b02011a17ff4c0009081356c0a801881b58160800f7887bc0c0c56ed6 +0ms
  hci 			type = 3 +0ms
  hci 			address = 38:55:55:4b:96:76 +0ms
  hci 			address type = random +0ms
  hci 			eir = 02011a17ff4c0009081356c0a801881b58160800f7887bc0c0c56e +0ms
  hci 			rssi = -42 +0ms
  hci onSocketData: 040e04010c2000 +2ms
  hci 	event type = 4 +0ms
  hci 	sub event type = 14 +0ms
  hci 		cmd = 8204 +0ms
  hci 		status = 0 +0ms
  hci 		result =  +0ms
Peripheral with ID d02eabcf8df6 found
  Local Name        = Handset
  Manufacturer Data = 970300420af86100
  Service Data      = []
  Service UUIDs     = 000016231212efde1623785feabcd123

Services and characteristics:
  hci create le conn - writing: 010d2019600030000000f68dcfab2ed0000600120000002a0004000600 +2ms
  hci onSocketData: 0111200700f68dcfab2ed0 +105ms
  hci 	event type = 1 +0ms
  hci 		cmd = 8209 +0ms
  hci 		data len = 7 +0ms
  hci onSocketData: 040e0401112000 +0ms
  hci 	event type = 4 +0ms
  hci 	sub event type = 14 +0ms
  hci 		cmd = 8209 +0ms
  hci 		status = 0 +0ms
  hci 		result =  +0ms
  hci onSocketData: 010b200700600060000001 +0ms
  hci 	event type = 1 +0ms
  hci 		cmd = 8203 +0ms
  hci 		data len = 7 +0ms
  hci onSocketData: 040e04010b2000 +0ms
  hci 	event type = 4 +0ms
  hci 	sub event type = 14 +0ms
  hci 		cmd = 8203 +0ms
  hci 		status = 0 +0ms

@MrShinyAndNew
Copy link
Author

It's definitely the stoprocent noble, you can see it in some of the older log files where I grepped the version number from the package-lock. The last log file was generated with DEBUG_COLORS=no in order to get timestamps on the log file, but is otherwise the same program.

@MrShinyAndNew
Copy link
Author

MrShinyAndNew commented Jan 3, 2025

For your device this looks like how I'd expect it to be: connecting to the BLE device should take a few ms

hci create le conn - writing: 010d2019600030000000f68dcfab2ed0000600120000002a0004000600 +2ms
 hci onSocketData: 0111200700f68dcfab2ed0 +105ms

on my system the first attempt to create le conn always fails to actually reach the poweredup handset and the handset gives up and shuts down while on the linux side the system call eventually times out.

@Apollon77
Copy link

maybe see also my other issue ... I also have such 40s issues still ... but in the meantime it works afterwards mostly ... so it is "just" delays

@stoprocent
Copy link
Owner

@MrShinyAndNew Can you try one thing please ...
After this line: https://github.com/stoprocent/noble/blob/ecb43339054cd3ebc2f5ff3da2600eb67c8e3fe9/examples/peripheral-explorer-async.js#L28C7-L29C5

await noble.stopScanningAsync();

can you add following code

await noble.stopScanningAsync();
await new Promise(r => setTimeout(4, 2000));

@MrShinyAndNew
Copy link
Author

Hi,
The setTimeout code you sent doesn't seem to work for me, it expects a function where you have "4". I tried with an empty function with these results:

2025_01_09_dell_async_0_public.txt
2025_01_09_dell_async_1_public.txt

It seems like whatever is different in mode 1 is doing something right. I'm not sure why mode 1 is creating a connection before I start the remote though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants