Skip to content

Track unique shortids seen#1459

Open
zealsham wants to merge 1 commit intopayjoin:masterfrom
zealsham:unique-shortid-metrics
Open

Track unique shortids seen#1459
zealsham wants to merge 1 commit intopayjoin:masterfrom
zealsham:unique-shortid-metrics

Conversation

@zealsham
Copy link
Copy Markdown
Collaborator

@zealsham zealsham commented Mar 31, 2026

This pr addresses #1284. it counts cardinality for shortids over hourly, daily and weekly time period.

This is still a work in progress
current bottle-neck is the gauges are a fix snapshots which doesnt let you query the amount of unique ids seen for X time, although the internal data to answer those queries exists (168 hourly sketches, 90 daily sketches are stored in memory), but it's not exposed yet.

Pull Request Checklist

Please confirm the following before requesting review:

@zealsham zealsham marked this pull request as draft March 31, 2026 19:28
@coveralls
Copy link
Copy Markdown
Collaborator

coveralls commented Mar 31, 2026

Coverage Report for CI Build 25455589339

Coverage increased (+0.1%) to 85.294%

Details

  • Coverage increased (+0.1%) from the base build.
  • Patch coverage: 5 uncovered changes across 1 file (145 of 150 lines covered, 96.67%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
payjoin-mailroom/src/metrics.rs 102 97 95.1%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 13668
Covered Lines: 11658
Line Coverage: 85.29%
Coverage Strength: 395.85 hits per line

💛 - Coveralls

let hour = secs / 3600;
let day = secs / 86400;

self.hourly.entry(hour).or_insert_with(new_sketch).insert(&id.0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr - hashing &id before insertion here would eliminate a theoretical leak that may result here

since we don't know what hashing HyperLogLogPlus does internally, i think there's a chance that with access to the metrics and access to a list of candidate short IDs, an adversary may be able to confirm if a subset of the short IDs was included, if they are lucky (i.e. those are the extremal values selected).

to prevent this kind of leak the short IDs could be hashed (keyed hash, where the key is secret) before insertion into the HLL sketch, the resulting keys would be a 1:1 mapping (well, almost.... some hash functions are bijective for u64, but i don't think that's really valuable since the chance of collisions on non-duplicate entries is basically 0 for the cardinality we expect)

so basically, by hashing before insertions with some keyed hash (or even a keyed cryptographic hash even), we can ensure that even the complete HLL sketch if published will not leak information about specific short IDs that were used during the relevant time period.

the overhead of even a cryptographic hash here should be negligible, so perhaps the safest approach is to compute a salted SHA256 of the short ID or something like that, but i think a non-cryptographic keyed hash is sufficient since collision resistance against adversarial inputs subsumes not being able to reverse the hash at least if the salt isn't leaked...

if HyperLogLogPlus's hashing is sufficient, perhaps we can control it and ensure that it is adequately keyed, then pinning it as a dependency (to ensure there's no regression) is arguably sufficient, but i think it's easier to reason that the cost of hashing one more time is basically 0 and worth the peace of mind here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyperlologplus uses Sip1-3 hash https://docs.rs/crate/hyperloglogplus/latest/source/src/hyperloglogplus.rs, it's a keyed hash function , so i'm thinking it's sufficient

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the sketches only reside in process memory it doesn't even need that, but a comment on the initialization that notes that if sketches are exported, not just counts, then it will become important for privacy to continue to key the hash function with secret randomness (i.e. keep using RandomState::new(), the stdlib hashing makes strong enough guarantees) and to not leak this key. this still applies if prometheus eventually supports cardinality estimation natively, all keys should be hashed with a secret key before being passed to whatever cardinality estimation API may be used in the future.

@nothingmuch
Copy link
Copy Markdown
Contributor

This pr addresses #1284. it counts cardinality for shortids over hourly, daily and weekly time period.

Concept ACK. Code looks good from a cursory read, I will review more thoroughly when it's out of draft

This is still a work in progress current bottle-neck is the gauges are a fix snapshots which doesnt let you query the amount of unique ids seen for X time

in principle HLL sketches can be merged, which means we can take high frequency sketches and aggregate them together in response to certain queries

if cardinality estimation is added to prometheus directly then that would be possible in theory (prometheus/prometheus#12233) without any additional considerations, but until then i think it's perfectly fine to just output u64s at enough temporal resolutions to be useful and not allow estimating count distinct over longer durations. my only suggestion here is to also add a monthly aggregation interval as well.

note that my comment about hashing short IDs before cardinality estimation only really makes any difference if native caridnality estimation is supported in prometheus directly, so i would also be satisfied with just a comment that says if cardinality estimation is later supported, IDs should be hashed before giving them to prometheus, but for now if we only publish the cardinality estimate resulting from the sketches and the sketches themselves only reside in the mailroom's RAM then there's no need for that additional hashing

@zealsham zealsham force-pushed the unique-shortid-metrics branch from 718757a to f8e868f Compare April 30, 2026 12:49
@zealsham zealsham marked this pull request as ready for review April 30, 2026 13:22
@zealsham
Copy link
Copy Markdown
Collaborator Author

@nothingmuch added the comments about hashing if this metrics will ever be exported

also added monthly cardinality

Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use btreemap, the keys are inherently ordered

Comment thread payjoin-mailroom/src/metrics.rs Outdated
Comment on lines +57 to +58
self.hourly.retain(|&k, _| hour.saturating_sub(HOURLY_RETENTION_HOURS) <= k);
self.daily.retain(|&k, _| day.saturating_sub(DAILY_RETENTION_DAYS) <= k);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retain here seems a bit crude, it scans through the whole collection

if it was a btreemap instead of a hashmap then you could do a while first_key_value <= k { remove(k) } while loop instead, that will typically terminate with 0 iterations instead of scanning through the whole collection, which means the retention counts can be pretty big without taking a performance hit

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to using a btremap, Thank you for the review

Comment thread payjoin-mailroom/src/metrics.rs Outdated
Comment on lines +47 to +52
let secs = SystemTime::now()
.duration_since(UNIX_EPOCH)
.expect("system clock before UNIX epoch")
.as_secs();
let hour = secs / 3600;
let day = secs / 86400;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting the number of some duration since epoch is boilerplate that repeats several times, i think an extension trait on SystemTime could introduce e.g. SystemTime::now().intervals_since_epoch(Duration) with hours_since_epoch and days_since_epoch as additional convenience methods to feed it the right constants

Suggested change
let secs = SystemTime::now()
.duration_since(UNIX_EPOCH)
.expect("system clock before UNIX epoch")
.as_secs();
let hour = secs / 3600;
let day = secs / 86400;
let now = SystemTime::now();
let hour = now.hours_since_epoch();
let day = now.days_since_epoch();

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks , this is clean

Comment thread payjoin-mailroom/src/metrics.rs Outdated
/ 86400;
let mut union = new_sketch();
for offset in 0..30 {
if let Some(sketch) = self.daily.get(&(today - offset)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTreeMap also has range for iterating over sub-range


const HLL_PRECISION: u8 = 14;
const HOURLY_RETENTION_HOURS: u64 = 168; // 7 days
const DAILY_RETENTION_DAYS: u64 = 90;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the last 30 are used in monthly count

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to 31, but then since we keep just 90 days of data, would it be better taking the average of all 90 ? , although that can heavily affect accuracy

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not an average, the sum of hll sketches of two sets A and B is the sketch of the union of A and B, there is no reduction in accuracy other than that which is inherent in hll sketches

This pr addresses payjoin#1284. it counts cardinality for shortids over
hourly, daily and weekly time period.
@zealsham zealsham force-pushed the unique-shortid-metrics branch from f8e868f to 03f4dba Compare May 6, 2026 19:09
@zealsham zealsham requested a review from nothingmuch May 6, 2026 19:13
Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK.

re 90 vs. 30 days of retention, i think that's a project management question ultimately. this information should help operators gauge interest but i think it's primarily of interest to us if operators are willing to share it with us as it's a proxy for adoption rates. it's probably best to figure out how this should be used and by whom

the main privacy risk is that especially with low adoption rates, if an adversary continually polls real time monitoring data and sees a change then that that is evidence that temporally adjacent transactions are more likely to be payjoins, so i think the real decision is do we even want to keep hourly information or is daily enough, and how do we warn operators about publishing the states somewhere readable because the gauge for the currently active sketch can provide high temporal resolution (basically prometheus's poll rate)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants