Skip to content

Use the one flatbuffer to store all lists #489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from
Closed

Conversation

atuchin-m
Copy link
Collaborator

@atuchin-m atuchin-m commented Jun 20, 2025

The PR moves from per-NetworkList flatbuffers to a one (per-Engine).
It doesn't affect the performance metrics, but open possibilities to put cosmetic filters to the same flatbuffer.
It also simplifies the serialization and deserialization code.

Notes:

  • It use the original algorithm and structures without futher optimizations. Next time, the diff is big enough.
  • insert_dup was dropped, currently it didn't have any effect with flatbuffers (we compared the flatbuffers offsets, not the unique ids). Maybe it makes sense to restore it the future versions.
  • NetworkFilterLists::new is left as-is for benches and tests to avoid excessive diff.

@atuchin-m atuchin-m self-assigned this Jun 20, 2025
pub(crate) fn filter_list(&self) -> fb::NetworkFilterList<'_> {
unsafe { fb::root_as_network_filter_list_unchecked(self.data()) }
pub(crate) fn root(&self) -> fb::Engine<'_> {
unsafe { fb::root_as_engine_unchecked(self.data()) }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

flatbuffers::Vector<'a, flatbuffers::ForwardsUOffset<NetworkFilterList>>,
>>(Engine::VT_LISTS, None)
.unwrap()
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

None,
)
.unwrap()
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] Detected 'unsafe' usage, please audit for secure usage

Source: https://semgrep.dev/r/rust.lang.security.unsafe-usage.unsafe-usage


Cc @thypon @kdenhartog

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust Benchmark

Benchmark suite Current: 0bb327f Previous: 4738d3f Ratio
rule-match-browserlike/brave-list 2268568240 ns/iter (± 20502033) 2252126465 ns/iter (± 13369021) 1.01
rule-match-first-request/brave-list 1006882 ns/iter (± 6061) 1000284 ns/iter (± 8364) 1.01
blocker_new/brave-list 150003954 ns/iter (± 1107279) 150674514 ns/iter (± 1975624) 1.00
blocker_new/brave-list-deserialize 63987710 ns/iter (± 1362897) 62360204 ns/iter (± 1865060) 1.03
memory-usage/brave-list-initial 16282069 ns/iter (± 3) 16225933 ns/iter (± 3) 1.00
memory-usage/brave-list-initial/max 64817658 ns/iter (± 3) 64817658 ns/iter (± 3) 1
memory-usage/brave-list-initial/alloc-count 1514486 ns/iter (± 3) 1514650 ns/iter (± 3) 1.00
memory-usage/brave-list-1000-requests 2516487 ns/iter (± 3) 2505592 ns/iter (± 3) 1.00
memory-usage/brave-list-1000-requests/alloc-count 66572 ns/iter (± 3) 66070 ns/iter (± 3) 1.01

This comment was automatically generated by workflow using github-action-benchmark.

@atuchin-m atuchin-m force-pushed the the-one-fb branch 2 times, most recently from f8e5204 to 27de614 Compare June 20, 2025 20:47
Copy link
Member

@kdenhartog kdenhartog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked over the unsafe usage and this matches what it was previously. We changed the return type here, but it's not a concern and the implementation matches what it previously was. I'm removing those alerts

Comment on lines 71 to 75
// TODO: do we need another feature for this?
#[cfg(feature = "unsync-regex-caching")]
pub(crate) type SharedStateRef = std::rc::Rc<SharedState>;
#[cfg(not(feature = "unsync-regex-caching"))]
pub(crate) type SharedStateRef = std::rc::Arc<SharedState>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feature for both is fine, but it would be nice to rename it since it'd no longer be strictly about regex caching

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note though, I don't think we actually need any refcounted pointers for this? We could pass the &SharedState in as an argument to whatever functions require it, or give Blocker a <'a> lifetime so it can own a shared_state: &'a SharedState field

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds logical, but the issue is that Engine (the primary owner of shared_state) also own a Blocker instance (that also stores shared_state).

A class member cannot use the same lifetime as another class member without stuff like https://docs.rs/ouroboros/latest/ouroboros/attr.self_referencing.html
Without using Rc Blocker will became Blocker<'a> with a limited lifetime and can't be stored as a member of Engine, which results in major changes in the codebase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the feature renamed to single-thread

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is it could be organized like this minimal example:

type FilterDataContextRef = [u8; 32];

struct Engine {
    filter_data_ref: FilterDataContextRef,
}

impl Engine {
    fn new() -> Self {
        let filter_data_ref = [0u8; 32];
        Self {
            filter_data_ref,
        }
    }

    fn blocker<'a>(&'a self) -> Blocker<'a> {
        Blocker {
            filter_data_ref: &self.filter_data_ref,
        }
    }

    fn check_network_request(&self) -> bool {
        self.blocker().check()
    }
}

struct Blocker<'a> {
    filter_data_ref: &'a FilterDataContextRef,
}

impl<'a> Blocker<'a> {
    fn check(&self) -> bool {
        true
    }
}

Key point is Engine owns all the data and has the blocker() method to produce a scoped convenience struct that can be used for the current network blocking methods without moving too much code around.

The only other fields on Blocker are:

  • regex_manager, which also makes sense to keep on the top-level Engine so that it may be used for cosmetic filtering as needed (some of uBO's newer syntax features could benefit from it)
  • tags_enabled, which I don't have strong preferences about since I'd like to get rid of it anyways

builder.add_filter(filter, list_id as u32);
}

builder.finish(if optimize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimize && id != FilterId::RemoveParam as u32 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was my initial attempt, but we can't use lambas with a capture as fn (like fn(u32) -> bool))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my other comment, but for future reference: https://doc.rust-lang.org/std/keyword.move.html

@atuchin-m atuchin-m changed the title [DRAFT] Use the one flatbuffer to store all lists Use the one flatbuffer to store all lists Jun 25, 2025
@atuchin-m atuchin-m marked this pull request as ready for review June 25, 2025 11:44
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Rust Benchmark'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.10.

Benchmark suite Current: 0bb327f Previous: 4738d3f Ratio
blocker_new/brave-list-deserialize 69734003 ns/iter (± 1109450) 62360204 ns/iter (± 1865060) 1.12

This comment was automatically generated by workflow using github-action-benchmark.

@atuchin-m atuchin-m requested review from boocmp and antonok-edm June 25, 2025 12:44
@@ -0,0 +1,337 @@
//! Builder for creating flatbuffer with serialized engine.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be rename it to fb_builder.rs ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can. @antonok-edm WDYT?

Copy link
Collaborator

@boocmp boocmp Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking because we have fb_network.rs and flat_filter_map.rs and I suppose that fb for flatbuffers and flat for flat containers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sounds good to me

Copy link
Collaborator

@antonok-edm antonok-edm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more general thing - could we do these in a 0.11.x branch rather than in master? That way it's a little bit easier to maintain the history of breaking changes vs patch releases

Comment on lines 330 to 335
builder.finish(if optimize {
// Don't optimize removeparam, since it can fuse filters without respecting distinct
|id: u32| id != FilterId::RemoveParam as u32
} else {
|_| false
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason to pass this as a lambda rather than moving the check inside finish()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, the initial idea was to make 2 layers: for the storage and for the logic how and what.
But in fact that parts of code is strongly connected to each other.
Changed to just bool.

builder.add_filter(filter, list_id as u32);
}

builder.finish(if optimize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my other comment, but for future reference: https://doc.rust-lang.org/std/keyword.move.html

Comment on lines +95 to +104
// Reconstruct the unique_domains_hashes_map from the flatbuffer data
let root = memory.root();
let mut unique_domains_hashes_map: HashMap<crate::utils::Hash, u32> = HashMap::new();
for (index, hash) in root.unique_domains_hashes().iter().enumerate() {
unique_domains_hashes_map.insert(hash, index as u32);
}
FilterDataContextRef::new(Self {
memory,
unique_domains_hashes_map,
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a concern for this particular PR, but does it make sense to hold data that needs to be "reconstructed" in a separate buffer? Then we don't necessarily need to hold both copies in memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that stored unique_domains_hashes is index => ShortHash (basically), but we need ShortHash => index.
Maybe it makes sense to store additional mapping (despite it will eat some small amount memory).
Well, I believe we need to dedup and index not only domains, but the other strings, so let's postpone this for some time.


Self {
blocker: Blocker::new(network_filters, &blocker_options),
blocker: Blocker::from_context(filter_data_context.clone()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convention is to explicitly use Rc::clone(x) or Arc::clone(x) rather than x.clone() to make it clear that it's just a refcount increase.

https://doc.rust-lang.org/book/ch15-04-rc.html#using-rct-to-share-data

In these instances it depends on the feature config, so I'd suggest FilterDataContextRef::clone(x) instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks.

@@ -0,0 +1,337 @@
//! Builder for creating flatbuffer with serialized engine.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sounds good to me

@atuchin-m
Copy link
Collaborator Author

atuchin-m commented Jun 26, 2025

one more general thing - could we do these in a 0.11.x branch rather than in master? That way it's a little bit easier to maintain the history of breaking changes vs patch releases

We can in theory. But the workflow isn't clear to me.
Create 0.11.x, make a PR here, merge it, then merge 0.11.x to master?
The approach to merge everything in master first is more clear to me, also in terms of tracking perf (a linear history).

@atuchin-m atuchin-m requested a review from antonok-edm June 26, 2025 22:10
Copy link

[puLL-Merge] - brave/adblock-rust@489

Description

This PR refactors the internal storage system for network filters in the adblock-rust library. The main changes include:

  1. Feature rename: Changes the unsync-regex-caching feature to single-thread with clearer documentation
  2. FlatBuffer consolidation: Replaces the old system where each filter list was stored separately with a unified FlatBuffer containing all filter lists and supporting data
  3. Architecture restructuring: Introduces FilterDataContext as a shared container for FlatBuffer data, eliminating the need to store multiple NetworkFilterList instances in Blocker
  4. Builder pattern: Adds FlatBufferBuilder to construct the consolidated FlatBuffer format
  5. API simplification: Streamlines the Engine API by removing the Engine::new() constructor in favor of Default::new()

Possible Issues

  • Breaking changes: The feature rename from unsync-regex-caching to single-thread will break existing configurations
  • Serialization compatibility: The changes to the FlatBuffer schema will break compatibility with previously serialized engines
  • Memory usage: The new unified FlatBuffer approach may have different memory characteristics than the previous approach
  • Performance implications: Moving from separate filter lists to a single consolidated structure may impact lookup performance
Changes

Changes

  • Cargo.toml: Renames feature from unsync-regex-caching to single-thread
  • README.md: Updates documentation to reflect the feature rename
  • src/blocker.rs: Major refactor removing individual filter list storage, replacing with methods that access filter lists through FilterDataContext
  • src/engine.rs: Removes Engine::new(), updates serialization/deserialization to use the new FlatBuffer format
  • src/filters/fb_builder.rs: New file implementing FlatBufferBuilder for creating consolidated FlatBuffers
  • src/filters/fb_network.rs: Adds FilterDataContext and updates network filter handling
  • src/filters/unsafe_tools.rs: Renames and updates VerifiedFlatFilterListMemory to VerifiedFlatbufferMemory
  • src/flatbuffers/: Updates FlatBuffer schema to use Engine as root type instead of NetworkFilterList
  • src/network_filter_list.rs: Simplifies to work with the new consolidated format
  • tests/ and benches/: Updates to accommodate API changes and new expected hash values for serialization tests
sequenceDiagram
    participant Client
    participant Engine
    participant FilterDataContext
    participant FlatBufferBuilder
    participant Blocker
    
    Client->>Engine: from_filter_set(rules, optimize)
    Engine->>FlatBufferBuilder: make_flatbuffer(network_filters, optimize)
    FlatBufferBuilder->>FlatBufferBuilder: categorize filters into lists
    FlatBufferBuilder->>FlatBufferBuilder: optimize if requested
    FlatBufferBuilder->>FilterDataContext: new(memory)
    FilterDataContext-->>Engine: FilterDataContextRef
    Engine->>Blocker: from_context(filter_data_context)
    Blocker-->>Engine: Blocker instance
    Engine-->>Client: Engine instance
    
    Client->>Engine: check_network_request(request)
    Engine->>Blocker: check(request, resources)
    Blocker->>Blocker: get_list(NetworkFilterListId)
    Blocker->>FilterDataContext: access filter data
    FilterDataContext-->>Blocker: NetworkFilterList
    Blocker-->>Engine: BlockerResult
    Engine-->>Client: result
Loading

@atuchin-m atuchin-m closed this Jun 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants