Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full fledged processing of protection profiles #466

Conversation

adamjanovsky
Copy link
Collaborator

Closes #72

@adamjanovsky adamjanovsky self-assigned this Jan 23, 2025
@adamjanovsky
Copy link
Collaborator Author

@J08nY first batch of commits that refactored Dataset classes is in. Unfrotunately, I had to merge fresh main into this, so lot of changes underway, better look just at the 2165a66 commit to assess the design.

Some of my early design notes:

  • Dataset class will have aux_handlers attribute that accepts a list of instances that implement the AuxiliaryDatasetHandler interface (in form of ABC base class. These days, I’d opt for Protocol, but to be coherent with the old implementation, let’s stick with inheritance).
  • AuxiliaryDatasetHandler protocol defines process_dataset
  • ProtectionProfile dataset can thus inherit from Dataset class and implement no handlers.
  • Each auxiliary dataset will come with its own handler. This enables code re-use between FIPSDataset and CCDataset classes. Any subclass of Dataset class will simply populate its handlers with the required logic.
  • Computation of individual heuristics is outsourced into functions (not part of any class).

Copy link
Member

@J08nY J08nY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK. But still has conflicts with main. Is the merge commit a real merge commit?

src/sec_certs/dataset/cc.py Show resolved Hide resolved
src/sec_certs/dataset/cc.py Show resolved Hide resolved
src/sec_certs/dataset/cc.py Outdated Show resolved Hide resolved
src/sec_certs/dataset/cc.py Outdated Show resolved Hide resolved
@adamjanovsky
Copy link
Collaborator Author

Looks OK. But still has conflicts with main. Is the merge commit a real merge commit?

Meh, something was left out, should be fixed by now.

Copy link

codecov bot commented Jan 23, 2025

Codecov Report

Attention: Patch coverage is 79.76081% with 220 lines in your changes missing coverage. Please review.

Project coverage is 69.36%. Comparing base (7407773) to head (ed26a25).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/sec_certs/utils/label_studio_utils.py 0.00% 56 Missing ⚠️
src/sec_certs/sample/protection_profile.py 78.82% 43 Missing ⚠️
...rc/sec_certs/dataset/auxiliary_dataset_handling.py 84.62% 30 Missing ⚠️
src/sec_certs/utils/cc_html_parsing.py 36.67% 19 Missing ⚠️
src/sec_certs/dataset/protection_profile.py 92.39% 15 Missing ⚠️
src/sec_certs/dataset/dataset.py 62.17% 14 Missing ⚠️
src/sec_certs/dataset/cc.py 76.00% 12 Missing ⚠️
src/sec_certs/heuristics/common.py 85.34% 11 Missing ⚠️
src/sec_certs/heuristics/cc.py 90.81% 8 Missing ⚠️
src/sec_certs/cli.py 0.00% 4 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #466      +/-   ##
==========================================
+ Coverage   68.55%   69.36%   +0.82%     
==========================================
  Files          62       69       +7     
  Lines        7934     8341     +407     
==========================================
+ Hits         5438     5785     +347     
- Misses       2496     2556      +60     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented Jan 25, 2025

Hey, the initial draft of the functionality is implemented. Some notes below.

Sample usage

Create and fully process PP dataset

pp_dset = ProtectionProfileDataset(root_dir="/path/to/pp/directory")
pp_dset.get_certs_from_web()
pp_dset.process_auxiliary_datasets()
pp_dset.download_all_artifacts()
pp_dset.convert_all_pdfs()
pp_dset.analyze_certificates()

Acess to PP Dataset from CC Dataset

  • A path to PP already-processed PP dataset can be provided in
cc_dset.process_auxiliary_dataset(processed_pp_dataset_root_dir="/path/to/pp/directory)

In such case, PPDataset instance will not be fully processed, but just copied into cc_dset.auxiliary_datasets_dir. The instance can be accessed via cc_dset.aux_handlers[ProtectionProfileDatasetHandler].dset once cc_dset.process_auxiliary_datasets() completes.

Alternatively, CCDataset instance is capable of invoking full processing of PPDataset (it does so similarly with Maintenance updates) when process_auxiliary_datasets() is called without any argument.

When CCDataset processing is complete, the ProtectionProfile instances linked to specific CCCertificates are listed as digests in cert.heuristics.protection_profiles.

Notes on PP processing

  • Primary key of ProtectionProfile instance (also entries to its dgst property implementation) is a three-tuple: (category, name, version)
  • Linking from CC to PP is done purely on identity of PP link in both CCCertificate and ProtectionProfile objects.
  • Some of collaborative PPs have multiple certification reports. I parsed only a single link.
  • Some identical PPs are certified under multiple schemes. Ignoring, taking into account only a single scheme now
  • CSV files not parsed. Only data entry missing is expected archival date of active PPs.
  • PP ID not computed.
  • Collaborative PPs with SDs pending review for compliance with the CC/CEM are not processed.
  • Maintenance updates of PP are left unprocessed

Next steps

  • @adamjanovsky: Write tests for PP Processing (soonish)
  • @J08nY: Test the current design with web, propose changes or start integrating.
  • @adamjanovsky and @J08nY allow for downloading a processed PP snapshot from sec-certs.org
  • @adamjanovsky Regression tests: check how many PPs are linked. Check if we miss any previously linked certs.

Comment on lines 57 to 60
pp_latest_full_archive: AnyHttpUrl = Field(
"https://sec-certs.org/cc/pp.tar.gz",
description="URL from where to fetch the latest full archive of fully processed PP dataset.",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pp_latest_snapshot config also needs to change. It will no longer live on the /static/ subdir. But have the same layout as the CC and FIPS datasets. Could you make the change pls?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I wanted to discuss this first before changing this.

src/sec_certs/dataset/cc.py Outdated Show resolved Hide resolved
@J08nY
Copy link
Member

J08nY commented Jan 25, 2025

Could you please add some test(s) that run the PP pipeline at least once? I.e. improve the coverage in the
dataset/protection_profile.py and sample/protection_profile.py files here.

@J08nY J08nY added enhancement New feature or request cc Related to CC certification labels Jan 25, 2025
@adamjanovsky
Copy link
Collaborator Author

Could you please add some test(s) that run the PP pipeline at least once? I.e. improve the coverage in the dataset/protection_profile.py and sample/protection_profile.py files here.

Sure 🙃 , see

@adamjanovsky: Write tests for PP Processing (soonish)

@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented Jan 27, 2025

Regression tests

OLD APPROACH
	- After `get_certs_from_web()`
--------------------`
	- # PP-rich certs: 3288
	- # PP links: 4291
	- # Unique PP links: 269
--------------------
	- # PP-rich certs: 3288
	- # PP links: 4291
	- # Unique PP links: 269
NEW APPROACH
	- After `get_certs_from_web()`
--------------------
	- # PP-rich certs: 3259
	 -# PP links: 4292
	- # Unique PP links: 266
	- After processing ProtectionProfileDataset
--------------------
	- # PP-rich certs: 3212
	 -# PP links: 4232
	- # Unique PP links: 264

The decreased number of PPs in the new approach is explained by:

  • After get_certs_from_web() phase: we don't retrieve PPs without link from CSV sources.
  • After processing PPDataset:
    1. Implication of fewer matches in the first phase.
    2. Also, stronger requirements on linking from CCs to PPs. We require exact match of PP pdf link atm.

@adamjanovsky
Copy link
Collaborator Author

This also implements #462, #463, #465

@J08nY J08nY changed the title Issue/72 new api allow for full fledged processing of protection profiles Full fledged processing of protection profiles Jan 27, 2025
@J08nY J08nY added the pp Related to Protection Profiles label Jan 27, 2025
@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented Jan 28, 2025

@J08nY the requested changes should now be incorporated. I will further check that notebooks work. I will also flag the NVDDatasetBuilder tests with xfail, as they often fail due to NIST server problems.

You can start working on the integration. I suggest merging this only once we're confident we can deploy this to sec-certs.org.

@J08nY
Copy link
Member

J08nY commented Jan 28, 2025

You can start working on the integration. I suggest merging this only once we're confident we can deploy this to sec-certs.org.

I guess we could do that, but my workflow kind of requires that the web is developed against the main branch here, but I will see. I mean real issues will only be visible after deploy anyway.

@J08nY
Copy link
Member

J08nY commented Jan 28, 2025

Could you make the pp URLs have /pp/ in the url? To align with the other blueprints on the web: /cc/ and /fips/.

@adamjanovsky
Copy link
Collaborator Author

You can start working on the integration. I suggest merging this only once we're confident we can deploy this to sec-certs.org.

I guess we could do that, but my workflow kind of requires that the web is developed against the main branch here, but I will see. I mean real issues will only be visible after deploy anyway.

If we merge to main asynchronously, then we have broken API for a while. I would appreciate if you try to deploy from here.

Once it all works, we can merge to main and create a new release.

@J08nY
Copy link
Member

J08nY commented Jan 29, 2025

What do you mean by broken API? What exactly breaks?

@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented Jan 29, 2025

What do you mean by broken API? What exactly breaks?

Well, ProtectionProfileDataset.from_web() will not work till you deploy. Also, other classes will have incompatible data till we run a new run on web.

@adamjanovsky
Copy link
Collaborator Author

Hi, final batch of updates is here.

  • I joined two different methods from_web_latest() and from_web() in various classes.
    • from_web() without additional arguments now defaults to from_web_latest()
  • I added example pp notebook
  • I added docs for pps
  • I added support for pp processing in CLI

@J08nY I consider this done from my side. I'll be just assisting with integration and implementation of review requests.

@J08nY
Copy link
Member

J08nY commented Feb 1, 2025

I am going to create a new PR based on a branch n which I rebased all of this on top of the current main to get rid of the messed up merge commit and have nicer history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cc Related to CC certification enhancement New feature or request pp Related to Protection Profiles
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New API: Allow for full-fledged processing of protection profiles
2 participants