Add an option to prefer ScanCode's file- over line-level findings #8152

sschuberth · 2024-01-22T12:32:19Z

Please have a look at the individual commit messages for the details.

plugins/scanners/scancode/src/main/kotlin/ScanCodeConfig.kt

plugins/scanners/scancode/src/main/kotlin/ScanCode.kt

plugins/scanners/scancode/src/test/kotlin/ScanCodeResultParserTest.kt

sschuberth · 2024-01-22T13:50:59Z

plugins/scanners/scancode/src/main/kotlin/ScanCodeResultModelMapper.kt

+        if (preferFileLicense && file is FileEntry.Version3 && file.detectedLicenseExpressionSpdx != null) {
+            licenseFindings += LicenseFinding(
+                license = file.detectedLicenseExpressionSpdx,
                location = TextLocation(
                    path = file.path,
-                    startLine = license.startLine,
-                    endLine = license.endLine
+                    startLine = licenses.minOf { it.startLine },
+                    endLine = licenses.maxOf { it.endLine }
                ),
-                score = license.score
+                score = licenses.map { it.score }.average().toFloat()
            )


FYI @Etsija, @lamppu, @mmurto: This should resemble DOS's logic.

codecov · 2024-01-22T18:33:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (47da430) 67.16% compared to head (ead5b94) 67.16%.
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##               main    #8152   +/-   ##
=========================================
  Coverage     67.16%   67.16%           
  Complexity     2049     2049           
=========================================
  Files           357      357           
  Lines         17158    17158           
  Branches       2461     2461           
=========================================
  Hits          11525    11525           
  Misses         4611     4611           
  Partials       1022     1022

Flag	Coverage Δ
funTest-docker	`66.25% <ø> (ø)`
funTest-non-docker	`33.96% <ø> (ø)`
test	`36.98% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

plugins/scanners/scancode/src/main/kotlin/ScanCode.kt

fviernau · 2024-01-23T21:36:22Z

plugins/scanners/scancode/src/main/kotlin/ScanCodeResultModelMapper.kt

                ),
-                score = license.score
+                score = licenses.map { it.score }.average().toFloat()


Would you know the rational DOS use when coming up with an avg score?
Also, I wonder in what way the CLI option --license-score influences the per file license? (and whether this should become a command line config option).

Would you know the rational DOS use when coming up with an avg score?

@Etsija and @lamppu probably know better, but the average is pretty much the only metric that somewhat makes sense and is reasonably easy to calculate. Min. does not make sense, max. does not make sense.

Of course, you could do something "crazy" like only taking the scores of those line findings into account whose licenses are also contained in the file finding, and maybe even weight the score according to the number of lines matched, but that's much more work for questionable benefit.

I wonder in what way the CLI option --license-score influences the per file license?

Let's ask the author's: @pombredanne, does ScanCode's --license-score option have an (implicit) impact on which licenses will be part of the per-file detected_license_expression(_spdx) fields?

Of course, you could do something "crazy" like only taking the scores of those line findings into account whose licenses are also contained in the file finding, and maybe even weight the score according to the number of lines matched, but that's much more work for questionable benefit.

Another alternative would be to omit lines and score entirely, with the rational that they stop being meaningful.
The reason I'm picking this up is, because (1) I suspect these values are a bit more likely to exhibit some bit of flakyness at least across scancode version (2) the result of the heuristic gets backed into the stored scan results and is thus very expensive to change.

Another alternative would be to omit lines and score entirely, with the rational that they stop being meaningful.

Making lines optional would be a bigger change in the data model, and it would prevent us from matching copyright to license findings, and probably have more implications.

However, I forgot that the scoreactually is optional already. Would you prefer to omit it instead of calculating the average?

I suspect these values are a bit more likely to exhibit some bit of flakyness at least across scancode version

More likely to be flaky than what, than the individual per-line findings? As the lines and score are calculated from the per-line findings, they can only be flaky if the per-line findings themselves are already flaky.

the result of the heuristic gets backed into the stored scan results and is thus very expensive to change.

Correct, but that's the way forward how ScanCode will operate, I guess. As this discussion shows, there seem to be no endeavors to enhance the quality of the per-line findings themselves, but to achieve better quality by post-processing these findings into the per-file finding.

Correct, but that's the way forward how ScanCode will operate, I guess. As this discussion shows, there seem to be no endeavors to enhance the quality of the per-line findings themselves, but to achieve better quality by post-processing these findings into the per-file finding.

I don't much about the processing inside of scancode, so I don't want to move this discussion off-topic,
but it's already interesting that this is actually a post-processing step of per-line findings.
If the only input of this post processing step is the a per-line findings, then it could
also / in the long run make more sense to have that postprocessing step applied by ORT (using ScanCodes logic), not by ScanCode (at least not in a single step) in order to make use of the per-line finding curations. The process could be:

Compute per line findings with ScanCode

Apply per line finding curations aka license finding curations

Take result from 2. as input for the post processing to compute the per-file findings using ScanCode's heuristic
feeding in the curated license findings.

I believe this would exhibit several advantages over the current solution.

Also, IMO it'd be nice that per line findings would no more be mutually exclusive from per file findings.
Curations would be possible on both levels of granularity.

@pombredanne - what do you think about this idea?

make more sense to have that postprocessing step applied by ORT (using ScanCodes logic)

I agree, if that's possible, and ScanCode does not require internal data for this that's not written to the result file.

However, I forgot that the scoreactually is optional already. Would you prefer to omit it instead of calculating the average?

@fviernau I still need you input here. I believe that's the last thing to clarify before we could merge?

I'd have slight tendency to omit it, but I also believe my knowledge is too limited in that area to make the call.
The reason I believe why its better to omit it, is because I believe the way the score would be defined would even harder to grasp and hardly be actionable. E.g. if an SPDX expression would consist of several 100% findings and a few 80% findings, the final score would have a relatively low score even though some parts of the expression apply a 100%. So,e.g. filtering out on score would not be feasible anymore, because one would have two different ways the score could have been computed.

So, all in all I prefer to omit it, but would be fine if you kept it. Up to you.

@mnonnenmacher what do you think about #8152 (comment) ?

fviernau · 2024-01-23T21:38:39Z

plugins/scanners/scancode/src/main/kotlin/ScanCode.kt

@@ -90,7 +97,14 @@ class ScanCode internal constructor(
            }
        }

-    override val configuration by lazy { config.commandLine.joinToString(" ") }
+    override val configuration by lazy {


Adding the per file findings as entries into the text location based license findings which correspond to a line range an a specific file, also implies that the existing license finding curation mechanism becomes applicable to these per file entries.

So, from the perspective of creating the license finding curations, I wonder about the implication such as would we accept (e.g. in ort-config) also license finding curations which match such per file license findings?

would we accept (e.g. in ort-config) also license finding curations which match such per file license findings?

I think we should, mostly for two reasons:

there's hardly any negative impact compared to not doing so,

we should also accept curations for other supported scanners than ScanCode, and these other scanners could theoretically find exactly the file-level finding that ScanCode finds. If we'd accept such a curation for another scanner, we should accept it for ScanCode, too. (This somewhat reminds me of @mnonnenmacher's remark that our license finding curations should also include the scanner name and version.)

fviernau · 2024-01-23T21:41:30Z

Can you help me understand whether this PR changes the "cache key" of ORT scan results? So that stored scan results would not match anymore with the new ORT version? (I have in mind changes like moving the strip-root option)

sschuberth · 2024-01-24T07:38:45Z

Can you help me understand whether this PR changes the "cache key" of ORT scan results?

First of all, this PR does not change any behavior of ORT when merged, as the option is disabled by default (String?.toBoolean() only returns true if the string is "true", and returns false in all other cases, including null).

Secondly, as the "fake" --prefer-file-license option becomes part of the scanner configuration if ort.scanner.config.ScanCode.options.preferFileLicense is true, then yes, the "cache key" changes, and results will not be "mixed up" compared to not using that options. Phrased differently, enabling that option for the first time causes rescans.

sschuberth · 2024-01-24T07:40:28Z

I have in mind changes like moving the strip-root option

Ah, right, that would trigger rescans then, too. So I guess we should drop it, even if basically it's the right thing to do?

fviernau · 2024-01-24T08:57:34Z

Ah, right, that would trigger rescans then, too. So I guess we should drop it, even if basically it's the right thing to do?

I'm not sure which option we'd choose. An alternative could be to migrate the scan result storage entries somehow.

sschuberth · 2024-01-24T10:32:42Z

Ah, right, that would trigger rescans then, too. So I guess we should drop it, even if basically it's the right thing to do?

I'm not sure which option we'd choose. An alternative could be to migrate the scan result storage entries somehow.

I've anyway dropped the last fix(scancode): Move --strip-root to non-config options commit now to not make it delay the merge of this PR further.

Signed-off-by: Sebastian Schuberth <[email protected]>

Move the knowledge about the default configuration from `ScanCode` to `ScanCodeConfig`. This allows to make properties of the latter non-nullable and more strongly typed (lists of strings instead of unparsed strings). Give some related variables more appropriate names in that context. Signed-off-by: Sebastian Schuberth <[email protected]>

The chosen output does not impact the scanner results, so it should not be considered as part of scanner configuration. Signed-off-by: Sebastian Schuberth <[email protected]>

Make the output format option directly follow other command line options and add a commend that it needs to precede the result file path. To emphasize even more that the latter two belong together, put them into the same line of code. Signed-off-by: Sebastian Schuberth <[email protected]>

The output format was hard-coded, so the derived option name was constant. Simplify the logic by inlining the whole output format option. Signed-off-by: Sebastian Schuberth <[email protected]>

This should not be part of the public API. Signed-off-by: Sebastian Schuberth <[email protected]>

Make the property go first and group command line related code. Signed-off-by: Sebastian Schuberth <[email protected]>

This is a preparation for highlighting the different number of findings compared to when enabling a new scanner option introduced in the next commit. Signed-off-by: Sebastian Schuberth <[email protected]>

See [1] for discussions about the `detected_license_expression_spdx`, in particular that it "is not merely the accumulation of the underlying matches". Optionally making use of this file-level license aligns ORT's behavior with that of the Double Open Scanner (DOS), see [2], which is useful for comparison of results. [1]: aboutcode-org/scancode-toolkit#3458 [2]: https://github.com/doubleopen-project/dos/blob/616c582/apps/api/src/helpers/db_operations.ts#L55-L78 Signed-off-by: Sebastian Schuberth <[email protected]>

sschuberth · 2024-01-25T13:09:18Z

@oss-review-toolkit/kotlin-devs as there was no feedback on the topic of settings the score to null or keeping it at the average, I prefer to merge this as-is with the average to resemble what the Double Open Server (DOS) currently does.

sschuberth requested a review from a team as a code owner January 22, 2024 12:32

fviernau requested changes Jan 22, 2024

View reviewed changes

sschuberth force-pushed the scancode-per-file-findings branch from f787740 to 3eacafd Compare January 22, 2024 13:40

sschuberth requested a review from fviernau January 22, 2024 13:40

sschuberth commented Jan 22, 2024

View reviewed changes

sschuberth requested a review from mnonnenmacher January 22, 2024 18:18

sschuberth force-pushed the scancode-per-file-findings branch from 3eacafd to c348dfa Compare January 22, 2024 18:19

sschuberth force-pushed the scancode-per-file-findings branch from c348dfa to 716e715 Compare January 23, 2024 09:53

fviernau reviewed Jan 23, 2024

View reviewed changes

sschuberth force-pushed the scancode-per-file-findings branch 3 times, most recently from 0d397ff to 42d974d Compare January 24, 2024 07:34

sschuberth requested a review from fviernau January 24, 2024 07:41

sschuberth force-pushed the scancode-per-file-findings branch 6 times, most recently from 5162e4b to bfbda6c Compare January 24, 2024 18:04

sschuberth added 5 commits January 25, 2024 14:07

chore(scancode): Specify the timeout as a duration for convenience

9a35bd6

Signed-off-by: Sebastian Schuberth <[email protected]>

refactor(scancode): Disregard the output format in scanner configuration

ae20031

The chosen output does not impact the scanner results, so it should not be considered as part of scanner configuration. Signed-off-by: Sebastian Schuberth <[email protected]>

refactor(scancode): Inline the output format option

e6deb43

The output format was hard-coded, so the derived option name was constant. Simplify the logic by inlining the whole output format option. Signed-off-by: Sebastian Schuberth <[email protected]>

sschuberth added 4 commits January 25, 2024 14:07

refactor(scanner)!: Make commandLineOptions private

0046a2e

This should not be part of the public API. Signed-off-by: Sebastian Schuberth <[email protected]>

chore(scancode): Reorder functions for a better overview

b405929

Make the property go first and group command line related code. Signed-off-by: Sebastian Schuberth <[email protected]>

test(scancode): Also assert the number of license findings in a test

24e7163

This is a preparation for highlighting the different number of findings compared to when enabling a new scanner option introduced in the next commit. Signed-off-by: Sebastian Schuberth <[email protected]>

sschuberth force-pushed the scancode-per-file-findings branch from bfbda6c to ead5b94 Compare January 25, 2024 13:07

sschuberth requested a review from a team January 29, 2024 10:19

fviernau approved these changes Jan 29, 2024

View reviewed changes

sschuberth merged commit 4af6360 into main Jan 29, 2024
22 checks passed

sschuberth deleted the scancode-per-file-findings branch January 29, 2024 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to prefer ScanCode's file- over line-level findings #8152

Add an option to prefer ScanCode's file- over line-level findings #8152

sschuberth commented Jan 22, 2024

sschuberth Jan 22, 2024

codecov bot commented Jan 22, 2024 •

edited

Loading

fviernau Jan 23, 2024

sschuberth Jan 24, 2024

fviernau Jan 24, 2024

sschuberth Jan 24, 2024

fviernau Jan 24, 2024 •

edited

Loading

sschuberth Jan 24, 2024

sschuberth Jan 24, 2024

fviernau Jan 24, 2024

fviernau Jan 24, 2024

fviernau Jan 23, 2024

sschuberth Jan 24, 2024

fviernau commented Jan 23, 2024

sschuberth commented Jan 24, 2024

sschuberth commented Jan 24, 2024

fviernau commented Jan 24, 2024

sschuberth commented Jan 24, 2024

sschuberth commented Jan 25, 2024

Add an option to prefer ScanCode's file- over line-level findings #8152

Add an option to prefer ScanCode's file- over line-level findings #8152

Conversation

sschuberth commented Jan 22, 2024

Choose a reason for hiding this comment

codecov bot commented Jan 22, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fviernau Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fviernau commented Jan 23, 2024

sschuberth commented Jan 24, 2024

sschuberth commented Jan 24, 2024

fviernau commented Jan 24, 2024

sschuberth commented Jan 24, 2024

sschuberth commented Jan 25, 2024

codecov bot commented Jan 22, 2024 •

edited

Loading

fviernau Jan 24, 2024 •

edited

Loading