-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to prefer ScanCode's file- over line-level findings #8152
Conversation
f787740
to
3eacafd
Compare
if (preferFileLicense && file is FileEntry.Version3 && file.detectedLicenseExpressionSpdx != null) { | ||
licenseFindings += LicenseFinding( | ||
license = file.detectedLicenseExpressionSpdx, | ||
location = TextLocation( | ||
path = file.path, | ||
startLine = license.startLine, | ||
endLine = license.endLine | ||
startLine = licenses.minOf { it.startLine }, | ||
endLine = licenses.maxOf { it.endLine } | ||
), | ||
score = license.score | ||
score = licenses.map { it.score }.average().toFloat() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3eacafd
to
c348dfa
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #8152 +/- ##
=========================================
Coverage 67.16% 67.16%
Complexity 2049 2049
=========================================
Files 357 357
Lines 17158 17158
Branches 2461 2461
=========================================
Hits 11525 11525
Misses 4611 4611
Partials 1022 1022
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
c348dfa
to
716e715
Compare
), | ||
score = license.score | ||
score = licenses.map { it.score }.average().toFloat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you know the rational DOS use when coming up with an avg
score?
Also, I wonder in what way the CLI option --license-score
influences the per file license? (and whether this should become a command line config option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you know the rational DOS use when coming up with an
avg
score?
@Etsija and @lamppu probably know better, but the average is pretty much the only metric that somewhat makes sense and is reasonably easy to calculate. Min. does not make sense, max. does not make sense.
Of course, you could do something "crazy" like only taking the scores of those line findings into account whose licenses are also contained in the file finding, and maybe even weight the score according to the number of lines matched, but that's much more work for questionable benefit.
I wonder in what way the CLI option --license-score influences the per file license?
Let's ask the author's: @pombredanne, does ScanCode's --license-score
option have an (implicit) impact on which licenses will be part of the per-file detected_license_expression(_spdx)
fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, you could do something "crazy" like only taking the scores of those line findings into account whose licenses are also contained in the file finding, and maybe even weight the score according to the number of lines matched, but that's much more work for questionable benefit.
Another alternative would be to omit lines
and score
entirely, with the rational that they stop being meaningful.
The reason I'm picking this up is, because (1) I suspect these values are a bit more likely to exhibit some bit of flakyness at least across scancode version (2) the result of the heuristic gets backed into the stored scan results and is thus very expensive to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative would be to omit
lines
andscore
entirely, with the rational that they stop being meaningful.
Making lines optional would be a bigger change in the data model, and it would prevent us from matching copyright to license findings, and probably have more implications.
However, I forgot that the score
actually is optional already. Would you prefer to omit it instead of calculating the average?
I suspect these values are a bit more likely to exhibit some bit of flakyness at least across scancode version
More likely to be flaky than what, than the individual per-line findings? As the lines and score are calculated from the per-line findings, they can only be flaky if the per-line findings themselves are already flaky.
the result of the heuristic gets backed into the stored scan results and is thus very expensive to change.
Correct, but that's the way forward how ScanCode will operate, I guess. As this discussion shows, there seem to be no endeavors to enhance the quality of the per-line findings themselves, but to achieve better quality by post-processing these findings into the per-file finding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, but that's the way forward how ScanCode will operate, I guess. As this discussion shows, there seem to be no endeavors to enhance the quality of the per-line findings themselves, but to achieve better quality by post-processing these findings into the per-file finding.
I don't much about the processing inside of scancode, so I don't want to move this discussion off-topic,
but it's already interesting that this is actually a post-processing step of per-line findings.
If the only input of this post processing step is the a per-line findings, then it could
also / in the long run make more sense to have that postprocessing step applied by ORT (using ScanCodes logic), not by ScanCode (at least not in a single step) in order to make use of the per-line finding curations. The process could be:
- Compute per line findings with ScanCode
- Apply per line finding curations aka license finding curations
- Take result from 2. as input for the post processing to compute the per-file findings using ScanCode's heuristic
feeding in the curated license findings.
I believe this would exhibit several advantages over the current solution.
Also, IMO it'd be nice that per line findings would no more be mutually exclusive from per file findings.
Curations would be possible on both levels of granularity.
@pombredanne - what do you think about this idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make more sense to have that postprocessing step applied by ORT (using ScanCodes logic)
I agree, if that's possible, and ScanCode does not require internal data for this that's not written to the result file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I forgot that the
score
actually is optional already. Would you prefer to omit it instead of calculating the average?
@fviernau I still need you input here. I believe that's the last thing to clarify before we could merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have slight tendency to omit it, but I also believe my knowledge is too limited in that area to make the call.
The reason I believe why its better to omit it, is because I believe the way the score would be defined would even harder to grasp and hardly be actionable. E.g. if an SPDX expression would consist of several 100% findings and a few 80% findings, the final score would have a relatively low score even though some parts of the expression apply a 100%. So,e.g. filtering out on score would not be feasible anymore, because one would have two different ways the score could have been computed.
So, all in all I prefer to omit it, but would be fine if you kept it. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mnonnenmacher what do you think about #8152 (comment) ?
@@ -90,7 +97,14 @@ class ScanCode internal constructor( | |||
} | |||
} | |||
|
|||
override val configuration by lazy { config.commandLine.joinToString(" ") } | |||
override val configuration by lazy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the per file findings as entries into the text location based license findings which correspond to a line range an a specific file, also implies that the existing license finding curation mechanism becomes applicable to these per file entries.
So, from the perspective of creating the license finding curations, I wonder about the implication such as would we accept (e.g. in ort-config) also license finding curations which match such per file license findings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we accept (e.g. in ort-config) also license finding curations which match such per file license findings?
I think we should, mostly for two reasons:
- there's hardly any negative impact compared to not doing so,
- we should also accept curations for other supported scanners than ScanCode, and these other scanners could theoretically find exactly the file-level finding that ScanCode finds. If we'd accept such a curation for another scanner, we should accept it for ScanCode, too. (This somewhat reminds me of @mnonnenmacher's remark that our license finding curations should also include the scanner name and version.)
Can you help me understand whether this PR changes the "cache key" of ORT scan results? So that stored scan results would not match anymore with the new ORT version? (I have in mind changes like moving the |
0d397ff
to
42d974d
Compare
First of all, this PR does not change any behavior of ORT when merged, as the option is disabled by default ( Secondly, as the "fake" |
Ah, right, that would trigger rescans then, too. So I guess we should drop it, even if basically it's the right thing to do? |
I'm not sure which option we'd choose. An alternative could be to migrate the scan result storage entries somehow. |
I've anyway dropped the last |
5162e4b
to
bfbda6c
Compare
Signed-off-by: Sebastian Schuberth <[email protected]>
Move the knowledge about the default configuration from `ScanCode` to `ScanCodeConfig`. This allows to make properties of the latter non-nullable and more strongly typed (lists of strings instead of unparsed strings). Give some related variables more appropriate names in that context. Signed-off-by: Sebastian Schuberth <[email protected]>
The chosen output does not impact the scanner results, so it should not be considered as part of scanner configuration. Signed-off-by: Sebastian Schuberth <[email protected]>
Make the output format option directly follow other command line options and add a commend that it needs to precede the result file path. To emphasize even more that the latter two belong together, put them into the same line of code. Signed-off-by: Sebastian Schuberth <[email protected]>
The output format was hard-coded, so the derived option name was constant. Simplify the logic by inlining the whole output format option. Signed-off-by: Sebastian Schuberth <[email protected]>
This should not be part of the public API. Signed-off-by: Sebastian Schuberth <[email protected]>
Make the property go first and group command line related code. Signed-off-by: Sebastian Schuberth <[email protected]>
This is a preparation for highlighting the different number of findings compared to when enabling a new scanner option introduced in the next commit. Signed-off-by: Sebastian Schuberth <[email protected]>
See [1] for discussions about the `detected_license_expression_spdx`, in particular that it "is not merely the accumulation of the underlying matches". Optionally making use of this file-level license aligns ORT's behavior with that of the Double Open Scanner (DOS), see [2], which is useful for comparison of results. [1]: aboutcode-org/scancode-toolkit#3458 [2]: https://github.com/doubleopen-project/dos/blob/616c582/apps/api/src/helpers/db_operations.ts#L55-L78 Signed-off-by: Sebastian Schuberth <[email protected]>
bfbda6c
to
ead5b94
Compare
@oss-review-toolkit/kotlin-devs as there was no feedback on the topic of settings the |
Please have a look at the individual commit messages for the details.