[RFC] Scans of License Texts, "is_license_text" plugin related

While going through [this issue](https://github.com/nexB/scancode-toolkit/issues/863), I’ve scrapped and collected the licenses, and run the scancode license scan on them, I Found some of these license files, even though they are entirely "license files", does not have the “is_license_text” (the plugin) value as True. 

The plugin works as follows, quoting from the docstring -
```
    Set the "is_license_text" flag to true for at the file level for text files
    that contain mostly (as 90% of their size) license texts or notices.
    Has no effect unless --license, --license-text and --info scan data
    are available.
```

These files 

[Free Art License 1.3.txt](https://github.com/nexB/scancode-toolkit/files/5083222/Free.Art.License.1.3.txt)
[GNU Lesser General Public License 3.0.txt](https://github.com/nexB/scancode-toolkit/files/5083223/GNU.Lesser.General.Public.License.3.0.txt)
[Lawrence Berkeley National Labs BSD Variant License (BSD-3-Clause-LBNL).txt](https://github.com/nexB/scancode-toolkit/files/5083224/Lawrence.Berkeley.National.Labs.BSD.Variant.License.BSD-3-Clause-LBNL.txt)
[Open Government Licence 1.0 (United Kingdom).txt](https://github.com/nexB/scancode-toolkit/files/5083225/Open.Government.Licence.1.0.United.Kingdom.txt)
[Open Government Licence 2.0 (United Kingdom).txt](https://github.com/nexB/scancode-toolkit/files/5083226/Open.Government.Licence.2.0.United.Kingdom.txt)
[Open Government Licence 3.0 (United Kingdom).txt](https://github.com/nexB/scancode-toolkit/files/5083227/Open.Government.Licence.3.0.United.Kingdom.txt)
[Open License 2.0 France.txt](https://github.com/nexB/scancode-toolkit/files/5083228/Open.License.2.0.France.txt)
[Quebec Free License - Permissive (LiLiQ-P) version 1.1.txt](https://github.com/nexB/scancode-toolkit/files/5083229/Quebec.Free.License.-.Permissive.LiLiQ-P.version.1.1.txt)
[University of Illinois - NCSA Open Source License.txt](https://github.com/nexB/scancode-toolkit/files/5083230/University.of.Illinois.-.NCSA.Open.Source.License.txt)
[X.Net License.txt](https://github.com/nexB/scancode-toolkit/files/5083231/X.Net.License.txt)

Scan results in this file -

[false_is_lic_text.json.txt](https://github.com/nexB/scancode-toolkit/files/5083330/false_is_lic_text.json.txt)

So assuming this is a case that is proper, we should have to handle these differently, as these are not detected easily.

Questions:-
1. Maybe this is because there’s some extra text with the license texts?
2. Still, they should at least be detected as a license file I presume, as more than 90% of their content is license words? 
3. Has these anything to do with Legalese words, also how often and in which cases do you update the legalese words, and how is that process?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] Scans of License Texts, "is_license_text" plugin related #2164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC] Scans of License Texts, "is_license_text" plugin related #2164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions