-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to file ICLAs #125
Comments
The advantage of the file hash is that it automatically detects exact duplicates, and collisions are vanishingly rare with the appropriate choice of hash function. It also does not expose any PII, though it would be best not to publish the hash wider than necessary. I think it would be worth exploring how to proceed on that basis, so here goes: The ICLA should be filed as hash.pdf, with hash.asc alongside if necessary. There needs to be an index to relate the hash to the person. This index needs to contain at least Full Name, Public Name, email, asfId and project. There may need to be a separate extract containing the Public fields plus the hash for use by appropriately authorized groups. The index also needs to have a means of linking related ICLAs together. There would be no need to update the hash associated with an availid, as the chain could be followed easily if necessary (it won't be long). On the face of it, this appears to be more complicated than the current filename-based solution, but that quickly becomes complicated (and not easily automated) when name duplicates occur. |
There are some issues to be sorted out:
The hash naming convention can co-exist if necessary with the current files. |
On Aug 12, 2021, at 6:40 AM, sebbASF ***@***.***> wrote:
There are some issues to be sorted out:
how to handle ICLAs consisting of multiple parts (e.g. JPEGs)
Current practice is to convert multiple bitmap formats into a single pdf before filing. Some historic entries can be converted now.
how to handle ICLAs which need to be rotated: hash before or after rotation (or even store both?)
What's important is the content which is a pdf representation of the document. I don't think it matters what the original format was, e.g. multiple pages of a single document, jpg or gif single page or multiple page. Whatever the format as received in email, the pdf version (rotated to be human readable if necessary) is what we file, and we don't need to keep a historic record in the iclas/ directory of the original.
Craig
The hash naming convention can co-exist if necessary with the current files.
Existing filename stems all contain at least one hyphen, and are very unlikely to have the same format as the hash.
However I would expect all the files to be migrated eventually to the new convention.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#125 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD4M6RAIQVLWAYI4GPUBZJLT4PFM5ANCNFSM5B45BALQ>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.
Craig L Russell
***@***.***
|
The reason for wanting to keep the original hash is to detect exact duplicates. When content is adjusted, it may not get the same hash. We don't have to keep the original (it should be in the mail archive). |
This is an interesting topic and approach. One question to consider is if there is a way other than a contributor emailing secretary@? Perhaps there can be a web form somewhere that can handle the submission of the ICLA. This could be followed by the secretary handling any linkage / "deduplication" through assignment of an Apache ID. The web form could have the same information as the pdf/odt form only it can actually validate proper project names and apache ids. There would be three cases:
A public web form would need some type of rate limiting / queueing to prevent abuse. I agree that naming the files with a hash protects PII. There would need to be some type of index building/validation which would include the metadata for existing/older ICLA files. |
@dave2wave That is an entirely separate issue from how ICLAs are filed. Please start a separate issue for alternative ways to submit ICLAs. As to metadata for the existing ICLA files, I was thinking of an index file with primary key of the hash, but equally it could contain older data using the filename stem as the key. |
It would probably be worth recording the following email meta-data as well:
This would make it easier to link back to the original email. Possibly even consider treating the mail archive as the storage, i.e. not committing the ICLA to SVN. |
ICLAs are currently filed using a filename stem based on the full name.
This approach increasingly suffers from collisions; it also has the potential to expose PII.
We need to find a different identifier where the likelihood of collisions is very small.
Unfortunately humans don't have a unique immutable identifier, at least not one which is likely to be accessible to us.
So some other ID needs to be found.
Some possibilities to consider:
Assuming two people don't share the same email address, all the above should be collision-free for distinct people.
Any others?
Note that whilst the above ids will uniquely identify an ICLA, additional ICLAs from the same person will generally have different ids (email may or may not be the same). This is also true of the full name used in ICLAs: apart from ICLAs which are sent to record a change of name, we sometimes get ICLAs with a different spelling of a name, or with changes to the given names.
The current approach for replacement ICLAs is to create a directory and store all the ICLAs in the same directory.
If email address is used, something similar would be needed.
For the other IDs, the list of files in a directory would need to be replaced with a list of IDs in the index.
The text was updated successfully, but these errors were encountered: