-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checksum Support (MD5,CRC32,SHA1,etc.) #7
Comments
I'd like to exclude it from the core protocol for now, but I'll leave this ticket open. If a lot of people think it's important, we should consider adding it. That being said: Verification can certainly be done on top of the protocol by passing custom headers along with the requests. |
#8 had some initial ideas on using a |
Something like an optional 'Accept-Checksum: md5,crc32' request header, either in the initial POST or in each PUT. And a following Checksum header in the response to each data chunk. The checksum algo should be negotiated, but something like md5 should probably be recommended. The client should be in charge of deciding whether a resend is required. |
You should absolutely make this a recommended, core part of the protocol. Of course, all the provided servers/clients should use it. File corruption is insidious, and sneaks in everywhere. I wish we had an end-to-end default checksum algorithm we could use, but as it is.. if speed is a concern, just specify crc64. (EDIT: As per sandfox, Content-MD5 already exists for this purpose.) A checksum tree is an interesting potential enhancement, but would only be worthwhile for the very largest uploads. It'd be great for those, though. |
For reference purpose, a gist of the IRC conversation about this (mostly involving @Baughn) |
If you really want to avoid a 2,0 version you really should add some sort of optional support. There are some ideas in the IRC log and I second that it would help the protocol to get accepted. |
@Baughn do you have a good link for the checksum trees you mentioned? @noptic it's happening! While doing some thinking of this subject, I also tried to find out what guarantees are given when plain TCP/IP with no additional checksums is used. It's a complex subject, but for anybody who is interested:
The tl;dr; is: TCP checksums are quite weak by modern standards, but layer 2 technologies such as Ethernet/Wifi/PPP have their own CRC checksums and will drop bad frames which TCP will detect and retransmit. So in order for a TCP frame with corrupted data to reach our BSD sockets, it will require both the weak and the stronger mechanism to fail at the same time - which should be quite rare. (unfortunately I don't have time to quantify "quite rare" further, it would be quite a little math side project). Cheers, |
Working on a "contractual exchange of high value assets" system (~ $50 * 1012/year) we did the maths and figured worst case noisy lines you might see a 2+-bit error (all single bit errors will be caught by the TCP checksum) once in 1012 bytes (1 terabyte transferred) on a raw socket, so we added an extra 32-bit checksum per binary message (~100-1000 bytes, multiple messages per TCP packet, and they can cross packet boundaries). We figured this pushed the error rate out to something close to 1040 which is my unofficial "ain't never gonna happen" figure (c.f. age of the universe in seconds ~ 1017). Oh and then we added IPSec (ESP+AH) to the parts of the network we controlled and SSL to the entire end-to-end comms, and as you note, the checksums from other layers work too and they will each tend to catch bad data, but with big money changing hands we really didn't want to take chances. |
@schmerg ❤️ - thank you so much for sharing this information! Most media files will probably not require the same care as the data in your system, but checksums should also proof useful for catching bugs in the software, so we'll certainly add support for them one way or another! |
Your Stone and Partridge link has no href, but is available here |
@schmerg Thank you for sharing the analysis, you win github today! |
@felixge http://en.wikipedia.org/wiki/Hash_tree @schmerg Errors are rare, but they do happen. The problem with relying on link-layer strong checksums is that errors can creep in between nodes; for instance, I once read about a router that would set one bit in the Nth byte of every packet, thereby causing SIP connections to fail in a weird manner. It was an epic piece of troubleshooting. The IP checksum failed, because it is recomputed at every node. The packet was received, checksum verified, corrupted, and new checksum set. That's why we really want end-to-end checksums. |
I must agree with others and say that adding a CRC check is necessary at least on the client side so that it can be implemented server side by those that wish to implement it. One situation that I can see it fixing is the situation of someone resuming a file upload of a file that has changed. There is a few ways that can be resolved however without checksums one would never know the data had changed and the file would/could end up corrupted as it attempts to resume. Another situation is what if two people resume uploading different files under the same name, without checksums I don't see that being resolvable. |
@andrewfenn For your last case, I think that should be handled by authentication, which is outside the scope of tus. The test server requires no authentication, true, but it's not meant to be fully representative. |
@andrewfenn In my proposal, with an optional (SHOULD) accept-checksum header from the client and an optional (SHOULD) checksum header from the server, the client is given the choice to handle mismatches as it wishes. And the server is allowed (but should be discouraged) to not implement checksumming. |
Authentication wouldn't help if I upload two files called "cat.jpg" under the same account at the same time then stop and resume them, or resume a changed file, etc. It's probably true that the server could handle those situations in other ways by looking at the data coming in and compare it to the partial files available however it would be much easier if I had a checksum from the file submitted by the client to begin with because then I could very quickly know which file is being referred to without having to do any server magic. A potential problem with checksums however is that you can't in some cases get a full checksum of a file you're uploading due to the file size and in some browsers that I've done implementations of things similar to tus is that it can sometimes be quite slow. |
@andrewfenn I'm not sure the server should resume a file based on the filename alone. If the client hasn't got the resource location, the recovery should at least be based on file length and some checksum of the file, shouldn't it? |
Personally, I'm inclined to think the upload <-> file-on-filesystem mapping On Thu, Apr 18, 2013 at 11:54 AM, Øystein Steimler <[email protected]
Svein Ove Aas |
I'm just going by what I'm reading on http://www.tus.io/protocols/resumable-upload.html So apologies if i've glossed over some details on how file resuming is currently specced however assume we have two files that:
Given this situation which could pretty easily happen I don't see how you could possibly resume an upload without a file checksum or there would be some file corruption. Is there anywhere I can read in more detail about how the file resuming is suppose to work? |
If the files have the same name, i.e. they're being uploaded to the same Checksums should be per-resource, not per-upload; whichever diverges from On Thu, Apr 18, 2013 at 12:02 PM, Andrew Fenn [email protected]:
Svein Ove Aas |
Ignore the part about two files I've made the wrong assumption that you can resume a file upload without knowing the file resource on the server. On closer inspection I can see that's not what the spec is saying however what I said I believe can still be a potential issue. If you know the file resource then yep, it shouldn't be a problem except in one case where the file has changed upon resume which is common but forgivable as user error so I understand if you don't want to build that use case into the spec. What i'd think is ideal is checksum support for letting the server make smarter decisions on where to tell the client to resume uploading. Otherwise you can not guarantee that the file you're resuming hasn't changed. For that though it needs some ability to tell the client that the server wants checksum data from xy range so that if I resume a file upload the client can send a checksum for xy range and the server can give a verified response to indicate that it has that chunk or perhaps just give a response telling the client what to do next such as give another checksum or resume uploading. This would allow a client and server to quickly double check that the file that is being resumed is in fact the exact same file or at least is up to the point the server uploaded it. |
Right. Hash trees.
|
@Baughn Agreed about the end-to-end - we had direct experience of that issue with bridges/routers that would corrupt the odd packet but put a new valid Ethernet checksum on the packet (and that 2nd link I posted noahdavids.org made the same point). Incidentally at that prev job we originally worked out and planned for TCP errors every 10^9 bytes but no one in the external review teams would believe such a figure so rather than have to re-explain it every time we watered down the claim to 10^12, which was enough that they wouldn't argue about it occurring (in ~2001), but would agree that our extra end-to-end message checksums were a good idea. As for checksums on the whole file - there are checksums that can be done on streams as you go (ie where you can resume the checksum operation as more data becomes available, typically used for checksumming streams where you don't know how much data you're going to get and don't want to buffer it up) but, and I don't really know the project but just came across a thread where I might be able to offer something, I'd suggest you could do partial checksums every 4k/256k/1Mb depending on checksum size/cost, and so recognise when a file "went bad" and just resume from that point, and still have a full "did it make it to disk properly" checksum if that's what you really needed (& getting close to suggesting re-implementing sliding window protocols like zmodem over TCP at this stage). But I'm off into areas that I only have vague knowledge about now.. so I'll bow out before spreading mis-information. |
Note: Not that it couldn't be resurrected... |
@kevinswiber I think that if the server and client negotiate a digest, and we strongly recommend to implement at least some md5-digest, the interop issue should be no problem. We could for instance require the server to provide md5 checksumming, but just strongly recommend the client to use it. Also in the scheme I propose, the client can simply omit to ask for digest or ignore the checksum returned if complexity is an issue. That's the client's loss. |
Just to add my 2 cents here: I was working on a file upload thingy which handled files up to 50 GB. And the server had a corrupt RAM which set one bit every 32 byte or so. We upload data in blocks of 1 MB, with a checksum for each block, and a checksum for the entire upload. And I am glad we checksummed each block, because then we knew the uploaded files were corrupt and could not be used. (Took a while to figure out the RAM issues, but that is another story.) So I would like to see checksum support for both individual "blocks" and for a completely uploaded file become part of the protocol, in the sense that: (1)
and (2)
|
@ChristianUlbrich Great idea, I am looking forward to your feedback. |
After thinking a bit more about this the extensions should address following points:
The only thing missing right now is the appropriated status code. |
I don't think there's anything wrong with using the same status code. Client's should handle all 4xx in a similar manner, probably just displaying a different error. As far as I know, there's nothing wrong in returning different status strings, is there? I mean |
@qsorix Although it is totally possible to send custom status texts I'm not sure if that's allowed by the HTTP spec. Anyway I would declare this bad behaviour sending a status text other than the official one. The best and not used status code may be |
What's the problem with sending
I've google a little and found http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6.1.1 which says "The reason phrases listed here are only recommendations -- they MAY be replaced by local equivalents without affecting the protocol." and "The Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason- Phrase." So the status text is just a debug. Question then is, do we want clients to be able to react differently to the two types of errors. "HTTP status codes are extensible. HTTP applications are not required to understand the meaning of all registered status codes, though such understanding is obviously desirable. However, applications MUST understand the class of any status code, as indicated by the first digit, and treat any unrecognized response as being equivalent to the x00 status code of that class" To me this means: we're free to pick our own codes if we want to. It will be within the intended use. We can specify them so TUS clients can react different on different errors. Or they can display a generic error message, deciding they just got something from 4xx. Of course, if there is already a code that matches our need, we should obviously reuse it. In this case, it feels like there isn't. |
First of all, I want to thank you for this research, it's really appreciated. 👍
Yes, but it's too generic. It's like, "oh, you got one out of tens of things wrong but I won't tell you which" :).
That's great news but as I said support for this is too weak as far as I know.
+1, I will have a look at which 4xx code is used rarely (you can't find one which never been used) and then come back. |
@Acconut , once you find the numbers...
I think the proper way to use the text will be to specify the number we return in case of an error, and encourage implementations to (if possible) add the text that explains what TUS means by returning this error. As an example: implementation MUST use 409 to indicate a mismatched offset, implementation SHOULD return Having a known numbers lets clients automate recovery, e.g. offer an option to automatically retry an upload on a failed checksum. Reason phrases will only serve to ensure developers that they're doing the right thing, and help them memorize which code means what. |
That's the point for reusing 409. As suggested in the retry extension a client MAY retry on 4xx. After some quick research, I could think of using |
Citing the current HTTP/2 draft:
Since HTTP/2 is the future I would vote against a custom reason phrase and instead rely on a custom status code. |
@Acconut good catch! |
@qsorix What's your opinion about "forcing" one algorithm vs. letting the client decide? |
@Acconut , I think the rationale you gave for sticking to md5 is good. In the sense that we try to protect against Murphy, not Machiavelli. I'm fine with md5 and you can stop reading here. If all we care about is byte integrity, CRC32 uses less CPU. But who cares? I can't tell from experience, but I guess that you would require a really massive network link to notice a difference of CPU usage when switching between two different checksum algorithms. All major languages offer libraries or come with built-in functions to calculate CRC, md5, sha1, sha256 and more. So if in practice there's no CPU cost, why not pick a checksum that is not yet known to be vulnerable to collision attacks, like sha-2 or sha-3? Embedded devices? If we care about those, better pick CRC or md5. Less bytes of memory needed for the algorithm's code. So... Because I think people really serious about security will have their own opinion and won't be afraid to break the rules of the protocol or extend it somehow for their needs, or just use HTTPS, I don't consider that as an issue. Picking between CRC and MD5, I just have a gut feeling that md5 maybe more widely supported by standard libraries of various languages. So yeah, I think md5 is just fine. |
Great! :) Here's what I came up with: https://gist.github.com/Acconut/bbe3838bb04d90845ba8#file-checksum-md One problem we may run into with using trailers: Any opinion on this? |
@qsorix Thanks for your feedback. Sadly Github doesn't notify you when your gists are commented so I just saw it at the moment. In the future I would be pleased to see your feedback in the according issues. Anyway I updated the draft following your recommendations and a discussion we had internally. It now mentions the |
@Acconut, Thanks for a tip. I didn't know about notifications and gists. Comments to the latest revision. Text mentions I don't think we can use
Yup, good point. I agree it makes sense for the support to be explicitly announced. You've proposed that Yeah, actually chunked transfer needs not be announced. If the server doesn't accept it, it will simply reject it. We can just say that if a server supports chunked transfers, it MUST support trailers. What do you think? |
Thanks, that was a mistake.
I meant to use
I disagree. Firstly nearly all HTTP clients and servers support chunked encoding since it's enforced by HTTP 1.1 (see RFC 2616 section 3.6.1):
And secondly it's out of tus' scope to deal with detecting chunked transfer coding since the HTTP spec already deals with that. In addition if you send a chunked HTTP 1.1 request to a HTTP 1.0 compatible server it must respond with
I would not say that. A HTTP server framework may support chunked transfers but not trailers (e.g. I'm not sure if PHP supports trailing headers but I may be wrong). Another example would be proxies which (accidentally) remove trailers. |
Ah, OK. In that case, I agree with the way this header is used.
Why counterproductive? The checksum of the entire file is exactly what I personally need in my application. I want some means of verifying that after all resumes that happened, I got the content right. Server will receive the entire file anyway, so what's the problem with needing the entire file? Server doesn't need to keep the whole content all the time. MD5 can be calculated over a stream. Server just needs to keep MD5's calculation state in the meantime. |
I will try to explain it using following example:
Now the client is not able to re-upload the second 100 bytes (see #51) and the only solution is to abort the current upload and start a new one from scratch. And now imagine uploading 10GB again because of one bytes.
In a lot of languages/frameworks storing the state is not easily achievable since you have to store it persistently. A upload may be resumed after days leading to a memory leak when you save the state in memory. |
Nice example, thanks for the explanation. I totally agree. I was not clear in my last comments so let me clarify. I am not against using a checksum for each single message. However, in addition to that, I need a checksum of the entire file. I got confused before because I thought you're looking for a name for the latter checksum and it wasn't the case.
However, now that I understand your intentions better, there's one thing I didn't get. If a client attempts to send the entire file with a single PATCH request, that patch request will include
Well... There's already a lot of data you need to store persistently anyway to handle the upload. Like URLs, offsets, expected size, optionally metadata, etc. So storing a few more values is not a game-changer. |
If you provide a checksum for every chunks you upload why do you need an additional one for the total file?
Everything you said in that paragraph is correct. Checksums slows the upload procession down by their design. I personally prefer to split a file into multiple chunks allowing me to upload them in parallel and re-uploading a single one. The spec advises to act different but it was written before the principle of checksums and parallel uploads were introduced to tus. We may change this sentence in 1.0.
Yes, it's not about the ability to store but about obtaining the underlining data (in case of MD5 the four buffers). As far as I can tell from my experience you are not able to store a MD5 (or any other hashing function) and resume it later in Golang and Node.js (speaking about the standard library, packages be third parties may introduce this functionality but we should not rely on that): |
To catch bugs. In my environment there are many teams implementing clients. Lesson learned is that on the server-side, you should never assume that the client works, because given so many chances, someone eventually implements something wrong. Having an extra checksum to verify that the combined chunks are really what they intended to be is valuable. It's actually quite easy to make an "off by 1" mistake when calculating offsets, etc. Yes, you can argue that individual checksum for each piece, accompanied with offsets of all pieces, is enough to ensure the transferred data were not corrupted. Agreed. However, having a single number is simpler and more practical in a day-to-day life at an office, where it is good to be able to quickly spot and isolate problems. I could start coming up with crazy scenarios, e.g. disk corruption, that would also require complete checksums, but in my case, it is much more likely I'll have a problem due to software bugs.
OK. I cannot stop you here ;-). TUS, IMHO, needs to be focused on big bandwidths and huge files, because that's the Internet today. And in such environment chunked and/or parallel uploads make sense and come with small cost. My deployment consists of embedded devices with limited flash memory and slow network connections. Small files. I'm interested in TUS because it allows resumable uploads in the simplest way I could imagine. This means not many kilobytes are needed to save the algorithm on a device. Things like parallel uploads are of no interest to me because I cannot afford them anyway. My comments are biased towards keeping TUS simple, because otherwise it may turn out, that in the end, this is not the protocol I want to use. ;) |
The case you explained to me makes sense for your project. So feel free to extend tus with specific extensions and corner-cases to match your application. I try to make the protocol flexible and powerful while not being too specific and restrictive what it a hard catch. Anyway before I position myself on one side about a single checksum for the entire file I have a few questions:
While, in fact, you can't stop me I honour every contributors' opinion. It's not about building the protocol which suites my needs best but it's about making something which works for nearly everyone :) |
I'll answer the questions how I see fit for my case but I understand that "it depends" is often the right answer ;)
File creation request was my first idea (#7 (comment)), because it keeps all metadata together and therefore seems right. Later I got convinced that it doesn't matter that much in my case, while having it on the PATCH makes streamed uploads easier.
If it is a checksum for the entire file, it mustn't change.
In my case, I don't want to process a corrupted file. It needs to be re-uploaded. |
Hmm... I am very unsure about this. |
Yeah, I'm cool with using metadata for my purposes. Thanks for the discussion @Acconut . |
Great to hear that. I have to thank you for the ongoing feedback, too. :) |
See #53. |
Merged. |
Just a vague idea I'm throwing out here:
Should there be some mechanism for verifying the upload contents somewhere in the process? e.g supplying a md5 hash of the file when responding to a
HEAD
request?Or is this outside the scope of the protocol and something that should left as option for implementations to handle if they want to?
The text was updated successfully, but these errors were encountered: