Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checksum Support (MD5,CRC32,SHA1,etc.) #7

Closed
sandfox opened this issue Apr 16, 2013 · 75 comments
Closed

Checksum Support (MD5,CRC32,SHA1,etc.) #7

sandfox opened this issue Apr 16, 2013 · 75 comments

Comments

@sandfox
Copy link

sandfox commented Apr 16, 2013

Just a vague idea I'm throwing out here:

Should there be some mechanism for verifying the upload contents somewhere in the process? e.g supplying a md5 hash of the file when responding to a HEAD request?
Or is this outside the scope of the protocol and something that should left as option for implementations to handle if they want to?

@felixge
Copy link
Contributor

felixge commented Apr 17, 2013

I'd like to exclude it from the core protocol for now, but I'll leave this ticket open. If a lot of people think it's important, we should consider adding it.

That being said: Verification can certainly be done on top of the protocol by passing custom headers along with the requests.

@felixge
Copy link
Contributor

felixge commented Apr 17, 2013

#8 had some initial ideas on using a Accept-Checksum header for this.

@cybic
Copy link

cybic commented Apr 17, 2013

Something like an optional 'Accept-Checksum: md5,crc32' request header, either in the initial POST or in each PUT. And a following Checksum header in the response to each data chunk. The checksum algo should be negotiated, but something like md5 should probably be recommended.

The client should be in charge of deciding whether a resend is required.

@Baughn
Copy link

Baughn commented Apr 17, 2013

You should absolutely make this a recommended, core part of the protocol. Of course, all the provided servers/clients should use it.

File corruption is insidious, and sneaks in everywhere. I wish we had an end-to-end default checksum algorithm we could use, but as it is.. if speed is a concern, just specify crc64. (EDIT: As per sandfox, Content-MD5 already exists for this purpose.)

A checksum tree is an interesting potential enhancement, but would only be worthwhile for the very largest uploads. It'd be great for those, though.

@sandfox
Copy link
Author

sandfox commented Apr 17, 2013

For reference purpose, a gist of the IRC conversation about this (mostly involving @Baughn)

@noptic
Copy link

noptic commented Apr 17, 2013

If you really want to avoid a 2,0 version you really should add some sort of optional support. There are some ideas in the IRC log and I second that it would help the protocol to get accepted.

@felixge
Copy link
Contributor

felixge commented Apr 18, 2013

@Baughn do you have a good link for the checksum trees you mentioned?

@noptic it's happening!

While doing some thinking of this subject, I also tried to find out what guarantees are given when plain TCP/IP with no additional checksums is used. It's a complex subject, but for anybody who is interested:

The tl;dr; is: TCP checksums are quite weak by modern standards, but layer 2 technologies such as Ethernet/Wifi/PPP have their own CRC checksums and will drop bad frames which TCP will detect and retransmit. So in order for a TCP frame with corrupted data to reach our BSD sockets, it will require both the weak and the stronger mechanism to fail at the same time - which should be quite rare. (unfortunately I don't have time to quantify "quite rare" further, it would be quite a little math side project).

Cheers,
Felix

@schmerg
Copy link

schmerg commented Apr 18, 2013

Working on a "contractual exchange of high value assets" system (~ $50 * 1012/year) we did the maths and figured worst case noisy lines you might see a 2+-bit error (all single bit errors will be caught by the TCP checksum) once in 1012 bytes (1 terabyte transferred) on a raw socket, so we added an extra 32-bit checksum per binary message (~100-1000 bytes, multiple messages per TCP packet, and they can cross packet boundaries).

We figured this pushed the error rate out to something close to 1040 which is my unofficial "ain't never gonna happen" figure (c.f. age of the universe in seconds ~ 1017).

Oh and then we added IPSec (ESP+AH) to the parts of the network we controlled and SSL to the entire end-to-end comms, and as you note, the checksums from other layers work too and they will each tend to catch bad data, but with big money changing hands we really didn't want to take chances.

@felixge
Copy link
Contributor

felixge commented Apr 18, 2013

@schmerg ❤️ - thank you so much for sharing this information! Most media files will probably not require the same care as the data in your system, but checksums should also proof useful for catching bugs in the software, so we'll certainly add support for them one way or another!

@schmerg
Copy link

schmerg commented Apr 18, 2013

Your Stone and Partridge link has no href, but is available here
http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf
and this is a handy summary of some of the thinking/conclusions
http://noahdavids.org/self_published/CRC_and_checksum.html

@seidtgeist
Copy link

@schmerg Thank you for sharing the analysis, you win github today!

@Baughn
Copy link

Baughn commented Apr 18, 2013

@felixge http://en.wikipedia.org/wiki/Hash_tree

@schmerg Errors are rare, but they do happen. The problem with relying on link-layer strong checksums is that errors can creep in between nodes; for instance, I once read about a router that would set one bit in the Nth byte of every packet, thereby causing SIP connections to fail in a weird manner. It was an epic piece of troubleshooting. The IP checksum failed, because it is recomputed at every node. The packet was received, checksum verified, corrupted, and new checksum set.

That's why we really want end-to-end checksums.

@andrewfenn
Copy link

I must agree with others and say that adding a CRC check is necessary at least on the client side so that it can be implemented server side by those that wish to implement it.

One situation that I can see it fixing is the situation of someone resuming a file upload of a file that has changed. There is a few ways that can be resolved however without checksums one would never know the data had changed and the file would/could end up corrupted as it attempts to resume.

Another situation is what if two people resume uploading different files under the same name, without checksums I don't see that being resolvable.

@Baughn
Copy link

Baughn commented Apr 18, 2013

@andrewfenn For your last case, I think that should be handled by authentication, which is outside the scope of tus. The test server requires no authentication, true, but it's not meant to be fully representative.

@cybic
Copy link

cybic commented Apr 18, 2013

@andrewfenn In my proposal, with an optional (SHOULD) accept-checksum header from the client and an optional (SHOULD) checksum header from the server, the client is given the choice to handle mismatches as it wishes. And the server is allowed (but should be discouraged) to not implement checksumming.

@andrewfenn
Copy link

Authentication wouldn't help if I upload two files called "cat.jpg" under the same account at the same time then stop and resume them, or resume a changed file, etc.

It's probably true that the server could handle those situations in other ways by looking at the data coming in and compare it to the partial files available however it would be much easier if I had a checksum from the file submitted by the client to begin with because then I could very quickly know which file is being referred to without having to do any server magic.

A potential problem with checksums however is that you can't in some cases get a full checksum of a file you're uploading due to the file size and in some browsers that I've done implementations of things similar to tus is that it can sometimes be quite slow.

@cybic
Copy link

cybic commented Apr 18, 2013

@andrewfenn I'm not sure the server should resume a file based on the filename alone. If the client hasn't got the resource location, the recovery should at least be based on file length and some checksum of the file, shouldn't it?

@Baughn
Copy link

Baughn commented Apr 18, 2013

Personally, I'm inclined to think the upload <-> file-on-filesystem mapping
is outside the scope of the project entirely. I want to run tus with
Bigtable as a backing datastore, which certainly doesn't have files - the
approach I'm intending there is to buffer the file in memory, then store it
under a row named after its checksum.

On Thu, Apr 18, 2013 at 11:54 AM, Øystein Steimler <[email protected]

wrote:

@andrewfenn https://github.com/andrewfenn I'm not sure the server
should resume a file based on the filename alone. If the client hasn't got
the resource location, the recovery should at least be based on file length
and some checksum of the file, shouldn't it?


Reply to this email directly or view it on GitHubhttps://github.com//issues/7#issuecomment-16569900
.

Svein Ove Aas

@andrewfenn
Copy link

I'm just going by what I'm reading on http://www.tus.io/protocols/resumable-upload.html

So apologies if i've glossed over some details on how file resuming is currently specced however assume we have two files that:

  • have the same name
  • have the same length of bytes
  • start off with the same data bytes
  • data diverges within the middle of the file or towards the end

Given this situation which could pretty easily happen I don't see how you could possibly resume an upload without a file checksum or there would be some file corruption.

Is there anywhere I can read in more detail about how the file resuming is suppose to work?

@Baughn
Copy link

Baughn commented Apr 18, 2013

If the files have the same name, i.e. they're being uploaded to the same
resource, I'm not sure what it might mean to "do this right" without
rejecting one of the uploads.

Checksums should be per-resource, not per-upload; whichever diverges from
the checksum set in the initial POST will be rejected. Not immediately, of
course. Hash trees could tell us exactly where the damage is, but it's
sounding like a byzantine client, and probably not our problem.

On Thu, Apr 18, 2013 at 12:02 PM, Andrew Fenn [email protected]:

I'm just going by what I'm reading on
http://www.tus.io/protocols/resumable-upload.html

So apologies if i've glossed over some details on how file resuming is
currently specced however assume we have two files that:

  • have the same name
  • have the same length of bytes
  • start off with the same data bytes
  • data diverges within the middle of the file or towards the end

Given this situation which could pretty easily happen I don't see how you
could possibly resume an upload without a file checksum or there would be
some file corruption.


Reply to this email directly or view it on GitHubhttps://github.com//issues/7#issuecomment-16570205
.

Svein Ove Aas

@andrewfenn
Copy link

Ignore the part about two files I've made the wrong assumption that you can resume a file upload without knowing the file resource on the server. On closer inspection I can see that's not what the spec is saying however what I said I believe can still be a potential issue. If you know the file resource then yep, it shouldn't be a problem except in one case where the file has changed upon resume which is common but forgivable as user error so I understand if you don't want to build that use case into the spec.

What i'd think is ideal is checksum support for letting the server make smarter decisions on where to tell the client to resume uploading. Otherwise you can not guarantee that the file you're resuming hasn't changed.

For that though it needs some ability to tell the client that the server wants checksum data from xy range so that if I resume a file upload the client can send a checksum for xy range and the server can give a verified response to indicate that it has that chunk or perhaps just give a response telling the client what to do next such as give another checksum or resume uploading. This would allow a client and server to quickly double check that the file that is being resumed is in fact the exact same file or at least is up to the point the server uploaded it.

@Baughn
Copy link

Baughn commented Apr 18, 2013

Right.

Hash trees.
On Apr 18, 2013 12:38 PM, "Andrew Fenn" [email protected] wrote:

Ignore the part about two files I've made the wrong assumption that you
can resume a file upload without knowing the file resource on the server.
On closer inspection I can see that's not what the spec is saying. If you
know the file resource then yep, it shouldn't be a problem except in one
case where the file has changed upon resume which is common but forgivable
as user error.

What I guess I am really trying to say and what i'd think is ideal is
checksum support for letting the server make smarter decisions on where to
tell the client to resume uploading. Otherwise you can not guarantee that
the file you're resuming hasn't changed.

For that though it needs some ability to tell the client that the server
wants checksum data from xy range so that if I resume a file upload the
client can send a checksum for xy range and the server can give a verified
response to indicate that it has that chunk. This would allow a client and
server to quickly double check that the file that is being resumed is in
fact the exact same file or at least is up to the point the server uploaded
it.


Reply to this email directly or view it on GitHubhttps://github.com//issues/7#issuecomment-16571543
.

@schmerg
Copy link

schmerg commented Apr 18, 2013

@Baughn Agreed about the end-to-end - we had direct experience of that issue with bridges/routers that would corrupt the odd packet but put a new valid Ethernet checksum on the packet (and that 2nd link I posted noahdavids.org made the same point).

Incidentally at that prev job we originally worked out and planned for TCP errors every 10^9 bytes but no one in the external review teams would believe such a figure so rather than have to re-explain it every time we watered down the claim to 10^12, which was enough that they wouldn't argue about it occurring (in ~2001), but would agree that our extra end-to-end message checksums were a good idea.

As for checksums on the whole file - there are checksums that can be done on streams as you go (ie where you can resume the checksum operation as more data becomes available, typically used for checksumming streams where you don't know how much data you're going to get and don't want to buffer it up) but, and I don't really know the project but just came across a thread where I might be able to offer something, I'd suggest you could do partial checksums every 4k/256k/1Mb depending on checksum size/cost, and so recognise when a file "went bad" and just resume from that point, and still have a full "did it make it to disk properly" checksum if that's what you really needed (& getting close to suggesting re-implementing sliding window protocols like zmodem over TCP at this stage).

But I'm off into areas that I only have vague knowledge about now.. so I'll bow out before spreading mis-information.

@felixge
Copy link
Contributor

felixge commented Apr 18, 2013

Just a quick update: I think we need to nail #14 and #2 before we can work out how checksums will fit into things.

Other than that I really think we need to specify them in the protocol, but probably as a SHOULD rather than MUST.

@felixge felixge mentioned this issue Apr 18, 2013
@kevinswiber
Copy link

Note: Content-MD5 is a suitable header for returning an MD5 checksum. However, there was a really old IETF draft that supports other algorithms under Content-Digest. Here: http://tools.ietf.org/html/draft-demailly-cd-header-00 . Alas, it was shit-canned. Reasons seemed to float around "harmful to interoperability" and "increased client complexity."

Not that it couldn't be resurrected...

@cybic
Copy link

cybic commented Apr 18, 2013

@kevinswiber I think that if the server and client negotiate a digest, and we strongly recommend to implement at least some md5-digest, the interop issue should be no problem. We could for instance require the server to provide md5 checksumming, but just strongly recommend the client to use it.

Also in the scheme I propose, the client can simply omit to ask for digest or ignore the checksum returned if complexity is an issue. That's the client's loss.

@radiospiel
Copy link

Just to add my 2 cents here: I was working on a file upload thingy which handled files up to 50 GB. And the server had a corrupt RAM which set one bit every 32 byte or so. We upload data in blocks of 1 MB, with a checksum for each block, and a checksum for the entire upload. And I am glad we checksummed each block, because then we knew the uploaded files were corrupt and could not be used. (Took a while to figure out the RAM issues, but that is another story.)

So I would like to see checksum support for both individual "blocks" and for a completely uploaded file become part of the protocol, in the sense that:

(1)

  • the client MAY request checksumming for each PUT/POST/PATCH by setting appropriate header(s), and
  • if the client requested checksumming the server MUST verify the transferred checksum, and response with a 4xx status code on mismatch.

and

(2)

  • the client MAY request a checksum for each successful upload

@qsorix qsorix mentioned this issue Jun 20, 2014
@Acconut
Copy link
Member

Acconut commented Dec 23, 2014

@ChristianUlbrich Great idea, I am looking forward to your feedback.

@Acconut
Copy link
Member

Acconut commented Jan 12, 2015

After thinking a bit more about this the extensions should address following points:

  • Allow only one algorithm. This reduces flexibility but also complexity and ensures that clients and server understand the checksum. MD5 or SHA1 are good in this case since they are very popular
  • This extension is not about security (use HTTPS for that) but about data integrity. Although you mustn't use MD5 for hashing passwords it's good enough for our case.
  • Checksums are sent by the client and verified by the server. If the checksums don't match the file won't be updated and an error will be returned.
  • The checksum may be sent in an header (TUS-Checksum or TUS-MD5-Checksum) but also as a trailer. This is especially handy when dealing with streaming uploads where the chunk is not known in advance.

The only thing missing right now is the appropriated status code. 409 Conflict would be a good match but will be used for illegal offsets (see #51).

@qsorix
Copy link

qsorix commented Jan 12, 2015

The only thing missing right now is the appropriated status code. 409 Conflict would be a good match but will be used for illegal offsets (see #51).

I don't think there's anything wrong with using the same status code. Client's should handle all 4xx in a similar manner, probably just displaying a different error. As far as I know, there's nothing wrong in returning different status strings, is there? I mean 409 Illegal Offset and 409 Wrong Checksum.

@Acconut
Copy link
Member

Acconut commented Jan 12, 2015

@qsorix Although it is totally possible to send custom status texts I'm not sure if that's allowed by the HTTP spec. Anyway I would declare this bad behaviour sending a status text other than the official one.
But the point I'm most concerned about is that most or some implementations for HTTP clients and server don't support reading/writing custom status texts. So -1 by me for this approach.

The best and not used status code may be 406 Not Acceptable, although this is meant for errors in conjunction with the Accept header (maybe TUS-Checksum is one of them?).

@qsorix
Copy link

qsorix commented Jan 12, 2015

The best and not used status code may be 406 Not Acceptable, although this is meant for errors in conjunction with the Accept header (maybe TUS-Checksum is one of them?).

What's the problem with sending 409 back in both cases? I think it is better than picking some other free number with its established meaning further away from what we want. Essentially, the 4xx codes all indicate the same type of problem. Client's fault: "If you're going to retry, better change something in what you did the first time". If there isn't a code that matches our needs, we can just send back 400. We can also consider using some currently-unused number, if we really want to have a different code for a different error.

Although it is totally possible to send custom status texts I'm not sure if that's allowed by the HTTP spec

I've google a little and found http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6.1.1 which says "The reason phrases listed here are only recommendations -- they MAY be replaced by local equivalents without affecting the protocol." and "The Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason- Phrase." So the status text is just a debug.

Question then is, do we want clients to be able to react differently to the two types of errors.

"HTTP status codes are extensible. HTTP applications are not required to understand the meaning of all registered status codes, though such understanding is obviously desirable. However, applications MUST understand the class of any status code, as indicated by the first digit, and treat any unrecognized response as being equivalent to the x00 status code of that class"

To me this means: we're free to pick our own codes if we want to. It will be within the intended use. We can specify them so TUS clients can react different on different errors. Or they can display a generic error message, deciding they just got something from 4xx. Of course, if there is already a code that matches our need, we should obviously reuse it. In this case, it feels like there isn't.

@Acconut
Copy link
Member

Acconut commented Jan 13, 2015

First of all, I want to thank you for this research, it's really appreciated. 👍

If there isn't a code that matches our needs, we can just send back 400.

Yes, but it's too generic. It's like, "oh, you got one out of tens of things wrong but I won't tell you which" :).

"The reason phrases listed here are only recommendations -- they MAY be replaced by local equivalents without affecting the protocol."

That's great news but as I said support for this is too weak as far as I know.

Of course, if there is already a code that matches our need, we should obviously reuse it. In this case, it feels like there isn't.

+1, I will have a look at which 4xx code is used rarely (you can't find one which never been used) and then come back.

@qsorix
Copy link

qsorix commented Jan 13, 2015

@Acconut , once you find the numbers...

That's great news but as I said support for this is too weak as far as I know.

I think the proper way to use the text will be to specify the number we return in case of an error, and encourage implementations to (if possible) add the text that explains what TUS means by returning this error. As an example: implementation MUST use 409 to indicate a mismatched offset, implementation SHOULD return 409 Illegal Offset. And it is fine if it returns 409 Conflict.

Having a known numbers lets clients automate recovery, e.g. offer an option to automatically retry an upload on a failed checksum. Reason phrases will only serve to ensure developers that they're doing the right thing, and help them memorize which code means what.

@Acconut
Copy link
Member

Acconut commented Jan 14, 2015

@qsorix

Having a known numbers lets clients automate recovery, e.g. offer an option to automatically retry an upload on a failed checksum.

That's the point for reusing 409. As suggested in the retry extension a client MAY retry on 4xx.

After some quick research, I could think of using 460 Checksum Mismatch as our custom status code. It's not used by some "official" HTTP specification and I wasn't able to find popular software which uses this code. I'm only concerned that proxies may have problems with custom status codes but I will investigate.

@Acconut
Copy link
Member

Acconut commented Jan 14, 2015

@qsorix

Citing the current HTTP/2 draft:

HTTP/2 does not define a way to carry the version or reason phrase that is included in an HTTP/1.1 status line.

Since HTTP/2 is the future I would vote against a custom reason phrase and instead rely on a custom status code.

@qsorix
Copy link

qsorix commented Jan 14, 2015

@Acconut good catch!

@Acconut
Copy link
Member

Acconut commented Jan 14, 2015

@qsorix What's your opinion about "forcing" one algorithm vs. letting the client decide?

@qsorix
Copy link

qsorix commented Jan 14, 2015

@Acconut , I think the rationale you gave for sticking to md5 is good. In the sense that we try to protect against Murphy, not Machiavelli. I'm fine with md5 and you can stop reading here.

If all we care about is byte integrity, CRC32 uses less CPU. But who cares? I can't tell from experience, but I guess that you would require a really massive network link to notice a difference of CPU usage when switching between two different checksum algorithms.

All major languages offer libraries or come with built-in functions to calculate CRC, md5, sha1, sha256 and more.

So if in practice there's no CPU cost, why not pick a checksum that is not yet known to be vulnerable to collision attacks, like sha-2 or sha-3?

Embedded devices? If we care about those, better pick CRC or md5. Less bytes of memory needed for the algorithm's code.

So...

Because I think people really serious about security will have their own opinion and won't be afraid to break the rules of the protocol or extend it somehow for their needs, or just use HTTPS, I don't consider that as an issue. Picking between CRC and MD5, I just have a gut feeling that md5 maybe more widely supported by standard libraries of various languages.

So yeah, I think md5 is just fine.

@Acconut
Copy link
Member

Acconut commented Jan 15, 2015

Great! :)

Here's what I came up with: https://gist.github.com/Acconut/bbe3838bb04d90845ba8#file-checksum-md
I used the Content-MD5 header instead of a custom TUS-MD5-Checksum because it serves the exact purpose.

One problem we may run into with using trailers:
Imagine a server reports to support the checksum header but only in the header and not trailers.
Then the client sends the checksum as a trailer while the server thinks there is no checksum at all and doesn't verify.

Any opinion on this?

@Acconut
Copy link
Member

Acconut commented Jan 19, 2015

@qsorix Thanks for your feedback. Sadly Github doesn't notify you when your gists are commented so I just saw it at the moment. In the future I would be pleased to see your feedback in the according issues. Anyway I updated the draft following your recommendations and a discussion we had internally. It now mentions the TUS-Extension header and includes a short description what trailer are (see https://gist.github.com/Acconut/bbe3838bb04d90845ba8/d7778f8e7143109e7d9787a42785099acaa88a7e).

@qsorix
Copy link

qsorix commented Jan 20, 2015

@Acconut, Thanks for a tip. I didn't know about notifications and gists.

Comments to the latest revision. Text mentions Content-MD5 but in the example there is TUS-MD5-Checksum.

I don't think we can use Content-MD5. Isn't it supposed to be a checksum of the message's body (i.e. checksum of the uploaded range, not of the whole file)? Introduction to https://www.ietf.org/rfc/rfc3230.txt confirms my worries... That RFC also, imho, solves the problem. Maybe just use that? Frankly, I haven't read it carefully nor completely, but it looks like we should read it :)

One problem we may run into with using trailers:
Imagine a server reports to support the checksum header but only in the header and not trailers.
Then the client sends the checksum as a trailer while the server thinks there is no checksum at all and doesn't verify.

Yup, good point. I agree it makes sense for the support to be explicitly announced.

You've proposed that checksum-trailer "subextension". That's OK. I'm also thinking about announcing instead that a server supports chunked transfers. This way it would be orthogonal to the headers provided in the trailer. I'm assuming that a support for chunked transfers implies support for the trailers.

Yeah, actually chunked transfer needs not be announced. If the server doesn't accept it, it will simply reject it. We can just say that if a server supports chunked transfers, it MUST support trailers. What do you think?

@Acconut
Copy link
Member

Acconut commented Jan 20, 2015

Text mentions Content-MD5 but in the example there is TUS-MD5-Checksum.

Thanks, that was a mistake.

Isn't it supposed to be a checksum of the message's body (i.e. checksum of the uploaded range, not of the whole file)

I meant to use Content-MD5 according to the HTTP spec. It's the hash of the chunk which is uploaded in the current request not the entire file! Latter usage would be counterproductive since the server first needs the entire file in order to calculate and validate the checksum (this was discussed in this issue already).

I'm also thinking about announcing instead that a server supports chunked transfers. This way it would be orthogonal to the headers provided in the trailer.

I disagree. Firstly nearly all HTTP clients and servers support chunked encoding since it's enforced by HTTP 1.1 (see RFC 2616 section 3.6.1):

All HTTP/1.1 applications MUST be able to receive and decode the "chunked" transfer-coding, and MUST ignore chunk-extension extensions they do not understand.

And secondly it's out of tus' scope to deal with detecting chunked transfer coding since the HTTP spec already deals with that. In addition if you send a chunked HTTP 1.1 request to a HTTP 1.0 compatible server it must respond with 505 HTTP Version Not Supported. So there's no need for us to take care about this.

I'm assuming that a support for chunked transfers implies support for the trailers.

I would not say that. A HTTP server framework may support chunked transfers but not trailers (e.g. I'm not sure if PHP supports trailing headers but I may be wrong). Another example would be proxies which (accidentally) remove trailers.

@qsorix
Copy link

qsorix commented Jan 20, 2015

I meant to use Content-MD5 according to the HTTP spec. It's the hash of the chunk which is uploaded in the current request not the entire file!

Ah, OK. In that case, I agree with the way this header is used.

Latter usage would be counterproductive since the server first needs the entire file in order to calculate and validate the checksum (this was discussed in this issue already).

Why counterproductive? The checksum of the entire file is exactly what I personally need in my application. I want some means of verifying that after all resumes that happened, I got the content right.

Server will receive the entire file anyway, so what's the problem with needing the entire file? Server doesn't need to keep the whole content all the time. MD5 can be calculated over a stream. Server just needs to keep MD5's calculation state in the meantime.

@Acconut
Copy link
Member

Acconut commented Jan 20, 2015

Why counterproductive? The checksum of the entire file is exactly what I personally need in my application. I want some means of verifying that after all resumes that happened, I got the content right.

I will try to explain it using following example:

  1. New file with length of 300 bytes is created.
  2. First 100 bytes are uploaded with the checksum of the entire file
  3. Second 100 bytes are uploaded but because of a networking error (or something else) one bytes changes. This error is not detected because the checksum is not verified yet.
  4. Last 100 bytes are successfully uploaded but the checksum validation failed because of the accident in step 3.

Now the client is not able to re-upload the second 100 bytes (see #51) and the only solution is to abort the current upload and start a new one from scratch. And now imagine uploading 10GB again because of one bytes.

Server just needs to keep MD5's calculation state in the meantime.

In a lot of languages/frameworks storing the state is not easily achievable since you have to store it persistently. A upload may be resumed after days leading to a memory leak when you save the state in memory.

@qsorix
Copy link

qsorix commented Jan 20, 2015

I will try to explain it using following example (...) Now the client is not able to re-upload the second 100 bytes (see #51) and the only solution is to abort the current upload and start a new one from scratch.

Nice example, thanks for the explanation. I totally agree. I was not clear in my last comments so let me clarify. I am not against using a checksum for each single message. However, in addition to that, I need a checksum of the entire file. I got confused before because I thought you're looking for a name for the latter checksum and it wasn't the case.

Content-MD5 will be the checksum for the content of a single message. We need a name for the checksum of a complete file.

However, now that I understand your intentions better, there's one thing I didn't get. If a client attempts to send the entire file with a single PATCH request, that patch request will include Content-MD5 (that will accidentally happen to be the checksum of the entire file as well, but that's irrelevant now). When this upload is interrupted, server is unable to verify that Content-MD5. So server now should drop the data it received, right? It cannot keep them and let the client continue, because the data can be malformed. This would mean that using Content-MD5 forces you to upload in parts or you won't be able to resume. Am I confused again?

In a lot of languages/frameworks storing the state is not easily achievable since you have to store it persistently. A upload may be resumed after days leading to a memory leak when you save the state in memory.

Well... There's already a lot of data you need to store persistently anyway to handle the upload. Like URLs, offsets, expected size, optionally metadata, etc. So storing a few more values is not a game-changer.

@Acconut
Copy link
Member

Acconut commented Jan 20, 2015

However, in addition to that, I need a checksum of the entire file.

If you provide a checksum for every chunks you upload why do you need an additional one for the total file?

This would mean that using Content-MD5 forces you to upload in parts or you won't be able to resume. Am I confused again?

Everything you said in that paragraph is correct. Checksums slows the upload procession down by their design. I personally prefer to split a file into multiple chunks allowing me to upload them in parallel and re-uploading a single one. The spec advises to act different but it was written before the principle of checksums and parallel uploads were introduced to tus. We may change this sentence in 1.0.

Well... There's already a lot of data you need to store persistently anyway to handle the upload. Like URLs, offsets, expected size, optionally metadata, etc. So storing a few more values is not a game-changer.

Yes, it's not about the ability to store but about obtaining the underlining data (in case of MD5 the four buffers). As far as I can tell from my experience you are not able to store a MD5 (or any other hashing function) and resume it later in Golang and Node.js (speaking about the standard library, packages be third parties may introduce this functionality but we should not rely on that):

@qsorix
Copy link

qsorix commented Jan 20, 2015

However, in addition to that, I need a checksum of the entire file.

If you provide a checksum for every chunks you upload why do you need an additional one for the total file?

To catch bugs. In my environment there are many teams implementing clients. Lesson learned is that on the server-side, you should never assume that the client works, because given so many chances, someone eventually implements something wrong. Having an extra checksum to verify that the combined chunks are really what they intended to be is valuable. It's actually quite easy to make an "off by 1" mistake when calculating offsets, etc.

Yes, you can argue that individual checksum for each piece, accompanied with offsets of all pieces, is enough to ensure the transferred data were not corrupted. Agreed. However, having a single number is simpler and more practical in a day-to-day life at an office, where it is good to be able to quickly spot and isolate problems.

I could start coming up with crazy scenarios, e.g. disk corruption, that would also require complete checksums, but in my case, it is much more likely I'll have a problem due to software bugs.

I personally prefer to split a file into multiple chunks allowing me to upload them in parallel and re-uploading a single one. The spec advises to act different but it was written before the principle of checksums and parallel uploads were introduced to tus. We may change this sentence in 1.0.

OK. I cannot stop you here ;-). TUS, IMHO, needs to be focused on big bandwidths and huge files, because that's the Internet today. And in such environment chunked and/or parallel uploads make sense and come with small cost.

My deployment consists of embedded devices with limited flash memory and slow network connections. Small files. I'm interested in TUS because it allows resumable uploads in the simplest way I could imagine. This means not many kilobytes are needed to save the algorithm on a device. Things like parallel uploads are of no interest to me because I cannot afford them anyway. My comments are biased towards keeping TUS simple, because otherwise it may turn out, that in the end, this is not the protocol I want to use. ;)

@Acconut
Copy link
Member

Acconut commented Jan 20, 2015

The case you explained to me makes sense for your project. So feel free to extend tus with specific extensions and corner-cases to match your application. I try to make the protocol flexible and powerful while not being too specific and restrictive what it a hard catch.

Anyway before I position myself on one side about a single checksum for the entire file I have a few questions:

  • When should the checksum be transferred? In the initial file creation request?
  • May the checksum be changed in the future?
  • And mostly what happens when the validation failed? Must the entire file be re-uploaded or is it just a warning to the client and the file will be processed anyway?

OK. I cannot stop you here ;-)

While, in fact, you can't stop me I honour every contributors' opinion. It's not about building the protocol which suites my needs best but it's about making something which works for nearly everyone :)

@qsorix
Copy link

qsorix commented Jan 20, 2015

I'll answer the questions how I see fit for my case but I understand that "it depends" is often the right answer ;)

When should the checksum be transferred? In the initial file creation request?

File creation request was my first idea (#7 (comment)), because it keeps all metadata together and therefore seems right. Later I got convinced that it doesn't matter that much in my case, while having it on the PATCH makes streamed uploads easier.

May the checksum be changed in the future?

If it is a checksum for the entire file, it mustn't change.

And mostly what happens when the validation failed? Must the entire file be re-uploaded or is it just a warning to the client and the file will be processed anyway?

In my case, I don't want to process a corrupted file. It needs to be re-uploaded.

@Acconut
Copy link
Member

Acconut commented Jan 20, 2015

Hmm... I am very unsure about this.
I think the best way to go is not to put this into the protocol. I think I clarified my arguments for this move before (if no, ask :) ). Hopefully you'll understand this.
This doesn't mean you won't be able to supply your checksum in the initial request since the metadata extension allows this.

@qsorix
Copy link

qsorix commented Jan 20, 2015

Yeah, I'm cool with using metadata for my purposes. Thanks for the discussion @Acconut .

@Acconut
Copy link
Member

Acconut commented Jan 20, 2015

Great to hear that. I have to thank you for the ongoing feedback, too. :)

@Acconut
Copy link
Member

Acconut commented Jan 21, 2015

See #53.

@Acconut
Copy link
Member

Acconut commented Jan 26, 2015

Merged.

@Acconut Acconut closed this as completed Jan 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests