browser - Why do some downloading files not know their own size?

Thursday, 29 March 2018

browser - Why do some downloading files not know their own size?

Occasionally, when downloading a file in a web browser, the download progress doesn't "know" the total size of the file, or how far along in the download it is -- it just shows the speed at which it's downloading, with a total as "Unknown".

Why wouldn't the browser know the final size of some files? Where does it get this information in the first place?

Answer

To request documents from web servers, browsers use the HTTP protocol. You may know that name from your address bar (it may be hidden now, but when you click the address bar, copy the URL and paste it in some text editor, you'll see http:// at the beginning). HTTP is a simple text-based protocol. It works like this:

First, your browser connects to the website's server and sends a URL of the document it wants to download (web pages are documents, too) and some details about the browser itself (User-Agent etc). For example, to load the main page on the SuperUser site, http://superuser.com/, my browser sends a request that looks like this:

GET / HTTP/1.1
Host: superuser.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.0 Safari/537.36
Accept-Encoding: gzip,deflate,sdch
Accept-Language: pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4
Cookie: [removed for security]
DNT: 1
If-Modified-Since: Tue, 09 Jul 2013 07:14:17 GMT

The first line specifies which document the server should return. The other lines are called headers; they look like this:

Header name: Header value

These lines send additional information that helps the server decide what to do.

If all is well, the server will respond by sending the requested document. The response starts off with a status message, followed by some headers (with details about the document) and finally, if all is well, the document's content. This is what the SuperUser server's reply for my request looks like:

HTTP/1.1 200 OK
Cache-Control: public, max-age=60
Content-Type: text/html; charset=utf-8
Expires: Tue, 09 Jul 2013 07:27:20 GMT
Last-Modified: Tue, 09 Jul 2013 07:26:20 GMT
Vary: *
X-Frame-Options: SAMEORIGIN
Date: Tue, 09 Jul 2013 07:26:19 GMT
Content-Length: 139672



    [...snip...]

After the last line, SuperUser's server closes the connection.

The first line (HTTP/1.1 200 OK) contains the response code, in this case it's 200 OK. It means that the server has decided it can return a document, as requested, and promises that the contents that follow will be such a document. If this is not the case the code will be something else, and it will provide some indication of the reason the server is not just returning a document as the response: for instance, if it cannot find the requested document, it is supposed to return 404 Not Found, and if you are not allowed to access the content in question it is supposed to return 403 Forbidden.

After this first status line, the response headers follow; they provide more information about the content being returned, such as its Content-type.

Next is an empty line. It signals the fact that no more response headers will follow. Everything past that line is the content of the document it requested. So in the above example, is the first line of the SuperUser home page (a HTML document). If I was requesting a document to download, it would probably be some gibberish characters, because most document formats are unreadable without prior processing.

Back to headers. The most interesting one for us is the last one, Content-Length. It informs the browser how many bytes of data it should expect after the empty line, so basically it's the document size expressed in bytes. This header isn't mandatory and may be omitted by the server. Sometimes the document size can't be predicted (for example when the document is generated on the fly), sometimes lazy programmers don't include it (quite common on driver download sites), sometimes websites are created by newbies who don't know of such a header.

Anyway, whatever the reason is, the header can be missing. In that case the browser doesn't know how much data the server is going to send, and thus displays the document size as unknown, waiting for the server to close the connection. And that's the reason for unknown document sizes.

Notes

Thursday, 29 March 2018