Tertiary to this question, I have been building my own imageboard that prevents (for example) duplicate images from being downloaded again and again on behalf of the client. How I do this, is that I keep all files in a database with a key being a hash of the file. The client sees the hash, and first checks its database to see if it has been downloaded before actually making a request. Similarly for my server; I also prevent duplicate uploads by having the client send me the hash first.
I am expanding a more general purpose networking library for downloading files from the web, and to my dismay; I discovered that not all servers will supply me with some sort of hash.
In an effort to de-duplicate downloads, and to continue partial downloads in which their url has changed, is there a way to reliably fingerprint a file from its headers and url?
Just taking an example here, of a plain HEAD request
QVariant reply->header( QNetworkRequest::ContentLengthHeader )
scheme() : https
userName() : NULL
password() : NULL
host() : i.imgur.com
port() : -1
path() : /oEdf6Rl.png
fragment() : NULL
query() : NULL
View post on imgur.com
Sun, 21 Feb 2021 15:14:36 GMT
Fri, 26 Feb 2021 04:14:22 GMT
cat factory 1.0
The only things that seem static here, are the Mime Type, and the file size. One thing I would be willing to do is do a
Accept-Ranges Download of certain bits, as I have found most servers do support this header, and from there; create a hash of the corresponding bytearray, and fingerprint it that way.
However, I am skeptical whether this would work reliably, especially concerned with something like two image frames that are nearly identical, but are in fact, not.
Am I pursuing a lost cause here? Or is there a reasonable way to fingerprint a file hosted on the web, without having to fully download it?
I’d like to do this with any file above 1mb large, because I have an exceptionally slow connection at times. Thanks.