Skip to content

Discuss the use of fingerprints (crc32, fnv64) instead of inode or paths in the sincedb #114

Open
@guyboertje

Description

@guyboertje

From discussions on Slack and in a filewatch PR, I am presenting some conclusion here for community discourse. The filewatch PR 79 and how these discussions are implemented in it should close issue 111


from @guyboertje
If we introduce a fingerprint instead of the path as the sincedb key we rely on two inherent properties of a file its address (inode) and a hash of some of its content - both of are not user adjustable.
So in comparing two files we have this matrix:

   |   3   |   4   |
x  | A   B |   C   |
y  |   D   |   E   |

3,4 = inode
x,y = fingerprint
A,B,C,D,E = files

same:                  A <-> B (cell - both same)
different:             A <-> E, C <-> D (diagonals)
different:             A <-> D, C <-> E (columns - inode same, fp different)
same:                  A <-> C, D <-> E (rows - fp same, inode different)

different(collision)?: A <-> C, D <-> E (rows - fp same, inode different)

@colinsurprenant said (paraphrased) "if we have a sufficiently robust hashing strategy that reduces collisions significantly we can assert that two files with the same fingerprint on different inodes are the same file."

Therefore the next discussion is what the robust hashing strategy should be.
SHA256 hash of (up to) X bytes.
Fnv.fnv1a_64 hash of X bytes
Where X bytes is 255, 512, one disk block [agreement needed] etc.
We need to think of the case where a tailed file is discovered with very little content - will the chance of collisions increase?


from @jordansissel
This is a hard problem. I wish I had better answers at this time :\

That said, I wonder if we can get away with just using a content fingerprint, and ignore addresses. If we assume that log files include timestamps -- something that, even for the same "log message" that the timestamp content will be different between two different log files --maybe we don't even need addresses? Given @colinsurprenant's hypothetical of a 'sufficiently robust hashing strategy' I have a feeling that we could use content hash and ignore addresses (file path, inode, device)

Thoughts? It would eliminate the column portion of the matrix and just have rows (content fingerprint).


from @guyboertje
👍 on fingerprints.

x  | A   B  |
y  | C   C' |
z  |   D    |

x,y,z = fingerprint
A-D = files

A <-> B  same
C <-> C' same
B <-> D  different

In slack @jsvd, @jordansissel and myself concluded the following:

  • a checksum is taken on content at the beginning of a 32K block (this is the sysread size) for a size of X (255, 512 ??).
  • having a second checksum of the last block read [agreement needed].
  • any empty files of files with only a header (CSV) and no content would have the same checksum initially but would get new ones as more content is read.
  • the checksums stored on disk (in the sinced file) would have an offset and length, so the fingerprint would be [checksum1, offset1, length1, checksum2, offset2, length2].

The upgrader will try to match inodes from an old sincedb with discovered files. We need to leave unmatched sincedb records in the new sincedb file as they maybe re-discovered later.
Each sincedb record old and new will have an expiry timestamp added of say 14 days [agreement needed]. Expired entries are removed. A new expiry timestamp is added to watched_files that have had more data read this session than the position stored in the hash value, before they are written to disk otherwise the old expiry timestamp is written.

Example 1: imagine we have previously read a file with 42 bytes, on disk we have one sincedb entry of 245697582,0,42 42 then we discover the same file but now it has more bytes, with a checksum of 282385235,0,255, there is no direct match on checksums so we have to find one that might match. We extract a list of possible sincedb keys (checksums) that are less than 255 in length and have not been matched before and create + test checksums from the discovered file for the lengths in the list of possibles. If we have a match we add 282385235,0,255 => 42 to the sincedb hash and remove 245697582,0,42 42. Meaning that we will start reading the discovered file from byte 43.

Example 2: we discover a small file of less than 255 bytes, e.g. 532356864,0,120. If there is no direct match we only need to extract possible keys that are less than 120 in length. We repeat the same test as above.

When we discover a file bigger than 32K we can pre-compute two checksums, one at 0,255 and the other at 32769,255 (or less if the file size is less than 32K + 255)

When we discover a file smaller than 32K we pre-compute only one checksum at 0,255 (or less if the file size is less than 255)

Example 3: we discover a very large file with a checksum of 287775235,0,255. We have seen and read 44444 bytes from it before and have this on disk: 287775235,0,255 44444 457775898,32769,255. The first checksum matches and we double check whether the second part matches too, if so - we know it is the same file and read from byte 44445. If there is no match then we treat it as a new file. When there is a first match and not a second match, if the sincedb second checksum length was less than 255 we compute for that length on the discovered file.

Example 4: we discover a file that is 32768m + 10 bytes. Its checksums are at 0,255 and 32768,10.

At file discovery we will cache either the bytes at 0,255 or 32768,255 to make the re-compute of checksums quicker.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions