Class SimilarityIndex


  • public class SimilarityIndex
    extends java.lang.Object
    Index structure of lines/blocks in one file.

    This structure can be used to compute an approximation of the similarity between two files. The index is used by SimilarityRenameDetector to compute scores between files.

    To save space in memory, this index uses a space efficient encoding which will not exceed 1 MiB per instance. The index starts out at a smaller size (closer to 2 KiB), but may grow as more distinct blocks within the scanned file are discovered.

    Since:
    4.0
    • Field Detail

      • KEY_SHIFT

        private static final int KEY_SHIFT
        Shift to apply before storing a key.

        Within the 64 bit table record space, we leave the highest bit unset so all values are positive. The lower 32 bits to count bytes.

        See Also:
        Constant Field Values
      • MAX_COUNT

        private static final long MAX_COUNT
        Maximum value of the count field, also mask to extract the count.
        See Also:
        Constant Field Values
      • hashedCnt

        private long hashedCnt
        Total amount of bytes hashed into the structure, including \n. This is usually the size of the file minus number of CRLF encounters.
      • idSize

        private int idSize
        Number of non-zero entries in idHash.
      • idGrowAt

        private int idGrowAt
        idSize that triggers idHash to double in size.
      • idHash

        private long[] idHash
        Pairings of content keys and counters.

        Slots in the table are actually two ints wedged into a single long. The upper 32 bits stores the content key, and the remaining lower bits stores the number of bytes associated with that key. Empty slots are denoted by 0, which cannot occur because the count cannot be 0. Values can only be positive, which we enforce during key addition.

      • idHashBits

        private int idHashBits
        idHash.length == 1 << idHashBits.
    • Constructor Detail

      • SimilarityIndex

        SimilarityIndex()