On the relationship between MinHash, Minimizers, and CNNs: an exploration through hash functions
Date: February 25th, 2022 @ 3pm-4pm
Location: Center for Biotechnology and Life Sciences. Room 100
Host: Noah Daniels
Talk Description
The selection of subsampled k-mer features in strings is one of the primitive tasks enabling efficient genomics algorithms. These subsampled k-mers have been employed for everything from accelerated read mapping, approximate pairwise distances, and taxonomic classification. One of the classical means of selecting k-mers is to apply some hash function to the k-mers and keep only the minimum hashed values. When the hash function has the property of min-wise independence, the resulting algorithm is known as MinHash, which is a probabilistic sketch for computing the Jaccard distance between sets. When the hash function is applied on a sliding window of k-mers on a string, we get a canonical set of “minimizers” that can be used as a smaller set of anchors for matching similar strings. However, although related, MinHash and minimizers do not require the same properties from the hash functions used.
In this talk, we discuss three loosely-related topics. First, we prove that we can use a floating-point encoding to reduce the space complexity of storing MinHash values from O(log n) to O(log log n), using the Flajolet-Martin trick of LogLog counters. Second, we show that selecting minimizer-like anchors for string matching in genomics may in fact not be best achieved through uniform random hashing; rather instead, we may want the k-mer selection method to have other properties than min-wise independence. Third and finally, we prove that convolution with a spherical Gaussian random variable results in a hash family that preferentially selects more distinct k-mers as extrema; this is not useful for designing classical hash functions because the output is non-uniform and continuous, but this fact allows us to interpret the initial layers of a convolutional neural network—the convolutional filters followed by max-pooling—as a minimizer-equivalent feature-selection operation.
No knowledge of genomics will be assumed.
Joint work with Jim Shaw and Griffin Weber.