Heuristics For Specific Substring Problem
Heuristics for the Specific Substring Problem with Hamming Distance
An important problem in Computational Biology is to determine genetic markers, substrings of a set of sequences that do not occur on sequences of other sets. Applications for this problem include finding small specific regions for primer design and to find specific organisms or sequences in metagenomes. Genetic markers can be addressed by the Specific Substring Problem - SSP which consists of finding all minimal substrings in a given set of sequences with at least k differences among all the substrings in another sequence set. Since this problem spend quadratic time when Hamming distance is considered and we have, in general, a large volume of data to be processed, this solution becomes impractical. With this in mind, the main focus of this work is to propose and investigate the use of heuristic and parallel approaches for the SSP whose effectiveness were verified with artificial and real data experiments.