Li, GangGangLiChan, Tak-MingTak-MingChanProf. LEUNG Kwong SakLee, Kin-HongKin-HongLee2023-03-232023-03-232010IEEE/ACM Transactions on Computational Biology and Bioinformatics,2010, Vol. 7( 4), pp. 654 - 668, Article number 478545515455963http://hdl.handle.net/20.500.11861/7542Finding Transcription Factor Binding Sites, i.e., motif discovery, is crucial for understanding the gene regulatory relationship. Motifs are weakly conserved and motif discovery is an NP-hard problem. We propose a new approach called Cluster Refinement Algorithm for Motif Discovery (CRMD). CRMD employs a flexible statistical motif model allowing a variable number of motifs and motif instances. CRMD first uses a novel entropy-based clustering to find complete and good starting candidate motifs from the DNA sequences. CRMD then employs an effective greedy refinement to search for optimal motifs from the candidate motifs. The refinement is fast, and it changes the number of motif instances based on the adaptive thresholds. The performance of CRMD is further enhanced if the problem has one occurrence of motif instance per sequence. Using an appropriate similarity test of motifs, CRMD is also able to find multiple motifs. CRMD has been tested extensively on synthetic and real data sets. The experimental results verify that CRMD usually outperforms four other state-of-the-art algorithms in terms of the qualities of the solutions with competitive computing time. It finds a good balance between finding true motif instances and screening false motif instances, and is robust on problems of various levels of difficulty. © 2006 IEEE.enMotif DiscoveryTranscription FactorsBinding SitesDNA SequencingStatistical ModelsComputation TimeTranscription Factor Binding SitesMultiple MotifsMotif InstancesPosterior ProbabilityLocal OptimumKullback-LeiblerReal ProblemsPosition Weight MatricesNucleotide LevelNumber Of NucleotidesMultinomial DistributionDataset SelectionNucleotide FrequenciesWidth RangeSubsequent SetSingle MotifMotif WidthHighest AverageSet Of MotifsConsensus-Based ApproachCis-Regulatory RegionsTypes of NucleotidesConserved MotifsPerformance IndicatorsA cluster refinement algorithm for motif discoveryPeer Reviewed Journal Article10.1109/TCBB.2009.25