Applying Agrep to r-NSA to solve multiple sequences approximate matching

Ni, Bing; Wong, Man-Hon; Lam, Chi-Fai David; Prof. LEUNG Kwong Sak

Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.11861/7500

DC Field	Value	Language
dc.contributor.author	Ni, Bing	en_US
dc.contributor.author	Wong, Man-Hon	en_US
dc.contributor.author	Lam, Chi-Fai David	en_US
dc.contributor.author	Prof. LEUNG Kwong Sak	en_US
dc.date.accessioned	2023-03-16T03:39:31Z	-
dc.date.available	2023-03-16T03:39:31Z	-
dc.date.issued	2014	-
dc.identifier.citation	International Journal of Data Mining and Bioinformatics, 2014, Vol. 9 (4), pp 358-385	en_US
dc.identifier.issn	1748-5673	-
dc.identifier.issn	1748-5681	-
dc.identifier.uri	http://hdl.handle.net/20.500.11861/7500	-
dc.description.abstract	This paper addresses the approximate matching problem in a database consisting of multiple DNA sequences, where the proposed approach applies Agrep to a new truncated suffix array, r-NSA. The construction time of the structure is linear to the database size, and the computations of indexing a substring in the structure are constant. The number of characters processed in applying Agrep is analysed theoretically, and the theoretical upper-bound can approximate closely the empirical number of characters, which is obtained through enumerating the characters in the actual structure built. Experiments are carried out using (synthetic) random DNA sequences, as well as (real) genome sequences including Hepatitis-B Virus and X-chromosome. Experimental results show that, compared to the straight-forward approach that applies Agrep to multiple sequences individually, the proposed approach solves the matching problem in much shorter time. The speed-up of our approach depends on the sequence patterns, and for highly similar homologous genome sequences, which are the common cases in real-life genomes, it can be up to several orders of magnitude.	en_US
dc.language.iso	en	en_US
dc.relation.ispartof	International Journal of Data Mining and Bioinformatics	en_US
dc.title	Applying Agrep to r-NSA to solve multiple sequences approximate matching	en_US
dc.type	Peer Reviewed Journal Article	en_US
dc.identifier.doi	10.1504/IJDMB.2014.062145	-
item.fulltext	No Fulltext	-
crisitem.author.dept	Department of Applied Data Science	-
Appears in Collections:	Applied Data Science - Publication

Find@HKSYU

Show simple item record

SCOPUS^TM
Citations

1

checked on May 18, 2025

Page view(s)

46

Last Week
0

Last month

checked on May 19, 2025

Google Scholar^TM

Impact Indices

SCOPUS^TM
Citations

Page view(s)

Google Scholar^TM

Altmetric

PlumX
Metrics

Publisher copyright policies & self-archiving

SCOPUSTM Citations

Page view(s)

Google ScholarTM

Altmetric

PlumX Metrics

SCOPUS^TM
Citations

Google Scholar^TM

PlumX
Metrics