AbstractBiology researchers have a pressing need for data management technologies which will make the storage and retrieval of DNA and protein sequence data accurate and efficient. The volume of data generated by DNA sequencing is already massive and will continue to grow rapidly. Even if the current sequence databases are adequate today, they most assuredly will become inadequate in the future when far more sequence data has been determined. The direction of future research in sequence databases needs to be in the organization of information. This is so that the volume of data needing to be searched does not grow linearly with the volume of sequence data being discovered.
We propose to develop an index structure and retrieval system called PROXIMAL for biological sequence databases which promises to be efficient and general. This organization of the databases will complement other current efforts at sequence comparison and analysis, by providing an infrastructure in which other methods can be used to efficiently locate desired sequences. Our method relies on the use of reference strings to partition the database of sequences. It is efficient since the use of multiple reference strings for any given distance measure greatly reduces the number of sequences that must be examined, allowing us to quickly locate sequences based on a precomputed metric. It is general since multiple distance measures can be used. These include at least differing gap and mismatch weights for the basic edit distance calculation, or entirely different models of mutation. The only requirement is that there is a metric structure - mainly, that the calculations satisfy the triangle inequality. This is a weak requirement that is satisfied by many interesting measures, including those currently in wide use for sequence comparison.
RightsThis Item is protected by copyright and/or related rights.You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use.For other uses you need to obtain permission from the rights-holder(s).