International Journal of Scientific & Engineering Research, Volume 3, Issue 7, July-2012 1

ISSN 2229-5518

Search Query Performance Improvement on

Medica Data Bases

Thayyaba Khatoon Mohammed, Gayatri.M, G. Swathi, Sukerthi.S.

AbstractSearch queries on biomedical databases, such as PubMed, often return a large number of results, only a small subset of which is relevant to the user. Ranking and categorization, which can also be combined, have been proposed to alleviate this information overload problem. Results cate- gorization for biomedical databases is the focus of this work. A natural way to organizebiomedical citations is according to their MeSH annotations. MeSH is a comprehensive concept hierarchy used by PubMed. In thi spaper, we present the BioNav system, a novel search interface that enables the user to navigate large number of query results by organizing them using the MeSH concept hierarchy. First, the query results are organized into a navi- gation tree. At each node expansion step, BioNav reveals only a small subset of the concept nodes, selected such that the expected user navigation cost is minimized. In contrast, previous works expand the hierarchy in a predefined static manner, without navigation cost modeling. W e show that the problem of selecting the best concepts to reveal at each node expansion is NP-complete and propose an efficient heuristic as well as a feasible optimal algorithm for relatively small trees. W e show experimentally that BioNav outperforms state-of-the-art categorization systems by up to an order of magni- tude, with respect to the user navigation cost.

.

Index TermsInteractive data exploration and discovery, search process, graphical user interfaces, interaction styles.

—————————— ——————————

1 INTRODUCTION

THE last decade has been marked by unprecedented growth in both the production of biomedical data and the amount of published litera- ture discussing it. The MEDLINE database, on which the PubMed search engine operates, contains over 18 million citations and is cur- rently growing at the rate of 500,000 new citations each year [20]. Other biological sources, such as Entrez Gene [18] and OMIM [21], witness similar growth. As claimed in previous work [26], the ability to rapidly survey this literature constitutes a necessary step toward both the design and the interpretation of any large-scale experiment. Biologists, chemists, medical and health scientists are used to search- ing their domain literature—such as PubMed—using a keyword search interface. Currently, in an exploratory scenario where the user tries to find citations relevant to her line of research and hence not known a priori, she submits an initially broad keyword-based query that typically returns a large number of results. Subsequently, the user iteratively refines the query, if she has an idea of how to, by adding more keywords, and resubmits it, until a relatively

small number of results are returned. This refinement process is problematic because after a number of iterations, the user is not aware if she has overspecified the query, in which case relevant cita- tions might be excluded from the final query result.

Thayyaba khatoon completed her masters degree program in computer science and engineering in JNTUH University, India, PH-

+919885696137. E-mail:thayyaba.khatoon16@gmail.com

Gayatri M completed her masters degree program in computer science and

engineering in Acharya Nagarjuna University, India, PH-

+919177842642. E-mail:gayatrimrec@gmail.com
Swathi Gadde is persuing her masters degree program in software engi- neering in JNTUH University, India, PH-+917396061007. E-mail: ga- deswathi@gmail.com

Sukerthi S completed her masters degree program in computer scienceg in

JNTUH University, India, PH-+917306954541. E-mail: su- kee.sutraya@gmail.com
As an example, a query on PubMed for “cancer” returns more than two million citations. A more specific query,“breast cancer treat- ment,” returns 111,433 citations. Our running example query for “prothymosin,” a nucleoprotein gaining attention for its putative role in cancer development, returns 313 citations. The size of the query result makes it difficult for the user to find the citations that she is most interested in, and a large amount of effort is expended search- ing for these results. Many solutions have been proposed to address this problem—commonly referred to as information overload [1], [2], [3] ,[9], [16]. These approaches can be broadly classified into two classes: ranking and categorization—which can also be com- bined. Ranking presents the user with a list of results ordered by some metric of relevance [9] or by content similarity to a result or
a set of results [16]. In categorization [1], [2], [3], query results are
grouped based on hierarchies, keywords, tags, or attribute values. User studies have demonstrated the usefulness of categorization in finding relevant results of exploratory queries [12]. While ranked results are useful when the ranking function is aligned with user pre- ferences or the result list is small in size, categorization is generally employed by users when ranking fails or the query is too
“broad” [12].
BioNav belongs primarily to the categorization class, which is espe- cially suitable for this domain given the rich concept hierarchies (e.g., MeSH [19]) available for biomedical data. We augment our categorization techniques with simple ranking techniques. BioNav organizes the query results into a dynamic hierarchy, the navigation tree. Each concept (node) of the hierarchy has a descriptive label. The user then navigates this tree structure, in a top-down fashion, exploring the concepts of interest while ignoring the rest.

IJSER © 2012

http://www.ijser.org

The research paper published by IJSER journal is about Search Query Performance Improvement on Medica Data Bases 2

ISSN 2229-5518

Fig. 1. Static navigation on the MeSH concept hierarchy.1

An intuitive way to categorize the results of a query on PubMed is by using the MeSH static concept hierarchy [19], thus, utilizing the initiative of the US National Library of Medicine (NLM) to build and maintain such a comprehensive structure. Each citation in MEDLINE is associated with several MeSH concepts in two ways:
1) by being explicitly annotated with them, and 2) by mentioning
those in their text (see Section 7 for details). Since these associations are provided by PubMed, a relatively straightforward interface to navigate the query result would first attach the citations to the cor- responding MeSH concept nodes and then let the user navigate the navigation tree. Fig. 1 displays a snapshot of such an interface where shown next to each node label is the count of distinct citations in the subtree rooted at that node. A typical navigation starts by revealing the children of the root ranked by their citation count, and is contin- ued by the user expanding on or more of them, revealing their ranked children and so on, until she clicks on a concept and inspects the citations attached to it. A similar interface and navigation method is used by e-commerce sites, such as Amazon and eBay. For this ex- ample interaction, we assume that some of the citations the user is interested in are available on the three indicated concepts corres- ponding to three independent lines of research related to prothymo- sin, and therefore the user is interested in navigating to these con- cepts. These include, “Histones,” which play a role in gene regula- tion and are essential for virus replication and tumor growth, “Cell Growth Processes” and “Transcription, Genetic,” a key process for synthesis and replication of RNA and thus plays an important role in the duplication of cancer cells.
Note that the user is not aware that the relevant results are availa- ble specifically on these nodes—she is only interested in narrowing down the results, using a familiar concept hierarchy, instead of ex- amining all the results.
The above static—same for every query result—navigation method is problematic when the MeSH hierarchy (or one with similar prop- erties) is used for categorization for the following reasons:

The massive size of the MeSH hierarchy (over 48,000 con- cept nodes) makes it challenging for the users to effectively

navigate to the desired concepts and browse the associated
records. Even if we remove from the MeSH concept nodes with no citations attached to them, the 313 distinct query results for “prothymosin” are attached to 3,940 nodes, which is the actual size of the navigation tree in Fig.
1.Combined with the fact that the MeSH hierarchy is quite bushy on the upper levels, this means that the user has to inspect, for example, a total of 152 concept nodes before she reaches the indicated concept “Histones”; a number comparable to the distinct citation count in the query result. A common practice [28] for hierarchy navigation is to show only a subset of a node’s children, which would be appropriate if only few nodes contain many results. Unfor- tunately, this is not the case for the MeSH navigation tree; most of the 98 children of the root in Fig. 1 have many re- sults (the first three shown have 310, 217 and 193).
A substantial number of duplicate citations are introduced in the navigation tree of Fig. 1, since each one of the 313
distinct citations is associated with several concepts. Spe-
cifically, the total count of citations in Fig. 1 is 30,895. Na- turally, the user would like to know which concepts frag- ment the query result into subsets of citations with as few duplicate citations as possible across them. Currently, the only way to figure this out using the interface in Fig. 1 is to click on different concept nodes and inspect the attached ci- tations. As an example, the query results for “prothymosin” are related to three independent lines of research, represented by the three indicated concepts in Fig. 1, which are hard to locate. Among the total 139 citations attached to the three indicated concept nodes, only 20 of them are dup- licates.
BioNav introduces a dynamic navigation method that depends on the particular query result at hand and is demonstrated in Fig. 2. The query results are attached to the corresponding MeSH concept nodes as in Fig. 1, but then the navigation proceeds differently. The key action on the interface is the expansion of a node that selectively reveals a ranked list of descendant (not necessarily children) con- cepts, instead of simply showing all its children. Fig. 2a, for exam- ple, shows the initial expansion of the root node where only eight (highlighted) descendants are revealed compared to 98 children shown in Fig. 1. The concepts are ranked by their relevance to the user query and the number of them revealed depends on the charac- teristics of the query results. Next, assuming the user is interested in the “Amino Acids ...” node and judging that the 310 attached cita- tions is still a big number, she expands it by clicking on the “>> >” hyperlink next to it in Fig. 2b. The user inspects the six concepts revealed and decides that she is not interested in any of them. Hence, she expands the “Amino Acids ...” node one more time in Fig. 2c, revealing four additional concepts. Note that “Nucleoproteins” is an example of a descendant node being revealed, since its parent node “Proteins” is not

IJSER © 2012

http://www.ijser.org

The research paper published by IJSER journal is about Search Query Performance Improvement on Medica Data Bases 3

ISSN 2229-5518


node expansions, compared to 152 concepts, also after four expan- sions, with the static navigation method of Fig. 1. For each expan- sion, the displayed descendant concepts are chosen in a way that the expected navigation cost is minimized, based on an intuitive naviga- tion cost model we present in Section 3. The cost model estimates the exploration probability for a node based on its selectivity, that is, the ratio of attached citations before and after the query. The naviga- tion cost for a concept node is also proportional to the density of the navigation subtree rooted at this node in terms of citation count. In- tuitively, the selection is done such that every expansion reduces

maximally the expected remaining navigation cost. For example, the reason that “Proteins” is not displayed in Fig. 2 is that it is too gener- al given the query results and the original distribution of citations in the PubMed database (details in Sections 3 and 4), and hence dis- playing it would lead to an expected increase in the user navigation cost, based on the user navigation cost model. In addition to the stat- ic hierarchy navigation works mentioned above, there are works on dynamic categorization of query results (e.g., the Clusty search en- gine [29], or [2], [3]), which create unsupervised query-dependent results clusters, but do not study how the clusters should be navi- gated. BioNav is distinct since it offers dynamic navigation on a predefined hierarchy, as is the MeSH concept hierarchy. Another difference is that BioNav uses a navigation cost model to minimize the navigation cost.
We make the following contributions:
1. A comprehensive framework for navigating large query re- sults from PubMed using MeSH, an extensive concept hierarchy used for indexing citations in MEDLINE (Sec- tion 2).
2. A formal cost model for measuring the navigation cost in- curred by the user (Sections 3 and 4).
3. A complexity result proving that expanding the tree in a
way that minimizes the user’s navigation cost is an NP- complete problem (Section 5).
4. An efficient heuristic and a feasible optimal algorithm for minimizing the navigation cost (Section 5).
5. Experimental results validating the effectiveness of the BioNav system when compared to state-of-theart categori- zation systems (Section 8).
6. An online version of the BioNav system is available at
http://db