Navigation
  • Home
  • Recent
  • Most Active
  • Popular
  • Blog
  • Credits
  • RSS
  •   Interaction
  • Register
  • Statistics
  •   Help
  • Suggestions
  • Contact Us
  • How to Edit
  • Help



  • [Edit]


    Information retrieval(IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data. There is a common confusion, however, between data retrieval, document retrieval, information retrieval, and text retrieval, and each of these has its own bodies of literature, theory, praxis and technologies. IR is like most nascent fields interdisciplinary, based in computer science, library science, information science, cognitive psychology, linguistics, and statistics.
    Automated IR systems are used to reduce information overload. Many universities and public libraries use IR systems to provide access to books, journals, and other documents. IR systems are often related to object and query. Queries are formal statements of information needs that are put to an IR system by the user. An object is an entity which keeps or stores information in a database. User queries are matched to documents stored in a database. A document is, therefore, a data object. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates.

    In 1992 the US Department of Defense, along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for such a huge evaluation of text retrieval methodologies.

    Web search engines such as Google and Lycos are the most visible IR applications.


        Information retrieval
            Performance measures
                Precision
                Recall
                Fall-Out
                F-measure
                Mean average precision
            Model types
                First dimension: mathematical basis
                Second dimension: properties of the model
            Open source information retrieval systems
            Other retrieval tools
            Major Information retrieval research groups
            Major figures in information retrieval
            Other figures associated to information retrieval
            ACM SIGIR Gerard Salton Award
            See also

    top

    Performance measures

    There are various ways to measure how well the retrieved information matches the intended information:
    The formulas for precision, recall and fall-out are translated from the german Wikipedia-article "Recall und Precision".
    See also this nice intuitive, graphical depiction.

    top

    Precision

    The proportion of retrieved and relevant documents to all the documents retrieved:

    mbox= rac


    In binary classification, precision is analogous to positive predictive value.
    Precision can also be evaluated at a given cut-off rank, denoted P@n, instead of all retrieved documents.

    Note that the meaning and usage of "precision" in the field of Information Retrieval differs from the definition of accuracy and precision within other branches of science and technology.

    top

    Recall

    The proportion of relevant documents that are retrieved, out of all relevant documents available:

    mbox= rac


    In binary classification, recall is called sensitivity.

    top

    Fall-Out
    The probability to find an irrelevant among the retrieved documents.

    mbox= rac


    top

    F-measure

    The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is:

    F = 2 imes mathrm imes mathrm / (mathrm + mathrm).,


    This is also known as the F_1 measure, because recall and precision are evenly weighted.

    The general formula for non-negative real α is:
    F_alpha = (1 + alpha) imes mathrm imes mathrm / (alpha imes mathrm + mathrm).,


    Two other commonly used F measures are the F_ measure, which weights precision twice as much as recall, and the F_ measure, which weights recall twice as much as precision.

    top

    Mean average precision

    Over a set of queries, find the mean of the average precisions, where Average Precision is the average of the precision after each relevant document is retrieved.

    Where r is the rank, N the number retrieved, rel() a binary function on the relevance of a given rank, and P() precision at a given cut-off rank:

    operatornameP = rac !


    This method emphasizes returning more relevant documents earlier.

    top

    Model types
    of IR-models (translated from http://de.wikipedia.org/wiki/Informationsrückgewinnung#Klassifikation_von_Modellen_zur_Repr.C3.A4sentation_nat.C3.BCrlichsprachlicher_Dokumente German entry, original source http://www.logos-verlag.de/cgi-bin/engbuchmid?isbn=0514&lng=eng&id= Dominik Kuropka)
    For successful IR, it is necessary to represent the documents in some way. There are a number of models for this purpose. They can be categorized according to two dimensions like shown in the figure on the right: the mathematical basis and the properties of the model. (translated from German entry, original source Dominik Kuropka)

    top

    First dimension: mathematical basis
      Set-theoretic Models represent documents by sets. Similarities are usually derived from set-theoretic operations on those sets. Common models are:

      Algebraic Models represent documents and queries usually as vectors, matrices or tuples. Those vectors, matrices or tuples are transformed by the use of a finite number of algebraic operations to a one-dimensional similarity measurement.
        Topic-based vector space model (literature: *, *)
        Enhanced topic-based vector space model (literature: *, *)

      Probabilistic Models treat the process of document retrieval as a multistage random experiment. Similarities are thus represented as probabilities. Probabilistic theorems like the Bayes' theorem are often used in these models.
        Uncertain inference

    top

    Second dimension: properties of the model
      Models without term-interdependencies treat different terms/words as not interdependent. This fact is usually represented in vector space models by the orthogonality assumption of term vectors or in probabilistic models by an independency assumption for term veriables.

      Models with immanent term interdependencies allow a representation of interdependencies between terms. However the degree of the interdependency between two terms is defined by the model itself. It is usually directly or indirectly derived (e.g. by dimensional reduction) from the co-occurrence of those terms in the whole set of documents.

      Models with transcendent term interdependencies allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined. They relay an external source for the degree of interdependency between two terms. (For example a human or sophisticated algorithms.)

    top

    Open source information retrieval systems

      ht://dig Open source web crawling software
      Egothor high-performance, full-featured text search engine written entirely in Java
      Lemur Language Modelling IR Toolkit
      Terrier Information Retrieval Platform
      Wumpus multi-user information retrieval system
      Xapian Open source IR platform based on Muscat
      Zebra GPL structured text/XML/MARC boolean search IR engine supporting Z39.50 and Web Services
      Zettair, compact and fast search engine written in C, able to handle large amounts of text

    top

    Other retrieval tools
      iHOP Information retrieval system for the biomedical domain
      EBIMed Information retrieval (and extraction) system over Medline
      GalaTex XQuery Full-Text Search (XML query text search)
      Sphinx Free open-source SQL full-text search engine

    top

    Major Information retrieval research groups

    top

    Major figures in information retrieval


    top

    Other figures associated to information retrieval

    Awards in this field: Tony Kent Strix award.

    top

    ACM SIGIR Gerard Salton Award
    1983 - Gerard Salton, Cornell University
    "About the future of automatic information retrieval"

    1988 - Karen Sparck Jones, University of Cambridge
    "A look back and a look forward"

    1991 - Cyril Cleverdon, Cranfield Institute of Technology
    "The significance of the Cranfield tests on index languages"

    1994 - William S. Cooper, University of California, Berkeley
    "The formalism of probability theory in IR: a foundation or an encumbrance?"

    1997 - Tefko Saracevic, Rutgers University
    "Users lost: reflections on the past, future, and limits of information science"

    2000 - Stephen E. Robertson, City University, London
    "On theoretical argument in information retrieval"

    2003 - W. Bruce Croft, University of Massachusetts, Amherst
    "Information retrieval and computer science: an evolving relationship"

    2006 - C. J. van Rijsbergen, University of Glasgow, UK
    "Quantum haystacks"


    top

    See also
     
    Search more:
     

       
    Source Privacy License Download Contact Us Atlas
    Scientus.org Dictionary (Yet Another Wiki) RC : 1.39
    This article is licensed under the GNU Free Documentation License [copyleft]. It uses material from the Wikipedia article "Information retrieval". link