Navigation
  • Home
  • Recent
  • Most Active
  • Popular
  • Blog
  • Credits
  • RSS
  •   Interaction
  • Register
  • Statistics
  •   Help
  • Suggestions
  • Contact Us
  • How to Edit
  • Help



  • [Edit]



    A stemmer is a computer program or algorithm which determines a stem form of a given inflected (or, sometimes, derived) word form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

    A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem".

    English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of more possible verb inflections), a Russian one is more complex (more possible noun declensions), a Hebrew one is even more complex (due to non-catenative morphology and a writing system without vowels), and so on. On the other hand, stemmers for true isolating languages such as Vietnamese can be even simpler than those for English.

    Stemmers are common elements in query systems such as Web search engines, since a user who runs a query on "daffodils" would probably also be interested in documents that contain the word "daffodil" (without the s). The effectiveness of stemming for English query systems is questionable, however. An alternative approach, based on searching for n-grams rather than stems, may be used instead.

    A more complex approach to the problem of determining a stem of a word is lemmatisation. This process involves first determining the part of speech of a word, and applying different normalisation rules for each part of speech.

    The first ever published stemmer was written by Julie Beth Lovins: Lovins JB (1968) Development of a stemming algorithm, Mechanical Translation and Computational Linguistics, 11: 22-31. This paper was remarkable for its early date, and had great influence on later work in this area.

    A later stemmer was written by Martin Porter, and published in the July 1980 issue of the journal Program. This stemmer became very widely used, and became the de-facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award in 2000 for his work on stemming and information retrieval.

    Many implementations of this algorithm were written and freely distributed. Unfortunately, many of these implementations contained subtle flaws, and as a result systems using these stemmers performed less well than they ought. To eliminate this source of error, around the year 2000 Martin Porter released an official free-software implementation of the algorithm. Over the next few years, he extended this work by building Snowball, a framework for writing stemming algorithms, and implemented an improved English stemmer together with stemmers for several other languages.




        Stemmer
            Further reading
            See also

    top

    Further reading

      Dawson, J.L. (1974) Suffix Removal for Word Conflation, Bulletin of the Association for Literary and Linguistic Computing, 2(3): 33-46
      Frakes, W.B. (1984) Term Conflation for Information Retrieval, Cambridge University Press
      Frakes, W.B. & Fox, C.J. (2003) Strength and Similarity of Affix Removal Stemming Algorithms, SIGIR Forum, 37: 26-30
      Hafer, M.A. & Weiss, S.F., (1974) Word segmentation by letter successor varieties, Information Processing & Management 10 (11/12), 371-386.
      Harman, D., (1991) How effective is suffixing? Journal of the American Society for Information Science 42 (1), 7-15.
      Hull, D.A. (1996) Stemming Algorithms – A Case Study for Detailed Evaluation, JASIS, 47(1): 70-84
      Hull, D.A. & Grefenstette, G. (1996) A Detailed Analysis of English Stemming Algorithms, Xerox Technical Report
      Kraaij, W. & Pohlmann, R., 1996: Viewing stemming as recall enhancement, in H-P. Frei, D. Harman, P. Schauble & R. Wilkinson (eds.), Proceedings of the 17th ACM SIGIR conference held at Zurich, August 18-22, pp.40-48.
      Krovetz, R. (1993) Viewing Morphology as an Inference Process, In Proceedings of ACM-SIGIR93, pp191-203
      Lennon, M., Pierce, D.S., Tarry, B.D. & Willett, P. (1981) An Evaluation of some Conflation Algorithms for Information Retrieval, Journal of Information Science, 3: 177-183
      Lovins, J. (1968) Development of a Stemming Algorithm, Mechanical Translation and Computational Linguistics, 11: 22-31
      Lovins, J. (1971) Error Evaluation for Stemming Algorithms as Clustering Algorithms, JASIS, 22: 28-40
      Marie-Claire, J. and Smith, D. (2005) Conservative stemming for search and indexing

      Paice, C.D. (1990) Another Stemmer, SIGIR Forum, 24: 56-61
      Paice, C.D. (1996) Method for Evaluation of Stemming Algorithms based on Error Counting, JASIS, 47(8): 632-649
      Popovic, M. and Willett, P., (1992) The effectiveness of stemmng for natural language access to Slovene textual data, Journal of the American Society for Information Science, 43(5), 384-390.
      Porter, M.F. (1980) An Algorithm for Suffix Stripping, Program, 14(3): 130-137
      Savoy, J., (1993) Stemming of French words based on grammatical categories Journal of the American Society for Information Science, 44(1), 1-9.
      Ulmschneider, J.E. & Doszkocs, (1983) A practical stemming algorithm for online search assistance, Online Review, 7(4), 301-318.
      Xu, J. & Croft, W.B., (1998) Corpus-based stemming using coocurrence of word variants, ACM Transactions on Information Systems, 16(1), 61-81.

    top

    See also
     
    Search more:
     

       
    Source Privacy License Download Contact Us Atlas
    Scientus.org Dictionary (Yet Another Wiki) RC : 1.39
    This article is licensed under the GNU Free Documentation License [copyleft]. It uses material from the Wikipedia article "Stemmer". link