Navigation
  • Home
  • Recent
  • Most Active
  • Popular
  • Blog
  • Credits
  • RSS
  •   Interaction
  • Register
  • Statistics
  •   Help
  • Suggestions
  • Contact Us
  • How to Edit
  • Help



  • [Edit]


    Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns such as association rules. It is a fairly recent topic in computer science but applies many older computational techniques from statistics, information retrieval, machine learning and pattern recognition.

        Data mining
            Example
            Use of the term
            Related terms
            Data dredging
            Privacy concerns
            Combinatorial game data mining
            Notable uses of data mining
            See also
                Structured Data Mining
                Unstructured Data Mining
                    Supervised learning
                    Unsupervised learning
                Dimensionality reduction
                Application areas
                Software

    top

    Example

    A simple example of data mining, often called Market Basket Analysis, is its use for retail sales. If a clothing store records the purchases of customers, a data mining system could identify those customers who favour silk shirts over cotton ones.

    Another is that of a supermarket chain who, through analysis of transactions over a long period of time, found that beer and diapers were often bought together. Although explaining this relationship may be difficult, taking advantage of it is easier, for example by placing the high-profit diapers in the store close to the high-profit beers. (This example is questioned at Beer and Nappies -- A Data Mining Urban Legend.)

    top

    Use of the term

    Data mining has been defined as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases" .

    It involves sorting through large amounts of data and picking out relevant information.

    It is usually used by businesses and other organizations, but is increasingly used in the sciences to extract information from the enormous data sets generated by modern experimentation.

    Metadata, or data about a given set of data, are often expressed in a condensed data mine-able format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts.

    Although data mining is a relatively new term, the technology is not. Companies for a long time have used powerful computers to sift through volumes of data such as supermarket scanner data, and produce market research reports. Continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy and usefulness of analysis.

    top

    Related terms
    Although the term "data mining" is usually used in relation to analysis of data, like artificial intelligence, it is an umbrella term with varied meanings in a wide range of contexts. Unlike data analysis, data mining is not based or focused on an existing model which is to be tested or whose parameters are to be optimized.

    In statistical analyses where there is no underlying theoretical model, data mining is often approximated via stepwise regression methods wherein the space of 2k possible relationships between a single outcome variable and k potential explanatory variables is smartly searched. With the advent of parallel computing, it became possible (when k is less than approximately 40) to examine all 2k models. This procedure is called all subsets or exhaustive regression. Some of the first applications of exhaustive regression involved the study of plant data.

    top

    Data dredging

    Data dredging or data fishing are terms one may use to criticize someone's data mining efforts when it is felt the patterns or causal relationships discovered are unfounded.

    Data dredging is the scanning of the data for any relationships, and then when one is found coming up with an interesting explanation. The conclusions may be suspect because data sets with large numbers of variables have by chance some "interesting" relationships. Fred Schwed said:
    "There have always been a considerable number of people who busy themselves examining the last thousand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly enough, they have usually found it."


    Nevertheless, determining correlations in investment analysis has proven to be very profitable for statistical arbitrage operations (such as pairs trading strategies), and correlation analysis has shown to be very useful in risk management. Indeed, finding correlations in the financial markets, when done properly, is not the same as finding false patterns in roulette wheels.

    Some exploratory data work is always required in any applied statistical analysis to get a feel for the data, so sometimes the line between good statistical practice and data dredging is less than clear.

    Most data mining efforts are focused on developing highly detailed models of some large data set. Other researchers have described an alternate method that involves finding the minimal differences between elements in a data set, with the goal of developing simpler models that represent relevant data.

    When data sets contain a big set of variables, the level of statistical significance should be proportional to the patterns that were tested. For example, if we test 100 random patterns, it is expected that one of them will be "interesting" with a statistical significance at the 0.01 level.

    Cross validation is a common approach to evaluating the fitness of a model generated via data mining, where the data is divided into a training subset and a test subset to respectively build and then test the model. Common cross validation techniques include the holdout method, k-fold cross validation, and the leave-one-out method.

    top

    Privacy concerns
    There are also privacy concerns associated with data mining - specifically regarding the source of the data analyzed. For example, if an employer has access to medical records, they may screen out people who have diabetes or have had a heart attack. Screening out such employees will cut costs for insurance, but it creates ethical and legal problems.

    Data mining government or commercial data sets for national security or law enforcement purposes has also raised privacy concerns.

    There are many legitimate uses of data mining. For example, a database of prescription drugs taken by a group of people could be used to find combinations of drugs exhibiting harmful interactions. Since any particular combination may occur in only 1 out of 1000 people, a great deal of data would need to be examined to discover such an interaction. A project involving pharmacies could reduce the number of drug reactions and potentially save lives. Unfortunately, there is also a huge potential for abuse of such a database.

    Essentially, data mining gives information that would not be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.

    top

    Combinatorial game data mining

    Since the early 1990s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. This is pattern-recognition at too high an abstraction for known Statistical Pattern Recognition algorithms or any other algorithmic approaches to be applied: at least, no one knows how to do it yet (as of January 2005). The method used is the full force of Scientific Method: extensive experimentation with the tablebases combined with intensive study of tablebase-answers to well designed problems, combined with knowledge of prior art i.e. pre-tablebase knowledge, leading to flashes of insight. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of people doing this work, though they were not and are not involved in tablebase generation.

    top

    Notable uses of data mining
      Data mining has been cited as the method by which the U.S. Army unit Able Danger supposedly had identified the 9/11 attack leader, Mohamed Atta, and three other 9/11 hijackers as possible members of an al Qaeda cell operating in the U.S. more than a year before the attack.
        It has been suggested that both the CIA and their Canadian counterparts, CSIS, have put this method of interpreting data to work for them as well, although they have not said how.

    top

    See also

    top

    Structured Data Mining

    top

    Unstructured Data Mining


    top

    Supervised learning

    top

    Unsupervised learning

    top

    Dimensionality reduction

    top

    Application areas

    top

    Software
      Essbase has data mining capabilities, including PMML support and a Data Mining Wizard;
      Point Horizon is an integrated data exploration, analysis, visualization and forcasting application with emphasis in dynamical methods.
      ROOT, a package born for physics data analysis, can also be used for data mining;
      Talend_Open_Studio (www.talend.com) - ETL Tool, which uses an Eclipse Rich Client Platform (RCP) as the GUI. The GUI is used to create graphical transformations and mappings, which ultimately generate underlying perl code. The platform is distributed under the GPL V2 terms.
      Teradata contains datamining tools such as data exploration, data preprocessing, analytic modelling, scoring and deployment within a database;
      Weka is a freely available open-source data mining software written in Java featuring numerous clustering, classification, regression, and meta-learning operators;
      YALE is an integrated freely available open-source software environment for data exploration, data preprocessing, intelligent data analysis, knowledge discovery, data mining, machine learning, prediction, visualization, etc. written in Java with more than 350 data mining operators, fully integrating Weka, and featuring a graphical user interface as well as a XML-based scripting language for data mining.
      XmlMiner is a class library, toolkit and free web service specialising in data, text and structure mining XML data sources, and in handling semi-structured data. The scripting language is XML based, as is the model representation language, Metarule, is also XML based, representing knowledge as a collection of fuzzy logic production rules.
     
    Search more:
     

       
    Source Privacy License Download Contact Us Atlas
    Scientus.org Dictionary (Yet Another Wiki) RC : 1.39
    This article is licensed under the GNU Free Documentation License [copyleft]. It uses material from the Wikipedia article "Data mining". link