|
A data warehouse is a computer system designed for archiving and analyzing an organization's historical data, such as sales, salaries, or other information from day-to-day operations. Normally, an organization summarizes and copies information from its operational systems (such as sales and human resources) to the data warehouse on a regular schedule, such as every night or every weekend; after that, management can perform complex queries and analysis (such as data mining) on the information without slowing down the operational systems. Definition A data warehouse is the main repository of the organization's historical data, or its corporate memory: to find out what day of the week they sold the most widgets in May 1992, or how employee sick leave the week before Christmas differed between California and Quebec from 2001-2005, an organization would use the information that's stored in its data warehouse. In other words, the data warehouse contains the raw material for management's decision support system. While operational systems are optimized for simplicity and speed of modification (online transaction processing, or OLTP) through heavy use of database normalization and an entity-relationship model, the data warehouse is optimized for simplicity and speed of queries (online analytical processing, or OLAP) through heavy use of denormalization and a dimension-based model. The data warehouse also normally stores information at a coarser grain than the operational systems: for example, if the operational systems contain a record for every sale, the data warehouse might simply contain the total number of sales for each product at each store. More formally, Bill Inmon (one of the earliest and most influential practitioners) defined a data warehouse as follows: A data mart is exactly like a data warehouse technically, but it serves a different business purpose: it either holds information for only part of a company (such as a division), or it holds a small selection of information for the entire company (to support extra analysis without slowing down the main system). In either case, however, it is not the organization's official repository, the way a data warehouse is. History of data warehousing Data Warehouses became a distinct type of computer database during the late 1980's and early 1990's. They developed to meet a growing demand for management information and analysis that could not be met by operational systems. Operational systems were unable to meet this need for a range of reasons: As a result, separate computer databases began to be built that were specifically designed to support management information and analysis purposes. These data warehouses were able to bring in data from a range of different data sources, such as mainframe computers, minicomputers, as well as personal computers and office automation software such as spreadsheet, and integrate this information in a single place. This capability, coupled with user-friendly reporting tools and freedom from operational impacts, has led to a growth of this type of computer system. As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages: Components of a data warehouse
Data Sources Data sources refers to any electronic repository of information that contains data of interest for management use or analytics. This definition covers mainframe databases (e.g. IBM DB2, ISAM, Adabas, Teradata, etc.), client-server databases (e.g. Teradata, IBM DB2, Oracle database, Informix, Microsoft SQL Server, etc.), PC databases (e.g. Microsoft Access, Alpha Five), spreadsheets (e.g. Microsoft Excel) and any other electronic store of data. Data needs to be passed from these systems to the data warehouse either on a transaction-by-transaction basis for real-time data warehouses or on a regular cycle (e.g. daily or weekly) for offline data warehouses. Data Transformation The Data Transformation layer receives data from the data sources, cleans and standardises it, and loads it into the data repository. This is often called "staging" data as data often passes through a temporary database while it is being transformed. This activity of transforming data can be performed either by manually created code or a specific type of software could be used called an ETL tool. Regardless of the nature of the software used, the following types of activities occur during data transformation: Data Warehouse The data warehouse need not be a relational database, as it must be organised to hold information in a structure that best supports not only query and reporting, but also advanced analysis techniques, like data mining. Most data warehouses hold information for at least 1 year and sometimes can reach half century, depending on the business/operations data retention requirement. As a result these databases can become very large. Reporting The data in the data warehouse must be available to the organisation's staff if the data warehouse is to be useful. There are a very large number of software applications that perform this function, or reporting can be custom-developed. Examples of types of reporting tools include:
Metadata Metadata, or "data about data", is used not only to inform operators and users of the data warehouse about its status and the information held within the data warehouse, but also as a means of integration of incoming data and a tool to update and refine the underlying DW model. Examples of data warehouse metadata include table and column names, their detailed descriptions, their connection to business meaningful names, the most recent data load date, the business meaning of a data item and the number of users that are logged in currently. Operations Data warehouse operations comprises of the processes of loading, manipulating and extracting data from the data warehouse. Operations also cover user management, security, capacity management and related functions Optional Components In addition, the following components also exist in some data warehouses: Different methods of storing data in a data warehouse All data warehouses store their data grouped together by subject areas that reflect the general usage of the data (Customer, Product, Finance etc.). The general principle used in the majority of data warehouses is that data is stored at its most elemental level for use in reporting and information analysis. Within this generic intent, there are two primary approaches to organising the data in a data warehouse. The first is using a "dimensional" approach. In this style, information is stored as "facts" which are numeric or text data that capture specific data about a single transaction or event, and "dimensions" which contain reference information that allows each transaction or event to be classified in various ways. As an example, a sales transaction would be broken up into facts such as the number of products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and sales person. The main advantages of a dimensional approach is that the Data Warehouse is easy for business staff with limited information technology experience to understand and use. Also, because the data is pre-processed into the dimensional form, the Data Warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business. The second approach uses database normalisation. In this style, the data in the data warehouse is stored in third normal form. The main advantage of this approach is that it is quite straightforward to add new information into the database, whilst the primary disadvantage of this approach is that it can be quite slow to produce information and reports. Advantages of using data warehouse There are many advantages to using a data warehouse, some of them are: Concerns in using data warehouse See also | |||||||||
|
| ||||||||||
![]() |
|
| |