An Analysis of Data Quality Dimensions
Vimukthi Jayawardane, Marta Indulska, Shazia Sadiq
The University of Queensland
Data quality (DQ) has been studied in significant depth over the last two decades and has received attention from both the academic and the practitioner community. Over that period of time a large number of data quality dimensions have been identified in due course of research and practice. While it is important to embrace the diversity of views of data quality, it is equally important for the data quality research and practitioner community to be united in the consistent interpretation of this foundational concept. In this paper, we provide a step towards this consistent interpretation. Through a systematic review of research and practitioner literature, we identify previously published data quality dimensions and embark on the analysis and consolidation of the overlapping and inconsistent definitions. We stipulate that the shared understanding facilitated by this consolidation is a necessary prelude to generic and declarative forms of requirements modeling for data quality.
Dimension | Characteristic | Description | References |
Completeness | Completeness of mandatory attributes | The attributes which are necessary for a complete representation of a real world entity must contain values and cannot be null |
[1-5] |
Completeness of optional attributes | Optional attributes should not contain invalid null values | [2, 6] | |
Completeness of records | Every real world entity instance that is relevant for the organization can be found in the data. | [1, 7-10] | |
Data volume | The volume of data is neither deficient nor overwhelming to perform an intended task | [11, 12] | |
Availability & Accessibility | Continuity of Data Access | The technology infrastructure should not prohibit the speed and continuity of access to the data for the users. |
[9, 11, 13] |
Data maintainability | Data should be accessible to perform necessary updates and maintenance operations in its entire lifecycle. |
[11, 12] | |
Data awareness | Data users should be aware of all available data and its location. | [8] | |
Ease of data access | Data should be easily accessible in a form that is suitable for its intended use. | [14-16] | |
Data Punctuality | Data should be available at the time of its intended use. | [1, 4, 11, 16] | |
Data access control | The access to the data should be controlled to ensure it is secure against damage or unauthorised access. |
[9, 11, 14, 15] | |
Currency | Data timeliness | Data which refers to time should be available for use within an acceptable time relative to its time of creation. |
[1, 3, 5, 8, 9, 12, 13, 15] |
Data Freshness | Data which is subjected to changes over the time should be fresh and up-to-date with respect to its intended use. |
[2, 4, 6, 7, 11, 12] | |
Accuracy | Accuracy to reference source | Data should agree with an identified source. | [1, 2, 4, 6, 8, 12-14] |
Accuracy to reality | Data should truly reflect the real world. | [1, 3, 5, 9-11, 15] | |
Precision | Attribute values should be accurate as per linguistics and granularity. | [1, 2, 6, 7, 11, 15] | |
Validity | Business rules compliance | Calculations on data must comply with business rules. | [1, 3] |
Meta-data compliance | Data should comply with its metadata | [1-6, 9] | |
Standards and Regulatory compliance | All data processing activities should comply with the policies, procedures, standards, industry benchmark practices and all regulatory requirements that the organization is bound by. |
[1, 5, 8, 12] | |
Statistical validity | Computed data must be statistically valid. | [8, 16] | |
Reliability | Source Quality | Data used is from trusted and credible sources. | [1, 2, 13-15] |
Objectivity | Data are unbiased and impartial. | [1, 11, 13, 14] | |
Traceability | The lineage of the data is verifiable. | [10, 11, 15] | |
Consistency | Uniqueness | The data is uniquely identifiable. | [4, 5, 9] |
Non-redundancy | The data is recorded in exactly one place. | [1, 3, 12] | |
Semantic consistency | Data is semantically consistent. | [1, 7, 15] | |
Value consistency | Data values are consistent and do not provide conflicting or heterogeneous instances. | [1-3, 5, 13] | |
Format consistency | Data formats are consistently used. | [12, 15] | |
Referential integrity | Data relationships are represented through referential integrity rules. | [1, 4, 9] | |
Usability and Interpretability | Usefulness and relevance | The data is useful and relevant for the task at hand. | [1, 8, 9, 11, 14-16] |
Understandability | The data is understandable. | [1, 2, 5, 6, 8, 13-15] | |
Appropriate Presentation | The data presentation is aligned with its use. | [1, 2, 6, 9, 12] | |
Interpretability | Data should be interpretable. | [6-8, 16] | |
Information value | The value that is delivered by quality information should be effectively evaluated and continuously monitored in the organizational context. |
[2, 12-14] |
Sources
[1] English, L.P., Information quality applied: Best practices for improving business information, processes and systems. 2009: Wiley Publishing. |
Check References |
[2] Loshin, D., Enterprise knowledge management: The data quality approach. 2001: Morgan Kaufmann Pub. |
Check References |
[3] Gatling G., ChamplinC.B. R., StefaniH. , WeigelG., Enterprise Information Management with SAP. 2007, Boston: Galileo Press Inc. |
Check References |
[4] Loshin, D., Monitoring Data quality Performance using Data Quality Metrics. Informatica Corporation, 2006. |
Check References |
[5] Byrne, J.K., D. Mccarty, G. Sauter, H. Smith, P Worcester, The information perspective of SOA design Part 6:The value of applying the data quality analysis pattern in SOA. 2008: IBM corporation. |
Check References |
[6] Redman, T.C., Data quality for the information age. 1997: Artech House, Inc. |
Check References |
[7] Kimball, R. and J. Caserta, The data warehouse ETL toolkit: practical techniques for extracting.Cleaning, Conforming, and Delivering, Digitized Format, originally published, 2004. |
Check References |
[8] HIQA, International Review of Data Quality Health Information and Quality Authority (HIQA), Ireland. http://www.hiqa.ie/press-release/2011-04-28-international-review-data-quality., 2011. |
Check References |
[9] Price, R.J. and G. Shanks. Empirical refinement of a semiotic information quality framework. in System Sciences, 2005. HICSS’05. Proceedings of the 38th Annual Hawaii International Conference on. 2005. IEEE. |
Check References |
[10] ISO, ISO 8000-2 Data Quality-Part 2-Vocabulary. 2012, ISO. | Check References |
[11] Eppler, M.J., Managing information quality: increasing the value of information in knowledge-intensive products and processes. 2006: Springer. |
Check References |
[12] McGilvray, D., Executing data quality projects: Ten steps to quality data and trusted information. 2008: Morgan Kaufmann. |
Check References |
[13] Scannapieco, M. and T. Catarci, Data quality under a computer science perspective.Archivi & Computer, 2002. 2: p. 1-15. |
Check References |
[14] Wang, R.Y. and D.M. Strong, Beyond accuracy: What data quality means to data consumers.Journal of management information systems, 1996: p. 5-33. |
Check References |
[15] Stvilia, B., et al., A framework for information quality assessment. Journal of the American Society for Information Science and Technology, 2007. 58(12): p. 1720-1733. |
Check References |
[16] Lyon, M., Assessing Data Quality,Monetary and Financial Statistics.Bank of England. http://www.bankofengland.co.uk/statistics/Documents/ms/articles/art1mar08.pdf., 2008. |
Check References |
More detais about the above classification and related work can be found in the following publications.
- Jayawardene, Vimukthi,Sadiq, Shazia and Indulska, Marta (2013). The curse of dimensionality in data quality. In: Proceedings of the 24th Australasian Conference on Information Systems (ACIS 2013). ACIS 2013: 24th Australasian Conference on Information Systems, Melbourne, VIC, Australia, (1-11). 4-6 December, 2013.
- Jayawardene, Vimuthki,Sadiq, Shazia and Indulska, Marta (2013) An analysis of data quality dimensions. ITEE Technical Report 2013-01, School of Information Technology and Electrical Engineering, The University of Queensland.
- Zhang, Ruojing,Jayawardene, Vimukthi, Indulska, Marta, Sadiq, Shazia and Zhou, Xiaofang(2014). A data driven approach for discovering data quality requirements. In: Proceedings of the Thirty Fifth International Conference on Information Systems: ICIS 2014. ICIS 2014: 35th International Conference on Information Systems, Auckland, New Zealand, (). 14-17 December 2014.
- Jayawardene, Vimukthi,Sadiq, Shazia and Indulska, Marta (2012). Practical significance of key data quality research areas. In: PACIS 2012 Proceedings. 16th Pacific Asia Conference on Information Systems (PACIS 2012), Ho Chi Minh City, Vietnam, (). 11-15 July 2012.
- Sadiq, Shazia,Indulska, Marta and Jayawardene, Vimukthi (2011). Research and industry synergies in data quality management. In: Proceedings of the 16th International Conference on Information Quality. 16th International Conference on Information Quality (ICIQ2011),Adelaide, Australia, (314-326). 18-20 November 2011.
- Shazia Sadiq, Naiem Khodabandehloo Yeganeh and Marta Indulska. 20 years of data quality research: Themes, trends and synergies. In: Heng Tao Shen and Yanchun Zhang, Conferences in Research and Practice in Information Technology. Proceedings of: The 22nd Australasian Database Conference (ADC 2011). Australasian Database Conference [ADC], Perth, WA, Australia, (1-10). 17-20 January 2011.
- Shazia Sadiq, Naiem Khodabandehloo Yeganeh and Marta Indulska. An Analysis of Cross-Disciplinary Collaborations in Data Quality Research. European Conference on Information Systems (ECIS2011), Helsinki, Finland, 9-11 June 2011