In the world of Data Governance (DG) and Master Data Management (MDM), Data and Information Management professionals spend a great deal of time creating stable and reusable representations of data that can be consistently shared and used by different systems and people, both within and across enterprises. However, not all Reference Data is the same. For example, some is far more stable and reusable than other, while some should only be used under specific contexts and not others. Understanding Reference Data, along with its permutations and its purposes, is important for all MDM and DG programs.
Defining Reference Data
Reference Data is a broad term used by Data and Information Management professionals to describe data that is or can be considered a common and stable baseline representation of concept or entity. For example, given multiple valid representations of a person named “Jane Doe” (e.g. “Doe, J.”, “Jane Doe”, “Doe, Jane”, “J. Doe”), which representation should be the one that is sanctioned for common use or, given that they are all correct, which one should be used under specific contexts?
Synonyms for Reference Data
The term Reference Data is also a synonymous to the terms Mastered Data and Master Data because, once mastered, all Mastered Data is Reference Data. In fact, when it comes to MDM, we master data to intentionally make it a stable reference for use by one or more downstream systems and/or people.
Automation is the primary benefit of Reference Data Management
The primary value of creating and maintaining Reference Data is to reduce the different types of waste that are associated with converting data for consumption and use by multiple systems. In other words, we create Reference Data to streamline automation and to minimize the waste and costs associated with excessive data conversions/translations, as data moves from any one system to another. While it is often quick and easy for the human mind to correlate strings like “U.K.” and “United Kingdom” as being the same and to know when to use which representation for a specific circumstance or context, it is impossible for a computer to make such correlations or determinations without them being clearly spelled out. So, while humans can also benefit from such documented clarity, we typically create Reference Data to clearly spell out such rules and representations for computers, in a manner that minimizes human coding.
In essence by clearly establishing Reference Data, we are establishing the highest levels of trust that we can for that data. Such levels of trust eliminate unnecessary work, especially for computers, in trying to make sense of that data.
Reference Data Primary Categories
- Generality or Commonality (implying how often it is used and under what specific contexts), and
- Frequency of Change (implying how often its formal representation or its containing set may change).
Reference Data denoted by Generality or Commonality
- General Reference Data (GRD), also referred to as Common Reference Data (CRD), represents the type of Reference Data that exists within and can be used across many different domain spaces. Very notable examples of GRD include things like mastered representations for lists of Continents, Countries, Regions/States, and Postal Codes.
- Domain-Specific Reference Data (DSRD), also referred to as Context-Specific Reference Data (CSRD), represents the type of Reference Data that only has relevance for a specific domain of operations. Examples of SDRD include the Periodic Table of Elements for the chemistry domain or Medical Codes (e.g. Diagnoses Codes and Treatment Codes) for the medical domain.
Reference Data denoted by Frequency of Change
- Static Reference Data (SDR) is considered to be the type of Reference Data that rarely changes. For example, the lists of Continents, Countries, Regions/States, and Postal Codes are all considered to be SRD because they rarely change or because their rate of change is so infrequent that it can be considered as inconsequential to most systems and/or people that use them.
- Dynamic Reference Data (DRD) is considered to be the type of Reference Data that changes often. For example, an enterprise’s list of Employees or Consultants or its list of Organizations. A more complex example is the constantly changing list of Prices for Securities on financial exchanges.
Reference Data may come in different structures
Not all Reference Data is represented in the same structures. Some examples of different structural representations include but are not limited to:
- Simple Lists
- Complex Lists (e.g. Complex Tables, Look-Up Tables, or Maps)
The first three examples tend to be the most commonly used.
Determining Complexity of Reference Data
When performing Data Governance (DG) or Master Data Management (MDM) is often important to determine and represent the complexity of Reference Data in order to help with the prioritization and scheduling of data-related work. The above categorizations of Reference Data help us create a Reference Data Complexity Grid that consists of four quadrants. As we identify different categories of data for our enterprise(s), we can place them in what we believe to be their appropriate quadrant (some may span multiple quadrants). Doing so will help us understand the complexity of any one category of data against one or more other categories of data.
It is important to understand that certain data may have different levels of complexity for different enterprises. For example, Customer Data might have a moderate level of complexity in an enterprise that has only two separate customer-related systems that drive Customer Data mastering. However, it might have a much higher level of complexity in an enterprise that has more than ten customer-related systems that drive Customer Data mastering.
There are clearly other factors that can be used to refine complexity assessments, such as quantity of the data, quality of the source, frequency of availability, etc. However, the dimensions described for the Reference Data Complexity Grid, outlined above, usually help quickly and accurately establish such complexity ratings with minimal effort.
Common Reference Data Management mistakes
Mistake: Treating Reference Data Management (RDM) as if it is different than Master Data Management (MDM). They are, in fact, the same thing and the general advice is that they should be treated as if they are, at all times. Doing so will help your enterprise eliminate confusion and waste of time and funds.
Mistake: A very common mistake is to not assign clear owners for specific Reference Data Types. Just like all data, it is considered a best practice to ensure a clearly named owner (usually an organization with a primary contact) for each and every Reference Data Type.
Mistake: The biggest mistake most enterprises make when pursuing the mastering of data (i.e. the generation Reference Data) is to not go after low complexity data types, first. In other words, Data and Information Management professionals fail to (or poorly do) list all data types and determine their complexities with the explicit intent to pursue mastering based on priorities and Return on Investment (ROI).
The general advice to correct this, barring exceptional higher priorities, is to go after low hanging fruit as quickly and as often as you can. This will allow you and your enterprise to develop and establish far more Reference Data sets faster and with lower financial investments. It will also help you identify and determine critical enterprise patterns such as most used formats, consumption frequencies and domains, etc.
Mistake: Also, given the Reference Data Complexity Grid, we can now see that data which sits within the lower left-hand quadrant is usually the easiest and quickest to deal with for the lowest amount of investment. It should come as no surprise that a common mistake made by many inexperienced Data and Information Management professionals is to attempt to master the data types that fit into this quadrant using very complex, expensive, and time-consuming Master Data Management (MDM) tools & technologies. In other words, they treat simple Reference Data the same way they treat complex Reference Data. This usually occurs because the enterprise’s stakeholders made a significant investment in a complex and expensive MDM tool and, because of inexperience, were advised that all Reference Data should be located in the same system and treated the same way.
The general advice to correct this is to master and publish low complexity Reference Data like Static Reference Data with the simplest technologies you can. For example, Data Compilers can be used to feed in spreadsheets that can be easily converted to consistently available csv, txt, JSON, etc. representations, all from one common access API.