The Roles of a Data Catalog
The difficulties of data management have intensified at a steady pace over the past several years. The management complexities of big data, cloud hosting, self-service analytics, and tightening regulations can’t be ignored. Effective data management has become a top priority for most organizations, but getting there is challenging. Data catalogs fill essential roles in overcoming these challenges.
Data catalogs were introduced to help data analysts find and understand data. Before data catalogs, most data analysts worked blind, without visibility into existing data sets or their contents, or the quality and usefulness of each. As a result, they spent much of their time finding data, understanding data, and recreating data sets that already existed. Data catalogs were designed to address these issues.
From modest beginnings as a means to manage data inventory and expose data sets to analysts, the data catalog has grown in functionality, popularity, and importance. Modern data catalogs still meet the needs of data analysts, but have expanded their reach. They are now central to data stewardship, data curation, and data governance. Data catalogs have become strategically important. Chief data officers (CDOs) and Chief analytics officers (CAOs) view the catalog as strategic not just for data inventory, but also for managing data assets and improving analytic quality and productivity.
Do you like this in-depth educational content on applied machine learning? Subscribe to our Enterprise AI mailing list to be alerted when we release new material.
The Variety of Data Catalog Tools
The selection of data catalog tools has grown rapidly in recent years. Several data cataloging tools are available today with new tools emerging and catalog functions being added to existing tools regularly. Data catalog tools exist today in several forms as described in the table below.
|Catalog Type||Catalog Characteristics|
|Integrated with Data Preparation|
|Integrated with Data Analysis|
|Fully Integrated Solution|
The value of a seamless user experience throughout the analytics lifecycle is evident, so the trend in catalog evolution is toward convergence. Most tools will mature to become fully integrated solutions supporting all three capabilities—cataloging, preparation, and analysis. Convergence, however, does not eliminate the need for interoperability, as self-service analysts often want to make their own choices of preparation and analysis tools.
Evaluating Data Catalog Tools
Data catalog stakeholders span a continuum from business and data analysts to C-level executives, and catalog impacts range from day-to-day tactical activities to long-term strategic position. Choosing a catalog that meets all of the needs, addresses all of the interests, and fits your environment and culture is a big job. Usability is a paramount consideration with the variety of users and the broad spectrum of data and technical skills. Intuitive user interface and ease of use are essential to widespread catalog adoption. The twenty criteria listed here are designed to help you work systematically through the evaluation process and find the catalog best suited for your organization.
- Cataloging Data Sets. A data catalog should support automated discovery of data sets, both for initial catalog build and ongoing discovery of new data sets. Use of machine learning for metadata collection, semantic inference, and automated tagging is important to get maximum value from automation and to minimize the manual effort of data cataloging.
- Cataloging Data Operations. A data catalog should include cataloging of data preparation operations and associate them with data sets. Operations include processes to improve, enrich, format, and blend data. Look for the ability to catalog data preparation workflows to prescribe sequences for a set of operations. Also consider the ability to designate mandatory operations such as masking or obfuscation of personally identifying information (PII).
- Searching. Searching for data sets is a fundamental requirement of catalog users. Robust search capabilities include search by facets, keywords, and business terms. Natural language search capabilities are especially valuable for non-technical users. Ranking of search results by relevance and by frequency of use are particularly useful and beneficial features.
- Recommendations. When searching for data sets, a recommendations engine is an especially valuable feature. Leveraging usage history metadata and machine learning to develop recommendations based on past user experiences accelerates the search process, improves quality of match between search results and user needs, and makes strong connections between data sets and data preparation operations and workflows.
- Data Set Evaluation. Finding data sets is only the beginning for the data analyst. Choosing the right data sets depends on ability to evaluate their suitability for an analysis use case without first needing to download or acquire the data. Important evaluation features include capabilities to preview a data set, view data profiles, see user ratings, read user reviews and curator annotations, and view data quality information.
- Data Access. On completion of data set evaluation, desired data sets should be accessible directly from the catalog, providing a seamless user experience from search through data acquisition. Consider the variety of data set types to which the catalog can connect—RDBMS, flat files, tagged files, document stores, graph databases, geospatial data, text documents, and more. Data access should include data protection for security, privacy, and compliance of sensitive data.
- Usage Metadata. Collection of usage metadata enables other important features including data set evaluation and intelligent recommendations. Look for ability to collect information about each data set including: Who has used the data set? For what use cases has it been used? How frequently is it used? With what other data sets is it typically used or combined?
- Data Valuation. As the catalog becomes a data strategy component of interest to CDOs and CAOs, data valuation is a consideration. How will the catalog help you quantify value of data assets? Knowledge about frequency of use and analytic use cases provides a starting point. Don’t expect the data catalog to calculate a dollar value for each data set, but it should provide usage information that contributes to value estimation.
- Metadata Catalog. Consider the richness of metadata that is collected. What data is collected about data sets? What data is collected about processes, and does it support full data lineage traceability? What metadata supports searching? Does it include data about curators, data stewards, SMEs, and data SMEs? How comprehensive is usage metadata?
- Security. Data security is the first of four essential data governance capabilities. Check for ability of the catalog to work with your existing security infrastructure and processes for user authentication and authorization. User security should at minimum distinguish between administrative users such as curators and analytic users. Also consider the levels at which security constraints can be imposed—data set level, record/row level, column/field level, and security by value.
- Lineage. Data lineage is a core data governance consideration. Ability to trace data from the original source, through analysis and reporting processes, to final analysis and reporting is a key component of trusted data. It is also valuable for change management, impact analysis, troubleshooting, and problem-solving.
- Compliance. Regulations focused on data protection are increasingly common, and a significant governance responsibility. GDPR is an immediate concern, but a multitude of other data-related regulations exist—many of them industry-specific, such as HIPAA and Dodd-Frank. Look closely at how the catalog handles PII, protects data privacy, and supports compliance with regulations.
- Quality. Data quality is a fourth major governance concern—one that has become more complex with the adoption of big data and data lakes. The catalog won’t cleanse data or improve data quality, but it does have an important role in data quality management. Smart algorithms may expose data conflicts and identify data quality deficiencies. Displaying automated and human judgments of data quality helps analysts evaluate and select data sets and decide how to work with less than perfect data.
- Data Curation. Data curators interact frequently with the data catalog and fill a critical role in making it useful and valuable. Evaluate the richness of curation capabilities including the ability to add data sets, hide or remove data sets, add annotations, create metadata, add search terms and tags, identify stewards and SMEs, tag security- and compliance-sensitive data, share tips and techniques, and encourage crowdsourcing of metadata.
- Socialization. As cataloging drives cultural shifts to collaborative data management and to community curation, socialization becomes an important element. Evaluate social capabilities such as crowdsourcing of metadata, collaboration features, posting of user ratings and reviews, and capture of user feedback. Look beyond the capabilities to consider usability and motivation. Unused social features provide little value. What does the catalog offer that makes it quick, easy, and desirable to participate in social aspects of data cataloging?
- Integration and Interoperability. The catalog can’t operate in isolation. It needs to work seamlessly throughout the analytics lifecycle from problem framing to data visualization, and to be seamless regardless of data preparation and analysis tool choices. How well will the catalog work with your data preparation tools? How well will it work with your data analysis and visualization tools? Will it integrate gracefully with your security and access controls?
- Deployment. You’ll certainly need to consider how the catalog fits into your current and future technical infrastructure. Does it offer options for on-premises, cloud, and hybrid deployments? Can it support both server-based and Web-based implementation? How well will it support your mobile and geographically dispersed users?
- Services. The nuances and details of catalog implementation can sometimes be challenging, and consulting services may prove valuable, especially when working with non-traditional data types. Data catalog users are likely to need some introductory training, and data curators may require more depth of training. Be sure to ask what kinds of training and consulting services are available. Also look for user groups or online forums as sources of knowledge and problem-solving.
- Pricing. Budget and cost are always considerations when acquiring new technology. Ask about the vendors’ pricing models. Will you pay by user seats? By volume of data? By the number of data sets? Or by other criteria? What should you expect as initial costs and ongoing costs? How can you estimate TCO?
- Vendor Roadmap. What plans does the vendor have for future features and functions? Will they expand integration with data preparation and data visualization tools? Will they increase interoperability with various preparation and analysis tools? Do they plan to offer connectors to various challenging data sources? How many data sources can they currently connect to? Are they adding advanced collaboration and socialization features or moving toward enterprise data marketplace capabilities?
Using the Evaluation Criteria
All criteria are not equally important. If practical, prioritize the criteria and assign weighting factors to align them with your organization’s needs and goals. If you’re uncertain about fully prioritizing then divide the criteria into three categories: must have, nice to have, and not important. Use the highest priority or must have criteria to qualify tools for your short list. Then use the next level of criteria to evaluate and compare tools on the short list. Carefully and systematically evaluating data catalog tools is a good investment of time. The data catalog will be with you for a long time, will affect many stakeholders, and will shape the maturity of your data management practices.
This article was originally published on Eckerson Group and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more updates on applied ML.
We’ll let you know when we release more technical education.