Best practices for data and AI governance

This article covers best practices of data and AI governance, organized by architectural principles listed in the following sections.

1. Unify data and AI management

Establish a data and AI governance process

Data and AI governance is the management of the availability, usability, integrity, and security of an organization’s data and AI assets. By strengthening data and AI governance, organizations can ensure the quality of the assets that are critical for accurate analytics and decision-making, help to identify new opportunities, improve customer satisfaction, and ultimately increase revenue. It helps organizations comply with data and AI privacy regulations and improve security measures, reducing the risk of data breaches and penalties. Effective data governance also eliminates redundancies and streamlines data management, resulting in cost savings and increased operational efficiency.

An organization might want to choose which governance model suits them best:

  • In the centralized governance model, your governance administrators are owners of the metastore and can take ownership of any object and grant and revoke permissions.

  • In a distributed governance model, the catalog or a set of catalogs is the data domain. The owner of that catalog can create and own all assets and manage governance within that domain. The owners of any given domain can operate independently of the owners of other domains.

The data and AI governance solution Unity Catalog is integrated into the Databricks Data Intelligence Platform. It supports both governance models and helps to seamlessly manage structured and unstructured data, ML models, notebooks, dashboards, and files on any cloud or platform. The Unity Catalog best practices help to implement data and AI governance.

Manage metadata for all data and AI assets in one place

The benefits of managing metadata for all assets in one place are similar to the benefits of maintaining a single source of truth for all your data. These include reduced data redundancy, increased data integrity, and the elimination of misunderstandings due to different definitions or taxonomies. It’s also easier to implement global policies, standards, and rules with a single source.

As a best practice, run the lakehouse in a single account with a Unity Catalog. The Unity Catalog can manage data and volumes (arbitrary files), as well as AI assets such as features and AI models. The top-level container of objects in the Unity Catalog is a metastore. It stores data assets (such as tables and views) and the permissions that govern access to them. Use a single metastore per cloud region and do not access metastores across regions to avoid latency issues.

The metastore provides a three-level namespace to structure data, volumes and AI assets:

Databricks recommends using catalogs to provide segregation across your organization’s information architecture. Often this means that catalogs can correspond to software development environment scope, team, or business unit.

Track data and AI lineage to drive visibility of the data

Data lineage is a powerful tool that helps data leaders gain greater visibility and understanding of the data in their organizations. Data lineage describes the transformation and refinement of data from source to insight. It includes the capture of all relevant metadata and events associated with the data throughout its lifecycle, including the source of the data set, what other data sets were used to create it, who created it and when, what transformations were performed, what other data sets use it, and many other events and attributes.

In addition, when you train a model on a table in Unity Catalog, you can track the model’s lineage to the upstream dataset(s) on which it was trained and evaluated.

Lineage can be used for many data-related use cases:

  • Compliance and audit readiness: Data lineage helps organizations trace the source of tables and fields. This is important for meeting the requirements of many compliance regulations, such as General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPAA), Basel Committee on Banking Supervision (BCBS) 239, and Sarbanes-Oxley Act (SOX).

  • Impact analysis/change management: Data undergoes multiple transformations from the source to the final business-ready table. Understanding the potential impact of data changes on downstream users becomes important from a risk management perspective. This impact can be easily determined using the data lineage captured by the Unity Catalog.

  • Data quality assurance: Understanding where a data set came from and what transformations have been applied provides much better context for data scientists and analysts, enabling them to gain better and more accurate insights.

  • Debugging and diagnostics: In the event of an unexpected result, data lineage helps data teams perform root cause analysis by tracing the error back to its source. This dramatically reduces troubleshooting time.

Unity Catalog captures runtime data lineage across queries running on Databricks and also model lineage. Lineage is supported for all languages and is captured down to the column level. Lineage data includes notebooks, jobs, and dashboards related to the query. Lineage can be visualized in near real-time in the Catalog Explorer and accessed using the Databricks’ Data Lineage REST API.

Add consistent descriptions to your metadata

Descriptions provide essential context for data. They help users understand the purpose and content of data tables and columns. This clarity allows them to more easily discover, identify, and filter the data they need, which is critical for effective data analysis and decision making. Descriptions can include data sensitivity and compliance information. This helps organizations meet legal and regulatory requirements for data privacy and security. Descriptions should also include information about the source, accuracy, and relevance of data. This helps ensure data integrity and promotes better collaboration across teams.

Two main features in Unity Catalog support describing tables and columns. The Unity Catalog allows to

  • add comments to tables and columns in the form of comments.

    You can also add an AI-generated comment for any table or table column managed by Unity Catalog to speed up the process. However, AI models are not always accurate and comments must be reviewed before saving. Databricks strongly recommends human review of AI-generated comments to check for inaccuracies.

  • add tags to any securable in Unity Catalog. Tags are attributes with keys and optional values that you can apply to different securable objects in Unity Catalog. Tagging is useful for organizing and categorizing different securable objects within a metastore. Using tags also makes it easier to search and discover your data assets.

Allow easy data discovery for data consumers

Easy data discovery enables data scientists, data analysts, and data engineers to quickly discover and reference relevant data and accelerate time to value.

Databricks Catalog Explorer provides a user interface for exploring and managing data, schemas (databases), tables, and permissions, data owners, external locations, and credentials. In addition, you can use the Insights tab in Catalog Explorer to view the most frequent recent queries and users of any table registered in Unity Catalog.

Govern AI assets together with data

The relationship between data governance and artificial intelligence (AI) has become critical to success. How organizations manage, secure, and use data directly impacts the outcomes and considerations of AI implementations: you can’t have AI without quality data, and you can’t have quality data without data governance.

Governing data and AI together improves AI performance by ensuring seamless access to high-quality, up-to-date data, leading to improved accuracy and better decision-making. Breaking down silos increases efficiency by enabling better collaboration and streamlining workflows, resulting in increased productivity and reduced costs.

Improved data security is another benefit, as a unified governance approach establishes consistent data handling practices, reducing vulnerabilities and improving an organization’s ability to protect sensitive information. Compliance with data privacy regulations is easier to maintain when data and AI governance are integrated, as data handling and AI processes are aligned with regulatory requirements.

Overall, a unified governance approach fosters trust among stakeholders and ensures transparency in AI decision-making processes by establishing clear policies and procedures for both data and AI.

In the Databricks Data Intelligence Platform, the Unity Catalog is the central component for governing both data and AI assets:

  • Feature in Unity Catalog

    In Unity Catalog enabled workspaces, data scientists can create feature tables in Unity Catalog. These feature tables are Delta tables or Delta Live Tables managed by Unity Catalog.

  • Models in Unity Catalog

    Models in Unity Catalog extends the benefits of Unity Catalog to ML models, including centralized access control, auditing, lineage, and model discovery across workspaces. Key features of models in Unity Catalog include governance for models, chronological model lineage, model versioning, and model deployment via aliases.

2. Unify data and AI security

Centralize access control for all data and AI assets

Centralizing access control for all data assets is important because it simplifies the security and governance of your data and AI assets by providing a central place to administer and audit access to these assets. This approach helps in managing data and AI object access more efficiently, ensuring that operational requirements around segregation of duty are enforced, which is crucial for regulatory compliance and risk avoidance.

The Databricks Data Intelligence Platform provides data access control methods that describe which groups or individuals can access which data. These are policy statements that can be extremely granular and specific, down to the definition of each record that each individual has access to. Or they can be very expressive and broad, such as all financial users can see all financial data.

The Unity Catalog centralizes access controls for all supported securable objects such as tables, files, models, and many more. Every securable object in Unity Catalog has an owner. The owner of an object has all privileges on the object, as well as the ability to grant privileges on the securable object to other principals. The Unity Catalog allows you to manage privileges, and to configure access control by using SQL DDL statements.

The Unity Catalog uses row filters and column masks for fine-grained access control. Row filters allow you to apply a filter to a table so that subsequent queries return only rows for which the filter predicate evaluates to true. Column masks allow you apply a masking function to a table column. The masking function gets evaluated at query runtime, substituting each reference to the target column with the results of the masking function.

For further information see Security, compliance & privacy - Manage identity and access using least privilege.

Configure audit logging

Audit logging is important because it provides a detailed account of system activities (user actions, changes to settings, and so on) that could affect the integrity of the system. While standard system logs are designed to help developers troubleshoot problems, audit logs provide a historical record of activity for compliance and other business policy enforcement purposes. Maintaining robust audit logs can help identify and ensure preparedness in the face of threats, breaches, fraud, and other system issues.

Databricks provides access to audit logs of activities performed by Databricks users, allowing your organization to monitor detailed Databricks usage patterns. There are two types of logs, Workspace-level audit logs with workspace-level events and account-level audit logs with account-level events.

You can also enable verbose audit logs are additional audit logs recorded whenever a query or command is run in your workspace.

Audit data platform events

Audit logging is important because it provides a detailed account of system activities. The Data Intelligence Platform has audit logs for the metadata access (hence data access) and for data sharing:

  • Unity Catalog captures an audit log of actions performed against the metastore. This enables admins to access fine-grained details about who accessed a given dataset and what actions they performed.

  • For secure sharing with Delta Sharing, Databricks provides audit logs to monitor Delta Sharing events, including:

    • When someone creates, modifies, updates, or deletes a share or a recipient.

    • When a recipient accesses an activation link and downloads the credential.

    • When a recipient accesses shares or data in shared tables.

    • When a recipient’s credential is rotated or expires.

3. Establish data quality standards

The Databricks Data Intelligence Platform provides robust data quality management with built-in quality controls, testing, monitoring, and enforcement to ensure accurate and useful data is available for downstream BI, analytics, and machine learning workloads.

Implementation details can be seen in Reliability - Manage data quality.

Define clear data quality standards

Defining clear and actionable data quality standards is crucial, because it helps ensure that data used for analysis, reporting, and decision-making is reliable and trustworthy. Documenting these standards helps ensure that they are upheld. Data quality standards should be based on the specific needs of the business and should address dimensions of data quality such as accuracy, completeness, consistency, timeliness, and reliability:

  • Accuracy: Ensure data accurately reflects real-world values.

  • Completeness: All necessary data should be captured and no critical data should be missing.

  • Consistency: Data across all systems should be consistent and not contradict other data.

  • Timeliness: Data should be updated and available in a timely manner.

  • Reliability: Data should be sourced and processed in a way that ensures its dependability.

Use data quality tools for profiling, cleansing, validating, and monitoring data

Leverage data quality tools for profiling, cleansing, validating, and monitoring data. These tools help in automating the processes of detecting and correcting data quality issues, which is vital for scaling data quality initiatives across large datasets typical in data lakes

For teams using DLT, you can use expectations to define data quality constraints on the contents of a dataset. Expectations allow you to guarantee data arriving in tables meets data quality requirements and provide insights into data quality for each pipeline update.

Implement and enforce standardized data formats and definitions

Standardized data formats and definitions help achieve a consistent representation of data across all systems to facilitate data integration and analysis, reduce costs, and improve decision making by enhancing communication and collaboration across teams and departments. It also helps provide a structure for creating and maintaining data quality.

Develop and enforce a standard data dictionary that includes definitions, formats, and acceptable values for all data elements used across the organization.

Use consistent naming conventions, date formats, and measurement units across all databases and applications to prevent discrepancies and confusion.