Operational excellence for the data lakehouse
The architectural principles of the operational excellence pillar cover all operational processes that keep the lakehouse running. Operational excellence addresses the ability to operate the lakehouse efficiently and discusses how to operate, manage, and monitor the lakehouse to deliver business value.
Principles of operational excellence
Optimize build and release processes
Use software engineering best practices across your entire lakehouse environment. Build and release using continuous integration and continuous delivery pipelines for both DevOps and MLOps.
Automate deployments and workloads
Automating deployments and workloads for the lakehouse helps standardize these processes, eliminate human error, improve productivity, and provide greater repeatability. This includes using “configuration as code” to avoid configuration drift, and “infrastructure as code” to automate the provisioning of all required lakehouse and cloud services.
For ML specifically, processes should drive automation: Not every step of a process can or should be automated. People still determine the business questions, and some models will always need human oversight before deployment. Therefore, the development process is primary and each module in the process should be automated as needed. This allows incremental build-out of automation and customization.
Set up monitoring, alerting, and logging
Workloads in the lakehouse typically integrate Databricks platform services and external cloud services, for example as data sources or targets. Successful execution can only occur if each service in the execution chain is functioning properly. When this is not the case, monitoring, alerting, and logging are important to detect and track problems and understand system behavior.
Manage capacity and quotas
For any service that is launched in a cloud, take limits into account, for example access rate limits, number of instances, number of users, and memory requirements. Before designing a solution, these limits must be understood.