Hail is a library built on Apache Spark for analyzing large genomic datasets.
- From Hail 0.2.65 onwards, use Apache Spark version 3.1.1 (Databricks Runtime 8.x or above)
- Install Hail on Databricks Runtime, not Databricks Runtime for Genomics (deprecated).
Install Hail via Docker with Databricks Container Services.
You can find containers to setup a Hail enviornment on the ProjectGlow Dockerhub page.
projectglow/databricks-hail:<hail_version>, replacing the tag with an available Hail version.
For the most part, Hail 0.2 code in Databricks works identically to the Hail documentation. However, there are a few modifications that are necessary for the Databricks environment.
When initializing Hail, pass in the pre-created
SparkContext and mark the initialization as idempotent. This setting
enables multiple Databricks notebooks to use the same Hail context.
skip_logging_configuration to save logs to the rolling driver log4j output. This setting is
supported only in Hail 0.2.39 and above.
import hail as hl hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)
Hail uses the Bokeh library to create plots. The
show function built into Bokeh does not work
in Databricks. To display a Bokeh plot generated by Hail, you can run a command like:
from bokeh.embed import components, file_html from bokeh.resources import CDN plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP') html = file_html(plot, CDN, "Chart") displayHTML(html)
See Bokeh for more information.