Hail 0.2

Hail is a library built on Apache Spark for analyzing large genomic datasets.

Important

  • From Hail 0.2.65 onwards, use Apache Spark version 3.1.1 (Databricks Runtime 8.x or above)
  • Install Hail on Databricks Runtime, not Databricks Runtime for Genomics (deprecated).

Create a Hail cluster

Install Hail via Docker with Databricks Container Services.

You can find containers to setup a Hail enviornment on the ProjectGlow Dockerhub page. Use projectglow/databricks-hail:<hail_version>, replacing the tag with an available Hail version.

Use Hail in a notebook

For the most part, Hail 0.2 code in Databricks works identically to the Hail documentation. However, there are a few modifications that are necessary for the Databricks environment.

Initialize Hail

When initializing Hail, pass in the pre-created SparkContext and mark the initialization as idempotent. This setting enables multiple Databricks notebooks to use the same Hail context.

Note

Enable skip_logging_configuration to save logs to the rolling driver log4j output. This setting is supported only in Hail 0.2.39 and above.

import hail as hl
hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)

Display Bokeh plots

Hail uses the Bokeh library to create plots. The show function built into Bokeh does not work in Databricks. To display a Bokeh plot generated by Hail, you can run a command like:

from bokeh.embed import components, file_html
from bokeh.resources import CDN
plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
html = file_html(plot, CDN, "Chart")
displayHTML(html)

See Bokeh for more information.