Large Data

hvPlot and HoloViews support even high-dimensional datasets easily, and the standard mechanisms discussed already work well as long as you select a small enough subset of the data to display at any one time. However, some datasets are just inherently large, even for a single frame of data, and cannot safely be transferred for display in any standard web browser. Luckily, HoloViews makes it simple for you to use the separate Datashader library together with any of the plotting extension libraries it supports, including Bokeh and Matplotlib. Datashader is designed to complement standard plotting libraries by providing faithful visualizations for very large datasets, focusing on revealing the overall distribution, not just individual data points.

Datashader uses computations accelerated using Numba, making it fast to work with datasets of millions or billions of datapoints stored in Dask dataframes. Dask dataframes provide an API that is functionally equivalent to Pandas, but allows working with data out of core and scaling out to many processors across compute clusters. Here we will use Dask to load and visualize the entire earthquake dataset.

How does datashader work?

  • Tools like Bokeh map Data (left) directly into an HTML/JavaScript Plot (right)
  • datashader instead renders Data into a plot-sized Aggregate array, from which an Image can be constructed then embedded into a Bokeh Plot
  • Only the fixed-sized Image needs to be sent to the browser, allowing millions or billions of datapoints to be used
  • Every step automatically adjusts to the data, but can be customized

When not to use datashader

  • Plotting less than 1e5 or 1e6 data points
  • When every datapoint must be resolveable individually; standard Bokeh will render all of them
  • For full interactivity (hover tools) with every datapoint

When to use datashader

  • Actual big data; when Bokeh/Matplotlib have trouble
  • When the distribution matters more than individual points
  • When you find yourself sampling, decimating, or binning to better understand the distribution
In [1]:
import holoviews as hv
import dask.dataframe as dd
import datashader as ds, datashader.geo

from holoviews import opts
from holoviews.operation.datashader import datashade, rasterize

hv.extension('bokeh')