Plotting#

When trying to make sense of data, there are many representations to choose from, including data tables, textual summaries and so on. We’ll mostly focus on plotting data to get an intuitive visual representation, using a simple but powerful plotting API.

If you have tried to visualize a pandas.DataFrame before, then you have likely encountered the Pandas .plot() API. These plotting commands use Matplotlib to render static PNGs or SVGs in a Jupyter notebook using the inline backend, or interactive figures via %matplotlib widget, with a command that can be as simple as df.plot() for a DataFrame with one or two columns.

The Pandas .plot() API has emerged as a de-facto standard for high-level plotting APIs in Python, and is now supported by many different libraries that use various underlying plotting engines to provide additional power and flexibility. Learning this API allows you to access capabilities provided by a wide variety of underlying tools, with relatively little additional effort. The libraries currently supporting this API include:

  • Pandas – Matplotlib-based API included with Pandas. Static or interactive output in Jupyter notebooks.

  • xarray – Matplotlib-based API included with xarray, based on pandas .plot API. Static or interactive output in Jupyter notebooks.

  • hvPlot – Bokeh/Matplotlib/Plotly-based HoloViews plots for Pandas, GeoPandas, xarray, Dask, Intake, and Streamz data.

  • Pandas Bokeh – Bokeh-based interactive plots, for Pandas, GeoPandas, and PySpark data.

  • Cufflinks – Plotly-based interactive plots for Pandas data.

  • Plotly Express – Plotly-Express-based interactive plots for Pandas data; only partial support for the .plot API keywords.

  • PdVega – Vega-lite-based, JSON-encoded interactive plots for Pandas data.

In this notebook we’ll explore what is possible with the default .plot API and demonstrate the additional capabilities provided by .hvplot, which include seamless interactivity in notebooks and deployed dashboards, server-side rendering of even the largest datasets, automatic small multiples and widget selectors for exploring complex data, and easy composition and linking of plots after they are generated.

To show these features, we’ll use a tabular dataset of earthquakes and other seismological events queried from the USGS Earthquake Catalog using its API. Of course, this particular dataset is just an example; the same approach can be used with just about any tabular dataset, and similar approaches can be used with gridded (multidimensional array) datasets.

Read in the data#

Here we will focus on Pandas, but a similar approach will work for any supported DataFrame type, including Dask for distributed computing or RAPIDS cuDF for GPU computing. This dataset is relatively large (2.1 million rows), but should still fit into memory on any recent machine, and thus won’t need special out-of-core or distributed approaches like Dask provides.

import pathlib
import pandas as pd
%%time
df = pd.read_parquet(pathlib.Path('../data/earthquakes-projected.parq'))
df = df.set_index(df.time)
CPU times: user 3.08 s, sys: 426 ms, total: 3.51 s
Wall time: 1.92 s
print(df.shape)
df.head()
(2116537, 25)
index depth depthError dmin gap horizontalError id latitude locationSource longitude ... net nst place rms status time type updated easting northing
time
2000-01-31 23:52:00.619000+00:00 0 7.800 1.400 0.09500 245.14 NaN nn00001936 37.1623 nn -116.6037 ... nn 5.0 Nevada 0.0519 reviewed 2000-01-31 23:52:00.619000+00:00 earthquake 2018-04-24T22:22:44.135Z -1.298026e+07 4.461754e+06
2000-01-31 23:44:54.060000+00:00 1 4.516 0.479 0.05131 52.50 NaN ci9137218 34.3610 ci -116.1440 ... ci 0.0 26km NNW of Twentynine Palms, California 0.1300 reviewed 2000-01-31 23:44:54.060000+00:00 earthquake 2016-02-17T11:53:52.643Z -1.292909e+07 4.077379e+06
2000-01-31 23:28:38.420000+00:00 2 33.000 NaN NaN NaN NaN usp0009mwt 10.6930 trn -61.1620 ... us NaN Trinidad, Trinidad and Tobago NaN reviewed 2000-01-31 23:28:38.420000+00:00 earthquake 2014-11-07T01:09:23.016Z -6.808523e+06 1.197310e+06
2000-01-31 23:05:22.010000+00:00 3 33.000 NaN NaN NaN NaN usp0009mws -1.2030 us -80.7160 ... us NaN near the coast of Ecuador 0.6000 reviewed 2000-01-31 23:05:22.010000+00:00 earthquake 2014-11-07T01:09:23.014Z -8.985264e+06 -1.339272e+05
2000-01-31 22:56:50.996000+00:00 4 7.200 0.900 0.11100 202.61 NaN nn00001935 38.7860 nn -119.6409 ... nn 5.0 Nevada 0.0715 reviewed 2000-01-31 22:56:50.996000+00:00 earthquake 2018-04-24T22:22:44.054Z -1.331836e+07 4.691064e+06

5 rows × 25 columns

To compare HoloViz approaches with other approaches, we’ll also construct a subsample of the dataset that’s tractable with any plotting or analysis tool, but has only 1% of the data:

small_df = df.sample(frac=.01)
print(small_df.shape)
small_df.head()
(21165, 25)
index depth depthError dmin gap horizontalError id latitude locationSource longitude ... net nst place rms status time type updated easting northing
time
2004-01-22 17:08:32.620000+00:00 2522 0.000 NaN NaN 146.6 NaN usp000cjf1 37.202000 mdd -2.268000 ... us 5.0 Spain NaN reviewed 2004-01-22 17:08:32.620000+00:00 earthquake 2014-11-07T01:21:06.750Z -2.524726e+05 4.467301e+06
2002-06-25 04:38:27.640000+00:00 1404 4.253 18.44 0.12980 124.0 0.76 ci9794809 36.188167 ci -117.989667 ... ci 17.0 10km S of Olancha, California 0.18 reviewed 2002-06-25 04:38:27.640000+00:00 earthquake 2016-02-17T00:58:38.391Z -1.313455e+07 4.326544e+06
2008-12-29 13:09:22.010000+00:00 892 522.600 13.10 NaN 130.7 NaN usp000grsm -24.130000 us 179.869000 ... us 14.0 south of the Fiji Islands 1.05 reviewed 2008-12-29 13:09:22.010000+00:00 earthquake 2014-11-07T01:38:04.194Z 2.002293e+07 -2.769257e+06
2013-05-21 06:23:07.720000+00:00 3604 1.064 31.61 0.01645 269.0 9.16 ci15346833 33.207000 ci -115.551000 ... ci 9.0 5km SW of Niland, CA 0.15 reviewed 2013-05-21 06:23:07.720000+00:00 earthquake 2016-03-11T01:57:17.748Z -1.286308e+07 3.922812e+06
2018-09-06 15:49:18.710000+00:00 9576 670.810 2.80 1.43100 12.0 9.50 us2000h9e2 -18.474300 us 179.350200 ... us NaN 102km ESE of Suva, Fiji 1.07 reviewed 2018-09-06 15:49:18.710000+00:00 earthquake 2019-04-23T04:45:27.592Z 1.996517e+07 -2.093140e+06

5 rows × 25 columns

We’ll switch back and forth between small_df and df depending on whether the technique we are showing works well only for small datasets, or whether it can be used for any dataset.

Using Pandas .plot()#

The first thing that we’d like to do with this data is visualize the locations of every earthquake. So we would like to make a scatter or points plot where x is longitude and y is latitude.

We can do that for the smaller dataframe using the pandas.plot API and Matplotlib:

%matplotlib inline
small_df.plot.scatter(x='longitude', y='latitude');
../_images/3383200d031285db2b92132eda5a835d53f083968d710483cc21704175c6e1c2.png

Exercise:#

Try changing inline to widget and see what interactivity is available from Matplotlib. In some cases you may have to reload the page and restart this notebook to get it to display properly.

Using .hvplot#

As you can see above, the Pandas API gives you a usable plot very easily, where you can start to see the structure of the edges of the tectonic plates, which in many cases correspond with the visual edges of continents (e.g. the westward side of Africa, in the center). You can make a very similar plot with the same arguments using hvplot, after importing hvplot.pandas to install hvPlot support into Pandas:

import hvplot.pandas # noqa: adds hvplot method to pandas objects
small_df.hvplot.scatter(x='longitude', y='latitude')

Here unlike in the Pandas .plot() the displayed plot is a Bokeh plot that has a default hover action on the datapoints to show the location values, and you can always pan and zoom to focus on any particular region of the data of interest. Zoom and pan also work if you use the widget Matplotlib backend.

You might have noticed that many of the dots in the scatter that we’ve just created lie on top of one another. This is called “overplotting” and can be avoided in a variety of ways, such as by making the dots slightly transparent, or binning the data.

Exercise#

Try changing the alpha (try .1) on the plot above to see the effect of this approach

small_df.hvplot.scatter(x='longitude', y='latitude', alpha=0.1)

Try creating a hexbin plot.

small_df.hvplot.hexbin(x='longitude', y='latitude')

Getting help with hvplot options#

You may be wondering how you could have found out about the alpha keyword option in the first exercise or how you can learn about all the options that are available with hvplot. For this purpose, you can use tab-completion in the Jupyter notebook or the hvplot.help function which are documented in the user guide.

For tab completion, you can press tab after the opening parenthesis in a obj.hvplot.<kind>( call. For instance, you can try pressing tab after the partial expression small_df.hvplot.scatter(<TAB>.

Alternatively, you can call hvplot.help(<kind>) to see a documentation pane pop up in the notebook. Try uncommenting the following line and executing it:

# hvplot.help('scatter')

You will see there are a lot of options! You can control which section of the documentation you view with the generic, docstring and style boolean switches also documented in the user guide. If you run the following cell, you will see that alpha is listed in the ‘Style options’.

# hvplot.help('scatter', style=True, generic=False)

These style options refer to options that are part of the Bokeh API. This means that the alpha keyword is passed directly to Bokeh just like all the other style options. As these are Bokeh-level options, you can find out more by using the search functionality in the Bokeh docs.

Datashader#

As you saw above, there are often arbitrary choices that you are faced with making even before you understand the properties of the dataset, such as selecting an alpha value or a bin size for aggregations. Making such assumptions can accidentally bias you towards certain aspects of the data, and of course having to throw away 99% of the data can cover up patterns you might have otherwise seen. For an initial exploration of a new dataset, it’s much safer if you can just see the data, before you impose any assumptions about its form or structure, and without having to subsample it.

To avoid some of the problems of traditional scatter/point plots we can use hvPlot’s Datashader support. Datashader aggregates data into each pixel without any arbitrary parameter settings, making your data visible immediately, before you know what to expect of it. In hvplot we can activate this capability by setting rasterize=True to invoke Datashader before rendering and cnorm='eq_hist' (“histogram equalization”) to specify that the colormapping should adapt to whatever distribution the data has:

small_df.hvplot.scatter(x='longitude', y='latitude', rasterize=True, cnorm='eq_hist')