Previous sections have focused on putting various simple types of data together in notebooks and deployed servers, but most people will want to include plots as well. In this section, we'll focus on one of the simplest (but still powerful) ways to get a plot.
If you have tried to visualize a
pandas.DataFrame before, then you have likely encountered the Pandas .plot() API. This basic plotting interface uses Matplotlib to render static PNGs in a Jupyter notebook or for exporting from Python, with a command that can be as simple as
df.plot() for a DataFrame with one or two columns.
The Pandas .plot() API has emerged as a de-facto standard for high-level plotting APIs in Python, and is now supported by many different libraries that use other underlying plotting engines to provide additional power and flexibility. Thus learning this API allows you to access capabilities provided by a wide variety of underlying tools, with relatively little additional effort. The libraries currently supporting this API include:
- Pandas -- Matplotlib-based API included with Pandas. Static PNG output in Jupyter notebooks.
- xarray -- Matplotlib-based API included with xarray, based on pandas .plot API. Static PNG output in Jupyter notebooks.
- hvPlot -- HoloViews and Bokeh-based interactive plots for Pandas, GeoPandas, xarray, Dask, Intake, and Streamz data.
- Pandas Bokeh -- Bokeh-based interactive plots, for Pandas, GeoPandas, and PySpark data.
- Cufflinks -- Plotly-based interactive plots for Pandas data.
- PdVega -- Vega-lite-based, JSON-encoded interactive plots for Pandas data.
In this notebook we'll explore what is possible with the default
.plot API and demonstrate the additional capabilities of
.hvplot, using the same tabular dataset of earthquakes and other seismological events queried
from the USGS Earthquake Catalog using its
API as in previous sections. Of course, this particular dataset is just an example; the same approach can be used with just about any tabular dataset.
Read in the data¶
Here we'll read in the data using Dask, which works well with a relatively large dataset like this (2.1 million rows). We'll use
.persist() to bring the whole dataset into main memory (which should be feasible on any recent machine) for higher performance:
import dask.dataframe as dd
df = dd.read_parquet('../data/earthquakes.parq').persist() df.head()
|1||4.516||0.479||0.05131||52.50||NaN||ci9137218||34.3610||ci||-116.1440||1.72||...||ci||mc||ci||0.0||26km NNW of Twentynine Palms, California||0.1300||reviewed||2000-01-31 23:44:54.060000+00:00||earthquake||2016-02-17T11:53:52.643Z|
|2||33.000||NaN||NaN||NaN||NaN||usp0009mwt||10.6930||trn||-61.1620||2.10||...||trn||md||us||NaN||Trinidad, Trinidad and Tobago||NaN||reviewed||2000-01-31 23:28:38.420000+00:00||earthquake||2014-11-07T01:09:23.016Z|
|3||33.000||NaN||NaN||NaN||NaN||usp0009mws||-1.2030||us||-80.7160||4.50||...||us||mb||us||NaN||near the coast of Ecuador||0.6000||reviewed||2000-01-31 23:05:22.010000+00:00||earthquake||2014-11-07T01:09:23.014Z|
5 rows × 22 columns
The first thing that we'd like to do with this data is visualize the locations of every earthquake. So we would like to make a scatter or points plot where
If you are familiar with the
pandas.plot API, you might expect to execute
df.plot.scatter(x='longitude', y='latitude'). Feel free to try this out in a new cell, but it will throw an error:
AttributeError: 'DataFrame' object has no attribute 'plot'. Since we have a Dask dataframe rather than a Pandas dataframe, we need to first convert it to Pandas to use
.plot. In order to make the data more manageable for now, we'll briefly use just a fraction (1%) of it and call that
small_df = df.sample(frac=.01).compute() small_df.shape
Now we have a smaller dataset with just 21k earthquakes. We can use that to test out our visualizations before ramping back up to the full dataset.
<matplotlib.axes._subplots.AxesSubplot at 0x7f6b782ac780>
As you can see above, the Pandas API gives you a usable plot very easily, where you can start to see the structure of the edges of the plates (which in some cases correspond with the edges of the continents and in others are between two continents). You can make a very similar plot with the same arguments using hvplot.