{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Matplotlib 101" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Assoc. Prof. Dr. Piyabute Fuangkhon\n", "# Department of Digital Business Management\n", "# Martin de Tours School of Management and Economics\n", "# Assumption University\n", "# Update: 22/05/2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Matplotlib 101 (Visualizing COVID-19 Data)\n", "\n", "In this notebook, we will explore the basics of Matplotlib while visualizing data from the OWID COVID-19 dataset. We will cover various types of plots including line plots, bar plots, scatter plots, histograms, and more." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Global file location\n", "file_location = 'owid-covid-data.csv'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iso_codecontinentlocationdatetotal_casesnew_casesnew_cases_smoothedtotal_deathsnew_deathsnew_deaths_smoothed...male_smokershandwashing_facilitieshospital_beds_per_thousandlife_expectancyhuman_development_indexpopulationexcess_mortality_cumulative_absoluteexcess_mortality_cumulativeexcess_mortalityexcess_mortality_cumulative_per_million
0AFGAsiaAfghanistan2020-01-05NaN0.0NaNNaN0.0NaN...NaN37.7460.564.830.51141128772.0NaNNaNNaNNaN
1AFGAsiaAfghanistan2020-01-06NaN0.0NaNNaN0.0NaN...NaN37.7460.564.830.51141128772.0NaNNaNNaNNaN
2AFGAsiaAfghanistan2020-01-07NaN0.0NaNNaN0.0NaN...NaN37.7460.564.830.51141128772.0NaNNaNNaNNaN
3AFGAsiaAfghanistan2020-01-08NaN0.0NaNNaN0.0NaN...NaN37.7460.564.830.51141128772.0NaNNaNNaNNaN
4AFGAsiaAfghanistan2020-01-09NaN0.0NaNNaN0.0NaN...NaN37.7460.564.830.51141128772.0NaNNaNNaNNaN
\n", "

5 rows × 67 columns

\n", "
" ], "text/plain": [ " iso_code continent location date total_cases new_cases \\\n", "0 AFG Asia Afghanistan 2020-01-05 NaN 0.0 \n", "1 AFG Asia Afghanistan 2020-01-06 NaN 0.0 \n", "2 AFG Asia Afghanistan 2020-01-07 NaN 0.0 \n", "3 AFG Asia Afghanistan 2020-01-08 NaN 0.0 \n", "4 AFG Asia Afghanistan 2020-01-09 NaN 0.0 \n", "\n", " new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... \\\n", "0 NaN NaN 0.0 NaN ... \n", "1 NaN NaN 0.0 NaN ... \n", "2 NaN NaN 0.0 NaN ... \n", "3 NaN NaN 0.0 NaN ... \n", "4 NaN NaN 0.0 NaN ... \n", "\n", " male_smokers handwashing_facilities hospital_beds_per_thousand \\\n", "0 NaN 37.746 0.5 \n", "1 NaN 37.746 0.5 \n", "2 NaN 37.746 0.5 \n", "3 NaN 37.746 0.5 \n", "4 NaN 37.746 0.5 \n", "\n", " life_expectancy human_development_index population \\\n", "0 64.83 0.511 41128772.0 \n", "1 64.83 0.511 41128772.0 \n", "2 64.83 0.511 41128772.0 \n", "3 64.83 0.511 41128772.0 \n", "4 64.83 0.511 41128772.0 \n", "\n", " excess_mortality_cumulative_absolute excess_mortality_cumulative \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", " excess_mortality excess_mortality_cumulative_per_million \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", "[5 rows x 67 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "df['date'] = pd.to_datetime(df['date'])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Line Plot\n", "Let's start with a simple line plot to visualize the daily new COVID-19 cases in the United States." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "\n", "# Filter data for the United States\n", "us_data = df[df['location'] == 'United States']\n", "\n", "# Plot daily new cases\n", "plt.figure(figsize=(10, 5))\n", "plt.plot(us_data['date'], us_data['new_cases'], label='New Cases', color='blue')\n", "plt.xlabel('Date')\n", "plt.ylabel('New Cases')\n", "plt.title('Daily New COVID-19 Cases in the United States')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bar Plot\n", "Next, let's create a bar plot to compare the total number of COVID-19 cases for each continent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "\n", "# Group data by continent and get the maximum total cases\n", "continent_cases = df.groupby('continent')['total_cases'].max().dropna()\n", "\n", "# Plot the bar chart\n", "plt.figure(figsize=(10, 5))\n", "continent_cases.plot(kind='bar', color='orange')\n", "plt.xlabel('Continent')\n", "plt.ylabel('Total Cases')\n", "plt.title('Total COVID-19 Cases by Continent')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scatter Plot\n", "Now, let's generate a scatter plot to show the relationship between total COVID-19 cases and total deaths for each country." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "\n", "# Get the latest data for each country\n", "latest_data = df[df['date'] == df['date'].max()]\n", "\n", "# Plot the scatter plot\n", "plt.figure(figsize=(10, 5))\n", "plt.scatter(latest_data['total_cases'], latest_data['total_deaths'], alpha=0.5)\n", "plt.xlabel('Total Cases')\n", "plt.ylabel('Total Deaths')\n", "plt.title('Total COVID-19 Cases vs. Total Deaths by Country')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histogram\n", "Let's create a histogram to visualize the distribution of daily new cases globally." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "\n", "# Plot the histogram\n", "plt.figure(figsize=(10, 5))\n", "plt.hist(df['new_cases'].dropna(), bins=50, color='green', alpha=0.7)\n", "plt.xlabel('New Cases')\n", "plt.ylabel('Frequency')\n", "plt.title('Distribution of Daily New COVID-19 Cases Globally')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Box Plot\n", "Let's compare the distribution of daily new cases per continent using a box plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "\n", "# Prepare data for box plot\n", "df['new_cases_per_million'] = df['new_cases'] / df['population'] * 1e6\n", "plt.figure(figsize=(12, 6))\n", "sns.boxplot(x='continent', y='new_cases_per_million', data=df)\n", "plt.xlabel('Continent')\n", "plt.ylabel('New Cases per Million')\n", "plt.title('Distribution of Daily New COVID-19 Cases per Continent')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pie Chart\n", "Finally, let's create a pie chart showing the proportion of total COVID-19 deaths by continent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "\n", "# Group data by continent and get the maximum total deaths\n", "continent_deaths = df.groupby('continent')['total_deaths'].max().dropna()\n", "\n", "# Plot the pie chart\n", "plt.figure(figsize=(8, 8))\n", "continent_deaths.plot(kind='pie', autopct='%1.1f%%', startangle=140)\n", "plt.ylabel('')\n", "plt.title('Proportion of Total COVID-19 Deaths by Continent')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Line Plot with Moving Averages and Annotations\n", "This example plots daily new cases for multiple countries with moving averages and annotations for significant business events." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "df['date'] = pd.to_datetime(df['date'])\n", "\n", "# Select countries for comparison\n", "countries = ['United States', 'India', 'Brazil']\n", "filtered_data = df[df['location'].isin(countries)]\n", "\n", "# Plot the data\n", "plt.figure(figsize=(12, 6))\n", "for country in countries:\n", " country_data = filtered_data[filtered_data['location'] == country]\n", " plt.plot(country_data['date'], country_data['new_cases'].rolling(window=7).mean(), label=f'{country} (7-day MA)')\n", "\n", "# Add annotations for business events\n", "business_events = {\n", " '2020-03-11': 'WHO Declares Pandemic',\n", " '2020-12-14': 'First US Vaccination'\n", "}\n", "for date, event in business_events.items():\n", " plt.axvline(pd.to_datetime(date), color='gray', linestyle='--')\n", " plt.text(pd.to_datetime(date), plt.ylim()[1] * 0.9, event, rotation=90, verticalalignment='top')\n", "\n", "plt.xlabel('Date')\n", "plt.ylabel('New Cases (7-day MA)')\n", "plt.title('Daily New COVID-19 Cases with Business Events Annotations')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouped Bar Plot for Total Cases by Continent and Quarter\n", "This example creates a grouped bar plot showing the total number of COVID-19 cases by continent for each quarter, useful for business trend analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "df['date'] = pd.to_datetime(df['date'])\n", "df['quarter'] = df['date'].dt.to_period('Q')\n", "\n", "# Aggregate data by continent and quarter\n", "continent_quarterly = df.groupby(['continent', 'quarter'])['total_cases'].sum().unstack().fillna(0)\n", "\n", "# Plot the grouped bar plot\n", "continent_quarterly.T.plot(kind='bar', stacked=False, figsize=(15, 8))\n", "plt.xlabel('Quarter')\n", "plt.ylabel('Total Cases')\n", "plt.title('Total COVID-19 Cases by Continent and Quarter')\n", "plt.legend(title='Continent')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scatter Plot with Bubble Sizes Representing Population\n", "This example generates a scatter plot showing the relationship between total cases and total deaths with bubble sizes representing population size." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "latest_data = df[df['date'] == df['date'].max()]\n", "\n", "# Plot the scatter plot with bubble sizes\n", "plt.figure(figsize=(10, 6))\n", "sns.scatterplot(data=latest_data, x='total_cases', y='total_deaths', size='population', sizes=(20, 200), alpha=0.5)\n", "plt.xlabel('Total Cases')\n", "plt.ylabel('Total Deaths')\n", "plt.title('Total COVID-19 Cases vs. Total Deaths by Country (Bubble Size: Population)')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histogram with Log Scale\n", "This example creates a histogram for daily new cases globally and applies a log scale to better visualize the distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "\n", "# Plot the histogram with log scale\n", "plt.figure(figsize=(10, 6))\n", "sns.histplot(df['new_cases'].dropna(), bins=50, color='purple', log_scale=True)\n", "plt.xlabel('New Cases (Log Scale)')\n", "plt.ylabel('Frequency')\n", "plt.title('Distribution of Daily New COVID-19 Cases Globally (Log Scale)')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Box Plot with Points Overlayed by Continent\n", "This example creates a box plot to compare the distribution of daily new cases per continent and overlays points colored by continent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "df['new_cases_per_million'] = df['new_cases'] / df['population'] * 1e6\n", "\n", "# Plot the box plot with points overlayed\n", "plt.figure(figsize=(12, 6))\n", "sns.boxplot(x='continent', y='new_cases_per_million', data=df)\n", "sns.stripplot(x='continent', y='new_cases_per_million', data=df, color='black', alpha=0.3, jitter=True)\n", "plt.xlabel('Continent')\n", "plt.ylabel('New Cases per Million')\n", "plt.title('Distribution of Daily New COVID-19 Cases per Continent with Points Overlayed')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Donut Pie Chart for Vaccine Distribution\n", "This example creates a donut pie chart showing the proportion of total vaccinations by continent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Load the dataset\n", "df = pd.read_csv(file_location)\n", "\n", "# Group data by continent and get the maximum total vaccinations\n", "continent_vaccinations = df.groupby('continent')['total_vaccinations'].max().dropna()\n", "\n", "# Plot the donut pie chart\n", "plt.figure(figsize=(8, 8))\n", "plt.pie(continent_vaccinations, labels=continent_vaccinations.index, autopct='%1.1f%%', startangle=140, wedgeprops=dict(width=0.3))\n", "plt.title('Proportion of Total COVID-19 Vaccinations by Continent')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 4 }