{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5d6ba50c",
   "metadata": {},
   "source": [
    "# Pandas 101"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "804c4139",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assoc. Prof. Dr. Piyabute Fuangkhon\n",
    "# Department of Digital Business Management\n",
    "# Martin de Tours School of Management and Economics\n",
    "# Assumption University\n",
    "# Update: 22/12/2024"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8723b64",
   "metadata": {},
   "source": [
    "# Introduction to Pandas for Business Students\n",
    "\n",
    "This notebook is designed to introduce you to the basics of Pandas, a powerful library for data manipulation and analysis. By the end of this tutorial, you should be comfortable with the fundamental operations and methods in Pandas that are essential for data analysis tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f744369d-dc86-4eea-9e07-f2f5245c39d1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: pandas in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (2.3.2)\n",
      "Requirement already satisfied: numpy>=1.26.0 in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (from pandas) (2.3.3)\n",
      "Requirement already satisfied: python-dateutil>=2.8.2 in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (from pandas) (2.9.0.post0)\n",
      "Requirement already satisfied: pytz>=2020.1 in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (from pandas) (2025.2)\n",
      "Requirement already satisfied: tzdata>=2022.7 in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (from pandas) (2025.2)\n",
      "Requirement already satisfied: six>=1.5 in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n"
     ]
    }
   ],
   "source": [
    "!source /home/piyabute/jupyter/venv/bin/activate && pip install pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42614a03-e779-48ea-bff4-0b0b6b9c3185",
   "metadata": {},
   "source": [
    "## Importing the Pandas Library\n",
    "\n",
    "First, we need to import the Pandas library. Pandas is usually imported under the alias `pd`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d0d85e8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importing Pandas Library\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be106f7e",
   "metadata": {},
   "source": [
    "## Basic Data Structures: Series\n",
    "\n",
    "A Series is a one-dimensional, labeled array capable of holding data of any type (integers, strings, floating-point numbers, etc.). It is akin to a single column in a spreadsheet or a database table. Each element in a Series has an associated label, known as its index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "f23d26e2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Series:\n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3    4\n",
      "4    5\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Creating a Series\n",
    "data = [1, 2, 3, 4, 5]                                                  # Sample data\n",
    "series = pd.Series(data)                                                # Creating a Series\n",
    "print(\"Series:\\n\", series, sep=\"\")                                              # Printing the Series"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b93506d",
   "metadata": {},
   "source": [
    "## Basic Data Structures: DataFrame\n",
    "\n",
    "A DataFrame in Pandas is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a table in a database or a data frame in Excel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "463ab1ef",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame:\n",
      "      Name  Age         City  Salary\n",
      "0    Alice   24     New York   50000\n",
      "1      Bob   27  Los Angeles   54000\n",
      "2  Charlie   22      Chicago   50000\n",
      "3    David   22      Houston   58000\n",
      "4      Eva   29      Phoenix   60000\n"
     ]
    }
   ],
   "source": [
    "# Creating a DataFrame\n",
    "data = {                                                                # Sample data\n",
    "    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],\n",
    "    'Age': [24, 27, 22, 22, 29],\n",
    "    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],\n",
    "    'Salary': [50000, 54000, 50000, 58000, 60000]\n",
    "}\n",
    "df = pd.DataFrame(data)                                                 # Creating a DataFrame\n",
    "print(\"DataFrame:\\n\", df, sep=\"\")                                       # Printing the DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66bd0aea",
   "metadata": {},
   "source": [
    "## Viewing the First Few Rows\n",
    "\n",
    "The `head()` method in Pandas is used to view the first few rows of a DataFrame. By default, it returns the first five rows, but you can specify a different number of rows to display by passing an integer argument to the method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4ac9e393",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First few rows:\n",
      "      Name  Age         City  Salary\n",
      "0    Alice   24     New York   50000\n",
      "1      Bob   27  Los Angeles   54000\n",
      "2  Charlie   22      Chicago   50000\n",
      "3    David   22      Houston   58000\n",
      "4      Eva   29      Phoenix   60000\n",
      "First row:\n",
      "    Name  Age      City  Salary\n",
      "0  Alice   24  New York   50000\n"
     ]
    }
   ],
   "source": [
    "# Viewing the first few rows of the DataFrame\n",
    "print(\"First few rows:\\n\", df.head(), sep=\"\")                           # Displaying the first few rows\n",
    "\n",
    "# Viewing the first row of the DataFrame\n",
    "print(\"First row:\\n\", df.head(1), sep=\"\")                               # Displaying the first row"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d604ee12",
   "metadata": {},
   "source": [
    "## Viewing the Last Few Rows\n",
    "\n",
    "The `tail()` method in Pandas is used to view the last few rows of a DataFrame. By default, it returns the last five rows, but you can specify a different number of rows to display by passing an integer argument to the method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b0ba4cb8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Last few rows:\n",
      "      Name  Age         City  Salary\n",
      "0    Alice   24     New York   50000\n",
      "1      Bob   27  Los Angeles   54000\n",
      "2  Charlie   22      Chicago   50000\n",
      "3    David   22      Houston   58000\n",
      "4      Eva   29      Phoenix   60000\n",
      "Last row:\n",
      "  Name  Age     City  Salary\n",
      "4  Eva   29  Phoenix   60000\n"
     ]
    }
   ],
   "source": [
    "# Viewing the last few rows of the DataFrame\n",
    "print(\"Last few rows:\\n\", df.tail(), sep=\"\")                            # Displaying the last few rows\n",
    "\n",
    "# Viewing the last row of the DataFrame\n",
    "print(\"Last row:\\n\", df.tail(1), sep=\"\")                                # Displaying the last row"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a152b58",
   "metadata": {},
   "source": [
    "## Summary Statistics\n",
    "\n",
    "The `describe()` method in Pandas generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "44876876",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Summary statistics:\n",
      "             Age      Salary\n",
      "count   5.000000      5.0000\n",
      "mean   24.800000  54400.0000\n",
      "std     3.114482   4560.7017\n",
      "min    22.000000  50000.0000\n",
      "25%    22.000000  50000.0000\n",
      "50%    24.000000  54000.0000\n",
      "75%    27.000000  58000.0000\n",
      "max    29.000000  60000.0000\n"
     ]
    }
   ],
   "source": [
    "# Getting summary statistics\n",
    "print(\"Summary statistics:\\n\", df.describe(), sep=\"\")                   # Displaying summary statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5045872a",
   "metadata": {},
   "source": [
    "## Selecting a Single Column\n",
    "\n",
    "To select a single column from a DataFrame, you can use the column name as a string inside the brackets. This returns a Pandas Series representing the column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "849db8a0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Selecting the 'Name' column:\n",
      "0      Alice\n",
      "1        Bob\n",
      "2    Charlie\n",
      "3      David\n",
      "4        Eva\n",
      "Name: Name, dtype: object\n"
     ]
    }
   ],
   "source": [
    "# Selecting a single column\n",
    "print(\"Selecting the 'Name' column:\\n\", df['Name'], sep=\"\")             # Selecting the 'Name' column"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b70f7c8a",
   "metadata": {},
   "source": [
    "## Selecting Multiple Columns\n",
    "\n",
    "To select multiple columns from a DataFrame, you can pass a list of column names inside the brackets. This returns a new DataFrame containing only the specified columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "dc2b00e5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Selecting 'Name' and 'Salary' columns:\n",
      "      Name  Salary\n",
      "0    Alice   50000\n",
      "1      Bob   54000\n",
      "2  Charlie   50000\n",
      "3    David   58000\n",
      "4      Eva   60000\n"
     ]
    }
   ],
   "source": [
    "# Selecting multiple columns\n",
    "print(\"Selecting 'Name' and 'Salary' columns:\\n\", df[['Name', 'Salary']], sep=\"\")  # Selecting 'Name' and 'Salary' columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f20a22d",
   "metadata": {},
   "source": [
    "## Filtering Rows Based on a Condition\n",
    "\n",
    "To filter rows in a DataFrame based on a condition, you can use boolean indexing within the brackets. This involves specifying a condition that returns a boolean Series, which is then used to index the DataFrame and return only the rows where the condition is True."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "33cc3787",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Filtering rows where Age > 25:\n",
      "  Name  Age         City  Salary\n",
      "1  Bob   27  Los Angeles   54000\n",
      "4  Eva   29      Phoenix   60000\n"
     ]
    }
   ],
   "source": [
    "# Filtering rows based on a condition\n",
    "print(\"Filtering rows where Age > 25:\\n\", df[df['Age'] > 25], sep=\"\")   # Filtering rows where Age > 25"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93e8a7f4-3750-45e4-ab73-1f923d97dcc0",
   "metadata": {},
   "source": [
    "## Sorting Rows\n",
    "\n",
    "Sorting rows in a DataFrame can be done using the sort_values() method. This method allows you to sort the data based on one or more columns, in either ascending or descending order. By specifying the column names and the desired order, you can easily organize your data to facilitate analysis. This is useful for tasks such as ranking, ordering, and prioritizing data within a DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "15a0cd62-8ef6-4613-aafc-d3f21c63fff8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame sorted by Age:\n",
      "      Name  Age         City  Salary\n",
      "3    David   22      Houston   58000\n",
      "2  Charlie   22      Chicago   50000\n",
      "0    Alice   24     New York   50000\n",
      "1      Bob   27  Los Angeles   54000\n",
      "4      Eva   29      Phoenix   60000\n"
     ]
    }
   ],
   "source": [
    "# Sorting by a single column\n",
    "sorted_by_age = df.sort_values(by='Age')                               # Sorting the DataFrame by the 'Age' column in ascending order\n",
    "print(\"DataFrame sorted by Age:\\n\", sorted_by_age, sep=\"\")             # Displaying the DataFrame sorted by 'Age'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "3006a91b-2ef1-4909-b60c-bd82875c497d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame sorted by Age (ascending) and Salary (descending):\n",
      "      Name  Age         City  Salary\n",
      "3    David   22      Houston   58000\n",
      "2  Charlie   22      Chicago   50000\n",
      "0    Alice   24     New York   50000\n",
      "1      Bob   27  Los Angeles   54000\n",
      "4      Eva   29      Phoenix   60000\n"
     ]
    }
   ],
   "source": [
    "# Sorting by multiple columns\n",
    "sorted_by_age_salary = df.sort_values(by=['Age', 'Salary'], ascending=[True, False])                   # Sorting the DataFrame by the 'Age' and 'Salary' columns in ascending order\n",
    "print(\"DataFrame sorted by Age (ascending) and Salary (descending):\\n\", sorted_by_age_salary, sep=\"\")  # Displaying the DataFrame sorted by 'Age' and 'Salary'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "410150ca-4bc8-45af-8583-feb2eee65989",
   "metadata": {},
   "source": [
    "## Adding a New Row\n",
    "\n",
    "To add a new row to a DataFrame, you concatenate a Series or list of values to a new row within the DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "be28b5b2-7cf2-4af0-92f5-32a50dbe9260",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame after adding new data (using loc):\n",
      "      Name  Age         City  Salary\n",
      "0    Alice   24     New York   50000\n",
      "1      Bob   27  Los Angeles   54000\n",
      "2  Charlie   22      Chicago   50000\n",
      "3    David   22      Houston   58000\n",
      "4      Eva   29      Phoenix   60000\n",
      "5    Louis   32    Minnesota   70000\n"
     ]
    }
   ],
   "source": [
    "# Adding a new row using 'loc'\n",
    "\n",
    "# New data to be added\n",
    "new_data = {'Name': 'Louis', 'Age': 32, 'City': 'Minnesota', 'Salary': 70000}  # New data to be added\n",
    "\n",
    "# Adding the new data using loc\n",
    "df.loc[len(df)] = new_data  # Adding the new data to the DataFrame using loc\n",
    "print(\"DataFrame after adding new data (using loc):\\n\", df, sep=\"\")            # Printing the DataFrame after adding new data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "8c72dd72-fedd-4777-918a-da99876d3a10",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame after adding new data (using concat):\n",
      "      Name  Age           City  Salary\n",
      "0    Alice   24       New York   50000\n",
      "1      Bob   27    Los Angeles   54000\n",
      "2  Charlie   22        Chicago   50000\n",
      "3    David   22        Houston   58000\n",
      "4      Eva   29        Phoenix   60000\n",
      "5    Louis   32      Minnesota   70000\n",
      "6    Frank   28  San Francisco   62000\n"
     ]
    }
   ],
   "source": [
    "# Adding a new row using 'concat'\n",
    "\n",
    "# New data to be added\n",
    "new_data = pd.DataFrame([{'Name': 'Frank', 'Age': 28, 'City': 'San Francisco', 'Salary': 62000}])  # New data to be added\n",
    "\n",
    "# Adding the new data to the DataFrame using concat\n",
    "df = pd.concat([df, new_data], ignore_index=True)                                                  # Concatenating the new data to the DataFrame and resetting the index\n",
    "print(\"DataFrame after adding new data (using concat):\\n\", df, sep=\"\")                             # Printing the DataFrame after adding new data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3d74527",
   "metadata": {},
   "source": [
    "## Adding a New Column\n",
    "\n",
    "To add a new column to a DataFrame, you assign a Series or list of values to a new column name within the DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "66789203",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame with new 'Department' column:\n",
      "      Name  Age           City  Salary        Department\n",
      "0    Alice   24       New York   50000                HR\n",
      "1      Bob   27    Los Angeles   54000       Engineering\n",
      "2  Charlie   22        Chicago   50000         Marketing\n",
      "3    David   22        Houston   58000           Finance\n",
      "4      Eva   29        Phoenix   60000                IT\n",
      "5    Louis   32      Minnesota   70000      Data Science\n",
      "6    Frank   28  San Francisco   62000  Digital Commerce\n"
     ]
    }
   ],
   "source": [
    "# Adding a new column\n",
    "df['Department'] = ['HR', 'Engineering', 'Marketing', 'Finance', 'IT', 'Data Science', 'Digital Commerce']  # Adding a new 'Department' column\n",
    "print(\"DataFrame with new 'Department' column:\\n\", df, sep=\"\")                                              # Displaying the updated DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5bc80e49",
   "metadata": {},
   "source": [
    "## Modifying an Existing Column\n",
    "\n",
    "To modify an existing column in a DataFrame, you can assign new values to the column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "cb0a36e6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame with updated 'Salary' column:\n",
      "      Name  Age           City  Salary        Department\n",
      "0    Alice   24       New York   55000                HR\n",
      "1      Bob   27    Los Angeles   59000       Engineering\n",
      "2  Charlie   22        Chicago   55000         Marketing\n",
      "3    David   22        Houston   63000           Finance\n",
      "4      Eva   29        Phoenix   65000                IT\n",
      "5    Louis   32      Minnesota   75000      Data Science\n",
      "6    Frank   28  San Francisco   67000  Digital Commerce\n"
     ]
    }
   ],
   "source": [
    "# Modifying an existing column\n",
    "df['Salary'] = df['Salary'] + 5000                                      # Increasing 'Salary' by 5000\n",
    "print(\"DataFrame with updated 'Salary' column:\\n\", df, sep=\"\")          # Displaying the updated DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2dc49ca",
   "metadata": {},
   "source": [
    "## Deleting a Column\n",
    "\n",
    "To delete a column in a DataFrame, you can use the `drop()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "9e9e9777",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame after dropping 'Department' column:\n",
      "      Name  Age           City  Salary\n",
      "0    Alice   24       New York   55000\n",
      "1      Bob   27    Los Angeles   59000\n",
      "2  Charlie   22        Chicago   55000\n",
      "3    David   22        Houston   63000\n",
      "4      Eva   29        Phoenix   65000\n",
      "5    Louis   32      Minnesota   75000\n",
      "6    Frank   28  San Francisco   67000\n"
     ]
    }
   ],
   "source": [
    "# Deleting a column\n",
    "df = df.drop(columns=['Department'])                                   # Dropping the 'Department' column\n",
    "print(\"DataFrame after dropping 'Department' column:\\n\", df, sep=\"\")   # Displaying the updated DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e9f5cb7",
   "metadata": {},
   "source": [
    "## Handling Missing Data\n",
    "\n",
    "Handling missing data is an essential part of data cleaning and preprocessing in Pandas. When creating a DataFrame, missing values can be introduced using `None`, `numpy.nan`, or `pandas.NaT` (for datetime types)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0aee1590",
   "metadata": {},
   "source": [
    "## Creating DataFrame with Missing Values\n",
    "\n",
    "To create a DataFrame with missing values, you can define a dictionary or a list of lists, including `None` or `numpy.nan` to represent missing entries. Pandas will interpret these as missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "f51cf14b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame with missing values:\n",
      "      Name   Age   Salary\n",
      "0    Alice  24.0  50000.0\n",
      "1      Bob   NaN  54000.0\n",
      "2  Charlie  22.0      NaN\n",
      "3    David  32.0  58000.0\n",
      "4      Eva  29.0  60000.0\n"
     ]
    }
   ],
   "source": [
    "# Creating a DataFrame with missing values\n",
    "import numpy as np\n",
    "data = {  # Sample data with missing values\n",
    "    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],\n",
    "    'Age': [24, np.nan, 22, 32, 29],\n",
    "    'Salary': [50000, 54000, np.nan, 58000, 60000]\n",
    "}\n",
    "df_missing = pd.DataFrame(data)                                        # Creating a DataFrame with missing values\n",
    "print(\"DataFrame with missing values:\\n\", df_missing, sep=\"\")          # Displaying the DataFrame with missing values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eda9980e",
   "metadata": {},
   "source": [
    "## Filling Missing Values\n",
    "\n",
    "Filling missing values in a DataFrame involves replacing NaN or None values with specified values. This process, known as imputation, can be done using the `fillna()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "8ec265dc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame with filled missing values:\n",
      "      Name    Age   Salary\n",
      "0    Alice  24.00  50000.0\n",
      "1      Bob  26.75  54000.0\n",
      "2  Charlie  22.00  55500.0\n",
      "3    David  32.00  58000.0\n",
      "4      Eva  29.00  60000.0\n"
     ]
    }
   ],
   "source": [
    "# Filling missing values\n",
    "df_missing_filled = df_missing.fillna({'Age': df_missing['Age'].mean(), 'Salary': df_missing['Salary'].mean()})  # Filling missing values with mean\n",
    "print(\"DataFrame with filled missing values:\\n\", df_missing_filled, sep=\"\")                                      # Displaying the DataFrame with filled missing values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00f9752c",
   "metadata": {},
   "source": [
    "## Dropping Rows with Missing Values\n",
    "\n",
    "To drop rows with missing values, you can use the `dropna()` method. This will remove any rows that contain NaN values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "ea72e062",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame after dropping rows with missing values:\n",
      "    Name   Age   Salary\n",
      "0  Alice  24.0  50000.0\n",
      "3  David  32.0  58000.0\n",
      "4    Eva  29.0  60000.0\n"
     ]
    }
   ],
   "source": [
    "# Dropping rows with missing values\n",
    "df_missing_dropped = df_missing.dropna()                                                   # Dropping rows with missing values\n",
    "print(\"DataFrame after dropping rows with missing values:\\n\", df_missing_dropped, sep=\"\")  # Displaying the DataFrame after dropping rows with missing values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51b0481d",
   "metadata": {},
   "source": [
    "## Grouping and Aggregating Data\n",
    "\n",
    "Grouping and aggregating data in Pandas involves combining data into groups based on specific criteria and then performing aggregate functions on these groups. The `groupby()` method is used to split the data into groups, and aggregation functions like `sum()`, `mean()`, `count()`, and others are applied to these groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "3a17d800",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Salary by Age:\n",
      "   Age  Average Salary\n",
      "0   22         59000.0\n",
      "1   24         55000.0\n",
      "2   27         59000.0\n",
      "3   28         67000.0\n",
      "4   29         65000.0\n",
      "5   32         75000.0\n"
     ]
    }
   ],
   "source": [
    "# Grouping data by 'Age' and calculating the average 'Salary'\n",
    "grouped_df = df.groupby('Age')['Salary'].mean().reset_index()          # Grouping by 'Age' and calculating mean 'Salary'\n",
    "grouped_df.columns = ['Age', 'Average Salary']                         # Setting the column name for the second column after grouping data by 'Age'\n",
    "print(\"Average Salary by Age:\\n\", grouped_df, sep=\"\")                  # Displaying the average Salary by Age"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06db0961",
   "metadata": {},
   "source": [
    "## Saving DataFrame to a CSV File\n",
    "\n",
    "Saving a DataFrame to a CSV file in Pandas is a straightforward process that involves using the `to_csv()` method. This method allows you to export the contents of a DataFrame into a CSV (Comma-Separated Values) file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "4279d51b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Saving DataFrame to a CSV file\n",
    "df.to_csv('data.csv', index=False)                                     # Saving the DataFrame to a CSV file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51f66737",
   "metadata": {},
   "source": [
    "## Loading DataFrame from a CSV File\n",
    "\n",
    "Loading a DataFrame from a CSV file in Pandas involves using the `read_csv()` function. This function reads a CSV file into a DataFrame, allowing you to work with the data in a structured, tabular format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "dc40764b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded DataFrame from CSV file:\n",
      "      Name  Age           City  Salary\n",
      "0    Alice   24       New York   55000\n",
      "1      Bob   27    Los Angeles   59000\n",
      "2  Charlie   22        Chicago   55000\n",
      "3    David   22        Houston   63000\n",
      "4      Eva   29        Phoenix   65000\n",
      "5    Louis   32      Minnesota   75000\n",
      "6    Frank   28  San Francisco   67000\n"
     ]
    }
   ],
   "source": [
    "# Loading DataFrame from a CSV file\n",
    "loaded_df = pd.read_csv('data.csv')                                   # Loading the DataFrame from a CSV file\n",
    "print(\"Loaded DataFrame from CSV file:\\n\", loaded_df, sep=\"\")         # Displaying the loaded DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf69cd89",
   "metadata": {},
   "source": [
    "## Practice\n",
    "\n",
    "The code block below reads sales data from a URL and stores the data in an array. The first row in the dataset defines the attribute names. Your task is to find and display useful information (aggregated data) from this data using Pandas library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "eee530d1",
   "metadata": {},
   "outputs": [
    {
     "ename": "HTTPError",
     "evalue": "HTTP Error 400: Bad Request",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mHTTPError\u001b[39m                                 Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[24]\u001b[39m\u001b[32m, line 8\u001b[39m\n\u001b[32m      5\u001b[39m url = \u001b[33m\"\u001b[39m\u001b[33mhttps://piyabute.s3.ap-southeast-1.amazonaws.com/notebook/sales_data_1000.csv\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m      7\u001b[39m \u001b[38;5;66;03m# Step 3: Read the CSV file from the URL into a pandas DataFrame\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m8\u001b[39m sales_data = \u001b[43mpd\u001b[49m\u001b[43m.\u001b[49m\u001b[43mread_csv\u001b[49m\u001b[43m(\u001b[49m\u001b[43murl\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     10\u001b[39m \u001b[38;5;66;03m# Step 4: Print the first 5 rows to verify the data\u001b[39;00m\n\u001b[32m     11\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33mFirst 5 rows of data:\u001b[39m\u001b[33m\"\u001b[39m)\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026\u001b[39m, in \u001b[36mread_csv\u001b[39m\u001b[34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[39m\n\u001b[32m   1013\u001b[39m kwds_defaults = _refine_defaults_read(\n\u001b[32m   1014\u001b[39m     dialect,\n\u001b[32m   1015\u001b[39m     delimiter,\n\u001b[32m   (...)\u001b[39m\u001b[32m   1022\u001b[39m     dtype_backend=dtype_backend,\n\u001b[32m   1023\u001b[39m )\n\u001b[32m   1024\u001b[39m kwds.update(kwds_defaults)\n\u001b[32m-> \u001b[39m\u001b[32m1026\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_read\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620\u001b[39m, in \u001b[36m_read\u001b[39m\u001b[34m(filepath_or_buffer, kwds)\u001b[39m\n\u001b[32m    617\u001b[39m _validate_names(kwds.get(\u001b[33m\"\u001b[39m\u001b[33mnames\u001b[39m\u001b[33m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m))\n\u001b[32m    619\u001b[39m \u001b[38;5;66;03m# Create the parser.\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m620\u001b[39m parser = \u001b[43mTextFileReader\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    622\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m chunksize \u001b[38;5;129;01mor\u001b[39;00m iterator:\n\u001b[32m    623\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m parser\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620\u001b[39m, in \u001b[36mTextFileReader.__init__\u001b[39m\u001b[34m(self, f, engine, **kwds)\u001b[39m\n\u001b[32m   1617\u001b[39m     \u001b[38;5;28mself\u001b[39m.options[\u001b[33m\"\u001b[39m\u001b[33mhas_index_names\u001b[39m\u001b[33m\"\u001b[39m] = kwds[\u001b[33m\"\u001b[39m\u001b[33mhas_index_names\u001b[39m\u001b[33m\"\u001b[39m]\n\u001b[32m   1619\u001b[39m \u001b[38;5;28mself\u001b[39m.handles: IOHandles | \u001b[38;5;28;01mNone\u001b[39;00m = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1620\u001b[39m \u001b[38;5;28mself\u001b[39m._engine = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_make_engine\u001b[49m\u001b[43m(\u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mengine\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880\u001b[39m, in \u001b[36mTextFileReader._make_engine\u001b[39m\u001b[34m(self, f, engine)\u001b[39m\n\u001b[32m   1878\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[33m\"\u001b[39m\u001b[33mb\u001b[39m\u001b[33m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m mode:\n\u001b[32m   1879\u001b[39m         mode += \u001b[33m\"\u001b[39m\u001b[33mb\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m-> \u001b[39m\u001b[32m1880\u001b[39m \u001b[38;5;28mself\u001b[39m.handles = \u001b[43mget_handle\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m   1881\u001b[39m \u001b[43m    \u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1882\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1883\u001b[39m \u001b[43m    \u001b[49m\u001b[43mencoding\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mencoding\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1884\u001b[39m \u001b[43m    \u001b[49m\u001b[43mcompression\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mcompression\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1885\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmemory_map\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mmemory_map\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1886\u001b[39m \u001b[43m    \u001b[49m\u001b[43mis_text\u001b[49m\u001b[43m=\u001b[49m\u001b[43mis_text\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1887\u001b[39m \u001b[43m    \u001b[49m\u001b[43merrors\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mencoding_errors\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mstrict\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1888\u001b[39m \u001b[43m    \u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mstorage_options\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1889\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1890\u001b[39m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m.handles \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m   1891\u001b[39m f = \u001b[38;5;28mself\u001b[39m.handles.handle\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/common.py:728\u001b[39m, in \u001b[36mget_handle\u001b[39m\u001b[34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[39m\n\u001b[32m    725\u001b[39m     codecs.lookup_error(errors)\n\u001b[32m    727\u001b[39m \u001b[38;5;66;03m# open URLs\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m728\u001b[39m ioargs = \u001b[43m_get_filepath_or_buffer\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m    729\u001b[39m \u001b[43m    \u001b[49m\u001b[43mpath_or_buf\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    730\u001b[39m \u001b[43m    \u001b[49m\u001b[43mencoding\u001b[49m\u001b[43m=\u001b[49m\u001b[43mencoding\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    731\u001b[39m \u001b[43m    \u001b[49m\u001b[43mcompression\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcompression\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    732\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmode\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    733\u001b[39m \u001b[43m    \u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[43m=\u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    734\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    736\u001b[39m handle = ioargs.filepath_or_buffer\n\u001b[32m    737\u001b[39m handles: \u001b[38;5;28mlist\u001b[39m[BaseBuffer]\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/common.py:384\u001b[39m, in \u001b[36m_get_filepath_or_buffer\u001b[39m\u001b[34m(filepath_or_buffer, encoding, compression, mode, storage_options)\u001b[39m\n\u001b[32m    382\u001b[39m \u001b[38;5;66;03m# assuming storage_options is to be interpreted as headers\u001b[39;00m\n\u001b[32m    383\u001b[39m req_info = urllib.request.Request(filepath_or_buffer, headers=storage_options)\n\u001b[32m--> \u001b[39m\u001b[32m384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[43murlopen\u001b[49m\u001b[43m(\u001b[49m\u001b[43mreq_info\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;28;01mas\u001b[39;00m req:\n\u001b[32m    385\u001b[39m     content_encoding = req.headers.get(\u001b[33m\"\u001b[39m\u001b[33mContent-Encoding\u001b[39m\u001b[33m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m)\n\u001b[32m    386\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m content_encoding == \u001b[33m\"\u001b[39m\u001b[33mgzip\u001b[39m\u001b[33m\"\u001b[39m:\n\u001b[32m    387\u001b[39m         \u001b[38;5;66;03m# Override compression based on Content-Encoding header\u001b[39;00m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/common.py:289\u001b[39m, in \u001b[36murlopen\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m    283\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m    284\u001b[39m \u001b[33;03mLazy-import wrapper for stdlib urlopen, as that imports a big chunk of\u001b[39;00m\n\u001b[32m    285\u001b[39m \u001b[33;03mthe stdlib.\u001b[39;00m\n\u001b[32m    286\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m    287\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01murllib\u001b[39;00m\u001b[34;01m.\u001b[39;00m\u001b[34;01mrequest\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m289\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43murllib\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m.\u001b[49m\u001b[43murlopen\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:215\u001b[39m, in \u001b[36murlopen\u001b[39m\u001b[34m(url, data, timeout, cafile, capath, cadefault, context)\u001b[39m\n\u001b[32m    213\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m    214\u001b[39m     opener = _opener\n\u001b[32m--> \u001b[39m\u001b[32m215\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mopener\u001b[49m\u001b[43m.\u001b[49m\u001b[43mopen\u001b[49m\u001b[43m(\u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:521\u001b[39m, in \u001b[36mOpenerDirector.open\u001b[39m\u001b[34m(self, fullurl, data, timeout)\u001b[39m\n\u001b[32m    519\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m processor \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.process_response.get(protocol, []):\n\u001b[32m    520\u001b[39m     meth = \u001b[38;5;28mgetattr\u001b[39m(processor, meth_name)\n\u001b[32m--> \u001b[39m\u001b[32m521\u001b[39m     response = \u001b[43mmeth\u001b[49m\u001b[43m(\u001b[49m\u001b[43mreq\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mresponse\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    523\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m response\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:630\u001b[39m, in \u001b[36mHTTPErrorProcessor.http_response\u001b[39m\u001b[34m(self, request, response)\u001b[39m\n\u001b[32m    627\u001b[39m \u001b[38;5;66;03m# According to RFC 2616, \"2xx\" code indicates that the client's\u001b[39;00m\n\u001b[32m    628\u001b[39m \u001b[38;5;66;03m# request was successfully received, understood, and accepted.\u001b[39;00m\n\u001b[32m    629\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[32m200\u001b[39m <= code < \u001b[32m300\u001b[39m):\n\u001b[32m--> \u001b[39m\u001b[32m630\u001b[39m     response = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mparent\u001b[49m\u001b[43m.\u001b[49m\u001b[43merror\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m    631\u001b[39m \u001b[43m        \u001b[49m\u001b[33;43m'\u001b[39;49m\u001b[33;43mhttp\u001b[39;49m\u001b[33;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mresponse\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcode\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmsg\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mhdrs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    633\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m response\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:559\u001b[39m, in \u001b[36mOpenerDirector.error\u001b[39m\u001b[34m(self, proto, *args)\u001b[39m\n\u001b[32m    557\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m http_err:\n\u001b[32m    558\u001b[39m     args = (\u001b[38;5;28mdict\u001b[39m, \u001b[33m'\u001b[39m\u001b[33mdefault\u001b[39m\u001b[33m'\u001b[39m, \u001b[33m'\u001b[39m\u001b[33mhttp_error_default\u001b[39m\u001b[33m'\u001b[39m) + orig_args\n\u001b[32m--> \u001b[39m\u001b[32m559\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_chain\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:492\u001b[39m, in \u001b[36mOpenerDirector._call_chain\u001b[39m\u001b[34m(self, chain, kind, meth_name, *args)\u001b[39m\n\u001b[32m    490\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m handler \u001b[38;5;129;01min\u001b[39;00m handlers:\n\u001b[32m    491\u001b[39m     func = \u001b[38;5;28mgetattr\u001b[39m(handler, meth_name)\n\u001b[32m--> \u001b[39m\u001b[32m492\u001b[39m     result = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    493\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m result \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m    494\u001b[39m         \u001b[38;5;28;01mreturn\u001b[39;00m result\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:639\u001b[39m, in \u001b[36mHTTPDefaultErrorHandler.http_error_default\u001b[39m\u001b[34m(self, req, fp, code, msg, hdrs)\u001b[39m\n\u001b[32m    638\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mhttp_error_default\u001b[39m(\u001b[38;5;28mself\u001b[39m, req, fp, code, msg, hdrs):\n\u001b[32m--> \u001b[39m\u001b[32m639\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m HTTPError(req.full_url, code, msg, hdrs, fp)\n",
      "\u001b[31mHTTPError\u001b[39m: HTTP Error 400: Bad Request"
     ]
    }
   ],
   "source": [
    "# Step 1: Import the 'urllib' library\n",
    "import pandas as pd\n",
    "\n",
    "# Step 2: Specify the URL of the file to be opened\n",
    "url = \"https://piyabute.s3.ap-southeast-1.amazonaws.com/notebook/sales_data_1000.csv\"\n",
    "\n",
    "# Step 3: Read the CSV file from the URL into a pandas DataFrame\n",
    "sales_data = pd.read_csv(url)\n",
    "\n",
    "# Step 4: Print the first 5 rows to verify the data\n",
    "print(\"First 5 rows of data:\")\n",
    "print(sales_data.head(5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2daee743",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 1: Find the number of units sold for each product\n",
    "units_sold_per_product = sales_data.groupby('Product Name')['Quantity Sold'].sum().reset_index()\n",
    "units_sold_per_product"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52e8b9b4-14c7-4992-b895-97ec62d43dbc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 2: Find the total revenue from each province\n",
    "total_revenue_per_province = sales_data.groupby('Province')['Total Sales'].sum().reset_index()\n",
    "total_revenue_per_province"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "173c5d4e-1e21-43b9-96b4-41802ab4e5f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 3: Find the number of transaction from each payment type \n",
    "transactions_per_payment_method = sales_data.groupby('Payment Method')['Payment Method'].count()\n",
    "transactions_per_payment_method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51807441-26f0-445e-a0b0-3f0867492af6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 4: Find the average unit price for each product\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "fb7a6f53-c26e-44aa-8690-2a79175dd353",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 5: Find the top 5 customers by total sales\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96e421ac-99d0-4d6c-93fc-ba36706fc3e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 6: Find the total discount given for each product\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3fe7ead5-e2d6-4fbe-b941-3cd01f75a59d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 7: Find the total tax collected for each province\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "afea5dd6-0b10-4b8e-ae30-bc82fe464bd4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 8: Find the average net sales for each channel (Online vs In-store)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4dbdc892-085c-4dc6-8e08-06724fdf4b0a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 9: Find the total quantity sold by each salesperson\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b56459d-3c60-4e62-9641-5cde75273a72",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 10: Find the top 3 months with the highest total sales\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}