{ "cells": [ { "cell_type": "markdown", "id": "5d6ba50c", "metadata": {}, "source": [ "# Pandas 101" ] }, { "cell_type": "code", "execution_count": 1, "id": "804c4139", "metadata": {}, "outputs": [], "source": [ "# Assoc. Prof. Dr. Piyabute Fuangkhon\n", "# Department of Digital Business Management\n", "# Martin de Tours School of Management and Economics\n", "# Assumption University\n", "# Update: 22/12/2024" ] }, { "cell_type": "markdown", "id": "e8723b64", "metadata": {}, "source": [ "# Introduction to Pandas for Business Students\n", "\n", "This notebook is designed to introduce you to the basics of Pandas, a powerful library for data manipulation and analysis. By the end of this tutorial, you should be comfortable with the fundamental operations and methods in Pandas that are essential for data analysis tasks." ] }, { "cell_type": "code", "execution_count": 1, "id": "f744369d-dc86-4eea-9e07-f2f5245c39d1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pandas in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (2.3.2)\n", "Requirement already satisfied: numpy>=1.26.0 in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (from pandas) (2.3.3)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (from pandas) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (from pandas) (2025.2)\n", "Requirement already satisfied: tzdata>=2022.7 in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (from pandas) (2025.2)\n", "Requirement already satisfied: six>=1.5 in /home/piyabute/jupyter/venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n" ] } ], "source": [ "!source /home/piyabute/jupyter/venv/bin/activate && pip install pandas" ] }, { "cell_type": "markdown", "id": "42614a03-e779-48ea-bff4-0b0b6b9c3185", "metadata": {}, "source": [ "## Importing the Pandas Library\n", "\n", "First, we need to import the Pandas library. Pandas is usually imported under the alias `pd`." ] }, { "cell_type": "code", "execution_count": 2, "id": "d0d85e8d", "metadata": {}, "outputs": [], "source": [ "# Importing Pandas Library\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "be106f7e", "metadata": {}, "source": [ "## Basic Data Structures: Series\n", "\n", "A Series is a one-dimensional, labeled array capable of holding data of any type (integers, strings, floating-point numbers, etc.). It is akin to a single column in a spreadsheet or a database table. Each element in a Series has an associated label, known as its index." ] }, { "cell_type": "code", "execution_count": 3, "id": "f23d26e2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Series:\n", "0 1\n", "1 2\n", "2 3\n", "3 4\n", "4 5\n", "dtype: int64\n" ] } ], "source": [ "# Creating a Series\n", "data = [1, 2, 3, 4, 5] # Sample data\n", "series = pd.Series(data) # Creating a Series\n", "print(\"Series:\\n\", series, sep=\"\") # Printing the Series" ] }, { "cell_type": "markdown", "id": "0b93506d", "metadata": {}, "source": [ "## Basic Data Structures: DataFrame\n", "\n", "A DataFrame in Pandas is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a table in a database or a data frame in Excel." ] }, { "cell_type": "code", "execution_count": 4, "id": "463ab1ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame:\n", " Name Age City Salary\n", "0 Alice 24 New York 50000\n", "1 Bob 27 Los Angeles 54000\n", "2 Charlie 22 Chicago 50000\n", "3 David 22 Houston 58000\n", "4 Eva 29 Phoenix 60000\n" ] } ], "source": [ "# Creating a DataFrame\n", "data = { # Sample data\n", " 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],\n", " 'Age': [24, 27, 22, 22, 29],\n", " 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],\n", " 'Salary': [50000, 54000, 50000, 58000, 60000]\n", "}\n", "df = pd.DataFrame(data) # Creating a DataFrame\n", "print(\"DataFrame:\\n\", df, sep=\"\") # Printing the DataFrame" ] }, { "cell_type": "markdown", "id": "66bd0aea", "metadata": {}, "source": [ "## Viewing the First Few Rows\n", "\n", "The `head()` method in Pandas is used to view the first few rows of a DataFrame. By default, it returns the first five rows, but you can specify a different number of rows to display by passing an integer argument to the method." ] }, { "cell_type": "code", "execution_count": 5, "id": "4ac9e393", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First few rows:\n", " Name Age City Salary\n", "0 Alice 24 New York 50000\n", "1 Bob 27 Los Angeles 54000\n", "2 Charlie 22 Chicago 50000\n", "3 David 22 Houston 58000\n", "4 Eva 29 Phoenix 60000\n", "First row:\n", " Name Age City Salary\n", "0 Alice 24 New York 50000\n" ] } ], "source": [ "# Viewing the first few rows of the DataFrame\n", "print(\"First few rows:\\n\", df.head(), sep=\"\") # Displaying the first few rows\n", "\n", "# Viewing the first row of the DataFrame\n", "print(\"First row:\\n\", df.head(1), sep=\"\") # Displaying the first row" ] }, { "cell_type": "markdown", "id": "d604ee12", "metadata": {}, "source": [ "## Viewing the Last Few Rows\n", "\n", "The `tail()` method in Pandas is used to view the last few rows of a DataFrame. By default, it returns the last five rows, but you can specify a different number of rows to display by passing an integer argument to the method." ] }, { "cell_type": "code", "execution_count": 6, "id": "b0ba4cb8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last few rows:\n", " Name Age City Salary\n", "0 Alice 24 New York 50000\n", "1 Bob 27 Los Angeles 54000\n", "2 Charlie 22 Chicago 50000\n", "3 David 22 Houston 58000\n", "4 Eva 29 Phoenix 60000\n", "Last row:\n", " Name Age City Salary\n", "4 Eva 29 Phoenix 60000\n" ] } ], "source": [ "# Viewing the last few rows of the DataFrame\n", "print(\"Last few rows:\\n\", df.tail(), sep=\"\") # Displaying the last few rows\n", "\n", "# Viewing the last row of the DataFrame\n", "print(\"Last row:\\n\", df.tail(1), sep=\"\") # Displaying the last row" ] }, { "cell_type": "markdown", "id": "4a152b58", "metadata": {}, "source": [ "## Summary Statistics\n", "\n", "The `describe()` method in Pandas generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values." ] }, { "cell_type": "code", "execution_count": 7, "id": "44876876", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Summary statistics:\n", " Age Salary\n", "count 5.000000 5.0000\n", "mean 24.800000 54400.0000\n", "std 3.114482 4560.7017\n", "min 22.000000 50000.0000\n", "25% 22.000000 50000.0000\n", "50% 24.000000 54000.0000\n", "75% 27.000000 58000.0000\n", "max 29.000000 60000.0000\n" ] } ], "source": [ "# Getting summary statistics\n", "print(\"Summary statistics:\\n\", df.describe(), sep=\"\") # Displaying summary statistics" ] }, { "cell_type": "markdown", "id": "5045872a", "metadata": {}, "source": [ "## Selecting a Single Column\n", "\n", "To select a single column from a DataFrame, you can use the column name as a string inside the brackets. This returns a Pandas Series representing the column." ] }, { "cell_type": "code", "execution_count": 8, "id": "849db8a0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selecting the 'Name' column:\n", "0 Alice\n", "1 Bob\n", "2 Charlie\n", "3 David\n", "4 Eva\n", "Name: Name, dtype: object\n" ] } ], "source": [ "# Selecting a single column\n", "print(\"Selecting the 'Name' column:\\n\", df['Name'], sep=\"\") # Selecting the 'Name' column" ] }, { "cell_type": "markdown", "id": "b70f7c8a", "metadata": {}, "source": [ "## Selecting Multiple Columns\n", "\n", "To select multiple columns from a DataFrame, you can pass a list of column names inside the brackets. This returns a new DataFrame containing only the specified columns." ] }, { "cell_type": "code", "execution_count": 9, "id": "dc2b00e5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selecting 'Name' and 'Salary' columns:\n", " Name Salary\n", "0 Alice 50000\n", "1 Bob 54000\n", "2 Charlie 50000\n", "3 David 58000\n", "4 Eva 60000\n" ] } ], "source": [ "# Selecting multiple columns\n", "print(\"Selecting 'Name' and 'Salary' columns:\\n\", df[['Name', 'Salary']], sep=\"\") # Selecting 'Name' and 'Salary' columns" ] }, { "cell_type": "markdown", "id": "9f20a22d", "metadata": {}, "source": [ "## Filtering Rows Based on a Condition\n", "\n", "To filter rows in a DataFrame based on a condition, you can use boolean indexing within the brackets. This involves specifying a condition that returns a boolean Series, which is then used to index the DataFrame and return only the rows where the condition is True." ] }, { "cell_type": "code", "execution_count": 10, "id": "33cc3787", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Filtering rows where Age > 25:\n", " Name Age City Salary\n", "1 Bob 27 Los Angeles 54000\n", "4 Eva 29 Phoenix 60000\n" ] } ], "source": [ "# Filtering rows based on a condition\n", "print(\"Filtering rows where Age > 25:\\n\", df[df['Age'] > 25], sep=\"\") # Filtering rows where Age > 25" ] }, { "cell_type": "markdown", "id": "93e8a7f4-3750-45e4-ab73-1f923d97dcc0", "metadata": {}, "source": [ "## Sorting Rows\n", "\n", "Sorting rows in a DataFrame can be done using the sort_values() method. This method allows you to sort the data based on one or more columns, in either ascending or descending order. By specifying the column names and the desired order, you can easily organize your data to facilitate analysis. This is useful for tasks such as ranking, ordering, and prioritizing data within a DataFrame." ] }, { "cell_type": "code", "execution_count": 11, "id": "15a0cd62-8ef6-4613-aafc-d3f21c63fff8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame sorted by Age:\n", " Name Age City Salary\n", "3 David 22 Houston 58000\n", "2 Charlie 22 Chicago 50000\n", "0 Alice 24 New York 50000\n", "1 Bob 27 Los Angeles 54000\n", "4 Eva 29 Phoenix 60000\n" ] } ], "source": [ "# Sorting by a single column\n", "sorted_by_age = df.sort_values(by='Age') # Sorting the DataFrame by the 'Age' column in ascending order\n", "print(\"DataFrame sorted by Age:\\n\", sorted_by_age, sep=\"\") # Displaying the DataFrame sorted by 'Age'" ] }, { "cell_type": "code", "execution_count": 12, "id": "3006a91b-2ef1-4909-b60c-bd82875c497d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame sorted by Age (ascending) and Salary (descending):\n", " Name Age City Salary\n", "3 David 22 Houston 58000\n", "2 Charlie 22 Chicago 50000\n", "0 Alice 24 New York 50000\n", "1 Bob 27 Los Angeles 54000\n", "4 Eva 29 Phoenix 60000\n" ] } ], "source": [ "# Sorting by multiple columns\n", "sorted_by_age_salary = df.sort_values(by=['Age', 'Salary'], ascending=[True, False]) # Sorting the DataFrame by the 'Age' and 'Salary' columns in ascending order\n", "print(\"DataFrame sorted by Age (ascending) and Salary (descending):\\n\", sorted_by_age_salary, sep=\"\") # Displaying the DataFrame sorted by 'Age' and 'Salary'" ] }, { "cell_type": "markdown", "id": "410150ca-4bc8-45af-8583-feb2eee65989", "metadata": {}, "source": [ "## Adding a New Row\n", "\n", "To add a new row to a DataFrame, you concatenate a Series or list of values to a new row within the DataFrame." ] }, { "cell_type": "code", "execution_count": 13, "id": "be28b5b2-7cf2-4af0-92f5-32a50dbe9260", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame after adding new data (using loc):\n", " Name Age City Salary\n", "0 Alice 24 New York 50000\n", "1 Bob 27 Los Angeles 54000\n", "2 Charlie 22 Chicago 50000\n", "3 David 22 Houston 58000\n", "4 Eva 29 Phoenix 60000\n", "5 Louis 32 Minnesota 70000\n" ] } ], "source": [ "# Adding a new row using 'loc'\n", "\n", "# New data to be added\n", "new_data = {'Name': 'Louis', 'Age': 32, 'City': 'Minnesota', 'Salary': 70000} # New data to be added\n", "\n", "# Adding the new data using loc\n", "df.loc[len(df)] = new_data # Adding the new data to the DataFrame using loc\n", "print(\"DataFrame after adding new data (using loc):\\n\", df, sep=\"\") # Printing the DataFrame after adding new data" ] }, { "cell_type": "code", "execution_count": 14, "id": "8c72dd72-fedd-4777-918a-da99876d3a10", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame after adding new data (using concat):\n", " Name Age City Salary\n", "0 Alice 24 New York 50000\n", "1 Bob 27 Los Angeles 54000\n", "2 Charlie 22 Chicago 50000\n", "3 David 22 Houston 58000\n", "4 Eva 29 Phoenix 60000\n", "5 Louis 32 Minnesota 70000\n", "6 Frank 28 San Francisco 62000\n" ] } ], "source": [ "# Adding a new row using 'concat'\n", "\n", "# New data to be added\n", "new_data = pd.DataFrame([{'Name': 'Frank', 'Age': 28, 'City': 'San Francisco', 'Salary': 62000}]) # New data to be added\n", "\n", "# Adding the new data to the DataFrame using concat\n", "df = pd.concat([df, new_data], ignore_index=True) # Concatenating the new data to the DataFrame and resetting the index\n", "print(\"DataFrame after adding new data (using concat):\\n\", df, sep=\"\") # Printing the DataFrame after adding new data" ] }, { "cell_type": "markdown", "id": "e3d74527", "metadata": {}, "source": [ "## Adding a New Column\n", "\n", "To add a new column to a DataFrame, you assign a Series or list of values to a new column name within the DataFrame." ] }, { "cell_type": "code", "execution_count": 15, "id": "66789203", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame with new 'Department' column:\n", " Name Age City Salary Department\n", "0 Alice 24 New York 50000 HR\n", "1 Bob 27 Los Angeles 54000 Engineering\n", "2 Charlie 22 Chicago 50000 Marketing\n", "3 David 22 Houston 58000 Finance\n", "4 Eva 29 Phoenix 60000 IT\n", "5 Louis 32 Minnesota 70000 Data Science\n", "6 Frank 28 San Francisco 62000 Digital Commerce\n" ] } ], "source": [ "# Adding a new column\n", "df['Department'] = ['HR', 'Engineering', 'Marketing', 'Finance', 'IT', 'Data Science', 'Digital Commerce'] # Adding a new 'Department' column\n", "print(\"DataFrame with new 'Department' column:\\n\", df, sep=\"\") # Displaying the updated DataFrame" ] }, { "cell_type": "markdown", "id": "5bc80e49", "metadata": {}, "source": [ "## Modifying an Existing Column\n", "\n", "To modify an existing column in a DataFrame, you can assign new values to the column." ] }, { "cell_type": "code", "execution_count": 16, "id": "cb0a36e6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame with updated 'Salary' column:\n", " Name Age City Salary Department\n", "0 Alice 24 New York 55000 HR\n", "1 Bob 27 Los Angeles 59000 Engineering\n", "2 Charlie 22 Chicago 55000 Marketing\n", "3 David 22 Houston 63000 Finance\n", "4 Eva 29 Phoenix 65000 IT\n", "5 Louis 32 Minnesota 75000 Data Science\n", "6 Frank 28 San Francisco 67000 Digital Commerce\n" ] } ], "source": [ "# Modifying an existing column\n", "df['Salary'] = df['Salary'] + 5000 # Increasing 'Salary' by 5000\n", "print(\"DataFrame with updated 'Salary' column:\\n\", df, sep=\"\") # Displaying the updated DataFrame" ] }, { "cell_type": "markdown", "id": "c2dc49ca", "metadata": {}, "source": [ "## Deleting a Column\n", "\n", "To delete a column in a DataFrame, you can use the `drop()` method." ] }, { "cell_type": "code", "execution_count": 17, "id": "9e9e9777", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame after dropping 'Department' column:\n", " Name Age City Salary\n", "0 Alice 24 New York 55000\n", "1 Bob 27 Los Angeles 59000\n", "2 Charlie 22 Chicago 55000\n", "3 David 22 Houston 63000\n", "4 Eva 29 Phoenix 65000\n", "5 Louis 32 Minnesota 75000\n", "6 Frank 28 San Francisco 67000\n" ] } ], "source": [ "# Deleting a column\n", "df = df.drop(columns=['Department']) # Dropping the 'Department' column\n", "print(\"DataFrame after dropping 'Department' column:\\n\", df, sep=\"\") # Displaying the updated DataFrame" ] }, { "cell_type": "markdown", "id": "3e9f5cb7", "metadata": {}, "source": [ "## Handling Missing Data\n", "\n", "Handling missing data is an essential part of data cleaning and preprocessing in Pandas. When creating a DataFrame, missing values can be introduced using `None`, `numpy.nan`, or `pandas.NaT` (for datetime types)." ] }, { "cell_type": "markdown", "id": "0aee1590", "metadata": {}, "source": [ "## Creating DataFrame with Missing Values\n", "\n", "To create a DataFrame with missing values, you can define a dictionary or a list of lists, including `None` or `numpy.nan` to represent missing entries. Pandas will interpret these as missing data." ] }, { "cell_type": "code", "execution_count": 18, "id": "f51cf14b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame with missing values:\n", " Name Age Salary\n", "0 Alice 24.0 50000.0\n", "1 Bob NaN 54000.0\n", "2 Charlie 22.0 NaN\n", "3 David 32.0 58000.0\n", "4 Eva 29.0 60000.0\n" ] } ], "source": [ "# Creating a DataFrame with missing values\n", "import numpy as np\n", "data = { # Sample data with missing values\n", " 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],\n", " 'Age': [24, np.nan, 22, 32, 29],\n", " 'Salary': [50000, 54000, np.nan, 58000, 60000]\n", "}\n", "df_missing = pd.DataFrame(data) # Creating a DataFrame with missing values\n", "print(\"DataFrame with missing values:\\n\", df_missing, sep=\"\") # Displaying the DataFrame with missing values" ] }, { "cell_type": "markdown", "id": "eda9980e", "metadata": {}, "source": [ "## Filling Missing Values\n", "\n", "Filling missing values in a DataFrame involves replacing NaN or None values with specified values. This process, known as imputation, can be done using the `fillna()` method." ] }, { "cell_type": "code", "execution_count": 19, "id": "8ec265dc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame with filled missing values:\n", " Name Age Salary\n", "0 Alice 24.00 50000.0\n", "1 Bob 26.75 54000.0\n", "2 Charlie 22.00 55500.0\n", "3 David 32.00 58000.0\n", "4 Eva 29.00 60000.0\n" ] } ], "source": [ "# Filling missing values\n", "df_missing_filled = df_missing.fillna({'Age': df_missing['Age'].mean(), 'Salary': df_missing['Salary'].mean()}) # Filling missing values with mean\n", "print(\"DataFrame with filled missing values:\\n\", df_missing_filled, sep=\"\") # Displaying the DataFrame with filled missing values" ] }, { "cell_type": "markdown", "id": "00f9752c", "metadata": {}, "source": [ "## Dropping Rows with Missing Values\n", "\n", "To drop rows with missing values, you can use the `dropna()` method. This will remove any rows that contain NaN values." ] }, { "cell_type": "code", "execution_count": 20, "id": "ea72e062", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame after dropping rows with missing values:\n", " Name Age Salary\n", "0 Alice 24.0 50000.0\n", "3 David 32.0 58000.0\n", "4 Eva 29.0 60000.0\n" ] } ], "source": [ "# Dropping rows with missing values\n", "df_missing_dropped = df_missing.dropna() # Dropping rows with missing values\n", "print(\"DataFrame after dropping rows with missing values:\\n\", df_missing_dropped, sep=\"\") # Displaying the DataFrame after dropping rows with missing values" ] }, { "cell_type": "markdown", "id": "51b0481d", "metadata": {}, "source": [ "## Grouping and Aggregating Data\n", "\n", "Grouping and aggregating data in Pandas involves combining data into groups based on specific criteria and then performing aggregate functions on these groups. The `groupby()` method is used to split the data into groups, and aggregation functions like `sum()`, `mean()`, `count()`, and others are applied to these groups." ] }, { "cell_type": "code", "execution_count": 21, "id": "3a17d800", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average Salary by Age:\n", " Age Average Salary\n", "0 22 59000.0\n", "1 24 55000.0\n", "2 27 59000.0\n", "3 28 67000.0\n", "4 29 65000.0\n", "5 32 75000.0\n" ] } ], "source": [ "# Grouping data by 'Age' and calculating the average 'Salary'\n", "grouped_df = df.groupby('Age')['Salary'].mean().reset_index() # Grouping by 'Age' and calculating mean 'Salary'\n", "grouped_df.columns = ['Age', 'Average Salary'] # Setting the column name for the second column after grouping data by 'Age'\n", "print(\"Average Salary by Age:\\n\", grouped_df, sep=\"\") # Displaying the average Salary by Age" ] }, { "cell_type": "markdown", "id": "06db0961", "metadata": {}, "source": [ "## Saving DataFrame to a CSV File\n", "\n", "Saving a DataFrame to a CSV file in Pandas is a straightforward process that involves using the `to_csv()` method. This method allows you to export the contents of a DataFrame into a CSV (Comma-Separated Values) file." ] }, { "cell_type": "code", "execution_count": 22, "id": "4279d51b", "metadata": {}, "outputs": [], "source": [ "# Saving DataFrame to a CSV file\n", "df.to_csv('data.csv', index=False) # Saving the DataFrame to a CSV file" ] }, { "cell_type": "markdown", "id": "51f66737", "metadata": {}, "source": [ "## Loading DataFrame from a CSV File\n", "\n", "Loading a DataFrame from a CSV file in Pandas involves using the `read_csv()` function. This function reads a CSV file into a DataFrame, allowing you to work with the data in a structured, tabular format." ] }, { "cell_type": "code", "execution_count": 23, "id": "dc40764b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded DataFrame from CSV file:\n", " Name Age City Salary\n", "0 Alice 24 New York 55000\n", "1 Bob 27 Los Angeles 59000\n", "2 Charlie 22 Chicago 55000\n", "3 David 22 Houston 63000\n", "4 Eva 29 Phoenix 65000\n", "5 Louis 32 Minnesota 75000\n", "6 Frank 28 San Francisco 67000\n" ] } ], "source": [ "# Loading DataFrame from a CSV file\n", "loaded_df = pd.read_csv('data.csv') # Loading the DataFrame from a CSV file\n", "print(\"Loaded DataFrame from CSV file:\\n\", loaded_df, sep=\"\") # Displaying the loaded DataFrame" ] }, { "cell_type": "markdown", "id": "cf69cd89", "metadata": {}, "source": [ "## Practice\n", "\n", "The code block below reads sales data from a URL and stores the data in an array. The first row in the dataset defines the attribute names. Your task is to find and display useful information (aggregated data) from this data using Pandas library." ] }, { "cell_type": "code", "execution_count": 24, "id": "eee530d1", "metadata": {}, "outputs": [ { "ename": "HTTPError", "evalue": "HTTP Error 400: Bad Request", "output_type": "error", "traceback": [ "\u001b[31m---------------------------------------------------------------------------\u001b[39m", "\u001b[31mHTTPError\u001b[39m Traceback (most recent call last)", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[24]\u001b[39m\u001b[32m, line 8\u001b[39m\n\u001b[32m 5\u001b[39m url = \u001b[33m\"\u001b[39m\u001b[33mhttps://piyabute.s3.ap-southeast-1.amazonaws.com/notebook/sales_data_1000.csv\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 7\u001b[39m \u001b[38;5;66;03m# Step 3: Read the CSV file from the URL into a pandas DataFrame\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m8\u001b[39m sales_data = \u001b[43mpd\u001b[49m\u001b[43m.\u001b[49m\u001b[43mread_csv\u001b[49m\u001b[43m(\u001b[49m\u001b[43murl\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 10\u001b[39m \u001b[38;5;66;03m# Step 4: Print the first 5 rows to verify the data\u001b[39;00m\n\u001b[32m 11\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33mFirst 5 rows of data:\u001b[39m\u001b[33m\"\u001b[39m)\n", "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026\u001b[39m, in \u001b[36mread_csv\u001b[39m\u001b[34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[39m\n\u001b[32m 1013\u001b[39m kwds_defaults = _refine_defaults_read(\n\u001b[32m 1014\u001b[39m dialect,\n\u001b[32m 1015\u001b[39m delimiter,\n\u001b[32m (...)\u001b[39m\u001b[32m 1022\u001b[39m dtype_backend=dtype_backend,\n\u001b[32m 1023\u001b[39m )\n\u001b[32m 1024\u001b[39m kwds.update(kwds_defaults)\n\u001b[32m-> \u001b[39m\u001b[32m1026\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_read\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620\u001b[39m, in \u001b[36m_read\u001b[39m\u001b[34m(filepath_or_buffer, kwds)\u001b[39m\n\u001b[32m 617\u001b[39m _validate_names(kwds.get(\u001b[33m\"\u001b[39m\u001b[33mnames\u001b[39m\u001b[33m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m))\n\u001b[32m 619\u001b[39m \u001b[38;5;66;03m# Create the parser.\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m620\u001b[39m parser = \u001b[43mTextFileReader\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 622\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m chunksize \u001b[38;5;129;01mor\u001b[39;00m iterator:\n\u001b[32m 623\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m parser\n", "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620\u001b[39m, in \u001b[36mTextFileReader.__init__\u001b[39m\u001b[34m(self, f, engine, **kwds)\u001b[39m\n\u001b[32m 1617\u001b[39m \u001b[38;5;28mself\u001b[39m.options[\u001b[33m\"\u001b[39m\u001b[33mhas_index_names\u001b[39m\u001b[33m\"\u001b[39m] = kwds[\u001b[33m\"\u001b[39m\u001b[33mhas_index_names\u001b[39m\u001b[33m\"\u001b[39m]\n\u001b[32m 1619\u001b[39m \u001b[38;5;28mself\u001b[39m.handles: IOHandles | \u001b[38;5;28;01mNone\u001b[39;00m = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1620\u001b[39m \u001b[38;5;28mself\u001b[39m._engine = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_make_engine\u001b[49m\u001b[43m(\u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mengine\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880\u001b[39m, in \u001b[36mTextFileReader._make_engine\u001b[39m\u001b[34m(self, f, engine)\u001b[39m\n\u001b[32m 1878\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[33m\"\u001b[39m\u001b[33mb\u001b[39m\u001b[33m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m mode:\n\u001b[32m 1879\u001b[39m mode += \u001b[33m\"\u001b[39m\u001b[33mb\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m-> \u001b[39m\u001b[32m1880\u001b[39m \u001b[38;5;28mself\u001b[39m.handles = \u001b[43mget_handle\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 1881\u001b[39m \u001b[43m \u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1882\u001b[39m \u001b[43m \u001b[49m\u001b[43mmode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1883\u001b[39m \u001b[43m \u001b[49m\u001b[43mencoding\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mencoding\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1884\u001b[39m \u001b[43m \u001b[49m\u001b[43mcompression\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mcompression\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1885\u001b[39m \u001b[43m \u001b[49m\u001b[43mmemory_map\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mmemory_map\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1886\u001b[39m \u001b[43m \u001b[49m\u001b[43mis_text\u001b[49m\u001b[43m=\u001b[49m\u001b[43mis_text\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1887\u001b[39m \u001b[43m \u001b[49m\u001b[43merrors\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mencoding_errors\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mstrict\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1888\u001b[39m \u001b[43m \u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mstorage_options\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1889\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1890\u001b[39m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m.handles \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1891\u001b[39m f = \u001b[38;5;28mself\u001b[39m.handles.handle\n", "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/common.py:728\u001b[39m, in \u001b[36mget_handle\u001b[39m\u001b[34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[39m\n\u001b[32m 725\u001b[39m codecs.lookup_error(errors)\n\u001b[32m 727\u001b[39m \u001b[38;5;66;03m# open URLs\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m728\u001b[39m ioargs = \u001b[43m_get_filepath_or_buffer\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 729\u001b[39m \u001b[43m \u001b[49m\u001b[43mpath_or_buf\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 730\u001b[39m \u001b[43m \u001b[49m\u001b[43mencoding\u001b[49m\u001b[43m=\u001b[49m\u001b[43mencoding\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 731\u001b[39m \u001b[43m \u001b[49m\u001b[43mcompression\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcompression\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 732\u001b[39m \u001b[43m \u001b[49m\u001b[43mmode\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 733\u001b[39m \u001b[43m \u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[43m=\u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 734\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 736\u001b[39m handle = ioargs.filepath_or_buffer\n\u001b[32m 737\u001b[39m handles: \u001b[38;5;28mlist\u001b[39m[BaseBuffer]\n", "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/common.py:384\u001b[39m, in \u001b[36m_get_filepath_or_buffer\u001b[39m\u001b[34m(filepath_or_buffer, encoding, compression, mode, storage_options)\u001b[39m\n\u001b[32m 382\u001b[39m \u001b[38;5;66;03m# assuming storage_options is to be interpreted as headers\u001b[39;00m\n\u001b[32m 383\u001b[39m req_info = urllib.request.Request(filepath_or_buffer, headers=storage_options)\n\u001b[32m--> \u001b[39m\u001b[32m384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[43murlopen\u001b[49m\u001b[43m(\u001b[49m\u001b[43mreq_info\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;28;01mas\u001b[39;00m req:\n\u001b[32m 385\u001b[39m content_encoding = req.headers.get(\u001b[33m\"\u001b[39m\u001b[33mContent-Encoding\u001b[39m\u001b[33m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m)\n\u001b[32m 386\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m content_encoding == \u001b[33m\"\u001b[39m\u001b[33mgzip\u001b[39m\u001b[33m\"\u001b[39m:\n\u001b[32m 387\u001b[39m \u001b[38;5;66;03m# Override compression based on Content-Encoding header\u001b[39;00m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/jupyter/venv/lib/python3.12/site-packages/pandas/io/common.py:289\u001b[39m, in \u001b[36murlopen\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 283\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 284\u001b[39m \u001b[33;03mLazy-import wrapper for stdlib urlopen, as that imports a big chunk of\u001b[39;00m\n\u001b[32m 285\u001b[39m \u001b[33;03mthe stdlib.\u001b[39;00m\n\u001b[32m 286\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 287\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01murllib\u001b[39;00m\u001b[34;01m.\u001b[39;00m\u001b[34;01mrequest\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m289\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43murllib\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m.\u001b[49m\u001b[43murlopen\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:215\u001b[39m, in \u001b[36murlopen\u001b[39m\u001b[34m(url, data, timeout, cafile, capath, cadefault, context)\u001b[39m\n\u001b[32m 213\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 214\u001b[39m opener = _opener\n\u001b[32m--> \u001b[39m\u001b[32m215\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mopener\u001b[49m\u001b[43m.\u001b[49m\u001b[43mopen\u001b[49m\u001b[43m(\u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:521\u001b[39m, in \u001b[36mOpenerDirector.open\u001b[39m\u001b[34m(self, fullurl, data, timeout)\u001b[39m\n\u001b[32m 519\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m processor \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.process_response.get(protocol, []):\n\u001b[32m 520\u001b[39m meth = \u001b[38;5;28mgetattr\u001b[39m(processor, meth_name)\n\u001b[32m--> \u001b[39m\u001b[32m521\u001b[39m response = \u001b[43mmeth\u001b[49m\u001b[43m(\u001b[49m\u001b[43mreq\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mresponse\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 523\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m response\n", "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:630\u001b[39m, in \u001b[36mHTTPErrorProcessor.http_response\u001b[39m\u001b[34m(self, request, response)\u001b[39m\n\u001b[32m 627\u001b[39m \u001b[38;5;66;03m# According to RFC 2616, \"2xx\" code indicates that the client's\u001b[39;00m\n\u001b[32m 628\u001b[39m \u001b[38;5;66;03m# request was successfully received, understood, and accepted.\u001b[39;00m\n\u001b[32m 629\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[32m200\u001b[39m <= code < \u001b[32m300\u001b[39m):\n\u001b[32m--> \u001b[39m\u001b[32m630\u001b[39m response = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mparent\u001b[49m\u001b[43m.\u001b[49m\u001b[43merror\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 631\u001b[39m \u001b[43m \u001b[49m\u001b[33;43m'\u001b[39;49m\u001b[33;43mhttp\u001b[39;49m\u001b[33;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mresponse\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcode\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmsg\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mhdrs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 633\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m response\n", "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:559\u001b[39m, in \u001b[36mOpenerDirector.error\u001b[39m\u001b[34m(self, proto, *args)\u001b[39m\n\u001b[32m 557\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m http_err:\n\u001b[32m 558\u001b[39m args = (\u001b[38;5;28mdict\u001b[39m, \u001b[33m'\u001b[39m\u001b[33mdefault\u001b[39m\u001b[33m'\u001b[39m, \u001b[33m'\u001b[39m\u001b[33mhttp_error_default\u001b[39m\u001b[33m'\u001b[39m) + orig_args\n\u001b[32m--> \u001b[39m\u001b[32m559\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_chain\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:492\u001b[39m, in \u001b[36mOpenerDirector._call_chain\u001b[39m\u001b[34m(self, chain, kind, meth_name, *args)\u001b[39m\n\u001b[32m 490\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m handler \u001b[38;5;129;01min\u001b[39;00m handlers:\n\u001b[32m 491\u001b[39m func = \u001b[38;5;28mgetattr\u001b[39m(handler, meth_name)\n\u001b[32m--> \u001b[39m\u001b[32m492\u001b[39m result = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 493\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m result \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 494\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m result\n", "\u001b[36mFile \u001b[39m\u001b[32m/usr/lib/python3.12/urllib/request.py:639\u001b[39m, in \u001b[36mHTTPDefaultErrorHandler.http_error_default\u001b[39m\u001b[34m(self, req, fp, code, msg, hdrs)\u001b[39m\n\u001b[32m 638\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mhttp_error_default\u001b[39m(\u001b[38;5;28mself\u001b[39m, req, fp, code, msg, hdrs):\n\u001b[32m--> \u001b[39m\u001b[32m639\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m HTTPError(req.full_url, code, msg, hdrs, fp)\n", "\u001b[31mHTTPError\u001b[39m: HTTP Error 400: Bad Request" ] } ], "source": [ "# Step 1: Import the 'urllib' library\n", "import pandas as pd\n", "\n", "# Step 2: Specify the URL of the file to be opened\n", "url = \"https://piyabute.s3.ap-southeast-1.amazonaws.com/notebook/sales_data_1000.csv\"\n", "\n", "# Step 3: Read the CSV file from the URL into a pandas DataFrame\n", "sales_data = pd.read_csv(url)\n", "\n", "# Step 4: Print the first 5 rows to verify the data\n", "print(\"First 5 rows of data:\")\n", "print(sales_data.head(5))" ] }, { "cell_type": "code", "execution_count": null, "id": "2daee743", "metadata": {}, "outputs": [], "source": [ "# Question 1: Find the number of units sold for each product\n", "units_sold_per_product = sales_data.groupby('Product Name')['Quantity Sold'].sum().reset_index()\n", "units_sold_per_product" ] }, { "cell_type": "code", "execution_count": null, "id": "52e8b9b4-14c7-4992-b895-97ec62d43dbc", "metadata": {}, "outputs": [], "source": [ "# Question 2: Find the total revenue from each province\n", "total_revenue_per_province = sales_data.groupby('Province')['Total Sales'].sum().reset_index()\n", "total_revenue_per_province" ] }, { "cell_type": "code", "execution_count": null, "id": "173c5d4e-1e21-43b9-96b4-41802ab4e5f0", "metadata": {}, "outputs": [], "source": [ "# Question 3: Find the number of transaction from each payment type \n", "transactions_per_payment_method = sales_data.groupby('Payment Method')['Payment Method'].count()\n", "transactions_per_payment_method" ] }, { "cell_type": "code", "execution_count": null, "id": "51807441-26f0-445e-a0b0-3f0867492af6", "metadata": {}, "outputs": [], "source": [ "# Question 4: Find the average unit price for each product\n" ] }, { "cell_type": "code", "execution_count": 25, "id": "fb7a6f53-c26e-44aa-8690-2a79175dd353", "metadata": {}, "outputs": [], "source": [ "# Question 5: Find the top 5 customers by total sales\n" ] }, { "cell_type": "code", "execution_count": null, "id": "96e421ac-99d0-4d6c-93fc-ba36706fc3e0", "metadata": {}, "outputs": [], "source": [ "# Question 6: Find the total discount given for each product\n" ] }, { "cell_type": "code", "execution_count": null, "id": "3fe7ead5-e2d6-4fbe-b941-3cd01f75a59d", "metadata": {}, "outputs": [], "source": [ "# Question 7: Find the total tax collected for each province\n" ] }, { "cell_type": "code", "execution_count": null, "id": "afea5dd6-0b10-4b8e-ae30-bc82fe464bd4", "metadata": {}, "outputs": [], "source": [ "# Question 8: Find the average net sales for each channel (Online vs In-store)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4dbdc892-085c-4dc6-8e08-06724fdf4b0a", "metadata": {}, "outputs": [], "source": [ "# Question 9: Find the total quantity sold by each salesperson\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0b56459d-3c60-4e62-9641-5cde75273a72", "metadata": {}, "outputs": [], "source": [ "# Question 10: Find the top 3 months with the highest total sales\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }