{ "cells": [ { "cell_type": "markdown", "id": "5d6ba50c", "metadata": {}, "source": [ "# Pandas 101" ] }, { "cell_type": "code", "execution_count": null, "id": "804c4139", "metadata": {}, "outputs": [], "source": [ "# Assoc. Prof. Dr. Piyabute Fuangkhon\n", "# Department of Digital Business Management\n", "# Martin de Tours School of Management and Economics\n", "# Assumption University\n", "# Update: 22/12/2024" ] }, { "cell_type": "markdown", "id": "e8723b64", "metadata": {}, "source": [ "# Introduction to Pandas for Business Students\n", "\n", "This notebook is designed to introduce you to the basics of Pandas, a powerful library for data manipulation and analysis. By the end of this tutorial, you should be comfortable with the fundamental operations and methods in Pandas that are essential for data analysis tasks." ] }, { "cell_type": "markdown", "id": "1fe67c1c", "metadata": {}, "source": [ "## Importing the Pandas Library\n", "\n", "First, we need to import the Pandas library. Pandas is usually imported under the alias `pd`." ] }, { "cell_type": "code", "execution_count": null, "id": "d0d85e8d", "metadata": {}, "outputs": [], "source": [ "# Importing Pandas Library\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "be106f7e", "metadata": {}, "source": [ "## Basic Data Structures: Series\n", "\n", "A Series is a one-dimensional, labeled array capable of holding data of any type (integers, strings, floating-point numbers, etc.). It is akin to a single column in a spreadsheet or a database table. Each element in a Series has an associated label, known as its index." ] }, { "cell_type": "code", "execution_count": null, "id": "f23d26e2", "metadata": {}, "outputs": [], "source": [ "# Creating a Series\n", "data = [1, 2, 3, 4, 5] # Sample data\n", "series = pd.Series(data) # Creating a Series\n", "print(\"Series:\\n\", series, sep=\"\") # Printing the Series" ] }, { "cell_type": "markdown", "id": "0b93506d", "metadata": {}, "source": [ "## Basic Data Structures: DataFrame\n", "\n", "A DataFrame in Pandas is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a table in a database or a data frame in Excel." ] }, { "cell_type": "code", "execution_count": null, "id": "463ab1ef", "metadata": {}, "outputs": [], "source": [ "# Creating a DataFrame\n", "data = { # Sample data\n", " 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],\n", " 'Age': [24, 27, 22, 22, 29],\n", " 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],\n", " 'Salary': [50000, 54000, 50000, 58000, 60000]\n", "}\n", "df = pd.DataFrame(data) # Creating a DataFrame\n", "print(\"DataFrame:\\n\", df, sep=\"\") # Printing the DataFrame" ] }, { "cell_type": "markdown", "id": "66bd0aea", "metadata": {}, "source": [ "## Viewing the First Few Rows\n", "\n", "The `head()` method in Pandas is used to view the first few rows of a DataFrame. By default, it returns the first five rows, but you can specify a different number of rows to display by passing an integer argument to the method." ] }, { "cell_type": "code", "execution_count": null, "id": "4ac9e393", "metadata": {}, "outputs": [], "source": [ "# Viewing the first few rows of the DataFrame\n", "print(\"First few rows:\\n\", df.head(), sep=\"\") # Displaying the first few rows\n", "\n", "# Viewing the first row of the DataFrame\n", "print(\"First row:\\n\", df.head(1), sep=\"\") # Displaying the first row" ] }, { "cell_type": "markdown", "id": "d604ee12", "metadata": {}, "source": [ "## Viewing the Last Few Rows\n", "\n", "The `tail()` method in Pandas is used to view the last few rows of a DataFrame. By default, it returns the last five rows, but you can specify a different number of rows to display by passing an integer argument to the method." ] }, { "cell_type": "code", "execution_count": null, "id": "b0ba4cb8", "metadata": {}, "outputs": [], "source": [ "# Viewing the last few rows of the DataFrame\n", "print(\"Last few rows:\\n\", df.tail(), sep=\"\") # Displaying the last few rows\n", "\n", "# Viewing the last row of the DataFrame\n", "print(\"Last row:\\n\", df.tail(1), sep=\"\") # Displaying the last row" ] }, { "cell_type": "markdown", "id": "4a152b58", "metadata": {}, "source": [ "## Summary Statistics\n", "\n", "The `describe()` method in Pandas generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values." ] }, { "cell_type": "code", "execution_count": null, "id": "44876876", "metadata": {}, "outputs": [], "source": [ "# Getting summary statistics\n", "print(\"Summary statistics:\\n\", df.describe(), sep=\"\") # Displaying summary statistics" ] }, { "cell_type": "markdown", "id": "5045872a", "metadata": {}, "source": [ "## Selecting a Single Column\n", "\n", "To select a single column from a DataFrame, you can use the column name as a string inside the brackets. This returns a Pandas Series representing the column." ] }, { "cell_type": "code", "execution_count": null, "id": "849db8a0", "metadata": {}, "outputs": [], "source": [ "# Selecting a single column\n", "print(\"Selecting the 'Name' column:\\n\", df['Name'], sep=\"\") # Selecting the 'Name' column" ] }, { "cell_type": "markdown", "id": "b70f7c8a", "metadata": {}, "source": [ "## Selecting Multiple Columns\n", "\n", "To select multiple columns from a DataFrame, you can pass a list of column names inside the brackets. This returns a new DataFrame containing only the specified columns." ] }, { "cell_type": "code", "execution_count": null, "id": "dc2b00e5", "metadata": {}, "outputs": [], "source": [ "# Selecting multiple columns\n", "print(\"Selecting 'Name' and 'Salary' columns:\\n\", df[['Name', 'Salary']], sep=\"\") # Selecting 'Name' and 'Salary' columns" ] }, { "cell_type": "markdown", "id": "9f20a22d", "metadata": {}, "source": [ "## Filtering Rows Based on a Condition\n", "\n", "To filter rows in a DataFrame based on a condition, you can use boolean indexing within the brackets. This involves specifying a condition that returns a boolean Series, which is then used to index the DataFrame and return only the rows where the condition is True." ] }, { "cell_type": "code", "execution_count": null, "id": "33cc3787", "metadata": {}, "outputs": [], "source": [ "# Filtering rows based on a condition\n", "print(\"Filtering rows where Age > 25:\\n\", df[df['Age'] > 25], sep=\"\") # Filtering rows where Age > 25" ] }, { "cell_type": "markdown", "id": "93e8a7f4-3750-45e4-ab73-1f923d97dcc0", "metadata": {}, "source": [ "## Sorting Rows\n", "\n", "Sorting rows in a DataFrame can be done using the sort_values() method. This method allows you to sort the data based on one or more columns, in either ascending or descending order. By specifying the column names and the desired order, you can easily organize your data to facilitate analysis. This is useful for tasks such as ranking, ordering, and prioritizing data within a DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "15a0cd62-8ef6-4613-aafc-d3f21c63fff8", "metadata": {}, "outputs": [], "source": [ "# Sorting by a single column\n", "sorted_by_age = df.sort_values(by='Age') # Sorting the DataFrame by the 'Age' column in ascending order\n", "print(\"DataFrame sorted by Age:\\n\", sorted_by_age, sep=\"\") # Displaying the DataFrame sorted by 'Age'" ] }, { "cell_type": "code", "execution_count": null, "id": "3006a91b-2ef1-4909-b60c-bd82875c497d", "metadata": {}, "outputs": [], "source": [ "# Sorting by multiple columns\n", "sorted_by_age_salary = df.sort_values(by=['Age', 'Salary'], ascending=[True, False]) # Sorting the DataFrame by the 'Age' and 'Salary' columns in ascending order\n", "print(\"DataFrame sorted by Age (ascending) and Salary (descending):\\n\", sorted_by_age_salary, sep=\"\") # Displaying the DataFrame sorted by 'Age' and 'Salary'" ] }, { "cell_type": "markdown", "id": "410150ca-4bc8-45af-8583-feb2eee65989", "metadata": {}, "source": [ "## Adding a New Row\n", "\n", "To add a new row to a DataFrame, you concatenate a Series or list of values to a new row within the DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "be28b5b2-7cf2-4af0-92f5-32a50dbe9260", "metadata": {}, "outputs": [], "source": [ "# Adding a new row using 'loc'\n", "\n", "# New data to be added\n", "new_data = {'Name': 'Louis', 'Age': 32, 'City': 'Minnesota', 'Salary': 70000} # New data to be added\n", "\n", "# Adding the new data using loc\n", "df.loc[len(df)] = new_data # Adding the new data to the DataFrame using loc\n", "print(\"DataFrame after adding new data (using loc):\\n\", df, sep=\"\") # Printing the DataFrame after adding new data" ] }, { "cell_type": "code", "execution_count": null, "id": "8c72dd72-fedd-4777-918a-da99876d3a10", "metadata": {}, "outputs": [], "source": [ "# Adding a new row using 'concat'\n", "\n", "# New data to be added\n", "new_data = pd.DataFrame([{'Name': 'Frank', 'Age': 28, 'City': 'San Francisco', 'Salary': 62000}]) # New data to be added\n", "\n", "# Adding the new data to the DataFrame using concat\n", "df = pd.concat([df, new_data], ignore_index=True) # Concatenating the new data to the DataFrame and resetting the index\n", "print(\"DataFrame after adding new data (using concat):\\n\", df, sep=\"\") # Printing the DataFrame after adding new data" ] }, { "cell_type": "markdown", "id": "e3d74527", "metadata": {}, "source": [ "## Adding a New Column\n", "\n", "To add a new column to a DataFrame, you assign a Series or list of values to a new column name within the DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "66789203", "metadata": {}, "outputs": [], "source": [ "# Adding a new column\n", "df['Department'] = ['HR', 'Engineering', 'Marketing', 'Finance', 'IT', 'Data Science', 'Digital Commerce'] # Adding a new 'Department' column\n", "print(\"DataFrame with new 'Department' column:\\n\", df, sep=\"\") # Displaying the updated DataFrame" ] }, { "cell_type": "markdown", "id": "5bc80e49", "metadata": {}, "source": [ "## Modifying an Existing Column\n", "\n", "To modify an existing column in a DataFrame, you can assign new values to the column." ] }, { "cell_type": "code", "execution_count": null, "id": "cb0a36e6", "metadata": {}, "outputs": [], "source": [ "# Modifying an existing column\n", "df['Salary'] = df['Salary'] + 5000 # Increasing 'Salary' by 5000\n", "print(\"DataFrame with updated 'Salary' column:\\n\", df, sep=\"\") # Displaying the updated DataFrame" ] }, { "cell_type": "markdown", "id": "c2dc49ca", "metadata": {}, "source": [ "## Deleting a Column\n", "\n", "To delete a column in a DataFrame, you can use the `drop()` method." ] }, { "cell_type": "code", "execution_count": null, "id": "9e9e9777", "metadata": {}, "outputs": [], "source": [ "# Deleting a column\n", "df = df.drop(columns=['Department']) # Dropping the 'Department' column\n", "print(\"DataFrame after dropping 'Department' column:\\n\", df, sep=\"\") # Displaying the updated DataFrame" ] }, { "cell_type": "markdown", "id": "3e9f5cb7", "metadata": {}, "source": [ "## Handling Missing Data\n", "\n", "Handling missing data is an essential part of data cleaning and preprocessing in Pandas. When creating a DataFrame, missing values can be introduced using `None`, `numpy.nan`, or `pandas.NaT` (for datetime types)." ] }, { "cell_type": "markdown", "id": "0aee1590", "metadata": {}, "source": [ "## Creating DataFrame with Missing Values\n", "\n", "To create a DataFrame with missing values, you can define a dictionary or a list of lists, including `None` or `numpy.nan` to represent missing entries. Pandas will interpret these as missing data." ] }, { "cell_type": "code", "execution_count": null, "id": "f51cf14b", "metadata": {}, "outputs": [], "source": [ "# Creating a DataFrame with missing values\n", "import numpy as np\n", "data = { # Sample data with missing values\n", " 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],\n", " 'Age': [24, np.nan, 22, 32, 29],\n", " 'Salary': [50000, 54000, np.nan, 58000, 60000]\n", "}\n", "df_missing = pd.DataFrame(data) # Creating a DataFrame with missing values\n", "print(\"DataFrame with missing values:\\n\", df_missing, sep=\"\") # Displaying the DataFrame with missing values" ] }, { "cell_type": "markdown", "id": "eda9980e", "metadata": {}, "source": [ "## Filling Missing Values\n", "\n", "Filling missing values in a DataFrame involves replacing NaN or None values with specified values. This process, known as imputation, can be done using the `fillna()` method." ] }, { "cell_type": "code", "execution_count": null, "id": "8ec265dc", "metadata": {}, "outputs": [], "source": [ "# Filling missing values\n", "df_missing_filled = df_missing.fillna({'Age': df_missing['Age'].mean(), 'Salary': df_missing['Salary'].mean()}) # Filling missing values with mean\n", "print(\"DataFrame with filled missing values:\\n\", df_missing_filled, sep=\"\") # Displaying the DataFrame with filled missing values" ] }, { "cell_type": "markdown", "id": "00f9752c", "metadata": {}, "source": [ "## Dropping Rows with Missing Values\n", "\n", "To drop rows with missing values, you can use the `dropna()` method. This will remove any rows that contain NaN values." ] }, { "cell_type": "code", "execution_count": null, "id": "ea72e062", "metadata": {}, "outputs": [], "source": [ "# Dropping rows with missing values\n", "df_missing_dropped = df_missing.dropna() # Dropping rows with missing values\n", "print(\"DataFrame after dropping rows with missing values:\\n\", df_missing_dropped, sep=\"\") # Displaying the DataFrame after dropping rows with missing values" ] }, { "cell_type": "markdown", "id": "51b0481d", "metadata": {}, "source": [ "## Grouping and Aggregating Data\n", "\n", "Grouping and aggregating data in Pandas involves combining data into groups based on specific criteria and then performing aggregate functions on these groups. The `groupby()` method is used to split the data into groups, and aggregation functions like `sum()`, `mean()`, `count()`, and others are applied to these groups." ] }, { "cell_type": "code", "execution_count": null, "id": "3a17d800", "metadata": {}, "outputs": [], "source": [ "# Grouping data by 'Age' and calculating the average 'Salary'\n", "grouped_df = df.groupby('Age')['Salary'].mean().reset_index() # Grouping by 'Age' and calculating mean 'Salary'\n", "grouped_df.columns = ['Age', 'Average Salary'] # Setting the column name for the second column after grouping data by 'Age'\n", "print(\"Average Salary by Age:\\n\", grouped_df, sep=\"\") # Displaying the average Salary by Age" ] }, { "cell_type": "markdown", "id": "06db0961", "metadata": {}, "source": [ "## Saving DataFrame to a CSV File\n", "\n", "Saving a DataFrame to a CSV file in Pandas is a straightforward process that involves using the `to_csv()` method. This method allows you to export the contents of a DataFrame into a CSV (Comma-Separated Values) file." ] }, { "cell_type": "code", "execution_count": null, "id": "4279d51b", "metadata": {}, "outputs": [], "source": [ "# Saving DataFrame to a CSV file\n", "df.to_csv('data.csv', index=False) # Saving the DataFrame to a CSV file" ] }, { "cell_type": "markdown", "id": "51f66737", "metadata": {}, "source": [ "## Loading DataFrame from a CSV File\n", "\n", "Loading a DataFrame from a CSV file in Pandas involves using the `read_csv()` function. This function reads a CSV file into a DataFrame, allowing you to work with the data in a structured, tabular format." ] }, { "cell_type": "code", "execution_count": null, "id": "dc40764b", "metadata": {}, "outputs": [], "source": [ "# Loading DataFrame from a CSV file\n", "loaded_df = pd.read_csv('data.csv') # Loading the DataFrame from a CSV file\n", "print(\"Loaded DataFrame from CSV file:\\n\", loaded_df, sep=\"\") # Displaying the loaded DataFrame" ] }, { "cell_type": "markdown", "id": "cf69cd89", "metadata": {}, "source": [ "## Practice\n", "\n", "The code block below reads sales data from a URL and stores the data in an array. The first row in the dataset defines the attribute names. Your task is to find and display useful information (aggregated data) from this data using Pandas library." ] }, { "cell_type": "code", "execution_count": null, "id": "eee530d1", "metadata": {}, "outputs": [], "source": [ "# Step 1: Import the 'urllib' library\n", "import pandas as pd\n", "\n", "# Step 2: Specify the URL of the file to be opened\n", "url = \"https://piyabute.s3.ap-southeast-1.amazonaws.com/notebook/sales_data_1000.csv\"\n", "\n", "# Step 3: Read the CSV file from the URL into a pandas DataFrame\n", "sales_data = pd.read_csv(url)\n", "\n", "# Step 4: Print the first 5 rows to verify the data\n", "print(\"First 5 rows of data:\")\n", "print(sales_data.head(5))" ] }, { "cell_type": "code", "execution_count": null, "id": "2daee743", "metadata": {}, "outputs": [], "source": [ "# Question 1: Find the number of units sold for each product\n", "units_sold_per_product = sales_data.groupby('Product Name')['Quantity Sold'].sum().reset_index()\n", "units_sold_per_product" ] }, { "cell_type": "code", "execution_count": null, "id": "52e8b9b4-14c7-4992-b895-97ec62d43dbc", "metadata": {}, "outputs": [], "source": [ "# Question 2: Find the total revenue from each province\n", "total_revenue_per_province = sales_data.groupby('Province')['Total Sales'].sum().reset_index()\n", "total_revenue_per_province" ] }, { "cell_type": "code", "execution_count": null, "id": "173c5d4e-1e21-43b9-96b4-41802ab4e5f0", "metadata": {}, "outputs": [], "source": [ "# Question 3: Find the number of transaction from each payment type \n", "transactions_per_payment_method = sales_data.groupby('Payment Method')['Payment Method'].count()\n", "transactions_per_payment_method" ] }, { "cell_type": "code", "execution_count": null, "id": "51807441-26f0-445e-a0b0-3f0867492af6", "metadata": {}, "outputs": [], "source": [ "# Question 4: Find the average unit price for each product\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fb7a6f53-c26e-44aa-8690-2a79175dd353", "metadata": {}, "outputs": [], "source": [ "# Question 5: Find the top 5 customers by total sales\n" ] }, { "cell_type": "code", "execution_count": null, "id": "96e421ac-99d0-4d6c-93fc-ba36706fc3e0", "metadata": {}, "outputs": [], "source": [ "# Question 6: Find the total discount given for each product\n" ] }, { "cell_type": "code", "execution_count": null, "id": "3fe7ead5-e2d6-4fbe-b941-3cd01f75a59d", "metadata": {}, "outputs": [], "source": [ "# Question 7: Find the total tax collected for each province\n" ] }, { "cell_type": "code", "execution_count": null, "id": "afea5dd6-0b10-4b8e-ae30-bc82fe464bd4", "metadata": {}, "outputs": [], "source": [ "# Question 8: Find the average net sales for each channel (Online vs In-store)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4dbdc892-085c-4dc6-8e08-06724fdf4b0a", "metadata": {}, "outputs": [], "source": [ "# Question 9: Find the total quantity sold by each salesperson\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0b56459d-3c60-4e62-9641-5cde75273a72", "metadata": {}, "outputs": [], "source": [ "# Question 10: Find the top 3 months with the highest total sales\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 5 }