{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python 101 - A Data Mining Perspective ⛏️\n", "\n", "This notebook is designed as a foundational guide to Python programming, tailored for a **Data Mining** course. We'll explore fundamental concepts, but with a constant focus on how they are applied in the context of handling, processing, and analyzing data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. The Basics: Hello World! 👋\n", "\n", "The \"Hello, World!\" program is a classic starting point. It's not just a tradition; it's the first step to confirming your environment is correctly set up. It introduces the fundamental `print()` function, our primary tool for outputting information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# In Python, the '#' symbol indicates a comment.\n", "# Comments are ignored by the interpreter and are used to explain code.\n", "\n", "# The `print()` function outputs text to the console.\n", "# Strings, which are sequences of characters, must be enclosed in single ('') or double (\"\\\"\") quotes.\n", "print('Hello, World!')\n", "\n", "# In this context, `print()` is useful for inspecting variables, debugging code, and displaying results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Data Types 📚\n", "\n", "Understanding data types is the first step toward effective data manipulation. In data mining, we work with different kinds of data, and Python provides specialized data structures to handle them efficiently.\n", "\n", "### 2.1. Basic Arithmetic 🧮\n", "\n", "Data mining often starts with numerical data. Basic arithmetic operations are essential for calculations, data transformations, and feature engineering." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We assign numerical values to variables.\n", "# Variables act as named containers for data.\n", "price = 100\n", "quantity = 9\n", "\n", "# Let's perform some basic arithmetic operations.\n", "# These operators are fundamental for any quantitative analysis.\n", "addition = price + quantity\n", "subtraction = price - quantity\n", "multiplication = price * quantity\n", "division = price / quantity # Standard division returns a float.\n", "integer_division = price // quantity # Floor division returns an integer.\n", "remainder = price % quantity # Modulus returns the remainder.\n", "exponentiation = price ** quantity # Exponentiation (price to the power of quantity).\n", "\n", "# Using f-strings (formatted string literals) for clear output.\n", "# F-strings are a modern and highly readable way to embed expressions inside strings.\n", "print(f\"Variables: price={price}, quantity={quantity}\")\n", "print(\"---\")\n", "print(f\"Addition (price + quantity) = {addition}\")\n", "print(f\"Subtraction (price - quantity) = {subtraction}\")\n", "print(f\"Multiplication (price * quantity) = {multiplication}\")\n", "print(f\"Division (price / quantity) = {division}\")\n", "print(f\"Integer Division (price // quantity) = {integer_division}\")\n", "print(f\"Remainder (price % quantity) = {remainder}\")\n", "print(f\"Exponentiation (price ** quantity) = {exponentiation}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. String Operations 📝\n", "\n", "Text data is a primary source in many data mining applications (e.g., sentiment analysis, text classification). Python provides a rich set of built-in string methods for processing this data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define a string variable.\n", "full_text = 'Data Science is a field of study focused on extracting knowledge and insights from data.'\n", "\n", "# String methods are crucial for preprocessing text data.\n", "# 1. Normalization: Converting text to a consistent case.\n", "print(f\"Original text: {full_text}\")\n", "print(f\"Lowercase: {full_text.lower()}\")\n", "print(f\"Uppercase: {full_text.upper()}\")\n", "\n", "# 2. Tokenization: Breaking text into individual words or \"tokens\".\n", "# The `split()` method, by default, splits a string by whitespace.\n", "tokens = full_text.split()\n", "print(f\"Tokens (a list of words): {tokens}\")\n", "\n", "# 3. Cleaning: Removing unwanted characters or spaces.\n", "# The `strip()` method removes leading and trailing whitespace.\n", "dirty_string = ' this text needs cleaning '\n", "cleaned_string = dirty_string.strip()\n", "print(f\"\\nOriginal (dirty) string: '{dirty_string}'\")\n", "print(f\"Cleaned string: '{cleaned_string}'\")\n", "\n", "# 4. Searching and Counting: Finding patterns in text.\n", "# `find()` returns the starting index of the first occurrence.\n", "print(f\"The word 'Science' is found at index: {full_text.find('Science')}\")\n", "# `count()` returns the number of occurrences.\n", "print(f\"The word 'data' appears {full_text.lower().count('data')} times.\") # using .lower() for case-insensitive count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3. Lists: The Primary Data Container 📦\n", "\n", "Lists are the most flexible data structure in Python. In data mining, they are commonly used to store datasets, feature vectors, or a collection of results. Their mutability allows for dynamic data collection and modification." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a list of numerical data points.\n", "data_points = [101, 102, 103, 104, 105, 105]\n", "\n", "print(f\"Original list: {data_points}\")\n", "print(f\"Data type: {type(data_points)}\")\n", "\n", "# Accessing elements: Indexing starts at 0.\n", "print(f\"\\nFirst element: {data_points[0]}\")\n", "# Negative indexing accesses elements from the end.\n", "print(f\"Last element: {data_points[-1]}\")\n", "\n", "# Slicing: Extracting a subset of a list is vital for sampling and splitting data.\n", "# `[start:stop]` creates a new list from `start` up to (but not including) `stop`.\n", "subset = data_points[1:4]\n", "print(f\"Subset from index 1 to 3: {subset}\")\n", "\n", "# Modifying a list: Lists are mutable.\n", "data_points.append(106) # Adds an element to the end.\n", "data_points[2] = 999 # Replaces an element at a specific index.\n", "print(f\"Modified list: {data_points}\")\n", "\n", "# Useful methods for data analysis.\n", "print(f\"\\nNumber of elements: {len(data_points)}\")\n", "print(f\"Count of '105': {data_points.count(105)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4. Tuples: For Fixed Records 🔖\n", "\n", "Tuples are similar to lists but are immutable. This makes them ideal for representing data records where the values should not change, such as a coordinate pair or a database row. Their immutability also makes them slightly more memory-efficient and faster to process." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A tuple representing a fixed data record (e.g., an RGB color value).\n", "color_record = ('red', 255, 0, 0)\n", "\n", "print(f\"Original tuple: {color_record}\")\n", "print(f\"Data type: {type(color_record)}\")\n", "\n", "# Accessing elements is the same as with lists.\n", "print(f\"\\nColor name: {color_record[0]}\")\n", "\n", "# Since tuples are immutable, you cannot change their elements.\n", "# The following line would cause a TypeError:\n", "# color_record[1] = 120\n", "\n", "# To modify a tuple, you must create a new one, often by converting it to a list first.\n", "temp_list = list(color_record)\n", "temp_list[1] = 120\n", "modified_tuple = tuple(temp_list)\n", "\n", "print(f\"Modified tuple (created from a list): {modified_tuple}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5. Dictionaries: For Structured Data 🗃️\n", "\n", "Dictionaries are key-value pairs, providing a powerful way to store structured data where each piece of information has a name. In data mining, a dictionary can represent a single observation or a row in a dataset, with keys being the feature names (e.g., 'age', 'gender') and values being the corresponding data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A dictionary representing a student record.\n", "student_record = {\n", " 'sid': 6010001,\n", " 'fname': 'Piyabute',\n", " 'lname': 'Fuangkhon',\n", " 'active': True\n", "}\n", "\n", "print(f\"Original dictionary: {student_record}\")\n", "print(f\"Data type: {type(student_record)}\")\n", "\n", "# Accessing values using their keys.\n", "print(f\"\\nStudent first name: {student_record['fname']}\")\n", "\n", "# Modifying and adding elements is straightforward.\n", "student_record['active'] = False\n", "student_record['major'] = 'Digital Business Management'\n", "print(f\"Updated dictionary: {student_record}\")\n", "\n", "# Iterating over a dictionary is a common pattern for data processing.\n", "print(\"\\nIterating through key-value pairs:\")\n", "for key, value in student_record.items():\n", " print(f\" - {key}: {value}\")\n", "\n", "# Sorting a dictionary by its keys or values is a useful visualization technique.\n", "sorted_by_key = sorted(student_record.items(), key=lambda x: x[0])\n", "print(f\"\\nDictionary sorted by keys: {sorted_by_key}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.6. Sets: For Unique Values 💠\n", "\n", "Sets are unordered collections of unique elements. They are incredibly efficient for checking for membership (`in` operator), removing duplicates from a dataset (a common preprocessing step), and performing mathematical set operations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a set. Note that duplicate values are automatically ignored.\n", "my_set = {'red', 'green', 'blue', 'blue'}\n", "print(f\"Original set (duplicates removed): {my_set}\")\n", "print(f\"Data type: {type(my_set)}\")\n", "\n", "# Sets are great for performing mathematical operations.\n", "# Let's define sets of students based on their skills.\n", "python_students = {'Alice', 'Bob', 'Charlie', 'David'}\n", "sql_students = {'Charlie', 'David', 'Eve', 'Frank'}\n", "\n", "print(f\"\\nPython students: {python_students}\")\n", "print(f\"SQL students: {sql_students}\")\n", "\n", "# Union: find all unique students from both sets.\n", "all_students = python_students.union(sql_students)\n", "print(f\"\\nUnion (all students): {all_students}\")\n", "\n", "# Intersection: find students who are in both sets.\n", "both_skills = python_students.intersection(sql_students)\n", "print(f\"Intersection (students with both skills): {both_skills}\")\n", "\n", "# Difference: find students who are in Python but not in SQL.\n", "python_only = python_students.difference(sql_students)\n", "print(f\"Difference (Python-only students): {python_only}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Operators and Control Structures 🚦\n", "\n", "Operators and control structures are the building blocks of program logic. They allow us to write code that can make decisions and perform tasks repeatedly, which is the essence of data processing.\n", "\n", "### 3.1. Conditional Logic (`if-elif-else`) 🤖\n", "\n", "Conditional statements allow your program to execute different code blocks based on whether a condition is true or false. This is a fundamental pattern for filtering data, handling missing values, or classifying records." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A simple example of conditional logic for filtering data.\n", "sales_amount = 150.0\n", "\n", "if sales_amount > 200:\n", " discount_rate = 0.25 # 25% discount\n", "elif sales_amount > 100:\n", " discount_rate = 0.10 # 10% discount\n", "else:\n", " discount_rate = 0.0 # No discount\n", "\n", "discount_value = sales_amount * discount_rate\n", "final_price = sales_amount - discount_value\n", "\n", "print(f\"For a sales amount of ${sales_amount:.2f}:\")\n", "print(f\" - Discount rate: {discount_rate * 100}%\")\n", "print(f\" - Final price: ${final_price:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2. Loops (`for` and `while`) 🔄\n", "\n", "Loops are essential for iterating over collections of data. You'll use them constantly to process each row of a dataset, perform calculations on every element in a list, or read a file line by line." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# `for` loops are ideal for iterating over a known sequence.\n", "# Let's iterate over a list of daily temperatures.\n", "daily_temps = [25, 27, 28, 26, 29]\n", "\n", "print(\"Daily Temperatures:\")\n", "for temp in daily_temps:\n", " print(f\" - {temp}°C\")\n", "\n", "# A `for` loop combined with `range()` is useful for iterating a specific number of times.\n", "# The `range(start, stop)` function generates numbers from `start` up to `stop - 1`.\n", "print(\"\\nIterating with `range()`:\")\n", "for i in range(3):\n", " print(f\" - Iteration number: {i}\")\n", "\n", "# `while` loops are used when the number of iterations isn't known beforehand.\n", "# Let's simulate processing records until a condition is met.\n", "data_to_process = ['A', 'B', 'C', 'STOP', 'D']\n", "index = 0\n", "\n", "print(\"\\nProcessing data until 'STOP' is found:\")\n", "while data_to_process[index] != 'STOP':\n", " print(f\" - Processing record: {data_to_process[index]}\")\n", " index += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Functions and Modularity ⚙️\n", "\n", "Functions are reusable blocks of code that perform a specific task. In data mining, you'll use functions to encapsulate common preprocessing steps, statistical calculations, or data visualization logic, making your code modular, readable, and easy to maintain.\n", "\n", "### 4.1. Defining a Function (`def`) 🛠️\n", "\n", "A function is defined with the `def` keyword, followed by the function name, a list of parameters, and a return value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A simple function to calculate the square of a number.\n", "def square(x):\n", " \"\"\"\n", " This is a docstring, which explains what the function does.\n", " It's good practice to write a docstring for every function.\n", " \"\"\"\n", " return x * x\n", "\n", "# Call the function and store the result in a variable.\n", "result = square(10)\n", "print(f\"The square of 10 is: {result}\")\n", "\n", "# Functions can also take multiple arguments.\n", "def calculate_metrics(data_list):\n", " \"\"\"Calculates the sum, average, min, and max of a list.\"\"\"\n", " total_sum = sum(data_list)\n", " average = total_sum / len(data_list)\n", " min_value = min(data_list)\n", " max_value = max(data_list)\n", " # Functions can return multiple values, often as a tuple.\n", " return total_sum, average, min_value, max_value\n", "\n", "# Use a list of sales data.\n", "sales_data = [100, 150, 120, 200, 180]\n", "total, avg, min_sale, max_sale = calculate_metrics(sales_data)\n", "\n", "print(f\"\\nSales data: {sales_data}\")\n", "print(f\" - Total sales: {total}\")\n", "print(f\" - Average sales: {avg}\")\n", "print(f\" - Minimum sale: {min_sale}\")\n", "print(f\" - Maximum sale: {max_sale}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2. Anonymous Functions (`lambda`) 💡\n", "\n", "Lambda functions are small, single-expression functions. They're often used as a concise way to pass a function as an argument to another function, such as `sorted()`, `map()`, or `filter()`. This is a common pattern in Python for data manipulation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A lambda function to add 10 to a number.\n", "add_ten = lambda x: x + 10\n", "print(f\"Using a lambda function: 5 + 10 = {add_ten(5)}\")\n", "\n", "# A more practical use case: sorting a list of dictionaries by a specific key.\n", "data = [\n", " {'name': 'Robert', 'score': 85},\n", " {'name': 'Piyabute', 'score': 92},\n", " {'name': 'Elon', 'score': 78}\n", "]\n", "print(f\"\\nOriginal data: {data}\")\n", "\n", "# `sorted()` uses a `key` argument to determine the sorting order.\n", "# The lambda function `lambda x: x['score']` tells `sorted()` to use the 'score' value from each dictionary for comparison.\n", "sorted_data = sorted(data, key=lambda item: item['score'], reverse=True)\n", "print(f\"Data sorted by score (descending): {sorted_data}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Modules (Libraries) 📦\n", "\n", "Python's standard library and a vast ecosystem of third-party libraries are what make it so powerful for data mining. We can import these modules to access specialized functions for mathematics, statistics, data manipulation, and more.\n", "\n", "### 5.1. File and URL Handling 🌐💾\n", "\n", "The first step in any data mining project is obtaining data. The `urllib` module can fetch data from the internet, and the built-in `open()` function handles local files. The `with` statement ensures files are properly closed, even if errors occur." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import urllib.request\n", "\n", "# Fetch data from a URL. This is a common pattern for accessing remote datasets.\n", "url = \"https://piyabute.com/data/research/iris.data.csv\"\n", "try:\n", " with urllib.request.urlopen(url) as file:\n", " print(\"--- First 5 lines from URL ---\")\n", " for i, line in enumerate(file):\n", " if i >= 5:\n", " break\n", " # We must decode the binary data from the file into a readable string.\n", " decoded_line = line.decode('utf-8')\n", " print(decoded_line.strip())\n", "except Exception as e:\n", " print(f\"Error accessing URL: {e}\")\n", "\n", "# Read and process a local file.\n", "# Assuming 'iris.data.csv' is in the same directory.\n", "try:\n", " with open('iris.data.csv', 'r') as file:\n", " print(\"\\n--- First 5 lines from local file ---\")\n", " # `readlines()` returns a list of all lines in the file.\n", " lines = file.readlines()\n", " for line in lines[:5]:\n", " print(line.strip())\n", "except FileNotFoundError:\n", " print(\"\\nError: The local file 'iris.data.csv' was not found.\")\n", " print(\"Please ensure the file is in the same directory as this notebook.\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 4 }