CSE 30124 - Introduction to Artificial Intelligence: Lab 01 (5 pts.)¶

  • NETID:

This assignment covers the following topics:

  • Data cleaning and splitting messy pandas datasets
  • Creating "features" from raw data (feature engineering)
  • Expert Systems / Rule-based Classification
  • An introduction to sklearn via kNNs

It will consist of 5 tasks:

Task ID Description Points
00 Load and Split Dataset 1
    00-1     - Load Dataset
    00-2     - Explore the Data
    00-3     - Split into Training and Testing DataFrames
01 Creating Features 2
    01-1     - Create Weapon Keyword Feature
    01-2     - Create Average Lying Probability Feature
    01-3     - Prepare Feature Data
02 Expert System / Rule-Based Classification 0
    02-1     - Test Expert System
03 Creating and Using kNN 1
    03-1     - Create kNN
    03-2     - Use kNN
04 Retrieve Suspect Statements 1
05 Generate Police Report 0

Please complete all sections. Some questions will require written answers, while others will involve coding. Be sure to run your code cells to verify your solutions.

Story Progression¶

As you finish your drink, you feel a tap on your shoulder. You respond with:

"You've got the wrong guy pal"

but as you look back, you realize it's a police officer. His name badge says Officer Gaff. He tells you to come with him and so you hop in his car and he drives you back to the precinct, where you're introduced to a director Bryant. It seems your job isn't over quite yet.

Director Bryant tells you that they've collected a list of 20 suspects and they've had them take a polygraph test to see if they can narrow it down at all.

No description has been provided for this image
Evidence 1: One of your TAs, a prime suspect, taking a polygraph test.

For each of the 20 suspects, the police asked them 10 questions and recorded the polygraph results for each question, resulting in the following data:

  • name, statement_1 through statement_10, lying_prob_1 through lying_prob_10

In addition to the 20 suspects they've had take a polygraph test, they have 60 examples of good suspects from prior cases, but unfortunately for these examples the data is in an entirely different format:

  • name, weapon_mentions, total_lying_prob, good_suspect

Director Bryant tells you that the intern they hired to do the data processing just offloaded it to Gemini and all of the data got mixed together into one evidence file called suspect_data.csv and it has been giving them fits. He heard about your success with the travel itinerary last week and is hoping you can help!

Task 00: Load and Split Dataset (1 pt.)¶

Task 00-1: Load Dataset (0 pts.)¶

Loading the Dataset¶

Run the cell below to download the evidence, load suspect_data.csv into a dataframe and print out the first five rows.

Task 00-1: Code (0 pts.)¶

In [ ]:
import os
import pandas as pd

try:
    import google.colab
    REPO_URL = "https://github.com/wtheisen/nd-cse-30124-homeworks.git"

    REPO_PATH = "/content/nd-cse-30124-homeworks"
    L_PATH = "nd-cse-30124-homeworks/evidence/lab01"

    %cd /content/
    !rm -rf {REPO_PATH}

    # Clone repo
    if not os.path.exists(REPO_PATH):
        !git clone {REPO_URL}

        # cd into the data folder
        %cd {L_PATH}
        !pwd

except ImportError:
    print("Unable to download repo, either:")
    print("\tA.) You're not on colab")
    print("\tB.) It has already been cloned")

# TODO: Load the dataset via pandas

# TODO: Print the first 5 rows of the dataframe

Task 00-1: Expected Output (0 pts.)¶

suspect_name	weapon_mentions	total_lying_prob	good_suspect	statement_1	statement_2	statement_3	statement_4	statement_5	statement_6	...	lying_prob_1	lying_prob_2	lying_prob_3	lying_prob_4	lying_prob_5	lying_prob_6	lying_prob_7	lying_prob_8	lying_prob_9	lying_prob_10
0	Olivia Zino	NaN	NaN	NaN	I arrived late to the party	Someone was carrying a bag upstairs	I heard arguing in the study	The buffet in the kitchen looked incomplete	I stayed near the entrance	Someone mentioned a knife was missing	...	0.50	0.52	0.49	0.51	0.50	0.52	0.49	0.51	0.50	0.51
1	Claudia Huck	3.0	0.77	1.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Everton Albuquerque de Oliveira	4.0	0.85	1.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Cameron Rohlfsen	NaN	NaN	NaN	I arrived alone	I mingled with guests	The host mentioned a knife collection	I stayed in common areas	I left at midnight	Nothing caught my eye	...	0.54	0.53	0.55	0.54	0.53	0.55	0.54	0.53	0.54	0.55
4	Samuel Gisiner	3.0	0.86	1.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Note: Your editor will hopefully format the dataframe rows nicely for you

Task 00-2: Description (0 pts.)¶

Splitting the Dataset¶

Using pandas, separate the singular dataframe into two new dataframes, one containing the 60 suspect examples from prior cases, where good_suspect is set to either 0 or 1 and the results of the interviews from the 20 suspects for the current case (those where good_suspect is NaN). Given below are some useful functions in pandas that you may be able to use to help you!

Checking for missing values:

# Count NaN values in each column
df.isna().sum()

# Check if a specific column has NaN values
df['column_name'].isna()  # Returns True/False for each row

# Filter to rows where a column is NOT NaN
df[df['column_name'].notna()]

Understanding the data types:

df.info()  # Shows data type and non-null count for each column
df.dtypes  # Just shows data types

Filtering Dataframes on Conditionals

# Basic filtering
adults = df[df['age'] >= 18]           # Rows where age is 18+
students = df[df['role'] == 'student'] # Rows where role is 'student'

# Filtering by NaN values
has_email = df[df['email'].notna()]    # Rows where email is NOT NaN
no_email = df[df['email'].isna()]      # Rows where email IS NaN

# Multiple conditions (use & for AND, | for OR)
young_students = df[(df['age'] < 25) & (df['role'] == 'student')]

Important: When using multiple conditions, wrap each condition in parentheses!

Task 00-2: Code (0 pts.)¶

In [ ]:
# TODO: Split the data based on whether good_suspect is filled in

print(f"Training samples: {len(training_df)}")
print(f"Testing samples: {len(testing_df)}")
print(f"Total: {len(training_df) + len(testing_df)}")

Task 00-2: Expected Output¶

Training samples: 60
Testing samples: 20
Total: 80

Story Progression¶

You can't believe the Police had the training and testing data mixed together in the same file, Professor Theisen would have killed you. I guess that's what you get when some intern blindly uses AI. Fortunately, you figured out how to separate them and now you can see that:

  • Training data (60 records): Has weapon_mentions, total_lying_prob, and good_suspect (the label)
  • Testing data (20 records): Has raw statement_1 through statement_10 and lying_prob_1 through lying_prob_10

Unfortunately, the testing data format doesn't match the training data format. You'll need to do some feature engineering to transform the raw testing data into the same format as the training data!

Feature engineering is the process of transforming raw data into meaningful numerical values ("features") that machine learning algorithms can use. It's often said that feature engineering is the most important part of machine learning - good features can make a simple algorithm work great, while bad features will make even sophisticated algorithms fail.

We'll need to create two features from the raw polygraph data in order to match our examples:

  1. weapon_mentions (count feature): How many times weapon-related keywords appear in the suspect's statements
  2. total_lying_prob (aggregate feature): The average lying probability across all statements

Task 01: Create Features from Testing Data to Match Training Data (2 pts.)¶

Task 01-1: Description (0 pts.)¶

Creating the Weapon Mention Feature¶

Our first feature counts how many times weapon-related keywords appear in a suspects answers to the polygraph. The weapons we're looking for are

weapons = ['gas', 'knife', 'poison', 'rope', 'bag', 'gun']

which were the weapons identified by the Police at the scene of the crime in homework01

No description has been provided for this image
Evidence 2: Mapping polgraph answers to weapon mentions.

Useful String Operations in Pandas¶

# Convert to lowercase (important for matching!)
text = "I saw a KNIFE"
text.lower()  # Returns: "i saw a knife"

# Count occurrences of a substring
text = "the knife was a sharp knife"
text.count('knife')  # Returns: 2

# Join multiple strings together
statements = ["I was home", "I saw nothing", "I left early"]
combined = ' '.join(statements)  # "I was home I saw nothing I left early"

Using df.apply() to Process Each Row¶

When you need to apply a custom function to each row of a DataFrame, use df.apply() with axis=1:

def my_function(row):
    # row is a pandas Series representing one row
    # You can access columns like: row['column_name']
    return some_value

# Apply to each row and create a new column
df['new_column'] = df.apply(my_function, axis=1)

# Or use a lambda (anonymous function) for simple operations
df['new_column'] = df.apply(lambda row: row['a'] + row['b'], axis=1)

in the cell below, finish the count_weapon_mentions function to get the weapon mention counts from a suspects polygraph answers.

Task 01-1: Code (1 pt.)¶

In [ ]:
weapons = ['gas', 'knife', 'poison', 'rope', 'bag', 'gun']

def count_weapon_mentions(row, weapons):
    """
    Count how many weapon keywords appear across all statements for a suspect.

    Args:
        row (pandas.Series): A pandas Series representing one suspect's data
        weapons (list): List of weapon keywords to search for

    Returns:
        weapon_count (int): Total count of weapon mentions
    """
    count = 0

    # TODO: Get all statement columns (statement_1, statement_2, ..., statement_10)

    # TODO: Combine all statements into one text block
    
    # TODO: Count each weapon keyword
    
    return count

# TODO: Apply to testing data using df.apply() with axis=1 (apply to each row)

print("Weapon mentions per suspect:")
print(testing_df[['suspect_name', 'weapon_mentions']].head(10))

Task 01-1: Expected Output (1 pt.)¶

Weapon mentions per suspect:
        suspect_name  weapon_mentions
0        Olivia Zino                2
3   Cameron Rohlfsen                1
9       Liam Sagucio                1
14    Olivia Pierret                0
16        Ryan Putka                0
18   Logan St Pierre                0
22     Jack Mangione                6
26        Tom Lohman                6
32        Conor Zech                0
37     Madelyn Perez                1

Task 01-2: Description (0 pts.)¶

Creating the Average Lying Probability Feature¶

The other feature we were missing is the total_lying_prob. It seems reasonable that if someone has a high lying probability on multiple answers during the polygraph, they're more suspicious overall. We want a single number that captures their overall "suspiciousness."


Aggregation Methods¶

There are several ways to combine multiple values into one:

Method Function When to use
Mean (average) np.mean() When all values contribute equally
Sum np.sum() When you want total amount
Max np.max() When the highest value matters most
Min np.min() When the lowest value matters most
Median np.median() When you want to ignore outliers

For our case:


Example Calculation¶

import numpy as np

# A suspect's lying probabilities for 10 statements
probs = [0.75, 0.82, 0.68, 0.79, 0.85, 0.71, 0.88, 0.73, 0.81, 0.77]

# Calculate average
average = np.mean(probs)  # Returns: 0.779

We'll use the mean (average) of all lying probabilities. This gives us a single value between 0 and 1 that represents how suspicious the suspect's polygraph answers were on average. In the cell below, finish the avg_lying_probs function to get the average lying probability for a suspect's polygraph answers

No description has been provided for this image
Evidence 3: Averaging polgraph lying probabilities.

Task 01-2: Code (1 pt.)¶

In [ ]:
import numpy as np

def avg_lying_probs(row):
    """
    Extract all lying probability values from a row and then calculate the average.

    Args:
        row (pandas.Series): A pandas Series representing one suspect's data

    Returns:
        avg (float): Average lying probability (between 0 and 1)
    """

    # TODO: Get all lying probability columns, we can use col.startswith to get them

    # TODO: Calculate the average of all lying probabilities

# TODO: Apply to testing data

print("Total lying probability (average) per suspect:")
print(testing_df[['suspect_name', 'total_lying_prob']].head(10))

Task 01-2: Expected Output (1 pt.)¶

Total lying probability (average) per suspect:
        suspect_name  total_lying_prob
0        Olivia Zino             0.505
3   Cameron Rohlfsen             0.540
9       Liam Sagucio             0.540
14    Olivia Pierret             0.175
16        Ryan Putka             0.175
18   Logan St Pierre             0.175
22     Jack Mangione             0.784
26        Tom Lohman             0.783
32        Conor Zech             0.177
37     Madelyn Perez             0.537

Task 01-3: Description (0 pts.)¶

Prepare Feature Matrix for Machine Learning¶

Now that we've created our features, we need to convert the data into the format that our classication models can use.


sklearn's Data Format¶

All sklearn algorithms expect data in a specific format:

Variable Shape Description Example
X (n_samples, n_features) Feature matrix [[4, 0.85], [0, 0.12], [3, 0.78]]
y (n_samples,) Target labels [1, 0, 1]
  • X is a 2D array where each row is one sample (suspect) and each column is one feature
  • y is a 1D array of labels (what we're trying to predict)

Converting DataFrame to NumPy Array¶

# Select specific columns and convert to numpy array
X = df[['feature1', 'feature2']].values  # .values converts to numpy array

# Get labels
y = df['label_column'].values

Task 01-3: Code (0 pts.)¶

In [ ]:
# TODO: Prepare testing features to match training format

# TODO: Training data already has features and labels

Story Progression¶

Great, now we have features that match! The testing data now has the same format as the training data:

  • weapon_mentions: Count of weapon keywords in statements
  • total_lying_prob: Average lying probability across all statements

Task 02: Expert System / Rule-Based Classification (0 pts.)¶

Task 02-1: Description (0 pts.)¶

What is an Expert System?¶

An expert system is a computer program that encodes human expert knowledge as explicit rules. It's one of the oldest approaches to AI, dating back to the 1970s.


How Expert Systems Work¶

Expert systems use if-then rules based on domain knowledge:

# Medical diagnosis expert system (simplified)
if fever > 101 and cough and fatigue:
    diagnosis = "flu"
elif fever > 101 and rash:
    diagnosis = "measles"
elif headache and stiff_neck and fever:
    diagnosis = "meningitis"
else:
    diagnosis = "unknown"

Expert Systems vs Machine Learning¶

Aspect Expert Systems Machine Learning
Knowledge source Human experts write rules Algorithm learns from data
Interpretability High - you can trace exactly why Often low - "black box"
Training data needed None Usually lots
Handles edge cases Poorly - only handles coded rules Better - generalizes patterns
Maintenance Hard - rules become complex Easy - just retrain with new data

Expert Police Suspect System Rules¶

  1. High lying probability (> 60%) → Suspicious
  2. Multiple weapons (≥ 3) AND moderate lying (> 50%) → Suspicious
  3. Many weapons (≥ 5) → Suspicious (regardless of lying probability)
  4. Very low lying probability (< 25%) → Probably innocent
No description has been provided for this image
Evidence 4: The old police suspect flowchat, crazy this is what they used to do.

Task 02-1: Code (0 pts.)¶

In [ ]:
def expert_classify_suspect(weapon_mentions, lying_prob):
    """
    Rule-based expert system for classifying suspects.
    Rules derived from "domain expert" knowledge about polygraph analysis.

    Args:
        weapon_mentions: Number of weapon keywords in statements
        lying_prob: Average lying probability across all statements (0-1)

    Returns:
        1 if suspect is worth investigating ("good suspect")
        0 otherwise
    """
    # Rule 1: High average lying probability is very suspicious
    if lying_prob > 0.6:
        return 1  # Good suspect

    # Rule 2: Multiple weapon mentions combined with moderate lying
    if weapon_mentions >= 3 and lying_prob > 0.5:
        return 1  # Good suspect

    # Rule 3: Many weapon mentions is suspicious regardless
    if weapon_mentions >= 5:
        return 1  # Good suspect

    # Rule 4: Very low lying probability - probably innocent
    if lying_prob < 0.25:
        return 0  # Not a good suspect

    # Default: not enough evidence
    return 0

# TODO: Apply expert system to testing data

# See who the expert system flags
expert_suspects = testing_df[testing_df['expert_prediction'] == 1]
print(f"Expert system identified {len(expert_suspects)} suspects:")
print(expert_suspects[['suspect_name', 'weapon_mentions', 'total_lying_prob']].to_string())

Task 02-1: Expected Output (0 pts.)¶

Expert system identified 4 suspects:
     suspect_name  weapon_mentions  total_lying_prob
21  Jack Mangione                6             0.784
25     Tom Lohman                6             0.783
65  Sophia Noonan                4             0.575
76   Claire Green                4             0.779

Story Progression¶

Well that doesn't seem right, the Police had pretty explicitly identified 6 suspects at the scene of the crime: Mr. Green, Professor Plum, Colonel Mustard, Mrs. White, Mrs. Peacock, and Miss Scarlet.

This boomer, out-of-date expert system only found four! It may be time to try something on your own. As you're sneaking a quaff of your flask it hits you, you could try using a k-Nearest Neighbors model to see if that gives you anything better than this dumb flowchart system. Luckily for you, most common ML algorithms can be accessed through a library called sklearn.


What is sklearn?¶

Scikit-learn (sklearn) is Python's most popular machine learning library. It provides:

  • Dozens of ML algorithms (kNN, Decision Trees, SVM, etc.)
  • Tools for preprocessing, evaluation, and model selection
  • A consistent API across all algorithms

sklearn's Consistent API¶

One of sklearn's best features is its consistent API. All classifiers follow the same pattern:

from sklearn.neighbors import KNeighborsClassifier

# Step 1: Create the model
model = KNeighborsClassifier(n_neighbors=N)

# Step 2: Train on data
model.fit(X_train, y_train)

# Step 3: Make predictions
predictions = model.predict(X_test)

Note: If one were to be asked to use a Decision Tree at some point cough, cough it's possible that it would look very similar

Task 03: Use kNN to Find the Suspects (1 pt.)¶

Task 03-1: Description (0 pts.)¶

What is k-Nearest Neighbors (kNN)?¶

In the cell below, create and fit a kNN classifier from sklearn

Task 03-1: Code (0 pts.)¶

In [ ]:
from sklearn.neighbors import KNeighborsClassifier

# TODO: Create kNN classifier for k = 3

# TODO: Fit on training data

print(f"Number of training samples: {len(X_train)}")
print(f"Number of neighbors (k): {knn.n_neighbors}")

Task 03-1: Expected Output (0 pts.)¶

Number of training samples: 60
Number of neighbors (k): 3

Task 03-2: Description (0 pts.)¶

Using kNN to Make Predictions¶

Now let's use our trained kNN to classify our suspects.

Key methods:

  • predict(X): Returns the predicted class (0 or 1) for each sample

Task 03-2: Code (0 pts.)¶

In [ ]:
# TODO: Predict which suspects are "good suspects"

# TODO: Add predictions back to dataframe

# TODO: Filter to good suspects, those with knn_prediction == 1

print(f"kNN (k=3) identified {len(knn_suspects)} suspects to investigate further:")
print(knn_suspects[['suspect_name', 'weapon_mentions', 'total_lying_prob', 'knn_prediction']].to_string())

Task 03-2: Expected Output (0 pts.)¶

kNN (k=3) identified 6 suspects to investigate further:
        suspect_name  weapon_mentions  total_lying_prob  knn_prediction
3   Cameron Rohlfsen                1             0.540             1.0
9       Liam Sagucio                1             0.540             1.0
22     Jack Mangione                6             0.784             1.0
26        Tom Lohman                6             0.783             1.0
37     Madelyn Perez                1             0.537             1.0
76      Claire Green                4             0.779             1.0

Story Progression¶

Hmmmmmmm that's odd, you could have swore that several of these suspects had alibis for the night of the murder. You know for sure at least that Cameron was rocking the cheer squad at a basketball game that very night. Something must have gone wrong. You wonder if a different value for k might effect the results. It may be worth trying some different hyperparameters. A hyperparameter is a value you choose before training that affects how the model behaves. Unlike regular parameters (which are learned from data), hyperparameters are set by you.


Examples of Hyperparameters¶

Algorithm Hyperparameter What it controls
kNN n_neighbors (k) How many neighbors to consider
Decision Tree max_depth How deep the tree can grow
Neural Network learning_rate How fast to update weights

The Effect of k in kNN¶

  • Small k (e.g., 1 or 3):

    • More sensitive to local patterns
    • Can be influenced by noise/outliers
    • May overfit
  • Large k (e.g., 10 or 20):

    • More robust to noise
    • Smoother decision boundaries
    • May underfit (miss local patterns)

Task 03-3: Description (0 pts.)¶

Hyperparameter Tuning¶

In the cell below try the following values of k:

[1, 3, 5, 7, 9]

to see if you can match the expected output of:

6 suspects - ['Olivia Zino', 'Jack Mangione', 'Tom Lohman', 'Cesar Cervera', 'Sophia Noonan', 'Claire Green']

Remember that we typically use odd numbers for k so we don't end up with a tie during the vote!

No description has been provided for this image
Evidence 5: Using kNN to find better suspects.

Task 03-3: Code (0 pts.)¶

In [ ]:
for k in [1, 3, 5, 7, 9]:
    # TODO: Try the different values of k listed above

    print(f"{len(knn_suspects)} suspects - {knn_suspects['suspect_name'].tolist()}")

Task 03-3: Expected Output (1 pt.)¶

6 suspects - ['Cameron Rohlfsen', 'Liam Sagucio', 'Jack Mangione', 'Tom Lohman', 'Madelyn Perez', 'Claire Green']
6 suspects - ['Cameron Rohlfsen', 'Liam Sagucio', 'Jack Mangione', 'Tom Lohman', 'Madelyn Perez', 'Claire Green']
6 suspects - ['Olivia Zino', 'Jack Mangione', 'Tom Lohman', 'Cesar Cervera', 'Sophia Noonan', 'Claire Green']
6 suspects - ['Olivia Zino', 'Jack Mangione', 'Tom Lohman', 'Cesar Cervera', 'Sophia Noonan', 'Claire Green']
6 suspects - ['Olivia Zino', 'Jack Mangione', 'Tom Lohman', 'Cesar Cervera', 'Sophia Noonan', 'Claire Green']

Story Progression¶

Wow, depending on what you choose for k you get different suspects! You hope the police don't bring in anyone innocent. Just to be sure, you decide to leave the names that changed to the police, but you also notice that there were a few names that were there the entire time:

  • Claire Green
  • Tom Lohman
  • Jack Mangione

These three must be really suspicious. It may be worth paying a couple of them a visit on your own time...

Task 04: Collecting Suspect Statements (1 pt.)¶

Task 04: Description (0 pts.)¶

Visit the following two suspects to collect their statements about the night of the murder. You'll need this data to complete homework02 so only one team member needs to visit each suspect, feel free to divide up the work. The available times to pick up the suspects (TAs) statements are listed below:

  • Jack Mangione: 5:00 - 7:00 on Monday and Wednesday in the CSE Commons
  • Claire Green: 6:00 - 7:30 Tuesday and 6:00 - 8:00 Thursday at Hagerty Cafe

Note: You'll need this data to complete homework02 though most of you won't actually read this part and will leave it until an hour before homework02 and be mad at me (the TAs are super cool and flexible so if you genuinely cannot find time in the next month to make it to one of their OH times please just reach out to them!)

Task 05: Generate Police Report¶

Task 05: Description (0 pts.)¶

Run the code cell below to generate a report for the Police and submit it on Canvas!

Task 05: Code (0 pts.)¶

In [ ]:
import os, json

ASS_PATH = "nd-cse-30124-homeworks/labs"
ASS = "lab01"

try:
    from google.colab import _message, files

    # where you WANT it to live (repo folder)
    repo_ipynb_path = f"/content/{ASS_PATH}/{ASS}/{ASS}.ipynb"

    # grab current notebook contents from the UI
    nb = _message.blocking_request("get_ipynb", timeout_sec=1)["ipynb"]

    # write it into the repo folder as a real file
    os.makedirs(os.path.dirname(repo_ipynb_path), exist_ok=True)
    with open(repo_ipynb_path, "w", encoding="utf-8") as f:
        json.dump(nb, f)

    # convert + download html
    !jupyter nbconvert --to html "{repo_ipynb_path}"
    files.download(repo_ipynb_path.replace(".ipynb", ".html"))
except:
    import subprocess

    nb_fp = os.getcwd() + f'/{ASS}.ipynb'
    print(os.getcwd())

    subprocess.run(["jupyter", "nbconvert", "--to", "html", nb_fp], check=True)
finally:
    print('[WARNING]: Unable to export notebook as .html')