CSE 30332 - HW1
Programming Paradigms
In this assignment you will be using functional programming
tools in python such as Map, List Comprehensions, Lambda Functions,
and others to write a command line tool to scrape Reddit.
Reddit Review
Peter likes to sit in the back of the class. It has its perks:
-
He can beat the rush out the door when class ends.
-
He can see everyone browsing Facebook, playing video games, watching
YouTube, or doing homework.
-
He feels safe from being called upon by the instructor... except when he
does that strange thing where he goes around the class and tries to talk to
people. Totally weird .
That said, sitting in the back has its downsides:
-
He can never see what the instructor is writing because he has terrible
handwriting and always writes too small .
-
He is prone to falling asleep because the instructor is really boring and
the class is not as interesting as his other computer science courses.
To combat his boredom, Peter typically just browses Reddit. His favorite
subreddits are AdviceAnimals, aww, todayilearned, and of course
UnixPorn. Lately, however, Peter has grown paranoid that his web browser
is leaking information about him, and so he wants to be able to get the
latest links from Reddit directly in his terminal.
Requests Module
Fortunately for Peter, Reddit provides a JSON feed for every subreddit.
You simply need to append .json
to the end of each subreddit. For
instance, the JSON feed for todayilearned can be found here:
https://www.reddit.com/r/todayilearned/.json
To fetch that data, Peter uses the Requests package in Python to access
the JSON data:
r = requests.get('https://www.reddit.com/r/todayilearned/.json')
print(r.json())
429 Too Many Requests
Reddit tries to prevent bots from accessing its website too often. To work
around any 429: Too Many Requests errors, we can trick Reddit by
specifying our own user agent:
headers = {'user-agent': 'reddit-{}'.format(os.environ.get('USER', 'cse-30332-sp23'))}
response = requests.get(url, headers=headers)
This should allow you to make requests without getting the dreaded 429
error.
The code above would output something like the following:
{"kind": "Listing", "data": {"modhash": "g8n3uwtdj363d5abd2cbdf61ed1aef6e2825c29dae8c9fa113", "children": [{"kind": "t3", "data": ...
Looking through that stream of text, Peter sees that the JSON data is a
collection of structured or hierarchical dictionaries and lists. This
looks a bit complex to him, so he wants you to help him complete the
reddit.py
script which fetches the JSON data for a specified
subreddit or URL and allows the user to sort the articles by various
fields, restrict the number of items displayed, and even shorten the URLs
of each article.
Command Line Arguments
The reddit.py
script takes the following arguments:
--subreddits SUB1,SUB2,SUB3,...SUBN The list of subreddits to scrape, delimited by commas
--num LIMIT Number of articles to display per subreddit (default: 5)
--regex REGEX A regex to use to filter posts
--attr ATTR Field to sort articles by (default: score)
--reverse Include this flag to reverse the output
The --subreddits
flag specifies the list of subreddits to scrape, is comma delimited, and can be of variable length.
The --num
flag specifies the number of titles to display per subreddit, the default is 5.
The --regex
flag specifies the regex used to filter the titles.
The --attr
flag specifies the field to sort the titles by.
The --reverse
flag specifies whether the output should be printed in reverse direction or not.
Code Overview and Scaffold
import requests
import os
import re
import argparse
from functools import partial
fromt typing import Generator
def scraper(sub: str) -> list:
'''Use the Reddit API to get the JSON for a single
subreddit.
Args:
sub (str): a subreddit name in the form of a string
Returns:
list: A list of dicts containing the posts on the subreddit
'''
def searcher(num: int, regex: str, post_list: list) -> list: #1 line
'''Use the supplied regex to filter the titles of posts
in the subreddit.
Args:
num (int): the number of posts to return out of the filtered set
regex (str): the regex with which to filter the post titles
post_list (str): a list of dicts for each post in the subreddit
Returns:
list: A list of NUM dicts for posts on the subreddit with titles
matching REGEX
'''
def sorter(attr: str, dir: bool, post_list: list) -> list: #1 line
'''Sort the filtered posts based on ATTR.
Args:
attr (str): The dictionary key on which to sort the posts
dir (str): A boolean for the direction in which to sort (asc/dsc)
post_list (str): A filtered list of posts in the subreddit
Returns:
list: A list of dicts containing the posts on the subreddit
sorted by ATTR
'''
def formatter(post_list: list) -> Generator[str, None, None]: #1 line
'''Return a nicely formatted string for each remaining post in your list
args:
post_list (list): The list of posts that have been filtered and sorted
Returns:
Generator: A generator yielding the strings
'''
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--subreddits", help="A comma separated list of subreddits to scrape")
#Parse the rest of your arguments here
#Use partial to create closures for two of your functions
#Use nested maps to call your functions
#Print out your formatted posts
Example Output
Here are some examples of reddit.py
in action:
# Show Linux subreddit
hw1 ➜ python3 reddit_scrape.py --subreddits ultimate,houseplants,cats --num 5 --regex '.*cat.*'
Show me your most adorable pictures of your cat/cats -- 0.96
Hi! This is stray cat I've made friends with this summer, now it's colder so I let him stay at home. He often sleeps like this, face down, is this normal? Looks both depressing but also kinda cute. -- 0.98
I'm sick and was taking a nap. Woke up to this. I don't have a cat. -- 0.98
Here are some great resources for answering common questions about feline aggression. And remember, it's always best to talk with your veterinarian about specific issues regarding your cat(s)!🐱❤️ -- 0.99
Stinko's Tumor Battle [UPDATE] I made a video of his journey for the reddit cat community! -- 1.0
We'll do it live
Note, since we are pulling data from an active website, the articles may
change between runs.
Submission Instructions
This assignment is due by 11:59 PM on Monday, February 6th (02/06).
To submit, please create a folder named HW1 in your dropbox. Then
put your python file, named reddit_scrape.py, into this folder. Assignments are programmatically
collected at the due date.
Grading Rubric
Component |
Points |
Scraper function follows guidelines:
- Requests module
|
5 |
Searcher function follows guidelines:
- List comprehension
- Regex
- 1 LOC
|
5 |
Sorter function follows guidelines:
- Lambda function
- 1 LOC
|
5 |
Formatter function follows guidelines:
- Generator
- 1 LOC
|
5 |
Main function follows guidelines:
- Argument parsing
- Map calls
|
5 |
Code runs without errors using different arguments and inputs |
25 |
Code output correct given reasonable inputs |
20 |
Code style |
5 |
Total |
75 |