CSE 30332 - HW1

Programming Paradigms

In this assignment you will be using functional programming tools in python such as Map, List Comprehensions, Lambda Functions, and others to write a command line tool to scrape Reddit.

Reddit Review

Peter likes to sit in the back of the class. It has its perks:

That said, sitting in the back has its downsides:

To combat his boredom, Peter typically just browses Reddit. His favorite subreddits are AdviceAnimals, aww, todayilearned, and of course UnixPorn. Lately, however, Peter has grown paranoid that his web browser is leaking information about him4, and so he wants to be able to get the latest links from Reddit directly in his terminal.

Requests Module

Fortunately for Peter, Reddit provides a JSON feed for every subreddit. You simply need to append .json to the end of each subreddit. For instance, the JSON feed for todayilearned can be found here:

To fetch that data, Peter uses the Requests package in Python to access the JSON data:

r = requests.get('')

429 Too Many Requests

Reddit tries to prevent bots from accessing its website too often. To work around any 429: Too Many Requests errors, we can trick Reddit by specifying our own user agent:

headers  = {'user-agent': 'reddit-{}'.format(os.environ.get('USER', 'cse-30332-sp23'))}
response = requests.get(url, headers=headers)

This should allow you to make requests without getting the dreaded 429 error.

The code above would output something like the following:

{"kind": "Listing", "data": {"modhash": "g8n3uwtdj363d5abd2cbdf61ed1aef6e2825c29dae8c9fa113", "children": [{"kind": "t3", "data": ...

Looking through that stream of text, Peter sees that the JSON data is a collection of structured or hierarchical dictionaries and lists. This looks a bit complex to him, so he wants you to help him complete the script which fetches the JSON data for a specified subreddit or URL and allows the user to sort the articles by various fields, restrict the number of items displayed, and even shorten the URLs of each article.

Command Line Arguments

The script takes the following arguments:

    --subreddits SUB1,SUB2,SUB3,...SUBN   The list of subreddits to scrape, delimited by commas
    --num        LIMIT                    Number of articles to display per subreddit (default: 5)
    --regex      REGEX                    A regex to use to filter posts
    --attr       ATTR                     Field to sort articles by (default: score)
    --reverse                             Include this flag to reverse the output

The --subreddits flag specifies the list of subreddits to scrape, is comma delimited, and can be of variable length.

The --num flag specifies the number of titles to display per subreddit, the default is 5.

The --regex flag specifies the regex used to filter the titles.

The --attr flag specifies the field to sort the titles by.

The --reverse flag specifies whether the output should be printed in reverse direction or not.

Code Overview and Scaffold

                            import requests
                            import os
                            import re
                            import argparse
                            from functools import partial
                            fromt typing import Generator

                            def scraper(sub: str) -> list:
                            '''Use the Reddit API to get the JSON for a single

                                sub (str): a subreddit name in the form of a string

                                list: A list of dicts containing the posts on the subreddit

                            def searcher(num: int, regex: str, post_list: list) -> list: #1 line
                            '''Use the supplied regex to filter the titles of posts
                            in the subreddit.

                                num (int): the number of posts to return out of the filtered set
                                regex (str): the regex with which to filter the post titles
                                post_list (str): a list of dicts for each post in the subreddit

                                list: A list of NUM dicts for posts on the subreddit with titles
                                      matching REGEX

                            def sorter(attr: str, dir: bool, post_list: list) -> list: #1 line
                            '''Sort the filtered posts based on ATTR.

                                attr (str): The dictionary key on which to sort the posts
                                dir (str): A boolean for the direction in which to sort (asc/dsc)
                                post_list (str): A filtered list of posts in the subreddit

                                list: A list of dicts containing the posts on the subreddit
                                      sorted by ATTR

                            def formatter(post_list: list) -> Generator[str, None, None]: #1 line
                            '''Return a nicely formatted string for each remaining post in your list

                                post_list (list): The list of posts that have been filtered and sorted

                                Generator: A generator yielding the strings

                            if __name__ == '__main__':
                                parser = argparse.ArgumentParser()
                                parser.add_argument("--subreddits", help="A comma separated list of subreddits to scrape")
                                #Parse the rest of your arguments here

                                #Use partial to create closures for two of your functions

                                #Use nested maps to call your functions

                                #Print out your formatted posts

Example Output

Here are some examples of in action:

# Show Linux subreddit
hw1 ➜ python3 --subreddits ultimate,houseplants,cats --num 5 --regex '.*cat.*'

Show me your most adorable pictures of your cat/cats -- 0.96
Hi! This is stray cat I've made friends with this summer, now it's colder so I let him stay at home. He often sleeps like this, face down, is this normal? Looks both depressing but also kinda cute. -- 0.98
I'm sick and was taking a nap. Woke up to this. I don't have a cat. -- 0.98
Here are some great resources for answering common questions about feline aggression. And remember, it's always best to talk with your veterinarian about specific issues regarding your cat(s)!🐱❤️ -- 0.99
Stinko's Tumor Battle [UPDATE] I made a video of his journey for the reddit cat community! -- 1.0

We'll do it live

Note, since we are pulling data from an active website, the articles may change between runs.

Submission Instructions

This assignment is due by 11:59 PM on Monday, February 6th (02/06). To submit, please create a folder named HW1 in your dropbox. Then put your python file, named, into this folder. Assignments are programmatically collected at the due date.

Grading Rubric

Component Points
Scraper function follows guidelines:
    - Requests module
Searcher function follows guidelines:
    - List comprehension
    - Regex
    - 1 LOC
Sorter function follows guidelines:
    - Lambda function
    - 1 LOC
Formatter function follows guidelines:
    - Generator
    - 1 LOC
Main function follows guidelines:
    - Argument parsing
    - Map calls
Code runs without errors using different arguments and inputs 25
Code output correct given reasonable inputs 20
Code style 5
Total 75