Table of Contents
CSE 30332 - HW1

CSE 30332 - HW1

Programming Paradigms


In this assignment you will be using functional programming tools in python such as Map, List Comprehensions, Lambda Functions, and others to write a command line tool to scrape Reddit.

Reddit Review


Peter likes to sit in the back of the class. It has its perks:

That said, sitting in the back has its downsides:

To combat his boredom, Peter typically just browses Reddit. His favorite subreddits are AdviceAnimals, aww, todayilearned, and of course UnixPorn. Lately, however, Peter has grown paranoid that his web browser is leaking information about him4, and so he wants to be able to get the latest links from Reddit directly in his terminal.

Requests Module


Fortunately for Peter, Reddit provides a JSON feed for every subreddit. You simply need to append .json to the end of each subreddit. For instance, the JSON feed for todayilearned can be found here:

https://www.reddit.com/r/todayilearned/.json

To fetch that data, Peter uses the Requests package in Python to access the JSON data:

r = requests.get('https://www.reddit.com/r/todayilearned/.json')
print(r.json())

429 Too Many Requests

Reddit tries to prevent bots from accessing its website too often. To work around any 429: Too Many Requests errors, we can trick Reddit by specifying our own user agent:

headers  = {'user-agent': 'reddit-{}'.format(os.environ.get('USER', 'cse-30332-sp23'))}
response = requests.get(url, headers=headers)

This should allow you to make requests without getting the dreaded 429 error.

The code above would output something like the following:

{"kind": "Listing", "data": {"modhash": "g8n3uwtdj363d5abd2cbdf61ed1aef6e2825c29dae8c9fa113", "children": [{"kind": "t3", "data": ...

Looking through that stream of text, Peter sees that the JSON data is a collection of structured or hierarchical dictionaries and lists. This looks a bit complex to him, so he wants you to help him complete the reddit.py script which fetches the JSON data for a specified subreddit or URL and allows the user to sort the articles by various fields, restrict the number of items displayed, and even shorten the URLs of each article.

Command Line Arguments


The reddit.py script takes the following arguments:


    --subreddits SUB1,SUB2,SUB3,...SUBN   The list of subreddits to scrape, delimited by commas
    --num        LIMIT                    Number of articles to display per subreddit (default: 5)
    --regex      REGEX                    A regex to use to filter posts
    --attr       ATTR                     Field to sort articles by (default: score)
    --reverse                             Include this flag to reverse the output
                        

The --subreddits flag specifies the list of subreddits to scrape, is comma delimited, and can be of variable length.

The --num flag specifies the number of titles to display per subreddit, the default is 5.

The --regex flag specifies the regex used to filter the titles.

The --attr flag specifies the field to sort the titles by.

The --reverse flag specifies whether the output should be printed in reverse direction or not.

Code Overview and Scaffold


                        
                            import requests
                            import os
                            import re
                            import argparse
                            from functools import partial
                            fromt typing import Generator

                            def scraper(sub: str) -> list:
                            '''Use the Reddit API to get the JSON for a single
                            subreddit.

                            Args:
                                sub (str): a subreddit name in the form of a string

                            Returns:
                                list: A list of dicts containing the posts on the subreddit
                            '''

                            def searcher(num: int, regex: str, post_list: list) -> list: #1 line
                            '''Use the supplied regex to filter the titles of posts
                            in the subreddit.

                            Args:
                                num (int): the number of posts to return out of the filtered set
                                regex (str): the regex with which to filter the post titles
                                post_list (str): a list of dicts for each post in the subreddit

                            Returns:
                                list: A list of NUM dicts for posts on the subreddit with titles
                                      matching REGEX
                            '''

                            def sorter(attr: str, dir: bool, post_list: list) -> list: #1 line
                            '''Sort the filtered posts based on ATTR.

                            Args:
                                attr (str): The dictionary key on which to sort the posts
                                dir (str): A boolean for the direction in which to sort (asc/dsc)
                                post_list (str): A filtered list of posts in the subreddit

                            Returns:
                                list: A list of dicts containing the posts on the subreddit
                                      sorted by ATTR
                            '''

                            def formatter(post_list: list) -> Generator[str, None, None]: #1 line
                            '''Return a nicely formatted string for each remaining post in your list

                            args:
                                post_list (list): The list of posts that have been filtered and sorted

                            Returns:
                                Generator: A generator yielding the strings
                            '''

                            if __name__ == '__main__':
                                parser = argparse.ArgumentParser()
                                parser.add_argument("--subreddits", help="A comma separated list of subreddits to scrape")
                                #Parse the rest of your arguments here

                                #Use partial to create closures for two of your functions

                                #Use nested maps to call your functions

                                #Print out your formatted posts
                        
                    

Example Output


Here are some examples of reddit.py in action:

# Show Linux subreddit
hw1 ➜ python3 reddit_scrape.py --subreddits ultimate,houseplants,cats --num 5 --regex '.*cat.*'

Show me your most adorable pictures of your cat/cats -- 0.96
Hi! This is stray cat I've made friends with this summer, now it's colder so I let him stay at home. He often sleeps like this, face down, is this normal? Looks both depressing but also kinda cute. -- 0.98
I'm sick and was taking a nap. Woke up to this. I don't have a cat. -- 0.98
Here are some great resources for answering common questions about feline aggression. And remember, it's always best to talk with your veterinarian about specific issues regarding your cat(s)!🐱❤️ -- 0.99
Stinko's Tumor Battle [UPDATE] I made a video of his journey for the reddit cat community! -- 1.0

We'll do it live

Note, since we are pulling data from an active website, the articles may change between runs.

Submission Instructions


This assignment is due by 11:59 PM on Monday, February 6th (02/06). To submit, please create a folder named HW1 in your dropbox. Then put your python file, named reddit_scrape.py, into this folder. Assignments are programmatically collected at the due date.

Grading Rubric


Component Points
Scraper function follows guidelines:
    - Requests module
5
Searcher function follows guidelines:
    - List comprehension
    - Regex
    - 1 LOC
5
Sorter function follows guidelines:
    - Lambda function
    - 1 LOC
5
Formatter function follows guidelines:
    - Generator
    - 1 LOC
5
Main function follows guidelines:
    - Argument parsing
    - Map calls
5
Code runs without errors using different arguments and inputs 25
Code output correct given reasonable inputs 20
Code style 5
Total 75