Hash Table Word Count

Problem:

Write a program which takes as its input a String of natural language text and outputs a HashMap<String,Integer> whose keys are the unique words in the input and whose values are the number of times those words occur. The algorithm should be case-insensitive (e.g. “Program” and “program” would count as the same word) and ignore punctuation and whitespace.

Example:

Given the input "To be or not to be, that is the question", the outputted HashMap would contain 8 entries, with two words having a count of 2 and six words having a count of 1:

"to"        → 2
"be"        → 2
"or"        → 1
"not"       → 1
"that"      → 1
"is"        → 1
"the"       → 1
"question"  → 1

1. Understand

# Write a program that takes a String as input
# And outputs a dictionary with String:Int key:value pairs
# The keys are unique words (case insensitive 
# and ignoring whitespace and punctuation)
# Values are frequency counts of that word in the string

## Example input: "To be or not to be, that is the question"
## Should output:
# "to"        → 2
# "be"        → 2
# "or"        → 1
# "not"       → 1
# "that"      → 1
# "is"        → 1
# "the"       → 1
# "question"  → 1

2. Match

# This is a string processing problem
# Can use Python's split() function to get individual words
# in a list and use .lower() to convert each word to lowercase
# and strip() to strip whitespace and punctuation

3. Plan / Pseudocode

# Start with empty dictionary and input string
# First, convert string to a list with split()
# Second, iterate through each word in the list and:
# * convert to lowercase lower()
# * strip whitespace and punct strip()
# * check if its in the dictionary
#   * if it's not, add it with a value of 1
#   * if it is, increment the value associated with the key
# return the dictionary

4. Implement

5. Review

6. Evaluate

Currently all non-alphabetical characters are stripped from the string by using regex.

The algorithm runs at:

O(N) Time and Space

Sara Kazemi

< k a z e m i c o d e / >