Problem:
Write a program which takes as its input a String
of natural language text and outputs a HashMap<String,Integer>
whose keys are the unique words in the input and whose values are the number of times those words occur. The algorithm should be case-insensitive (e.g. “Program” and “program” would count as the same word) and ignore punctuation and whitespace.
Example:
Given the input "To be or not to be, that is the question"
, the outputted HashMap
would contain 8 entries, with two words having a count of 2 and six words having a count of 1:
"to" → 2
"be" → 2
"or" → 1
"not" → 1
"that" → 1
"is" → 1
"the" → 1
"question" → 1
1. Understand
# Write a program that takes a String as input
# And outputs a dictionary with String:Int key:value pairs
# The keys are unique words (case insensitive
# and ignoring whitespace and punctuation)
# Values are frequency counts of that word in the string
## Example input: "To be or not to be, that is the question"
## Should output:
# "to" → 2
# "be" → 2
# "or" → 1
# "not" → 1
# "that" → 1
# "is" → 1
# "the" → 1
# "question" → 1
2. Match
# This is a string processing problem
# Can use Python's split() function to get individual words
# in a list and use .lower() to convert each word to lowercase
# and strip() to strip whitespace and punctuation
3. Plan / Pseudocode
# Start with empty dictionary and input string
# First, convert string to a list with split()
# Second, iterate through each word in the list and:
# * convert to lowercase lower()
# * strip whitespace and punct strip()
# * check if its in the dictionary
# * if it's not, add it with a value of 1
# * if it is, increment the value associated with the key
# return the dictionary
4. Implement
5. Review
6. Evaluate
Currently all non-alphabetical characters are stripped from the string by using regex.
The algorithm runs at:
O(N) Time and Space