The following are the solutions I came up with. I tried to be as verbose as possible. If you have any suggestions, feel free to contact me.
I recommend using ExplainShell to understand commands / flags you’ve never seen before.
Also, my solutions use awk, sed & grep extensively. If you don’t understand an expression, try regular expressions 101, It’ll make your life easier.
Furthermore, If you’ve never heard or understand regular expressions, learn them. They are extremely powerful. You can start by reading my blog post about them.
Task #1
Splits each chapter in the book into a dedicated file under the ‘chapters’ directory. There should be 12 chapters when you’re done.
awk '{ if ($1 == "CHAPTER") { # extract the chapter chapter = substr($0,9, index($0, ".")-9) l = sprintf("%s.txt", chapter) } if (length(chapter) > 0) { print $0 >> l } }'"$book_path"
# iterate all files and remove blank lines - this is NOT mandatory for f in *; do echo"$(<"$f")" > "$f" done
echo"$chapters_dir"
Task #2
Generates statistics for the whole book, and each chapter:
The most frequently used word
The least frequently used word
The longest word
#!/usr/bin/env bash book_path=${1:-""}
# print most frequent word, least frequent word & longest word # 1. replace all punctuation and spaces with new line # 2. remove all non alphanumeric characters # 3. make everything lowercase # 4. sort grep -oE "\w+""$book_path" | \ awk '{ for (i=1; i<=NF; i++){ hist[tolower($i)]++; } } END { min = 1 for (word in hist) { if (length(word) > length(longest)) longest = word times=hist[word] if (times >= max) { max = times maxword = word } if (times <= min) { min = times minword = word } } printf "Most frequent word is \"%s\" (appeared %d times)\n", maxword, max printf "Least frequent word is \"%s\" (appeared %d times)\n", minword, min printf "Longest word is \"%s\" (%d characters)\n",longest, length(longest) }'
Task #3
Find the amount of words that their length is bigger than the average word length in the entire book.
#!/usr/bin/env bash book_path=${1:-""}
# 1. replace all punctuation and spaces with new line # 2. remove all non alphanumeric characters # 3. make everything lowercase # 4. sort
grep -oE "\w+""$book_path" | \ awk ' { words[tolower($0)] sum += length($0); } END { average_word_length = sum/NR words_longer_than_average = 0 for (word in words) if (length(word) >= average_word_length) words_longer_than_average++ printf "There are %d words that are longer \ than the average word length (%0.3f)\n", \ words_longer_than_average, average_word_length }'
Task #4
Find the “closest” word to “Alice” - the most frequently used word that appeared right before or after the word “Alice”.
#!/usr/bin/env bash book_path=${1:-""}
# 1. filter allwords next to Alice # 2. remove 'Alice' from the grep expression # 3. replace all spaces with a new line # 4. remove all empty lines # 5. sort # 6. get all uniq lines, with the amount they appeared # 7. sort according to the previous output # 8. get only the one that has the highest value grep -Po '(\w+\s+)?Alice(\s+\w+)?'"$book_path" | \ sed 's/Alice//g' | \ tr '[:space:]''\n' | \ tr '[:upper:]''[:lower:]' | \ sed '/^\s*$/d' | \ sort | \ uniq -c | \ sort -rn | \ head -n 1 | \ awk '{ printf "The most frequent word near \"Alice\" is \"%s\" \ (appeared %s times)\n", $2, $1 }'