How to get the word frequency in a text
In my previous post
http://forum.ivorde.ro/tr-how-to-convert-a-text-into-a-list-of-words-one-per-line-t17.html I explained how to convert a text into a list of words, with one word per line.
This is how to get the word frequency in a text:
Code:
# cat test.file
FreeBSD 7.2-RELEASE is now available for the amd64, i386, ia64, pc98, powerpc, and sparc64 architectures.
FreeBSD 7.2 can be installed from bootable ISO images or over the network; the required files can be downloaded via FTP or BitTorrent as described in the sections below. While some of the smaller FTP mirrors may not carry all architectures, they will all generally contain the more common ones, such as i386 and amd64.
# cat test.file | tr -d '[:punct:]' | tr ' ' '\n' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn
6 the
2 or
2 i386
2 ftp
2 freebsd
2 can
2 be
2 as
2 architectures
2 and
2 amd64
2 all
1 will
1 while
1 via
1 they
1 such
1 sparc64
1 some
1 smaller
1 sections
1 required
1 powerpc
1 pc98
1 over
1 ones
1 of
1 now
1 not
1 network
1 more
1 mirrors
1 may
1 iso
1 is
1 installed
1 in
1 images
1 ia64
1 generally
1 from
1 for
1 files
1 downloaded
1 described
1 contain
1 common
1 carry
1 bootable
1 bittorrent
1 below
1 available
1 72release
1 72
What I did was to convert all upper case letters to lowercase (because I don't need duplicate words because of one or more letters in different case) and then sort all the words and sent the output to uniq (-c Precede each output line with the count of the number of time the line occurred in the input, followed by a single space.). Then I sorted all the output numerically in descending order.