Linux, FreeBSD, Juniper, Cisco / Network security articles and troubleshooting guides
https://forum.ivorde.com/

How to get the word frequency in a text
https://forum.ivorde.com/how-to-get-the-word-frequency-in-a-text-t16.html
Page 1 of 1

Author:  LaR3 [ Wed Aug 05, 2009 7:01 am ]
Post subject:  How to get the word frequency in a text

In my previous post http://forum.ivorde.ro/tr-how-to-convert-a-text-into-a-list-of-words-one-per-line-t17.html I explained how to convert a text into a list of words, with one word per line.

This is how to get the word frequency in a text:
Code:
# cat test.file
FreeBSD 7.2-RELEASE is now available for the amd64, i386, ia64, pc98, powerpc, and sparc64 architectures.

FreeBSD 7.2 can be installed from bootable ISO images or over the network; the required files can be downloaded via FTP or BitTorrent as described in the sections below. While some of the smaller FTP mirrors may not carry all architectures, they will all generally contain the more common ones, such as i386 and amd64.

# cat test.file | tr -d '[:punct:]' | tr ' ' '\n' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn
   6 the
   2 or
   2 i386
   2 ftp
   2 freebsd
   2 can
   2 be
   2 as
   2 architectures
   2 and
   2 amd64
   2 all
   1 will
   1 while
   1 via
   1 they
   1 such
   1 sparc64
   1 some
   1 smaller
   1 sections
   1 required
   1 powerpc
   1 pc98
   1 over
   1 ones
   1 of
   1 now
   1 not
   1 network
   1 more
   1 mirrors
   1 may
   1 iso
   1 is
   1 installed
   1 in
   1 images
   1 ia64
   1 generally
   1 from
   1 for
   1 files
   1 downloaded
   1 described
   1 contain
   1 common
   1 carry
   1 bootable
   1 bittorrent
   1 below
   1 available
   1 72release
   1 72


What I did was to convert all upper case letters to lowercase (because I don't need duplicate words because of one or more letters in different case) and then sort all the words and sent the output to uniq (-c Precede each output line with the count of the number of time the line occurred in the input, followed by a single space.). Then I sorted all the output numerically in descending order.

Page 1 of 1 All times are UTC - 5 hours [ DST ]
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/