This post is my practice on getting the most frequent words in a string in Python. Here is the code
#import the necessary packages
from collections import Counter
import pandas as pd
#open the text file. Here in this case, it is named "text"
with open('text.txt') as fin:
counter = Counter(fin.read().strip().split())
numbers = sorted(counter.most_common(), key=lambda student: student[1], reverse=True)
top15 = numbers[0:15]
counter.most_common() is the function to get all the words and their respective count from the string. If you put a number, let’s say 10, in the brackets, it means that you want to get only the first 10 elements of the array. Here is how counter.most_common(10) looks:
[(‘to’, 27), (‘the’, 26), (‘of’, 20), (‘and’, 19), (‘in’, 16), (‘a’, 15), (‘is’, 12), (‘for’, 11), (‘it’, 9), (‘new’, 9)]
numbers = sorted(counter.most_common(), key=lambda student: student[1], reverse=True)
top15 = numbers[0:15]
The above code is to get all the words and sort them in the descending order according to the words’ frequency. top15 is to get the first 15 elements of the sorted array. Here is how the top15 looks:
[(‘to’, 27), (‘the’, 26), (‘of’, 20), (‘and’, 19), (‘in’, 16), (‘a’, 15), (‘is’, 12), (‘for’, 11), (‘it’, 9), (‘new’, 9), (‘are’, 9), (‘their’, 8), (‘video’, 7), (‘has’, 7), (‘by’, 7)]
After we get the top 15, we should put them into a data frame so that data processing can be easier. Here is how
text = [] #an array for the words
number = [] #an array for the frequency
for i in top15: #iterate through the top 15
text.append(i[0])
number.append(i[1])
#create the data frame
rawdata = {'words': text, 'frequency': number}
df = pd.DataFrame(rawdata, columns = ['words', 'frequency'])
This is the final data frame
words frequency
0 to 27
1 the 26
2 of 20
3 and 19
4 in 16
5 a 15
6 is 12
7 for 11
8 it 9
9 new 9
10 are 9
11 their 8
12 video 7
13 has 7
14 by 7