Minh Quang Duong

Python: Most frequent words in a string

MQDuong

This post is my practice on getting the most frequent words in a string in Python. Here is the code

#import the necessary packages
from collections import Counter
import pandas as pd

#open the text file. Here in this case, it is named "text"
with open('text.txt') as fin:
    counter = Counter(fin.read().strip().split())

numbers = sorted(counter.most_common(), key=lambda student: student[1], reverse=True)
top15 = numbers[0:15]

counter.most_common() is the function to get all the words and their respective count from the string. If you put a number, let’s say 10, in the brackets, it means that you want to get only the first 10 elements of the array. Here is how counter.most_common(10) looks:

[(‘to’, 27), (‘the’, 26), (‘of’, 20), (‘and’, 19), (‘in’, 16), (‘a’, 15), (‘is’, 12), (‘for’, 11), (‘it’, 9), (‘new’, 9)]

numbers = sorted(counter.most_common(), key=lambda student: student[1], reverse=True)
top15 = numbers[0:15]

The above code is to get all the words and sort them in the descending order according to the words’ frequency. top15 is to get the first 15 elements of the sorted array. Here is how the top15 looks:

[(‘to’, 27), (‘the’, 26), (‘of’, 20), (‘and’, 19), (‘in’, 16), (‘a’, 15), (‘is’, 12), (‘for’, 11), (‘it’, 9), (‘new’, 9), (‘are’, 9), (‘their’, 8), (‘video’, 7), (‘has’, 7), (‘by’, 7)]

After we get the top 15, we should put them into a data frame so that data processing can be easier. Here is how

text = [] #an array for the words
number = [] #an array for the frequency
for i in top15: #iterate through the top 15
   text.append(i[0])
   number.append(i[1])

#create the data frame
rawdata = {'words': text, 'frequency': number}
df = pd.DataFrame(rawdata, columns = ['words', 'frequency'])

This is the final data frame

    words  frequency
0      to         27
1     the         26
2      of         20
3     and         19
4      in         16
5       a         15
6      is         12
7     for         11
8      it          9
9     new          9
10    are          9
11  their          8
12  video          7
13    has          7
14     by          7

Tags:

Coding, Counter package, create a data frame from a loop, Most frequent words in python, Practice, Python, sort an array, Text analysis

Date:

January 23, 2019

Up next:

Before:

Python: Most frequent words in a string

Share this:

Leave a comment Cancel reply