Getting Started

This is an example tuotrial which finds topic of BBC dataset using best combination for Pre-Processing and Community Topic Algorithm. Open In Colab

Step 1: import necessary class of the library

from communitytopic import CommunityTopic
from communitytopic import PreProcessing

Step 2: Load raw corpus as the dataset, here we are using BBC dataset.

with open("<Path-To-train-Dataset>/bbc_train.txt", "r", encoding='utf-8') as f:
      bbc_train = f.read()

with open("<Path-To-Test-Dataset>/bbc_test.txt", "r", encoding='utf-8') as f:
      bbc_test = f.read()

Step 3: Performing pre-processing on training and testing corpus

tokenized_bbc_train_sents, tokenized_bbc_train_docs, tokenized_bbc_test_docs, dictionary = PreProcessing.do_preprocessing(
        train=bbc_train,
        test=bbc_test,
        ner=1,
        pos_filter=3,
        phrases="npmi",
        phrase_threshold=0.35,
        language="en")

Step 4: Applying Community Topic algorithm on pre-processed data

community_topic = CommunityTopic(train_corpus=tokenized_bbc_train_sents,  dictionary=dictionary)
community_topic.fit()

Step 5: Get topic words founded by abovr algorithm

topic_words = community_topic.get_topics_words_topn(10)