API Usage

There are two important classes in the library as follows:- - PreProcessing - CommunityTopic

We will see API usage of both of them.

Pre-Processing Module

Method: `do_preprocessing`

The do_preprocessing method performs pre-processing on a given training and testing corpus to convert it into a format suitable for CommunityTopic.

Parameters

do_preprocessing(train=None, test=None, ner=1, pos_filter=0, phrases='npmi', phrase_threshold=0.35, language='en')

train : str

Input training corpus
test : str

Input testing corpus
ner : int
- Named Entity Recognition flag
- Possible values = [0, 1]
- 0 - to not use NER
- 1 - to use NER
pos_filter : int
- Part-of-Speech filter for extracting features and marking the words in a text with labels for entity extraction
- Possible values = [0, 1, 2, 3]
- 0 - No POS filtering
- 1 - Keep only adjectives, adverbs, nouns, proper nouns, and verbs
- 2 - Keep only adjectives, nouns, proper nouns
- 3 - Keep only nouns, proper nouns
phrases : str

Currently using 'npmi' type for phrase detection
phrase_threshold : float
- Phrase detection threshold
- Currently using 0.35
language : str
- Possible values = ['en', 'it', 'fr', 'de', 'es']
- 'en' - English
- 'it' - Italian
- 'fr' - French
- 'de' - German
- 'es' - Spanish
- Language of the training and testing corpus

Returns

tokenized_train_sents
tokenized_train_docs
tokenized_test_docs
dictionary

tokenized_train_sents : list of list

Returns pre-processed training corpus as sentences (in list of words form)
tokenized_train_docs : list of list

Returns pre-processed training corpus as docs (in list of words form)
tokenized_test_docs : list of list

Returns pre-processed testing corpus as sentences (in list of words form)
dictionary : dict
- Gensim dictionary object that tracks frequencies and can filter vocab
- Keys are id for words
- Values are words

Community Topic Module

Class constructor: `init`

__init__(self, train_corpus=None, dictionary=None, edge_weight='count',
         weight_threshold=0.0, cd_algorithm='leiden', resolution_parameter=1.0,
         network_window='sentence')

Parameters

train_corpus : list of list (of string)
- Preprocessed sentences of training corpus (List of list)
- It contains pre-processed tokenized sentence as list of list
dictionary : dict
- Gensim dictionary object that tracks frequencies and can filter vocab
- keys are id for words
- values are words
edge_weight: str
- It is weight of edges which comes from the frequency of co-occurrence.
- Possible values: ["count", "npmi"]
- "count": Raw count of possible edges as the edge weight.
- "npmi": Weighing scheme which uses Normalized Pointwise Mutual Information (NPMI) between terms
weight_threshold : float
- The edges can be thresholded, i.e. those edges whose weights fall below a certain threshold are removed from the network.
cd_algorithm : str
- To choose community detection algorithm, possible values: ["leiden", "walktrap"]
resolution_parameter : float
- Te resolution_parameter to use for leiden community detection algorithm.
- Higher resolution_parameter lead to smaller communities, while
- lower resolution_parameter lead to fewer larger communities.
network_window:
- The network that we construct from a corpus has terms as vertices. This decides the fixed sliding window of document.
- Possible values: ["sentence", "5", "10"]
- "sentence": two terms co-occur if they both occur in the same sentence.
- "5" or "10": two terms co-occur if they both occur within a fixed-size sliding window over a document.

Method: `fit`

fit()

This method performs task of finding simple topics

Method: `fit_hierarchical`

fit_hierarchical(n_level=2)

This method performs task of finding hierarchical topics

Parameter

n_level : int
- Number of level for hierarchical topics

Method: `get_topics_words`

get_topics_words()

Get topic words of flat topic modelling

Returns

topics : list of list
- Returns flat topics as topic words

Method: `get_topics_words_topn`

get_topics_words_topn(n=10)

Get top n topic words of flat topic modelling

Parameter

n : int
- top n topic words

Returns

topics : list of list
- Returns top n flat topics as topic words

Method: `get_topics`

get_topics()

Get topic as dictionary id

Returns

topics : list of list
- Returns flat topics as dictionary id

Method: `get_topic_words_hierarchical`

get_topic_words_hierarchical()

Get hierarchical topic as topic words

Returns

hierarchical_topics_words : dict of dict
- In following format (each level and topic in that level)-
  
  { 1 : {"0": ['firm', 'company', 'economy',...], "1": ['country', 'china', 'bank'....], } .....}, 2 : {"0": [''orders', 'spring', 'allies',...], "1": ['lawyer', 'individuals', 'failure'....], } .....}, ..... }

Method: `get_topics_hierarchical`

get_topics_hierarchical()

Get hierarchical topic as dictionary id, and ig_graph of topic

Parameter

n : int
- top n topic words

Returns

hierarchical_topics : dict of dict
- Returns top hierarchical topics as topic words
- In following format (each level and topic in that level)-
{ 1 : {"0": {'dict_num': [2, 147, 6, 1180, 327, ,....], 'ig_graph': object of ig_graph }, "1": {'dict_num': [2, 147, 6, 1180, 327, ,....], 'ig_graph': object of ig_graph }, .....}, 2 : {"0_0": {'dict_num': [2, 147, 6, 1180, 327, ,....], 'ig_graph': object of ig_graph}, "0_1": {'dict_num': [2, 147, 6, 1180, 327, ,....], 'ig_graph': object of ig_graph}, ..... ..... "1_0": {'dict_num': [2, 147, 6, 1180, 327, ,....], 'ig_graph': object of ig_graph}, "1_1": {'dict_num': [2, 147, 6, 1180, 327, ,....], 'ig_graph': object of ig_graph}, .....}, 3 : {"0_0_0": {'dict_num': [2, 147, 6, 1180, 327, ,....], 'ig_graph': object of ig_graph}, "0_0_1": {'dict_num': [2, 147, 6, 1180, 327, ,....], 'ig_graph': object of ig_graph}, ..... }, ...... }

Note, in above format each level has topic names as the key of dictionary. For example, level 1 has single digit value which specifies topics in that level level 2 has two values seperated by underscore, first value is super topic and second is child topic Similary, level 3 has three values, for which parent topics and current child topic

Method: `get_topics_hierarchical`

get_n_level_topic_words_hierarchical(n_level=2)

Get first n number of levels from hierarchy

Parameter

n_level : int
- top n level

Returns

topics : dict of dict

Method: `get_hierarchy_tree`

get_hierarchy_tree()

This function is for visualisation purpose of hierarchical topics.

Returns

tree : It returns a tree-like structure in dictionary format.

API Usage

Pre-Processing Module

Method: do_preprocessing

Parameters

Returns

Community Topic Module

Class constructor: __init__

Parameters

Method: fit

Method: fit_hierarchical

Parameter

Method: get_topics_words

Returns

Method: get_topics_words_topn

Parameter

Returns

Method: get_topics

Returns

Method: get_topic_words_hierarchical

Returns

Method: get_topics_hierarchical

Parameter

Returns

Method: get_topics_hierarchical

Parameter

Returns

Method: get_hierarchy_tree

Returns

Method: `do_preprocessing`

Class constructor: `init`

Method: `fit`

Method: `fit_hierarchical`

Method: `get_topics_words`

Method: `get_topics_words_topn`

Method: `get_topics`

Method: `get_topic_words_hierarchical`

Method: `get_topics_hierarchical`

Method: `get_topics_hierarchical`

Method: `get_hierarchy_tree`