반응형
/*******************************************************************************************************************
-- Title : [Py3.5] TF-IDF w/ Scikit-Learn
-- Reference : clarkgrubb.com/nlp#tf-idf
-- Key word : tf-idf tfidf sklearn scikit-learn tfidf vectorizer tfidfvectorizer fit_transform get_feature_names
word vectorizing word vectorizer word frequency 단어 벡터 워드 벡터 벡터라이징 단어 빈도
단어 빈도수 워드 빈도수 word dictionary term dictionary
*******************************************************************************************************************/
■ Scripts
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | from sklearn.feature_extraction.text import TfidfVectorizer # Ref : dbrang.tistory.com/1197
# ======================================= # -- Sourcing # ======================================= train_set = [ 'In Xanadu did Kubla Khan', 'A stately pleasure-dome decree:', 'Where Alph, the sacred river, ran', 'Through caverns measureless to man', 'Down to a sunless sea.'] print(train_set) print("... source", "." * 100, "\n") # ======================================= # -- Word Vectorizer & TF-IDF # ======================================= # Ref : dbrang.tistory.com/1197 # WordNGramAnalyzer 기본 사용 # 기본으로 모두 소문화로 전환하여 처리 # min_df는 최소 빈도 회수 지정(이상인 단어만 처리) # 기본 처리기, 소문자화.. 여기에 Default나 사용자 Tokenizer 사용. # -- # -- declare # -- vectorizer = TfidfVectorizer() print(vectorizer) print(",,, tf-idf vectorizer", "," * 100, "\n") # -- # -- Make feature matrix # -- data = vectorizer.fit_transform(train_set) print(data) print("::: feature matrix from fit_transform", ":" * 100, "\n") # -- # -- get feature # -- word_token = vectorizer.get_feature_names() for index in range(len(word_token)): print(index, word_token[index]) print("::: word_token", ":" * 100, "\n") # -- # -- TF-IDF # -- print(data.toarray()) print("::: toarray with vector", ":" * 100, "\n") | cs |
반응형