LDA topic modeling with BoW¶
In [1]:
import re
import os
import glob
import numpy as np
import pandas as pd
import nltk
import gensim
import gensim.corpora as corpora
from timeit import default_timer as timer
from datetime import timedelta
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
load and concat dataset¶
- http://mlg.ucd.ie/datasets/bbc.html 에서 가져온 오픈소스 bbc 뉴스데이터
- 2004-2005 사이의 짧은 뉴스기사 2225개로 구성
- 대분류: 'business', 'entertainment', 'politics', 'sport', 'tech'
- 대분류 상관없이 하나의 document list로 concat해 사용
In [2]:
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\sanghee\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
Out[2]:
True
In [3]:
nltk.download('punkt')
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\sanghee\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
Out[3]:
True
In [4]:
os.getcwd(), os.listdir()
Out[4]:
('C:\\Users\\sanghee\\jupyter', ['.ipynb_checkpoints', '0321-crawling.ipynb', '0322-BoW.ipynb', '0322-LDA.ipynb', '0322-tfidf.ipynb', '0323-tfidf2.ipynb', 'bbc-fulltext', 'bbc-fulltext.zip'])
In [5]:
os.chdir('./bbc-fulltext/bbc')
In [6]:
data = []
print("**START")
for i, theme in enumerate(os.listdir()):
file_path = glob.glob(os.path.join(os.getcwd(), theme, "*.txt"))
# reading text files from each directory
print("-------------------------------")
print("Collecting bbc {} news dataset".format(theme))
start = timer()
for files in file_path:
try:
with open(files, "r", encoding="utf-8") as f:
data.append(f.read())
except UnicodeDecodeError as e:
print(e)
end = timer()
print("execution time: {} ".format(timedelta(seconds=end-start)))
print("-------------------------------")
print()
print("**END")
**START ------------------------------- Collecting bbc business news dataset execution time: 0:00:01.872588 ------------------------------- ------------------------------- Collecting bbc entertainment news dataset execution time: 0:00:01.052010 ------------------------------- ------------------------------- Collecting bbc politics news dataset execution time: 0:00:01.177353 ------------------------------- ------------------------------- Collecting bbc sport news dataset 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte execution time: 0:00:01.360794 ------------------------------- ------------------------------- Collecting bbc tech news dataset execution time: 0:00:02.078201 ------------------------------- **END
In [7]:
len(data), data[0]
Out[7]:
(2224, 'Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.\n\nThe firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.\n\nTime Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL\'s underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL\'s existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.\n\nTime Warner\'s fourth quarter profits were slightly better than analysts\' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. "Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility," chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.\n\nTimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann\'s purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.\n')
In [8]:
# 그냥 나중에 편하게 쓰려고 csv 파일로 저장
df = pd.DataFrame(data, columns = ['contents'])
print(df)
contents 0 Ad sales boost Time Warner profit\n\nQuarterly... 1 Dollar gains on Greenspan speech\n\nThe dollar... 2 Yukos unit buyer faces loan claim\n\nThe owner... 3 High fuel prices hit BA's profits\n\nBritish A... 4 Pernod takeover talk lifts Domecq\n\nShares in... ... ... 2219 BT program to beat dialler scams\n\nBT is intr... 2220 Spam e-mails tempt net shoppers\n\nComputer us... 2221 Be careful how you code\n\nA new European dire... 2222 US cyber security chief resigns\n\nThe man mak... 2223 Losing yourself in online gaming\n\nOnline rol... [2224 rows x 1 columns]
In [9]:
df.to_csv(os.getcwd()+'news_total.csv', index = True)
preprocessing¶
- regex로 문자 제외 모두 삭제(숫자, punctuation 등등)
- nltk의 stopwords 가져다 씀 + 몇번 돌려보면서 나온 의미없는 단어들 추가
In [10]:
data = [re.sub('[^a-zA-Z_]', ' ', doc) for doc in data]
data = [re.sub('\s+', ' ', doc) for doc in data]
In [11]:
tokenized_document = [word_tokenize(d) for d in data]
print(tokenized_document[0][:30])
['Ad', 'sales', 'boost', 'Time', 'Warner', 'profit', 'Quarterly', 'profits', 'at', 'US', 'media', 'giant', 'TimeWarner', 'jumped', 'to', 'bn', 'm', 'for', 'the', 'three', 'months', 'to', 'December', 'from', 'm', 'year', 'earlier', 'The', 'firm', 'which']
In [12]:
stop_words = stopwords.words('english')
stop_words.extend(['said', 'says', 'year', 'also', 'would', 'mr', 'bn', 'could', 'first', 'second', 'one', 'two',
'use', 'used', 'last', 'time', 'make', 'new'])
stop_words = set(stop_words)
In [13]:
def cleansing(document):
corpus = []
for d in document:
doc = []
for word in d:
low_word = word.lower()
if (low_word not in stop_words) and (len(low_word)!=1):
doc.append(word)
corpus.append(doc)
return corpus
In [14]:
cleaned_document = cleansing(tokenized_document)
print(cleaned_document[0][:30])
['Ad', 'sales', 'boost', 'Warner', 'profit', 'Quarterly', 'profits', 'US', 'media', 'giant', 'TimeWarner', 'jumped', 'three', 'months', 'December', 'earlier', 'firm', 'biggest', 'investors', 'Google', 'benefited', 'sales', 'high', 'speed', 'internet', 'connections', 'higher', 'advert', 'sales', 'TimeWarner']
In [15]:
dict_ = corpora.Dictionary(cleaned_document)
print(dict_)
Dictionary(31716 unique tokens: ['AOL', 'Ad', 'Alexander', 'Bertelsmann', 'Bros']...)
In [16]:
doc_term_matrix = [dict_.doc2bow(i) for i in cleaned_document]
print(doc_term_matrix[0])
[(0, 7), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 2), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 2), (20, 1), (21, 7), (22, 3), (23, 4), (24, 2), (25, 1), (26, 1), (27, 2), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 2), (49, 1), (50, 1), (51, 1), (52, 2), (53, 2), (54, 1), (55, 1), (56, 2), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 2), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 3), (77, 1), (78, 2), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 2), (86, 2), (87, 1), (88, 1), (89, 1), (90, 1), (91, 4), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1), (111, 1), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 1), (118, 2), (119, 1), (120, 1), (121, 1), (122, 1), (123, 5), (124, 5), (125, 1), (126, 1), (127, 1), (128, 3), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 2), (135, 2), (136, 2), (137, 2), (138, 1), (139, 2), (140, 1), (141, 4), (142, 1), (143, 1), (144, 1), (145, 2), (146, 1), (147, 1), (148, 1), (149, 1), (150, 1), (151, 2), (152, 3), (153, 1), (154, 1), (155, 2), (156, 1), (157, 2), (158, 1), (159, 1), (160, 1), (161, 1), (162, 1), (163, 1), (164, 1), (165, 1)]
LDA model training & testing¶
- K=5 (원래 대분류가 5개였으므로)
- 나머지 파라미터는 임의 조정
In [17]:
Lda = gensim.models.ldamodel.LdaModel
K = 5
passes = 30
iterations = 600
ldamodel = Lda(doc_term_matrix, num_topics=K, id2word = dict_, passes=passes, iterations=iterations, random_state=123)
In [18]:
ldamodel.print_topics()
Out[18]:
[(0, '0.005*"government" + 0.005*"US" + 0.004*"people" + 0.003*"Labour" + 0.003*"told" + 0.003*"election" + 0.003*"UK" + 0.002*"Blair" + 0.002*"China" + 0.002*"years"'), (1, '0.007*"Apple" + 0.007*"software" + 0.006*"search" + 0.006*"music" + 0.005*"technology" + 0.005*"Mac" + 0.005*"DVD" + 0.004*"people" + 0.004*"spam" + 0.004*"file"'), (2, '0.011*"people" + 0.007*"mobile" + 0.006*"technology" + 0.005*"phone" + 0.005*"TV" + 0.004*"net" + 0.004*"digital" + 0.004*"music" + 0.004*"users" + 0.004*"video"'), (3, '0.006*"game" + 0.005*"best" + 0.004*"film" + 0.004*"games" + 0.003*"Sony" + 0.003*"Nintendo" + 0.003*"world" + 0.003*"top" + 0.003*"play" + 0.003*"win"'), (4, '0.005*"software" + 0.004*"European" + 0.004*"computer" + 0.004*"US" + 0.003*"law" + 0.003*"Spanish" + 0.003*"patent" + 0.003*"domain" + 0.003*"browser" + 0.003*"world"')]
<<생각해봐야 할 점>>¶
그냥 raw document를 바로 넣으면 결과가 매우 좋지 않다
- stopwords 뿐만 아니라 document 전체에 자주 등장하지만 중요하지 않은 단어들 어떻게 필터링해서 넣지 (일일이 할 수는 X)
- 위와 관련된 스택익스체인지 글 https://stackoverflow.com/questions/45822801/how-to-improve-word-assignement-in-different-topics-in-lda/45855850#45855850
토픽 개수 정하기
- (지금이야 그냥 임의로 정해줬지만) automatic하게 최적의 K를 assign해줄 수 있을까...
- model perplexity & coherence로 grid search해서 최적화하는 방법 고민해보기
'AI > Machine Learning' 카테고리의 다른 글
[ML] YAKE! 예제 및 실습 try-outs (0) | 2022.04.29 |
---|---|
[ML] 토픽 모델링(Topic Modeling) - LSA와 LDA (0) | 2022.03.13 |
[ML] 앙상블(Ensemble Learning) - Voting, Bagging, Boosting (0) | 2021.10.18 |
[ML] XGBoost(eXtreme Gradient Boost) 파헤치기 (0) | 2021.10.12 |
[ML] CatBoost 소개 - Ordered Boosting, 범주형 피처 처리 (0) | 2021.10.08 |