TY - JOUR
T1 - Breaking news
T2 - Unveiling a new dataset for Portuguese news classification and comparative analysis of approaches
AU - Garcia, Klaifer
AU - Shiguihara, Pedro
AU - Berton, Lilian
N1 - Publisher Copyright:
© 2024 Public Library of Science. All rights reserved.
PY - 2024/1
Y1 - 2024/1
N2 - Every day thousands of news are published on the web and filtering tools can be used to extract knowledge on specific topics. The categorization of news into a predefined set of topics is a subject widely studied in the literature, however, most works are restricted to documents in English. In this work, we make two contributions. First, we introduce a Portuguese news dataset collected from WikiNews an open-source media that provide news from different sources. Since there is a lack of datasets for Portuguese, and an existing one is from a single news channel, we aim to introduce a dataset from different news channels. The availability of comprehensive datasets plays a key role in advancing research. Second, we compare different architectures for Portuguese news classification, exploring different text representations (BoW, TF-IDF, Embedding) and classification techniques (SVM, CNN, DJINN, BERT) for documents in Portuguese, covering classical methods and current technologies. We show the trade-off between accuracy and training time for this application. We aim to show the capabilities of available algorithms and the challenges faced in the area.
AB - Every day thousands of news are published on the web and filtering tools can be used to extract knowledge on specific topics. The categorization of news into a predefined set of topics is a subject widely studied in the literature, however, most works are restricted to documents in English. In this work, we make two contributions. First, we introduce a Portuguese news dataset collected from WikiNews an open-source media that provide news from different sources. Since there is a lack of datasets for Portuguese, and an existing one is from a single news channel, we aim to introduce a dataset from different news channels. The availability of comprehensive datasets plays a key role in advancing research. Second, we compare different architectures for Portuguese news classification, exploring different text representations (BoW, TF-IDF, Embedding) and classification techniques (SVM, CNN, DJINN, BERT) for documents in Portuguese, covering classical methods and current technologies. We show the trade-off between accuracy and training time for this application. We aim to show the capabilities of available algorithms and the challenges faced in the area.
UR - http://www.scopus.com/inward/record.url?scp=85183591152&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0296929
DO - 10.1371/journal.pone.0296929
M3 - Artículo
C2 - 38277376
AN - SCOPUS:85183591152
SN - 1932-6203
VL - 19
JO - PLoS ONE
JF - PLoS ONE
IS - 1 January
M1 - e0296929
ER -