Rotulação de dados para a tarefa de reconhecimento de entidades nomeadas no domínio da bebida cachaça

Silva, Priscilla de Souza

Use este identificador para citar ou linkar para este item: http://repositorio.ufla.br/jspui/handle/1/56065

Título:	Rotulação de dados para a tarefa de reconhecimento de entidades nomeadas no domínio da bebida cachaça
Título(s) alternativo(s):	Data labeling for the task of named entity recognition in the domain of cachaça beverage
Autores:	Pereira, Denilson Alves Merschmann, Luiz Henrique de Campos Brito, Mozar José de Dalip, Daniel Hasan
Palavras-chave:	Reconhecimento de entidades nomeadas Cachaça Aprendizagem de máquina Processamento de Linguagem Natural (PLN) Processamento de Linguagem Natural Named Entity Recognition (NER) Machine learning Natural Language Processing (NLP)
Data do documento:	27-Fev-2023
Editor:	Universidade Federal de Lavras
Citação:	SILVA, P. de S. Rotulação de dados para a tarefa de reconhecimento de entidades nomeadas no domínio da bebida cachaça. 2022. 111 p. Dissertação (Mestrado em Ciência da Computação)–Universidade Federal de Lavras, Lavras, 2022.
Resumo:	Named Entity Recognition (NER) is the task of identifying tokens in free text and classifying them according to a set of predefined categories such as person name, organization and location. Datasets labeled for this task are essential for training supervised machine learning models. However, although there are many datasets labeled with texts in English, for the Portuguese language they are still scarce. Therefore, this work contributes with the creation and evaluation of a manually labeled dataset for the NER task, with texts written in Brazilian Portuguese, in the specific domain of the distilled beverage cachaça. Essa é uma bebida popular no Brasil e de grande importância econômica. The dataset proposed in this work is the first in Portuguese in the field of beverages and may be useful for other types of beverages with categories of entities similar to cachaça, such as wine and beer. This work describes the process of textual data collection and extraction, creation and labeling of the NER data set and its experimental evaluation. As a result, a dataset called cachacaNER was obtained, which contains more than 180,000 tokens labeled in 17 categories of named entities specific to the cachaça context and generic categories. According to Fleiss’ Kappa metric, the agreement (0.857) obtained between the different labelers was almost perfect, guaranteeing the reliability of the dataset in relation to manual labeling. The size of the dataset, as well as the result of its experimental evaluation, are comparable to other datasets in Portuguese, although the one in this work has a greater number of categories of named entities. In addition to manual labeling, an automatic entity labeling technique was also evaluated, with cachacaNER data, in order to propose faster labeling with less manual work. As a result, it was identified that the NER model trained with automatically labeled data performed well (F1 of 0.808), considering the result of the same model trained with manually labeled data (F1 of 0.899).
URI:	http://repositorio.ufla.br/jspui/handle/1/56065
Aparece nas coleções:	Ciência da Computação - Mestrado (Dissertações)

Arquivos associados a este item:

Arquivo	Descrição	Tamanho	Formato
DISSERTAÇÃO_Rotulação de dados para a tarefa de reconhecimento de entidades nomeadas no domínio da bebida cachaça.pdf		5,58 MB	Adobe PDF	Visualizar/Abrir

Mostrar registro completo do item Recomendar este item Visualizar estatísticas