Ukraine War: An Online Corpus to analyze the impact of the War in Ukraine


Extractor for newspaper Jornal de Negócios

To gather data from the selected articles, an extractor mirroring the one used for "Público" articles was developed. Similarly, for extracting comments, an extractor like the one designed for "Público" articles was created. Only the articles between the dates 24th February and 31st July were considered.

However, the information extracted from the comments, in this case, is different since the comments in "Jornal de Negócios" have a different structure. Therefore, the following information is being saved: the id of the comment (ID), the title of the comment (that is, whether or not it is the most voted comment) (Comment's Title), the commentator's name (Name), the comment's date (Comment Date), the number of likes and dislikes of the comment (Number of Likes and Number of dislikes), the text of the comment (Comment), and the answers to the commentator's remarks (Responses). It should be noticed that the list of JSON objects in the field corresponding to the comments' responses represents each comment's response.

At this point, the extracted data is saved in JSON format and NLP techniques have not yet been applied, therefore text cleaning has not yet been done. Because of that, the information is unformatted.