GUCO

Ukraine War: An Online Corpus to analyze the impact of the War in Ukraine

Extractor for newspaper Público

To gather information from the the Portuguese online newspaper "Público", a Python extractor was created that scrapes their website and extracts the desired data. The extractor receives the URL of the chosen news as an argument and returns a JSON file with the extracted data. In order to do that, several Python libraries were used. The focus will be on the following ones: Selenium and Beautiful Soup.

Selenium, which is an open-source program, was used to automate the web browser. Since Firefox was the preferred browser for this project, the selenium driver used to open a browser instance and run the script was the Firefox driver geckodriver. On the other hand, Beautiful Soup, which is a library for pulling data out of HTML and XML files, was used to extract the desired data related to the news publication, such as its title, subtitle, owner, number of shares, publication date, extraction date, language in which the article is written, the platform name, the article's URL, body text, name of the newspaper, and the number of comments.

To extract the comments of a news publication, another extractor in Python, very similar to the previously discussed one, was developed. In this case, the extractor receives the following data as arguments: the link to the article, to retrieve the article date that is used in the generated JSON filename, and the link to the article comments, that is, the article link plus the text "#comments". The same Python libraries were used.

Only the articles between the dates 24th February and 31st July were considered.

As mentioned earlier, the information is saved in JSON format, and at this point, natural language processing techniques have not yet been applied, therefore the text cleaning has not yet been done. Because of that, the extracted information is unformatted.

In the case of information extracted from an article, it is being stored in a JSON object that holds information about the article's title (title) and subtitle (subtitle), its owner (owner), the total number of shares (shares), the data of the article's publication (datePosted) and its extraction (dateExtraction), the language in which it is written (language), the platform where it is available (plataform), the article's URL (url), the text of the article (postText), the name of the newspaper to which the article belongs (nameNewspaper), and the number of comments (comments).

On the other hand, details regarding comments extracted from a newspaper article are saved in a list of JSON objects, where each object has a number of key/value pairs that store the details regarding a comment. The following information is saved: the id of the comment (ID), the commentator profile link (Profile Link), the commentator ranking (Ranking), the commentator's name (Name), the text of the comment (Comment), and the answers to the commentator's remarks (Responses). It should be noticed that the list of JSON objects in the field corresponding to the comments' responses represents each comment's response.