GUCO

Ukraine War: An Online Corpus to analyze the impact of the War in Ukraine

Extractor for Reddit

To extract posts and their comments from the platform "Reddit", two extractors were developed in the programming language Python. For each extractor, the Python library PRAW (Python Reddit API Wrapper) was used, as it provides an interface for accessing the Reddit API. PRAW is built on top of the Reddit API and allows users to connect their Python code to Reddit.

First, it was necessary to create a Reddit account and then go to https://reddit.com/prefs/apps and create a Reddit application. Then, to create our scraper instance, the Reddit class inside the PRAW library was used. After that, the subreddit from which the information was going to be extracted was specified, in both extractors, "r/portugal". In each extractor, to scrape posts related to the conflict between Ukraine and Russia, the posts in the subreddit were searched using the keywords "Ucrânia" and "Russia", respectively. Only the results between the dates 24th February and 31st July were considered.

The extracted data is currently being saved in JSON format. At this point, since NLP techniques have not yet been applied, the text cleaning has not yet been done. Because of that, the extracted information is unformatted.

In the case of information extracted from a post, it is being stored in a JSON object that holds information about the post's title (title), its score (score), its id (id), the post's URL (url), the number of comments (comms\_num), its timestamp (created), and its description (body).

On the other hand, details regarding comments extracted from a post are being saved in a list of JSON objects, where each object has a number of key/value pairs that store the details regarding the comment. The following information is saved: the comment id (comment\_id), the id of the parent comment ( comment\_parent\_id), the content of the comment (comment\_body), and the answers to the commentator's remarks (comment\_responses). It should be noticed that the list of JSON objects in the field corresponding to the comments' responses represents each comment's response. Also, at this moment, the comments are not organized correctly.