Digital Youth in East Asia Fall Methods Workshop: How to Analyse Large Volumes of Online Text

Introduction and Objectives

From 11 to 13 October 2017, the East Asian Studies research unit at the Université Libre de Bruxelles (ULB) hosted a workshop entitled ‘Digital Youth in East Asia: Theoretical, Methodological, and Technical Issues’. The range of talks spanned three countries (China, Japan, and South Korea) and a variety of themes related to youth and identity online, including ethics and religion, citizenship, nationalism, cosplay, and cyberliterature. The participants were from diverse disciplines across the social sciences and humanities, and ranged from doctoral students to professors.

The final day of the workshop was a methods training session co-sponsored by CLARIN ERIC, which taught basic natural language processing (NLP) techniques for large volumes of online text to the participants. English was the predominant language analysed, in order to demonstrate the tools and methods via a lingua franca, although resources for Chinese, Korean, and Japanese were also introduced. A special dataset relevant to East Asian youth was collected for the event: 1,586,671 comments on 202 YouTube videos from the four most popular Korean pop (K-pop) boy and girl groups: Bangtan Boys (BTS), EXOBLACKPINK, and TWICE. Needless to say, there was more than a bit of culture shock!

Online text is qualitatively different from offline text, and many traditional corpus methods do not directly translate to online material. Moreover, the concept of ‘big data’ is closely connected with the internet, so issues of scale make quantitative approaches more necessary. How can we find patterns in online text? What are the opportunities, and what are the main challenges and constraints? These patterns can relate to sentiment, topic, or simple frequencies, all of which the workshop covered using the Python programming language.

Given the time constraints and the limited technical backgrounds of the participants, the aim of the training day was not to teach Python, but rather to illuminate its potential applications for large-scale text analysis. As such, all of the code was pre-written within Jupyter Notebooks, and the sessions were designed around the simple execution of code blocks, the modification of their parameters, and interpretation of the output. (Jupyter Notebook was an ideal platform because it allows for the addition of narrative in Markdown alongside code blocks, and is easily accessible via a web browser.) The objective was to inspire the early stage researchers to learn more on their own, and the later stage researchers to engage in more collaborations with computational scientists.

Challenges and Opportunities Presented by (Big) Digital Data

The day began with two plenary lectures, chaired by François Forêt (ULB-CEVIPOL). The first focused on the challenges and opportunities presented by (big) digital data, and was delivered by Yin Yin Lu (Oxford Internet Institute, University of Oxford). After introducing the workshop trainers, schedule, and overall objectives, she focused the rest of her talk on two main areas: points of access to online data and the ethics of online research. There are four main points of access, two of which require knowledge of programming, and two of which do not. Application programming interfaces (APIs) and web scraping fall into the former category; collection tools (some available as downloadable software) and preexisting public datasets the latter. Yin described the Twitter and Facebook APIs in more detail, highlighting their main constraint: social media companies have severe rate limits, and often data collected is not representative. Moreover, the largest East Asian messaging app, WeChat, does not appear to have a public API. There are less technical constraints when it comes to scraping data from webpages, and the Beautiful Soup Python library is the best tool for this.

For those who do not have or wish to acquire programming experience, there is a wealth of collection software designed for web data that not only obtains the data, but also analyses and visualises it; many are free to download. Twitter and YouTube are the most popular platforms for these collection tools, which are not exempt from API rate limits for non-entreprise users. The final option for accessing online data is to use a preexisting dataset, and Yin presented four categories: academic data repositories, open data portals, industry-prepared datasets, and CLARIN corpora. Many of these datasets are very large and free to use; those prepared by industry are particularly interesting, as they represent potential collaboration opportunities between researchers and internet companies.

Yin focused the rest of her talk on ethics, which has always been a particularly problematic issue for social media data, given its highly personal nature. There is a tension between Terms of Service agreements and user expectations. These agreements legally authorise social media companies to make user content available to other companies, organisations, and individuals (including academics). Although users have to accept the Terms of Service to create an account, many are uneasy about their posts being analysed and published by a third party (in, e.g., a journal article). Anonymisation is often not enough, as content can easily be linked to users through search engines. Thus, determining what is best ethical practise for the analysis of social media data is not straightforward. Current standards are being refined and negotiated; the Digital Wildfire project at the University of Oxford has played a significant role in developing the standards by creating a risk matrix to identify who is most at harm according to type of user profile and type of message. Their key conclusion is that informed consent must be obtained from ordinary users (as opposed to public figures or institutions), especially if the content of their posts is sensitive or provocative.

Introduction to CLARIN Tools and Resources for Digital Data and East Asian Languages

Martin Wynne (Bodleian Libraries, University of Oxford; National Coordinator for CLARIN UK) delivered the second morning lecture, which contextualised the focus of the training day, provided an overview of CLARIN tools and services for digital data, and introduced NLP tools specifically designed for East Asian languages. Martin pointed out that researchers from the humanities, social sciences, and computational sciences are likely to have different approaches and goals when it comes to digital research, which also results in different terminology. But the good news is that, with some sensitivity to this potential for misunderstanding, many of the same tools, methods, and datasets can be shared.

Martin subsequently drew attention to other traditions that can inform social media analysis. Research on intellectual history in the eighteenth century, for example, involves tracing networks of correspondence, tracking epistolary interchanges and references to concepts, events, people, and places. Text reuse, multilingualism, and censorship are all important issues. Many of these aspects of research are strongly echoed in our explorations of social media today.

Martin also highlighted one of the most important methodological lessons to be learned from the humanities in this context: the importance of ‘close reading’, which involves the careful examination of the meaning of texts in context, and a thorough understanding of the provenance and social significance of these texts. In the era of big data and ‘distant reading’, it is particularly important not to forget this, and to ensure that the tools and methods that we develop not only allow for the exploration and visualisation of patterns in millions of texts, but also for researchers to drill down and see the source texts in context. Close reading and distant reading need to be connected by tools that enable ‘scalable reading’.

CLARIN is a key European initiative that aims to support digital research in these areas. The results of CLARIN investigations into datasets and tools for research in social media can be found on the new ‘CLARIN for Researchers’ webpages, and the ‘work-in-progress’ survey of NLP software for East Asian languages is in a shared Google document. Suggestions, corrections, and contributions to these ongoing registries are welcome.

Discussion and Debugging

The morning talks were followed by a Q&A session, in which interesting questions were raised regarding the accessibility of data, as well as the legal and ethical issues relating to their use. Some of these problems are particularly acute with the platforms used in China (namely WeChat). It was noted that the relatively easy accessibility of Twitter data has led to the overrepresentation of Twitter in academic research. This, in turn, raises questions about the representativeness of the platform itself. How well do Twitter users reflect segments of the population? To what extent are tweets representative of computer-mediated communication?

After the Q&A and a coffee break, the practical and hands-on aspect of the training day began. First there was a troubleshooting session to address any technical issues with software that participants may have encountered during installation (they were asked to download the Anaconda distribution and SentiStrength prior to the training day; Anaconda includes the latest version of Python and Jupyter Notebook, as well as many popular libraries for data analysis). Given the diversity of operating systems in the room (various versions of Windows, Mac, and Linux were all represented), there was indeed plenty of debugging to do. Next there was a quick review of basic Python concepts, followed by an extremely well-timed lunch break.

Chico Camargo Explains Strings in Python

Introduction to Python for Digital Text Analysis (Pandas and NLTK)

Led by Chico Camargo (Oxford Internet Institute and Department of Physics, University of Oxford), the first afternoon session provided an overview of how Python could be used to descriptively summarise the K-pop dataset. It focused on the pandas library, which is indispensable for data analysis tasks. The objective of the session was to obtain a high-level statistical overview of the dataset after loading it into a dataframe. This versatile structure, roughly akin to an Excel spreadsheet, allows for the comments to be sorted and filtered in various ways (e.g., by number of likes, date published, author). Unlike Excel, it can process thousands upon thousands of rows in milliseconds. The session concluded with some basic, yet powerful, visualisations: comments over time, number of comments by like count, number of comments and likes per author, and number of comments per group in histogram form.

Comments Over Time

Number of Comments by Like Count


Log-log Plot of Number of Comments per Author

Log-log Plot of Number of Likes per Author

Scatterplot of Comments per Author vs Likes per Author

Histogram of Comments per K-pop Group

After this high-level statistical overview, Yin led the second introductory session on Python’s Natural Language Toolkit (NLTK) library. This is an excellent tool for analysing linguistic data, as it has built-in corpora and text processing functionalities for everything from tokenisation to semantic reasoning. Yin demonstrated the application of some basic NLTK functions to four YouTube comment files in the K-pop dataset (a popular song from each group). They were examined individually as well as comparatively, after the implementation of a special tweet tokeniser(although designed for tweets, it is applicable to other forms of social media data). Tokenisation also allowed for the calculation of lexical diversity, frequency distributions of specific keywords (e.g., singer names), words that only appeared once, the most popular verbs (after the part-of-speech tagger was used), n-grams, and collocations. This facilitated a more fine-grained linguistic analysis of what was being said about each video.

Lexical Diversity of Four Comment Files

Hapax Legomena in One Comment File

20 Most Frequent Bigrams and Trigrams in One Comment File


Topic Modelling—An Empirical Approach to Theme Detection

The second half of the workshops focused on two extremely popular text analysis techniques: topic modelling and sentiment analysis. Folgert Karsdorp (Meertens Institute) led the topic modelling session, which extracted themes from the K-pop corpus. A ‘theme’ is essentially a collection of words; topic models assign themes to documents based upon the co-occurrences of words in the documents. They operate under a very naïve assumption: a document is defined by the distribution of its vocabulary across various themes; syntax (and thereby context) is not taken into consideration. That being said, this naïve model can generate some powerful insights about a corpus of text that instigate further qualitative analyses.

There are many different types of topic models, and latent Dirichlet allocation (LDA) was chosen for the workshop, given its popularity and effectiveness with digital text. Folgert explained its implementation with collapsed Gibbs sampling using Allen Riddell’s library, which produced more meaningful results than either Rehurek’s gensim library or Pedregosa et al.’s scikit-learn library. He noted that the quality of the results was directly correlated with the amount of time taken to run the model (the Gibbs sampler is computationally intensive). Training a model for the K-pop corpus with 1,500 iterations and 25 topics took a little over an hour on Folgert’s 2015 MacBook Pro. As that is equivalent to the length of the topic modelling session, the results had to be cached for the workshop.

Ten Most Likely Words in Topic 19 (Allen Riddell’s LDA Library)

The next step was to visualise the results. This is extremely important, as it impacts the ease of analysis; moreover, many of the participants had limited experience with programming, and the primary objective of the training day was to excite them about the potential of data science techniques for the humanities and social sciences. First, Folgert created a grouped bar chart of the mean topic distributions per K-pop group, which illuminates which topics feature prominently for each group, and allows for the comparison of groups within each topic. For some topics, there was not much difference among the distributions; for other topics there were extreme differences. The former topics tended to be extremely general (e.g., expressions of love), and the latter focused on specific groups (or even specific members of specific groups).

Grouped Bar Chart of Average Topic Distributions for K-pop Groups

Folgert then demonstrated a more advanced interactive visualisation using the third-party library pyLDAvis, which is part of the R package created by Carson Sievert and Kenny Shirley. This visualisation facilitates the exploration of topics and their relationships, and consists of two parts. On the left-hand side, the topics are plotted as circles in a two-dimensional intertopic distance map, created via multidimensional scaling (principal components analysis by default); the closer two topics are in the plot, the more similar they are. Moreover, the size of the circle is proportional to how prominent the topic is in the entire corpus. On the right-hand side of the visualisation, when a topic (circle) is selected in the plot, the top 30 most relevant terms are displayed in a horizontal stacked bar chart, ranked by relevance. The bar chart is stacked because it also displays the overall frequency of the term, which provides a sense of how unique the term is to the topic. Needless to say, the participants were all tremendously impressed by this visualisation and the thematic analyses that it facilitated.

Folgert Karsdorp Explains pyLDAvis

pyLDAvis Interactive Visualisation

Given that the ‘documents’ in the dataset are YouTube comments, in order to fully understand them it is important to watch the videos with which they are associated. Thus, the topic modelling session ended with a simple function that retrieved the ID of the most relevant video for a given topic, and embedded the video within the Jupyter Notebook. This musical interlude was a welcome respite from the technical intensity of the session!

Sentiment Analysis—The Emotionality of Discourse

The last methods workshop of the day introduced SentiStrength, sentiment analysis software designed specifically for social web text that is available to academics for free. It was led by Mike Thelwall (University of Wolverhampton), the developer of the software (as well as many other collection and analysis tools). Mike began the session with an overview of how SentiStrength works: it assigns two scores to each text, one for positive and one for negative sentiment. Positive sentiment ranges from 1 (no positive sentiment) to 5 (extremely positive); negative sentiment ranges from -1 (no negative sentiment) to -5 (extremely negative). This represents a non-binary, multidimensional approach to emotion, as it recognises that messages are capable of containing positive and negative sentiment at the same time.

SentiStrength has a dictionary of 2,489 words that are pre-classified for sentiment strength (e.g., ‘love’ is +3; ‘hate’ is -4), and applies these scores to words when found in a text. The highest positive and highest negative scores are applied to the entire text. Its algorithm uses rules to cope with sentiment expressed or modified in other ways, such as negation (e.g., ‘not happy’), boosting (‘very nice!’), emoticons (e.g., J), and sentiment spelling (e.g., ‘Yaaaaaay!!!’). This bag of words approach might seem somewhat crude, but results from the software have been demonstrated to agree with human coders as much as they agree with each other. It must be noted, however, that SentiStrength is very weak at detecting sarcasm, irony, and figurative language in general.

Mike subsequently demonstrated the application of SentiStrength using one of the comment files in the K-pop dataset, and the participants collectively assessed the results, focusing specifically on the very strongly positive and negative comments. They were, for the most part, quite accurate—illuminating that a simple model can capture a great deal of the emotion contained in social media text, despite the complexities introduced by idiosyncratic diction, context, and brevity of expression.

Mike Thelwall Introduces SentiStrength

Closing Reflections

The day concluded with a well-deserved drinks reception, followed by dinner for the organisers at a traditional French restaurant. It was an incredibly intense day, filled with technical hurdles and coding frustration, but the overall outcome was well worth the toil. As mentioned above, the intended objective for the practical sessions was not for the participants to learn how to write Python scripts from scratch, but rather to run pre-written code snippets and understand how they work and how they can be modified for other datasets. This was successful on the whole for the majority of participants—indeed, quite a few were inspired to either learn more on their own or initiate collaborations with computational scientists. However, there was of course scope for improvement. This could involve more focus on demonstrating early in a session what the goals were, and what could be achieved with a particular approach, rather than waiting for the dramatic reveal of the results at the end. Also, there could be more points in the exercises where the participants could examine the actual text to better understand the results of a computational procedure, facilitating a type of ‘scalable reading’.

All materials—presentation slides, Jupyter Notebooks, and software installation instructions—are on the GitHub page for the training day. We hope that there will be an opportunity in future to develop them further for a more advanced, and perhaps also more focused, NLP workshop for online text. We are tremendously grateful to both CLARIN ERIC and ULB for making this event possible!



Videos of the workshop can be found on the CLARIN Videolectures channel.


socialhumanities logo