When Corpus Linguistics Meets Data Science: An Interview with Professor Łukasz Grabowski

Professor Łukasz Grabowski’s seminar delves into the growing intersection between corpus linguistics and data science, revealing how linguists increasingly adopt computational and statistical perspectives. His research highlights how data cleaning, visualization, and predictive modelling reshape the way we examine language and translation. In this interview, he reflects on emerging methodological shifts, insights from his current project, and the skills future linguists need to thrive in a data-driven academic landscape.

Your seminar explores the overlap between corpus linguistics and data science. In what ways do you think corpus linguists are beginning to think like data scientists, and how does this shift influence the kinds of questions we ask about language?

I think that when corpus linguists like me do linguistic research, their primary goal is to learn or find out something interesting about language itself, be it a language system or language in use. I am particularly interested in studying how language is used by translators, interpreters, foreign language learners, social or professional groups etc. and what research methods to use for that purpose. Data scientists, on the other hand, conduct text analysis – often by extracting meaningful insights from unstructured textual data - to solve practical problems which are typically not linguistic ones. For example, they may want to identify trends or sentiment, that is, emotional tone, in the data to obtain a better picture of customer opinions or market insights etc. But to obtain reliable findings we need to have high quality data. That is why I think that corpus linguists have started to better understand how important it is to carefully prepare the data for statistical analyses, they have become more attentive to detail when cleaning, filtering or annotating linguistic data. Also, traditional corpus linguistic research has been largely descriptive but nowadays we can see that corpus linguists, similar to data scientists, also conduct explanatory research or predictive modelling to better understand or forecast linguistic behaviour. I have also noticed that data visualization techniques are getting more and more popular. That is very useful because sometimes a single boxplot or scatterplot may reveal a lot more about an aspect of language use that we are interested in than a longish description or spreadsheet. It seems that the popular adage “A picture is worth a thousand words” is particularly relevant here.

How can data visualization and statistical modeling help us better understand the creative and cognitive processes that shape literary translation?

That is a really broad question, we have opened a Pandora’s box really, because multifactorial research methods accompanied by data visualisation techniques have been recently used by researchers dealing with authorship attribution, translatorial attribution, (computational) stylistics, to name but a few areas. [NB! Nowadays such analyses are also enhanced by using machine learning or deep learning techniques, to mention a famous research conducted in 2019 by Petr Plecháč, from the Czech Academy of Sciences in Prague, on the authorship of Shakespeare’s “Henry VIII” as an example]. In my current project, where apart from corpus-based approaches we also employ multifactorial statistics, such a combination of research methods offers an opportunity to model or even predict translator’s decisions, and this way make a contribution to the body of research on translator’s style or translator’s idiolect.

Your ongoing project, developed with colleagues from Vilnius University, is now well underway. How has the research progressed so far — and have any unexpected patterns or findings emerged along the way?

First of all, I am really honoured that thanks to the support of the Polish National Agency for Academic Exchange (NAWA) and Vilnius University I have an opportunity to conduct this project with such a professional, devoted and knowledgeable research team here at the Faculty of Philology at Vilnius University, Dr Anna Ruskan and Dr Audronė Šolienė. Vilnius is such a nice city for a long research stay, especially now when the Christmas atmosphere fills every nook and cranny of picturesque streets in the old town. Of course, I am also grateful to my home university in Opole, Poland, for letting me do my research here in Vilnius this semestre. Now speaking about the project, I reckon that the most time-consuming part is almost over now. We finished theoretical overviews, refined our methodology and completed a lengthy and labour-intensive process of linguistic data annotation. We also completed the first large-scale analysis of two novels, that is, 1984 by G. Orwell and Brave New World by A. Huxley and their Lithuanian translations. Our goal was not only to describe translation tendencies, but also to explain which factors, out of 6 taken into consideration, impacted the avoidance or preservation of repetitions of English-original reporting verbs signalling direct speech of literary characters in Lithuanian translations. It is difficult to summarize all detailed results in a short interview, a research paper is definitely better for that, and we hope it will come out in print next year. But our innovative methodology [involving a logistic regression with mixed effects] helped us reveal that repetition avoidance is not a random or stylistically motivated tendency alone but it is conditioned by specific linguistic factors. For example, in the case of Brave New World we showed that when the translator intervenes more substantially, by shifting stance, reworking syntax, or when source-text reporting verbs were accompanied by adverbial modifiers, there is a strong tendency to break source-text repetition patterns and increase lexical variety in translation. It seems that overall the translators appear to use reporting verbs as a flexible space in which they can subtly express their own stance, a pragmatic fingerprint, by strengthening or softening character voices possibly in ways that reflect stylistic conventions characteristic of Lithuanian literary discourse.

Finally, your talk also touches on the skills needed for corpus-based research in the 2020s. What balance between linguistic knowledge and data literacy do you see as most essential for aspiring researchers in this evolving field?

That is a very important question. I think that linguistic knowledge is very important because it may provide a critical outlook on current developments in corpus linguistics, natural language processing as well as development and use of large language models. Linguists know a lot about how complex communication and human interaction and that there are aspects of it that are difficult to be formalized in an algorithmic way. Also, the question whether we should treat AI-generated texts in the same way as texts written by humans is far from trivial, let alone in practical or everyday contexts of language use. As for doing research, there are a number of additional challenges. Contemporary science has become more interdisciplinary than ever before so even though corpus linguists are fundamentally linguists in the first place, they should, in my opinion, also possess sufficient collaboration skills, including soft skills so important nowadays, to work together with specialists from other disciplines or stakeholders from different sectors of society, including business, culture, technology etc. Since corpus linguistics has become increasingly embedded in statistics and data science, some technical skills also would not do any harm, although it is difficult to expect that a graduate of philology would master them on the level expected on the job market. Nevertheless, some background in statistics and programming, in particular in Python or R environments, may help better understand, at least conceptually, specificity of linguistic data, for example types of variables, understand metrics found in descriptive and inferential statistics, estimate the quality of a statistical model or interpret visualizations of linguistic data, to name but a few. In short, we are linguists in the first place but we should be able to talk and collaborate with specialists from other areas, and the right balance will emerge from the specificity of a given project or task. It is common knowledge that we are living in a world where information or data deluge is part and parcel of everyday life, and technological developments, including wider adoption of AI, will contribute even further to datasphere growth in the near future. That is why I think that critical thinking instruction at universities should also include a module or trainings on how to critically approach and interpret various types of data, including linguistic ones. Not only may it help fight fake news, misinformation or disinformation, among others, but it may also allow students to obtain transferrable skills so important on the fast changing job market.