In recent years, Twitter has become an increasingly valuable source of data for research purposes. As a type of micro-blogging, it allows the user to send a short message of not more than 280 characters, which results in condensed messages, containing information about current events, cultural trends, or user preferences. Combined with expertise in statistics and machine learning, Twitter is a goldmine for researchers and businesses alike, allowing them to track market trends and make predictions based on real-time data across the world. While much Twitter data has been collected and analyzed for countries such as the U.S. and Germany, Swiss Tweets have not received much attention. Yet Switzerland offers a highly multicultural, multilingual setting for social media, in which posts are made in various languages, including English, German, French and Italian, offering an interesting insight into a unique environment from a linguistic, cultural and economic perspective. This raises an important question: how does one determine the “Swissness” of a tweet? To be able to answer this question, SpinningBytes decided to build the Swiss Twitter Corpus, a continuously collected compilation of Tweets related to Switzerland.
Twitter Corpus Architecture
SpinningBytes started the tweet compilation in January 2018. Using the Twitter API and Elasticsearch, our engineers implemented a twitter crawler which continuously collects tweets that are relevant to Switzerland and capture what’s going on in and around Switzerland.
Each retrieved tweet does not only contain the tweet text, but also a large amount of meta information. Contained in form of a JSON object, it includes information on the user, such as the language used, geolocation at time of tweet and registered location of the user account, or information on the tweet itself, such as if the tweet was a reply or a retweet.
Example of Tweet as JSON Object
In order to filter relevant Swiss tweets, we needed to first figure out what may be an indication of “Swissness”. This is not a trivial task – is a tweet about CreditSuisse or UBS Swiss? Is a tweet about Donald Trump Swiss, if it was tweeted in Switzerland? Are tweets, which were tweeted in a foreign language at the airport while passing through Switzerland, relevant? We quickly noticed that opinions of our team members differ and decided to create an initial filtering system, which generously includes as many tweets as possible, while still matching at least one of our filtering constraints for “Swissness”. For this purpose, we added custom attributes to Twitter’s already lengthy list of 150 attributes and assess for each tweet the following criteria: if the country code is Switzerland’s, whether the language is a Swiss language, including Swiss German, or whether the username is on a custom made list of “Swiss” usernames, including politicians and celebrities.
Example of Tweet with Swiss user location
We also check for specific words, using a custom list we collected, containing keywords such as names of Swiss tourist destinations (e.g. Zermatt), companies (e.g. Migros), politicians (e.g. Eveline Widmer-Schlumpf), food (e.g. Rösti), locations and landmarks (e.g. Matterhorn), or common Swiss expressions (e.g. Samichlaus). In total we initially consider three attributes: whether the geolocation is in Switzerland, whether the tweet is linked to a public, Swiss account, or whether a keyword from our list was found in the tweet.
Swiss Twitter Corpus Architecture
Each day, we collect on average 20’000 tweets and use further keyword filtering to give each tweet a “Swissness” score, depending on the occurrence and combination of found attributes, which make up the Swiss Twitter Corpus. To perform searches, we use Kibana, an open source plugin for ElasticSearch, which allows us to visualize the data, perform queries and navigate through the corpus. With it we can find all tweets within a certain time frame, limit our search based on certain attributes only and easily create statistics.
Graph showing distribution of tweets collected within one month
Further Development and Applications
As of today, our corpus consists of approximately 3 million tweets and is continuously growing (you can find a live feed of the crawler here). We are currently continuing to explore the question “What makes a national tweet?” by evaluating the data and attributes used to filter and label the tweets, thereby gaining insights into the Swiss Twitter community. By running sentiment analysis or other text analysis tools over the data, we are able to learn about Swiss preferences and opinions, finding answers to questions such as “What is the happiest canton?” and “Where in Switzerland do people tweet most … and about what?”.
Popular Topics in Switzerland in May 2018
The Swiss Twitter Corpus is available for academic and commercial use. For example, you could perform a sociocultural analysis of tweets by canton, search for the “word of the year” in Switzerland, live track the attitude towards the national soccer team, or find out how your business is represented in the Swiss Twitter community. In fact, there are endless possible use cases! Are you interested to find out more about our corpus? Make sure to visit the website or contact us, if you would like to gain access.