SB-10k: German Sentiment Corpus
SB-10k is a publicly available corpus that contains 9738 German tweets, each labeled by 3 annotators with “positive”, “negative”, “neutral”, “mixed”, or “unknown”. It was created by SpinningBytes in collaboration with the Zurich University of Applied Sciences (ZHAW).
All word embeddings are provided under Creative Commons License CC BY 4.0.
This means that they are free to use and distribute, even commercially, as long as appropriate credit to the reference below is given.
Human-readable format: Link
Licence Contract: Link
If you use the corpus, please make sure to reference the following publication:
- A Twitter Corpus and Benchmark Resources for German Sentiment Analysis. by Mark Cieliebak, Jan Deriu, Fatih Uzdilli, and Dominic Egger. In “Proceedings of the 4th International Workshop on Natural Language Processing for Social Media (SocialNLP 2017)”, Valencia, Spain, 2017
A detailed description of the corpus and how it was constructed can be found in the reference above.
In order to use the corpus, download the annotations below. Since Twitter does not allow to distribute the content of tweets, the dataset only contains tweet ID’s (first column) and the corresponding annotations (second column). A Python script to download the tweet content for the IDs can be found here*.
*On Windows, you might have to comment the “signal.alarm(…)” calls in download_tweets_api.py to get the script to work