Word embeddings are used in Natural Language Processing (NLP) to map words to vector representations. They are used, for instance, in deep learning algorithms for named entity extraction, sentiment analysis or chatbots.
All word embeddings are provided under Creative Commons License CC BY 4.0.
This means that they are free to use and distribute, even commercially, as long as appropriate credit to the reference below is given.
Human-readable format: Link
Licence Contract: Link
If you use any of the word embeddings, please make sure to reference at least one of the following publications:
- A Twitter Corpus and Benchmark Resources for German Sentiment Analysis. by Mark Cieliebak, Jan Deriu, Fatih Uzdilli, and Dominic Egger. In “Proceedings of the 4th International Workshop on Natural Language Processing for Social Media (SocialNLP 2017)”, Valencia, Spain, 2017
- Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classiﬁcation by Jan Deriu, Aurelien Lucchi, Valeria De Luca, Aliaksei Severyn, Simon Müller, Mark Cieliebak, Thomas Hoffmann, and Martin Jaggi. In “Proceedings of the 26th International World Wide Web Conference (WWW-2017), Perth, Australia, 2017
We provide word embeddings for various languages. The following table gives an overview of the available embeddings.
We trained our word embeddings on different text types, such as Tweets and Wikipedia. The text type influences how the embeddings perform for the NLP task at hand. For instance, in the case of sentiment analysis, word embeddings trained on News or Tweets tend to achieve better results than those trained on Wikipedia. For a detailed analysis of how to select proper word embeddings, see the following research article:
Potential and Limitations of Cross-Domain Sentiment Classiﬁcation, by Dirk von Grünigen, Martin Weilenmann, Jan Deriu, and Mark Cieliebak (SocialNLP-2017).
We provide pre-trained word embeddings with different vector lengths (e.g. 52 and 200 dimensions). Typically, higher dimensions will allow for better quality, but need much more space.
The embeddings are stored either in a folder or as a standalone file. The folder structure consists of:
- bigram: finds bi-grams in a sentence
- trigram: given a sentence where bi-grams are found this transformer finds trigrams
- config.json: shows the hyperparameters used to creat the word embeddings
- embedding_file: main file with the corresponding vector for each word
- embedding_matrix.npy: numpy matrix which encodes the embeddings, each row represents one vector
- vocabulary.pickle: index that maps each word to a unique id (id represents the row where the vector is stored in the embedding_matrix.npy)
In case the download only consists of a single file, that file then is the same as the above mentioned embedding_file.
Unless mentioned otherwise, our word embeddings are trained with Word2Vec.
Word Embeddings from Tweets
Download Word Embeddings trained with Word2Vec on 200 million English Tweets using 200 dimensions.
Download Word Embeddings trained with Word2Vec on 590 million English Tweets using 52 dimensions.
Download Word Embeddings trained with fastText on 50 million German Tweets using 100 dimensions.
Download Word Embeddings trained with fastText on 50 million German Tweets using 300 dimensions.
Download Word Embeddings trained with Word2Vec on 300 million French Tweets using 52 dimensions.
Download Word Embeddings trained with Word2Vec on 200 million German Tweets using 200 dimensions.
Download Word Embeddings trained with Word2Vec on 300 million German Tweets using 52 dimensions.
Download Word Embeddings trained with Word2Vec on 200 million Spanish Tweets using 200 dimensions.
Download Word Embeddings trained with Word2Vec on 300 million multilingual Tweets using 52 dimensions.
Download Word Embeddings trained with Word2Vec on 800 million multilingual Tweets using 200 dimensions.
Word Embeddings from Wikipedia articles
Download Word Embeddings trained with Word2Vec on 4.5 million English Wikipedia articles using 200 dimensions.
Download Word Embeddings trained with fastText on 2 million German Wikipedia articles using 300 dimensions.
Download Word Embeddings trained with Word2Vec on 1.3 million Italian Wikipedia articles using 200 dimensions.
Download Word Embeddings trained with Word2Vec on 1.3 million Italian Wikipedia articles using 52 dimensions.
We also offer a free annotated corpus for text sentiment of German tweets, available for Download. We provide tweet IDs and labels.
Please refer to the README.md as well as to the Annotator_Instructions.pdf.