acWaC-EU is a 90-million word corpus of web pages in English crawled from the websites of European universities. The universities sampled are based in countries where English is or is not a native/official language, hence this (monolingual comparable) corpus affords comparisons of native and lingua franca (ELF) varieties of English in the institutional academic domain.
Each text in the corpus (corresponding to a single web page) is annotated with rich metadata, including:
This page provides information on acWaC-EU and the scripts that were used for its construction.
In the near future the corpus will be made available as a set of n-grams, along the lines, e.g. of the Google Books N-gram dataset or the Rovereto Twitter N-Gram Corpus. Stay tuned!
When referring to the acWaC-EU corpus please cite:
Ferraresi, A. and Bernardini, S. (forthcoming). The academic Web-as-Corpus. In Evert, S., Stemle, E., and Rayson, P. (eds). Proceedings of the 8th Web as Corpus workshop (WAC-8), Lancaster, UK.
The table below presents detailed information on the contents of the acWaC-EU corpus.
Language family |
Country |
Status |
N. of texts |
N. of tokens |
Baltic |
Latvia |
ELF |
827 |
432,851 |
Lithuania |
ELF |
1,225 |
709,070 |
|
Caucasian |
Georgia |
ELF |
747 |
361,303 |
Germanic |
Austria |
ELF |
1,388 |
564,892 |
Denmark |
ELF |
2,496 |
1,203,820 |
|
Faroe Islands |
ELF |
37 |
14,803 |
|
Germany |
ELF |
5,243 |
2,376,142 |
|
Greenland |
ELF |
10 |
1,676 |
|
Iceland |
ELF |
353 |
168,478 |
|
Netherlands |
ELF |
3,676 |
1,633,428 |
|
Norway |
ELF |
2,441 |
1,262,458 |
|
Sweden |
ELF |
5,486 |
2,683,997 |
|
Germanic-Romance |
Belgium |
ELF |
1,938 |
1,208,444 |
Luxembourg |
ELF |
71 |
27,030 |
|
Hellenic |
Greece |
ELF |
1,439 |
874,852 |
Hellenic-Turkic |
Cyprus |
ELF |
487 |
272,728 |
Illyric |
Albania |
ELF |
250 |
143,699 |
Romance |
France |
ELF |
6,528 |
3,624,400 |
Italy |
ELF |
3,275 |
1,806,499 |
|
Moldova |
ELF |
191 |
113,117 |
|
Monaco |
ELF |
45 |
78,521 |
|
Portugal |
ELF |
586 |
290,798 |
|
Romania |
ELF |
775 |
467,699 |
|
San Marino |
ELF |
9 |
4,234 |
|
Spain |
ELF |
5,438 |
4,230,012 |
|
Romance-Germanic |
Switzerland |
ELF |
2,124 |
1,069,884 |
Semitic |
Israel |
ELF |
1,251 |
669,595 |
Slavic |
Belarus |
ELF |
882 |
676,909 |
Bosnia and Herzegovina |
ELF |
653 |
356,565 |
|
Bulgaria |
ELF |
776 |
517,524 |
|
Croatia |
ELF |
249 |
142,719 |
|
Czech Republic |
ELF |
1,930 |
1,220,770 |
|
Macedonia |
ELF |
570 |
391,458 |
|
Montenegro |
ELF |
50 |
31,529 |
|
Poland |
ELF |
2,460 |
1,421,953 |
|
Russia |
ELF |
5,805 |
3,557,076 |
|
Serbia |
ELF |
376 |
169,094 |
|
Slovakia |
ELF |
570 |
267,372 |
|
Slovenia |
ELF |
570 |
345,730 |
|
Ukraine |
ELF |
2,134 |
1,564,804 |
|
Thracian |
Armenia |
ELF |
404 |
211,865 |
Turkic |
Azerbaijan |
ELF |
511 |
318,519 |
Turkey |
ELF |
2,100 |
1,073,657 |
|
Uralic |
Estonia |
ELF |
853 |
540,089 |
Finland |
ELF |
3,072 |
1,981,841 |
|
Hungary |
ELF |
995 |
612,406 |
|
Germanic |
United Kingdom |
Native |
61,465 |
41,911,277 |
Germanic-Celtic |
Ireland |
Native |
6,236 |
3,945,831 |
Isle of Man |
Native |
50 |
28,475 |
|
Germanic-Semitic |
Malta |
Native |
260 |
286,846 |
The plot below shows the number of tokens (in %) for the “main” language families featured in acWaC-EU (i.e. those with 500K+ tokens).
Here you can download the main scripts that were used to build acWaC-EU — more detailed documentation will follow (hopefully soon):