acWaC-EU is a 90-million word corpus of web pages in English crawled from the websites of European universities. The universities sampled are based in countries where English is or is not a native/official language, hence this (monolingual comparable) corpus affords comparisons of native and lingua franca (ELF) varieties of English in the institutional academic domain.

Each text in the corpus (corresponding to a single web page) is annotated with rich metadata, including:

This page provides information on acWaC-EU and the scripts that were used for its construction.

In the near future the corpus will be made available as a set of n-grams, along the lines, e.g. of the Google Books N-gram dataset or the Rovereto Twitter N-Gram Corpus. Stay tuned!

 

When referring to the acWaC-EU corpus please cite:

Ferraresi, A. and Bernardini, S. (forthcoming). The academic Web-as-Corpus. In Evert, S., Stemle, E., and Rayson, P. (eds). Proceedings of the 8th Web as Corpus workshop (WAC-8), Lancaster, UK.

Corpus composition

The table below presents detailed information on the contents of the acWaC-EU corpus.

 

Language family

Country

Status

N. of texts

N. of tokens

Baltic

Latvia

ELF

827

432,851

Lithuania

ELF

1,225

709,070

Caucasian

Georgia

ELF

747

361,303

Germanic

Austria

ELF

1,388

564,892

Denmark

ELF

2,496

1,203,820

Faroe Islands

ELF

37

14,803

Germany

ELF

5,243

2,376,142

Greenland

ELF

10

1,676

Iceland

ELF

353

168,478

Netherlands

ELF

3,676

1,633,428

Norway

ELF

2,441

1,262,458

Sweden

ELF

5,486

2,683,997

Germanic-Romance

Belgium

ELF

1,938

1,208,444

Luxembourg

ELF

71

27,030

Hellenic

Greece

ELF

1,439

874,852

Hellenic-Turkic

Cyprus

ELF

487

272,728

Illyric

Albania

ELF

250

143,699

Romance

France

ELF

6,528

3,624,400

Italy

ELF

3,275

1,806,499

Moldova

ELF

191

113,117

Monaco

ELF

45

78,521

Portugal

ELF

586

290,798

Romania

ELF

775

467,699

San Marino

ELF

9

4,234

Spain

ELF

5,438

4,230,012

Romance-Germanic

Switzerland

ELF

2,124

1,069,884

Semitic

Israel

ELF

1,251

669,595

Slavic

Belarus

ELF

882

676,909

Bosnia and Herzegovina

ELF

653

356,565

Bulgaria

ELF

776

517,524

Croatia

ELF

249

142,719

Czech Republic

ELF

1,930

1,220,770

Macedonia

ELF

570

391,458

Montenegro

ELF

50

31,529

Poland

ELF

2,460

1,421,953

Russia

ELF

5,805

3,557,076

Serbia

ELF

376

169,094

Slovakia

ELF

570

267,372

Slovenia

ELF

570

345,730

Ukraine

ELF

2,134

1,564,804

Thracian

Armenia

ELF

404

211,865

Turkic

Azerbaijan

ELF

511

318,519

Turkey

ELF

2,100

1,073,657

Uralic

Estonia

ELF

853

540,089

Finland

ELF

3,072

1,981,841

Hungary

ELF

995

612,406

Germanic

United Kingdom

Native

61,465

41,911,277

Germanic-Celtic

Ireland

Native

6,236

3,945,831

Isle of Man

Native

50

28,475

Germanic-Semitic

Malta

Native

260

286,846

 

The plot below shows the number of tokens (in %) for the “main” language families featured in acWaC-EU (i.e. those with 500K+ tokens).

 

The scripts

Here you can download the main scripts that were used to build acWaC-EU — more detailed documentation will follow (hopefully soon):