Country Sites Products & Services Careers Reuters.com

You are here >

HOME > STATISTICS


What is available

Naming and versioning scheme

How to apply

Publications

Statistics
Invisible Placement Image
Statistics

The Corpus consists of 806,791 XML files in NewsML format. They are distributed in the form of 365 zip files, one per day, over 2 CDs. Approximately 3.7Gb is required for the storage of the uncompressed XML files.

Due to seasonal variations the number of stories per day is not constant, but on weekdays there are on average of 2,880 per day and 480 on weekends.

Contents

(Small images link to large versions.)

Stories, Words and Paragraphs [up]

Number of Stories on Weekdays Number of Stories over the Weekend
Number of Stories on Weekdays Number of Stories over the Weekend

These charts shows the number of stories per day, over the period of the Corpus, since the weekend has a dramatically different number of stories this is shown on a separate chart.

Distribution of Stories
Distribution of Stories

The difference between the weekdays and weekend can be shown more clearly in the graph above, showing the number of days where a particular number of stories were produced. The distribution of stories is clearly bi-modal, with a much higher median for weekdays than for weekends. This is a typical pattern in the News Industry.

Number of Stories per Day of Week
Number of Stories per Day of Week

In addition to the difference between the weekdays and weekend there is a more subtle difference between the weekdays themselves. This pattern can be seen in the chart above.

Stories per Week and Weekday
Stories per Week and Weekday

Information about the number of stories per day is sumarised in the chart above. In addition to the previous comments it is clear that the weeks beginning 21st Decemeber 1996 and 30th December 1996 had a particularly low number of stories produced, especially on the New Years day and Christmas day (both falling on a Wednesday.

Number of Words per Document Number of Paragraphs per Document
Number of Words per Document Number of Paragraphs per Document

These charts shows the number of stories that have a particular word or paragraph count. It can be seen that most stories are quite short, with around 6-7 paragraphs and 1000 words


BIP Codes [up]

Top Topic Codes Top Industry Codes
Top Topic Codes Top Industry Codes
Top Country Codes


Each story is tagged with zero or more topics, zero or more countries and zero or more industries. The charts above show the freqency of occurence of these codes. If there are more than one codes of a given category in a document then each will be counted separately.

Topic Code Count Industry Code Count
Topic Code Count Industry Code Count

 
Country Code Count
Country Code Count


As mentioned above each story can have zero or more topic, industry or country codes. The charts above show the number of documents with a particluar number of codes, for each of the code types.


Part of Speech Tags [up]

POS Tag Distribution
POS Tag Distribution (Reuters and WSJ Corpora)

This chart shows the Part of Speech tag distributions over the Penn-Treebank tag set for both the Wall Street Journal and the Reuters news corpus. The POS tagging was carried out using Brill's tagger, freely available from Eric Brill's Homepage. As one would expect, the distributions are similar.