| |
|
||||||||||||||||||||
|
||||
|
What is available Naming and versioning scheme How to apply Publications
|
The Corpus consists of 806,791 XML files in NewsML format. They are distributed in the form of 365 zip files, one per day, over 2 CDs. Approximately 3.7Gb is required for the storage of the uncompressed XML files. Due to seasonal variations the number of stories per day is not constant, but on weekdays there are on average of 2,880 per day and 480 on weekends. (Small images link to large versions.) Stories, Words and Paragraphs [up]
These charts shows the number of stories per day, over the period of the Corpus, since the weekend has a dramatically different number of stories this is shown on a separate chart.
The difference between the weekdays and weekend can be shown more clearly in the graph above, showing the number of days where a particular number of stories were produced. The distribution of stories is clearly bi-modal, with a much higher median for weekdays than for weekends. This is a typical pattern in the News Industry.
In addition to the difference between the weekdays and weekend there is a more subtle difference between the weekdays themselves. This pattern can be seen in the chart above.
Information about the number of stories per day is sumarised in the chart above. In addition to the previous comments it is clear that the weeks beginning 21st Decemeber 1996 and 30th December 1996 had a particularly low number of stories produced, especially on the New Years day and Christmas day (both falling on a Wednesday.
These charts shows the number of stories that have a particular word or paragraph count. It can be seen that most stories are quite short, with around 6-7 paragraphs and 1000 words
BIP Codes [up]
Part of Speech Tags [up]
This chart shows the Part of Speech tag distributions over the Penn-Treebank tag set for both the Wall Street Journal and the Reuters news corpus. The POS tagging was carried out using Brill's tagger, freely available from Eric Brill's Homepage. As one would expect, the distributions are similar. |
|||||||||||||||||||||||||||||||||||
| Disclaimer
|
|