|
|
Showing 1 - 1 of
1 matches in All Departments
The World Wide Web constitutes the largest existing source of texts
written in a great variety of languages. A feasible and sound way
of exploiting this data for linguistic research is to compile a
static corpus for a given language. There are several adavantages
of this approach: (i) Working with such corpora obviates the
problems encountered when using Internet search engines in
quantitative linguistic research (such as non-transparent ranking
algorithms). (ii) Creating a corpus from web data is virtually
free. (iii) The size of corpora compiled from the WWW may exceed by
several orders of magnitudes the size of language resources offered
elsewhere. (iv) The data is locally available to the user, and it
can be linguistically post-processed and queried with the tools
preferred by her/him. This book addresses the main practical tasks
in the creation of web corpora up to giga-token size. Among these
tasks are the sampling process (i.e., web crawling) and the usual
cleanups including boilerplate removal and removal of duplicated
content. Linguistic processing and problems with linguistic
processing coming from the different kinds of noise in web corpora
are also covered. Finally, the authors show how web corpora can be
evaluated and compared to other corpora (such as traditionally
compiled corpora). For additional material please visit the
companion website: sites.morganclaypool.com/wcc Table of Contents:
Preface / Acknowledgments / Web Corpora / Data Collection /
Post-Processing / Linguistic Processing / Corpus Evaluation and
Comparison / Bibliography / Authors' Biographies
|
You may like...
Loot
Nadine Gordimer
Paperback
(2)
R367
R340
Discovery Miles 3 400
Ambulance
Jake Gyllenhaal, Yahya Abdul-Mateen II, …
DVD
(1)
R260
Discovery Miles 2 600
|
Email address subscribed successfully.
A activation email has been sent to you.
Please click the link in that email to activate your subscription.