|
Showing 1 - 2 of
2 matches in All Departments
Data mining is a mature technology. The prediction problem, looking
for predictive patterns in data, has been widely studied. Strong
me- ods are available to the practitioner. These methods process
structured numerical information, where uniform measurements are
taken over a sample of data. Text is often described as
unstructured information. So, it would seem, text and numerical
data are different, requiring different methods. Or are they? In
our view, a prediction problem can be solved by the same methods,
whether the data are structured - merical measurements or
unstructured text. Text and documents can be transformed into
measured values, such as the presence or absence of words, and the
same methods that have proven successful for pred- tive data mining
can be applied to text. Yet, there are key differences. Evaluation
techniques must be adapted to the chronological order of
publication and to alternative measures of error. Because the data
are documents, more specialized analytical methods may be preferred
for text. Moreover, the methods must be modi?ed to accommodate very
high dimensions: tens of thousands of words and documents. Still,
the central themes are similar.
Data mining is a mature technology. The prediction problem, looking
for predictive patterns in data, has been widely studied. Strong
me- ods are available to the practitioner. These methods process
structured numerical information, where uniform measurements are
taken over a sample of data. Text is often described as
unstructured information. So, it would seem, text and numerical
data are different, requiring different methods. Or are they? In
our view, a prediction problem can be solved by the same methods,
whether the data are structured - merical measurements or
unstructured text. Text and documents can be transformed into
measured values, such as the presence or absence of words, and the
same methods that have proven successful for pred- tive data mining
can be applied to text. Yet, there are key differences. Evaluation
techniques must be adapted to the chronological order of
publication and to alternative measures of error. Because the data
are documents, more specialized analytical methods may be preferred
for text. Moreover, the methods must be modi?ed to accommodate very
high dimensions: tens of thousands of words and documents. Still,
the central themes are similar.
|
|