Test shows big data text analysis inconsistent, inaccurate

Accuracy of 90 percent with 80 percent consistency sounds good, but the scores are “actually very poor, since they are for an exceedingly easy case,” Amaral said in an announcement from Northwestern about the study.

Applied to messy, inconsistently scrubbed data from many sources in many formats – the base of data for which big data is often praised for its ability to manage – the results would be far less accurate and far less reproducible, according to the paper.

via Test shows big data text analysis inconsistent, inaccurate | Computerworld.

Here’s an interesting explanation as to how LDA, Latent Dinchlet Allocation works.  From: What is a good explanation of Latent Dirichlet Allocation?

From a 3000 foot level as I understand the explanation of LDA; it seems like a mechanism to score words in order to categorize sets of words like paragraphs or entire papers.  Interesting exercise but a human must data model this first.  Any time some program has to estimate or guess like this there will be error, the only issue is how much is acceptable to even use the results that this kind of analysis produces.