Michael Hastings

Mesozoic Mister Nigel · August 23, 2013, 11:03:56 PM

You negative nellies just need to turn those frowns upside down and look at the bright side! Things aren't so bad, it's all about your attitude.

The Good Reverend Roger · August 23, 2013, 11:39:44 PM

Quote from: FOCUS GROUP RAGEMONKEY OF HATE HATE HATE on August 23, 2013, 11:03:56 PM
You negative nellies just need to turn those frowns upside down and look at the bright side! Things aren't so bad, it's all about your attitude.

I'm in danger of having a thought, here.

Mesozoic Mister Nigel · August 23, 2013, 11:47:07 PM

Quote from: The Good Reverend Roger on August 23, 2013, 11:39:44 PM
Quote from: FOCUS GROUP RAGEMONKEY OF HATE HATE HATE on August 23, 2013, 11:03:56 PM
You negative nellies just need to turn those frowns upside down and look at the bright side! Things aren't so bad, it's all about your attitude.

I'm in danger of having a thought, here.

Hey hey hey now big guy, slow down! That kind of stuff is for policy-makers, not regular folks like you.

The Good Reverend Roger · August 24, 2013, 02:01:04 AM

Quote from: FOCUS GROUP RAGEMONKEY OF HATE HATE HATE on August 23, 2013, 11:47:07 PM
Quote from: The Good Reverend Roger on August 23, 2013, 11:39:44 PM
Quote from: FOCUS GROUP RAGEMONKEY OF HATE HATE HATE on August 23, 2013, 11:03:56 PM
You negative nellies just need to turn those frowns upside down and look at the bright side! Things aren't so bad, it's all about your attitude.

I'm in danger of having a thought, here.

Hey hey hey now big guy, slow down! That kind of stuff is for policy-makers, not regular folks like you.

The Johnny · August 24, 2013, 05:37:00 AM

Quote from: Triple Zero on August 12, 2013, 06:47:14 PM
BTW (somewhat related to my braindump in the Surveillance thread), "Latent Semantic Indexing" is a Natural Language Processing / Machine Learning algorithm that can do "fuzzy" text matching according to semantic content. Meaning it doesn't require sets of specific keywords to group texts with similar topics, or to calculate a "semantic distance" between two texts.

There's no real parsing or linguistic "understanding" involved, it's mainly a statistical technique that correlates groups of words and phrases used in similar contexts between different texts. But neither the words nor the contexts need to be identical in a strict word-for-word sence, in order to get a (partial) match.

Quote from: en.wikipedia.org/wiki/Latent_semantic_indexingLatent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.[1]

LSI is also an application of correspondence analysis, a multivariate statistical technique developed by Jean-Paul Benzécri[2] in the early 1970s, to a contingency table built from word counts in documents.

Called Latent Semantic Indexing because of its ability to correlate semantically related terms that are latent in a collection of text, it was first applied to text at Bell Laboratories in the late 1980s. The method, also called latent semantic analysis (LSA), uncovers the underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches. Queries, or concept searches, against a set of documents that have undergone LSI will return results that are conceptually similar in meaning to the search criteria even if the results don't share a specific word or words with the search criteria.

(...) LSI is also used to perform automated document categorization.

(...) Dynamic clustering based on the conceptual content of documents can also be accomplished using LSI. Clustering is a way to group documents based on their conceptual similarity to each other without using example documents (this is called "unsupervised learning", btw - 000) to establish the conceptual basis for each cluster. This is very useful when dealing with an unknown collection of unstructured text.

Because it uses a strictly mathematical approach, LSI is inherently independent of language. This enables LSI to elicit the semantic content of information written in any language without requiring the use of auxiliary structures, such as dictionaries and thesauri. LSI can also perform cross-linguistic concept searching and example-based categorization. For example, queries can be made in one language, such as English, and conceptually similar results will be returned even if they are composed of an entirely different language or of multiple languages.

(...) LSI automatically adapts to new and changing terminology, and has been shown to be very tolerant of noise (i.e., misspelled words, typographical errors, unreadable characters, etc.).[9] This is especially important for applications using text derived from Optical Character Recognition (OCR) and speech-to-text conversion. LSI also deals effectively with sparse, ambiguous, and contradictory data.

Text does not need to be in sentence form for LSI to be effective. It can work with lists, free-form notes, email, Web-based content, etc. As long as a collection of text contains multiple terms, LSI can be used to identify patterns in the relationships between the important terms and concepts contained in the text.

(full WP article)

It's a really cool (elegant / relatively simple) algorithm, btw.

So yeah you can probably guess how this sort of technology would be very useful to an organisation that has the need for automatic classification and relevance filtering / selection of huge amounts of textual data. I couldn't say whether it would be feasible to apply it on all data, or whether its computational complexity restricts it to use only on certain groups of targets and/or people on certain "lists".

Additionally new developments in a different technique called Restricted Boltzmann Machines / Deep Learning Networks, are said to yield even better results for unsupervised learning and Semantic Indexing of Big Data. Geoffrey Hinton is the big name in this field, he works for Google now. His talks are quite enjoyable to watch, IMHO. What I further understand about RBMs is that because of their simple structure they can be implemented in FPGAs and specialized computation hardware in order to increase performance. On the other hand, research in this field has only gotten huge results since a few years, so governments are probably not using it, just yet.

Thats the type of heuristics i was talking about in regards to qualitative analysis... its so efficient that it looks for synonims or even themes to make the analytical categories, and arranges them accordingly, so no ammount of noise is going to block that out, it simply means that in your given case there will be a greater number of categories.

Principia Discordia

News:

Michael Hastings

Mesozoic Mister Nigel

The Good Reverend Roger

Mesozoic Mister Nigel

The Good Reverend Roger

The Johnny