cause the original thread turned into a toolbox:
Quote from: Professor Cramulus on May 29, 2008, 01:49:12 AMQuote from: triple zero on May 26, 2008, 04:29:42 PMFurther, Neural Network algorithms have been widely replaced by newer and better Pattern Classification algorithms, such as Support Vector Machines (seems to be the current "market leader") and Learning Vector Quantization (the focus of my research group). These algorithms do not have anything to do with brains and/or neurons anymore, but they have other advantages (you can't really see *how* a neural network or an SVM has learned what it does, but LVQ allows this, for instance), and are more accurate.
very interesting, 000. Can you elaborate a bit on what these algorithms are used for?
a primer on machine learning, ok here we go.
say you got a a whole bunch of fish. like a big pile of 1,000 salmon and herrings, randomly, about 500 each. then you get this poor Biology PhD to sift them out for you using his expert knowledge on fish and put them into a machine that measures their length (in millimeters) and average brightness (as a floating point number between 0.0 = black and 1.0 = white).
finally you get an excel sheet emailed from this biology dude, three columns listing all lengths, brightness the third column says "salmon" or "herring". first thing you do, is plot every row as a dot on a 2-dimensional graph, say length on the X-axis and brightness on the Y-axis.
since salmon and herring have distinct characteristics in brightness and size, what you should see is two kind of separate clouds of points, one is the salmons (which we all know are brighter and larger) and the other are the herrings (which are well known for being darker and shorter than salmon).
perhaps the clouds overlap a littlebit, could be. shouldn't be too much though, otherwise machine learning isn't going to perform very well.
what you can do now, is to draw a (straight) line between these two clouds, that sort of separates them. one idea for example is to take the average (middle) point of the one cloud, and the middle point of the other cloud and draw the line in between there (where it's equidistant).
anyway, one day, you get mysterious smelling package in the mail, inside the box is a fish with a note saying
QuoteEsteemed collegue Professor Cramulus,
can you please identify this fish for me? I'm not sure
whether it's a herring or a salmon.
Regards, Professor Zippletits
so you measure the length and brightness of this fish, pin the point on the graph, and check whether it falls on one side of the line or the other. and now, you know the answer!
now, LVQ is an algorithm that, given a whole load of example data, tries to determine the optimal position of these "middle points" called prototypes, in such a way that any unseen data will be classified depending on which prototype they are closest to.
the nice thing is that these prototypes represent the "average" salmon and the "average" herring, so you can sort of examine what the criteria are that the classification is based on. you can have as many prototypes as you want btw, also multiple per class, if that's the case, they will sort of space out inside the cloud to still generate the optimal decision boundary based on closeness to the prototype.
SVM and Neural Nets work slightly different, but the basic process is the same: in goes some numbers (length and brightness in this case) and out comes a classification ("salmon" or "herring", in this case).
now usually instead of just two numbers, for a machine learning problem you can have a vector* of 50 or even as much as 6000 numbers as input. you can imagine that at this point it becomes kinda hard to make a graph with a point cloud from this and draw those lines by yourself. fortunately, the computer has no problem with this at all (ok you do run into some problems as your dimensionality increases, but i dont want to get into details too much now).
(* the word "vector" means nothing more than "list of numbers" for purposes of this explanation. for example a 5 dimensional vector would look like (-2.3, 1.7, 0.555, 73.001, 0.0) or something)
for example, my bachelors thesis was about classifying images of boar spermatozoid heads into classes "healthy" or "damages" after they had been frozen. yeah it's a bio-industry application, i did consider the moral issues of doing this research, but the original research had already been done by a PhD from a spanish institute, we just obtained this dataset from her in order to test and compare our own algorithms on real world data, not to develop a biotech pig artificial insemination application.
all the images were 35x19 pixels with greyscale value. your average spermatozoid head image looks kinda like a grey oval on a dark background with some light and dark spots in it.
anyway, 35x19 = 665. so the input vector is 665 dimensional.
now the cool thing is, if you train the LVQ algorithm based on the example data, you'll end up with some prototypes for the "good" and the "bad" spermatozoid head images. these prototypes are of course also 665 dimension vectors, which means, you can plot them as a 35x19 grayscale image! which is very nice because that way you can see what the algorithm "thinks" the prototypical "good" spermatozoid and the prototypical "bad" ones look like.
other applications in machine learning.
DNA data is a lot of numbers (ok actually ACTG letters, but you can convert them to numbers), and you wanna train an algorithm that classifies parts of DNA into whether they code for any (known) proteins ( = exons), or not ( = introns) (or the other way around, i forgot).
or if you have a MRI scan of some tissue, in like 6 different varieties, and a little region, and you get a load of numbers (say, several gigabytes for your average MRI scan) you wanna classify whether bits are cancer or healthy.
it's basically datamining. an interesting applicaiton could be to somehow encode a forum post into a list of numbers (word frequency counts and such), and then try to classify it into who wrote it, depending on writing style.
wow, that's crazy interesting. Especially the bits on possible applications. Thanks for the primer, zippletits!
and thanks for all the fish
yeah, so if you ever happen onto a lot of numbers, preferably from a gaussian-like source, and you need to classify them, who you gonna call ;-)
some things i forgot to mention:
machine learning algorithms are algorithms that get better at a certain classification task, the more examples you give it.
"better" approaches some limit of course, for example if the two clouds of fish overlap for 20%, you cannot get accuracy above 80%, but neither could a human, given the same data. this is why, if this happens, it helps to find a third data source that is as much uncorrelated with the other sources as possible. say, umm, fin-length as a percentage of the total length (i dunno, just an idea).
other important part is overtraining. if you give the algorithm too much room, for example, you'd have it train 500 prototypes for each class, it would simply memorize all the data points. problem is that it will start to draw the wrong conclusions. it will draw a very jagged line between those clouds, but this jagged line has the jags defined by the training points, not by any sort of general information it learns about these fish species.
the usual result of this is that while performance in the trainingset will invariably go up, the generalization accuracy (for unseen data) will worsen.
to keep an eye on how much this is happening, a technique called "ten fold cross validation" is often used. you take the trainingset, and divide it into 10 equal parts. you train the classifier with 9/10th of these parts, and when that's done, you test it with the remaining 1/10th. this way you make sure you are testing the classifier with unseen data, which is what you want it to be good at, you could care less about the training data itself, cause the biologist PhD already did that work.
noting down the error rates, accuracy and whatnot, you repeat this procedure 10 times, leaving a different part out every time. now you got 10 different estimates to the generalization accuracy of the classifier you made, if you average this, it is believed that this makes a good estimate to the generalization accuracy to unseen data when you train the classifier with the whole 10/10th.
why do you do this, well you can tweak all sorts of parameters on your classifier, like learning speed, representation of input data, and a whole bunch of other stuff. it all affects the accuracy of your final classifier. but with the 10fold cross validation, you can tweak these parameters until you get a reasonable result at unseen data generalization accuracy.
i could go on and on and on about all sorts of machine learning tricks and such, but i will stop here :)
i've been taking a break from all this stuff recently, but in a few weeks i might pick up my research again.