VARIATIONS FOR LARGE SYSTEMS AND MANY INPUTS
I wrote this incredibly long comment to shollen’s article when I woke up this morning but decided that instead I would publish it as its own entity. I like where this particular discussion is going, and if I could group them all together I would. Regardless, before you read this article you definitely need to read shollen’s below.
Why systems like Pandora suck
The problem with things like Pandora is that it relies on a keying system. Those of you that have gmail might make the same type of correlation to “labels” in their interface. This system is completely arbitrary in complexity and in inputs. There is an incredible amount of data out there but each individual person (input) is responsible for organizing their own sets, and no real intelligent analysis of the problem happens past that point, other than basic, basic statistics. So here’s a question. Is the key for ‘genre’ actually a significant one? How significant is it? What keys are most significant? What keys are least significant? Does my assignment of genre X and theme Y to one particular song actually differentiate it at all from any other items? How do I quantify that? This keying system is a supervised approach which absolutely sucks for this type of data.
Factor components
Factor analysis or, in the simple case principal component analysis (PCA), is generally capable of making unsupervised decisions about the significance of arbitrary factors, and creating a series of column vectors describing the most significant ones. I encourage everyone to look PCA and factor analysis up if interested.
Initial application to music classification
Screw manually organizing music into genres and ratings. I want my computer to tell me what I’m going to like without having to do anything other than analyze my current collection. If I had the processing power, I would probably assign some sort of 1D value to every 1/10th of a second second of a song based on some calculation. A perfectly reasonable example might be a frequency/amplitude analysis in the simple case, or perhaps something very fancy like a Fourier/superposition-type deconstruction. Regardless, I would end up with a column vector of values for each song. Then I’d normalize then length such that each column vector for each such is exactly the same, such that a rectangular matrix could be constructed, representing the entirety of music known to my computer. Then I’d probably do a 2D factor decomposition (although this can be done in any dimension, and often is for good reason). The eigenvectors/eigenvalues I get out of the analysis represent the most, and second most variance within the data set. This is generally most easily done with a singular value decomposition (SVD). I would bet almost anyone reading this site that most of the music you actually listen to falls within one quadrant of Factor 1 vs. Factor 2 on Cartesian plane.
Even if your musical taste is extremely eclectic, this would easily give you great differentiation of genre (probably represented in Factor 1, the factor describing the most variance in the data set) without having to actually manually key or classify anything. To those of you who know what I’m working on currently, the similarities between this and my work are, I’m sure, not lost on you.
Other systems
Now you can ask questions: Is genre a separable quantity given the vectors I have constructed from the raw data? Is any key used in the Pandora or Google model actually useful at all? If yes, great … now I can arbitrarily understand a new mp3 or email based on factors that actually have some real significance without having to do any manual classification or introduce any personal bias into the system. I can, of course, write very clever code to do this all for me. This is a general technique applicable to really any system in which you can construct column vectors describing one data point, and in which you want to differentiate between types (for instance, mass spectra). Here’s an example of a 2D plot for such data. Note that the data I used to construct this (and ran my code on) was completely arbitrary and random and therefore no separation was observed.
Google labels suck bad
So let’s extrapolate to a system that everyone is extremely familiar with: Gmail. Even though I get a lot of crap from friends for this, I really like Google. 99% percent of the time they produce products and services that are just downright impressive and well done. However, one of my least favorite concepts is this manual labeling of emails and of RSS feeds, etc. It’s a terrible system. Really terrible. So instead of manually labeling stuff, let’s analyze all your email and figure out everything we need without having to do a damn thing.
- Take emails and normalize them.
- Build column vectors describing something about each set of characters in the email.
- Do the factor analysis.
- Ask a question: are these N labels significant?
- If so, apply.
- Build a new set of vectors describing something different about each character set.
- Repeat the analysis.
- Ask another question: do we do better with this vector type?
- You get the point …
For those of you who love your calculus I could even imagine some very fancy multidimensional function/model describing the way in which we build column vectors to describe character sets in each email, a minimization of that function giving us the best way to construct these column vectors from the data.
Now, when a new email comes in, I know everything about it based on its relationship to the vectors I’ve calculated! Sweet! There’s obviously a little more to this to get it to work right, and this does require fairly large data sets. This would be of basically no use to someone with only a few emails in their account, although there could be a global set calculated from averages of each Gmail user from which initial vectors are calculated for new/small system users.
I will kill spam with this technique
The stuff I’ve described above deals with fairly complex systems. One type of system that is absolutely not complex is spam. Spam filters work in nearly the same way as any keying mechanism in the other systems.
- if (email_contains_word_viagra) total_score += significance_of_contains_word_viagra;
- if (email_contains_unresolved_link) total_score += significance_of_contains_unresolved_link;
- …
- if (total_score >= spam_score_limit) mark_as_spam(email);
You can see this in action in any email just by looking at the X-Spam: headers (of course, there are other mechanisms that don’t work this way). Go ahead, give it a try. One of the problems with this system is that it assumes everyone gets the same type of spam. They may be right: word N might be the most significant globally, but if I don’t get any spam of that type, what good is it to me? Therefore I think it might be reasonable to address spam as a non-global issue. So, here you go Google, here’s another idea.
- if (i_mark_something_as_spam) label_email(email, “spam”);
- factor_analysis_for_spam(factor_results, set_of_real_email, all_current_emails_marked_as_spam);
- int any_good = can_i_differentiate(factor_results, “spam”);
- if (any_good) develop_filter_from_factor_analysis(factor_results, “spam”);
- if (any_good) store_new_filter(factor_results, filter_structure);
- remove_obsolete_filters(factor_results, filter_structure);
Without having to do anything at all, I have organically developed a new filter for a specific type of spam that I get. These aren’t run in order as their listed but they are just concepts necessary to build such a spam-vector data set.
There are a lot of good answers to good questions about this type of methodology. For instance, applications to visual pattern and facial recognition. Feel free to ask them, but for the brevity of this post, I’m leaving a lot out. It’s already far too long as it is. I’d also like to point out that I think this is the first time I’ve ever used six category “keys” to describe a post. Fitting, isn’t it?
Cheers.

by the way i’m applying a lot of this stuff in my work right now to techniques in biological mass spectrometry … thats why i’m so interested in it. and, of course, let’s not forget the netflix problem.
So I think you might be confused between Last.fm and Pandora. Pandora takes the descriptive tags (i.e. heavy bass line, light vamping, soft vocals, etc.) and then bases recommendations based on similarity of tags to the current song.
Last.fm is a social method basing recommendations on similarity actual playlists listened to. If I listen to Band A often and the users with my most similar “tastes” also listen to Band A and Band B but I don’t listen to Band B then B is recommended.
Is there an actual functional difference between the two methods or do they both break down to and arbitrary tagging system?
Also I feel that this type of analysis does not necessarily invalidate my approach of the “constellation” of factors for recommendation. As long as there is a significant ability to differentiate then it doesn’t matter what the factor used is whether it be a stylistic tag or a consumer purchasing choice.
I don’t understand most of the math behind it but your argument seems to be primarily that most tags are not descriptively significant in terms of differentiating between what to recommend or not (whether that be spam to be deleted or music to be listened to).
yeah you’re right. i meant pandora — whoops. this really isnt at all applicable to last.fm.
secondly: tags are useless when you can create arbitrary sets that represent the MAXIMUM amount of variance between groups.
also the notion of recommendation just got me started on this whole rant … the majority of the post isnt at all amount music or music rec technology.
and of course … regardless of last.fm … the whole music analysis by factors is still cool. i REALLY wanna see how what i listen to performs in this analysis. i really think we’d all be surprised at the lack of variation in our musical choices.
also i’d like to point out that this is really a topical treatment of this problem. for instance, the spam problem and the google label problem —
the MUCH more complex problem is figuring out how to intelligently build those input vectors for each message. that is without question a billion times more difficult than the stuff i wrote about here. and i dont have any idea how to do it.
It kind of depends about what kind of variance you are talking about. Or at least how you define variance.
I definitely believe that I have low variance in my music library on a kind of global scale. My classical music is limited, I have very little jazz, and I don’t have much in the way of modern top of the pops. However within certain genres I have a great degree of variance.
It is not really an argument along the same lines. It is more of a “depth” vs. “breadth” distinction. Then of course there is a whole other system of value there.
well sure. if you take breadth to be defined as genre, then yes, there is not a lot of variance there. i think i’ve gotten people confused a little by using the word variance in too many places.
this analysis on music would only tell you about what you listen to vs. what you have. in a global sense you would need a global data set. so this would conclusively answer the question: is my music library representative of what i actually like?
for instance, i think right now i’d be doing a lot better than say, back in the day when i stored and organized everything under the sun.
and remember this is all as a function of HOW you build your inputs to the analysis.
Now if only somebody at Macroshaft would read this and realise that the reason why “live mail” sucks is that they dont even try and filter spam in any sensible way at all …