## Vancouver Machine Learning and Data Science Meetup – NLP to Find User Archetypes for Search & Matching

I will attend the following seminar by Thomas Levi in the next R/Machine Learning/Data Science Meetup in Vancouver on Wednesday, June 25.  If you will also attend this event, please come up and say “Hello”!  I would be glad to meet you!

To register, sign up for an account on Meetup, and RSVP in the R Users Group, the Machine Learning group or the Data Science group.

Title: NLP to Find User Archetypes for Search & Matching

Speaker: Thomas Levi, Plenty of Fish

Location: HootSuite, 5 East 8th Avenue, Vancouver, BC

Time and Date: 6-8 pm, Wednesday, June 25, 2014

Abstract

As the world’s largest free dating site, Plenty Of Fish would like to be able to match with and allow users to search for people with similar interests. However, we allow our users to enter their interests as free text on their profiles. This presents a difficult problem in clustering, search and machine learning if we want to move beyond simple ‘exact match’ solutions to a deeper archetypal user profiling and thematic search system. Some of the common issues that arise are misspellings, synonyms (e.g. biking, cycling and bicycling) and similar interests (e.g. snowboarding and skiing) on a several million user scale. In this talk I will demonstrate how we built a system utilizing topic modelling with Latent Dirichlet Allocation (LDA) on a several hundred thousand word vocabulary over ten million+ North American users and explore its applications at POF.

Bio

Thomas Levi started out with a doctorate in Theoretical Physics and String Theory from the University of Pennsylvania in 2006. His post-doctoral studies in cosmology and string theory, where he wrote 19 papers garnering 650+ citations, then took him to NYU and finally UBC.  In 2012, he decided to move into industry, and took on the role of Senior Data Scientist at POF. Thomas has been involved in diverse projects such as behaviour analysis, social network analysis, scam detection, Bot detection, matching algorithms, topic modelling and semantic analysis.

Schedule
• 6:00PM Doors are open, feel free to mingle
• 6:30 Presentations start
• 8:00 Off to a nearby watering hole (Mr. Brownstone?) for a pint, food, and/or breakout discussions

## Machine Learning Lesson of the Day – Babies and Non-Statisticians Practice Unsupervised Learning All the Time!

My recent lesson on unsupervised learning may make it seem like a rather esoteric field, with attempts to categorize it using words like “clustering“, “density estimation“, or “dimensionality reduction“.  However, unsupervised learning is actually how we as human beings often learn about the world that we live in – whether you are a baby learning what to eat or someone reading this blog.

• Babies use their mouths and their sense of taste to explore the world, and they can probably determine what satisfies their hunger and what doesn’t pretty quickly.  As they expose themselves to different objects – a formula bottle, a pacifier, a mother’s breast, their own fingers – their taste and digestive system are recognizing these inputs and detecting patterns of what satisfies their hunger and what doesn’t.  This all happens before they even fully understand what “food” or “hunger” means.  This will probably happen before someone says “This is food” to them and they have the language capacity to know what those 3 words mean.
• When a baby finall realizes what hunger feels like and develops the initiative to find something to eat, then that becomes a supervised learning problem: What attributes about an object will help me to determine if it’s food or not?
• I recent wrote a page called “About this Blog” to categorize the different types of posts that I have written on this blog so far.  I did not aim to predict anything about any blog post; I simply wanted to organize the 50-plus blog posts into a few categories and make it easier for you to find them.  I ultimately clustered my blog posts into 4 mutually exclusive categories (now with some overlaps).  You can think of each blog post as a vector-valued input, and I chose 2 elements – the length and the topic – of each vector to find a way to group them into classes that are very similar in length and topic within each class and very different in length and topic between the classes.  (I used those 2 elements – or features – to maximize the similarities within each category and minimized the dissimilarities between the 4 categories.)  There were other features that I could have used – whether it had an image (binary feature), the number of colours of the fonts (integer-valued feature), the time of publication of the post (continuous feature) – but length and topic were sufficient for me to arrive at the 4 categories of “Tutorials”, “Lessons”, “Advice”, and “Notifications about Presentations and Appearances at Upcoming Events”.

## Machine Learning Lesson of the Day: Clustering, Density Estimation and Dimensionality Reduction

I struggle to categorize unsupervised learning.  It is not an easily defined field, and it is also hard to find generalizations of techniques that are exhaustive and mutually exclusive.

Nonetheless, here are some categories of unsupervised learning that cover many of its commonly used techniques.  I learned this categorization from Mathematical Monk, who posted a great set of videos on machine learning on Youtube.

• Clustering: Categorize the observed variables $X_1, X_2, ..., X_p$ into groups that maximize some similarity criterion, or, equivalently, minimize some dissimilarity criterion.
• Density Estimation: Use statistical models to find an underlying probability distribution that gives rise to the observed variables.
• Dimensionality Reduction: Find a smaller set of variables that captures the essential variations or patterns of the observed variables.  This smaller set of variables may be just a subset of the observed variables, or it may be a set of new variables that better capture the underlying variation of the observed variables.

Are there any other categories that you can think of?  How would you categorize hidden Markov models?  Your input is welcomed and appreciated in the comments!

## Presentation Slides – Finding Patterns in Data with K-Means Clustering in JMP and SAS

My slides on K-means clustering at the Toronto Area SAS Society (TASS) meeting on December 14, 2012, can be found here.

This image is slightly enhanced from an image created by Weston.pace from Wikimedia Commons.

#### My Presentation on K-Means Clustering

I was very pleasured to be invited for the second time by the Toronto Area SAS Society (TASS) to deliver a presentation on machine learning.  (I previously presented on partial least squares regression.)  At its recent meeting on December 14, 2012, I introduced an unsupervised learning technique called K-means clustering.

I first defined clustering as a set of techniques for identifying groups of objects by maximizing a similarity criterion or, equivalently, minimizing a dissimilarity criterion.  I then defined K-means clustering specifically as a clustering technique that uses Euclidean proximity to a group mean as its similarity criterion.  I illustrated how this technique works with a simple 2-dimensional example; you can follow along this example in the slides by watching the sequence of images of the clusters toward convergence.  As with many other machine learning techniques, some arbitrary decisions need to be made to initiate the algorithm for K-means clustering:

1. How many clusters should there be?
2. What is the mean of each cluster?

I provided some guidelines on how to make these decisions in these slides.