Using decision trees to predict paying users in Ruby

Send to friend

Over at Ilya Grigorik’s blog, there’s a great post on using decision trees to figure out what attributes matter in a decision

I had some hiccups trying to implement this in a project, so this is a quick post on how you might actually use his decisiontree gem in your app.

Sample use

Suppose you have an app with a bunch of users, and several payment options. You also offer a free service, which you hope will lead people to buy the paid service. You want to know what kind of people buy the paid service, so that you can target your app towards them.

Setup

To get started you’ll need Ilya’s decisiontree gem.

1
sudo gem install decisiontree

You’ll also need to install the graphr ruby classes. You can pick up the graphr stuff here

If you’re on leopard, you’ll also need to use an older version of graphviz, as the newest version gives errors like:

1
dyld: lazy symbol binding failed: Symbol not found: _pixman_image_create_bits

When I ran into this problem myself, I finally found this post on graphviz with leopard which suggests that you download an earlier release here This is what worked for me.

Those were the only technical problems that I had.

Finally, since you’ll probably be using this in a Rails app, let’s mock out the situation that you will probably have.

Here’s a quick script to basically pretend that you have an active record model named User, as most people do.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class User
        attr_accessor :age
         attr_accessor :salary
        attr_accessor :paid

        def initialize(atts={})
                atts.each {|att, value| self.send("#{att}=", value)}
        end

        def salary_range
                return 'Less than 10k' if salary < 10000
                return '10k - 50k' if salary < 50000
                return '50k - 100k' if salary < 100000
                return '100k+' if salary
        end

        def age_range
                return 'less than 20' if age < 20
                return '20 - 30' if age < 30
                return '30 - 40' if age < 40
                return '40+' if age
        end

        def paid?
                paid ? 1 : 0
        end

end


USERS = [
        User.new(:age => 25, :salary => 40000, :paid => true),
        User.new(:age => 40, :salary => 10000, :paid => false),
        # more
]        
-

Figuring out why users paid

Next, we can just copy Ilya’s script to train the tree and print a graph.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
require 'rubygems'
require 'decisiontree'
require 'users'

def find_reasons
  attributes = ['Age range', 'Salary range']

  training = USERS.map do |user|
                  [user.age_range, user.salary_range, user.paid?]
        end   

  dec_tree = DecisionTree::ID3Tree.new(attributes, training, 1, :discrete)
  dec_tree.train

  dec_tree.graph("reasons_for_paying")
end

The key thing to note is just that we first define the attributes that we think might play into the decision process, then we create an array of arrays that has the information. For each array, the last element should be the decision. So, if the decision I was analyzing were ‘did users eat cake’, 0 would mean that they didn’t and 1 would mean that they did.

So, we end up with a nice visual like this:

Caveats

There are obviously a few caveats here. One huge one is that you must make sure you have not omitted a key factor in the decision.

For example, if you were trying to figure out who will pay for cake, and you have an attribute like ‘likes cake’ and you only included ‘salary’ and ‘age’ in your analysis, you might end up with the conclusion that age is the most important factor in whether or not people will pay for cake, while in fact ‘likes cake’ may be the more important factor.

Furthermore, it’s important to realize that many of these values may be affected by whether or not users pay. If you are including ‘cakes ordered’ in your analysis of whether users sign up for a monthly cake subscription, you will want to be sure to only include cakes that were ordered before the user started the monthly subscription. Otherwise, your conclusions will likely be affected by the fact that paying users are more likely to be ordering cakes. That is, paying users are the ones most likely to be using your service, so concluding that usage leads to payment is probably too simple.

Let me know if people have any other interesting ways to use this.

Images: