Article Reviews

Model Thinking: Notes from Coursera Course by Scott Page

I haven’t been actively posting for some time, but I do have a number of draft posts that are ~90% there. Rather than spending a lot more time to do that last 10% which may not have much value-add in terms of content, I’ll try to get some of these out now with minimal commentary. This post is one.

I did the Model Thinking Coursera course some time ago. The course was taught by Scott E. Page from the University of Michigan. Took some notes as follows.


Why You Find Clusters of People Who Are Similar

  • When you find clusters of people looking the same, acting the same, behaving the same, believing in the same thing, etc. there are possible causes
    • Sorting: People move to cluster with people like them (e.g. social groups in school)
    • Peer effects: People act like the people around them (e.g. different regions in the U.S. use pop, coke, soda, to mean the same thing)

Schelling’s Tipping Model – Model of Racial Segregation

  • Phenomena
    • In terms of where we live, there is both racial and income segregation.
  • Model
    • Each person has a threshold as to their requirement to stay in a place, e.g. need 30% of neighbours to be the same race, need 40% of neighbours to be around the same income level. If their threshold is not met, they move.
  • Implications
    • If micro-level threshold required is 30% similar, you end up with 72% similar at the macro level.
    • If micro-level threshold required is 40% similar, you end up with 80% similar at the macro level.
    • If micro-level threshold required is 52% similar, you end up with 94% similar at the macro level.
    • If micro-level threshold required is 80% similar, there is no equilibrium.
    • Micromotives are not aligned with macrobehavior
  • Why it’s called the tipping model
    • Exodus tip: People moving out cause people living there to move out
    • Genesis tip: People moving in cause people living there to move out
  • Segregation index
    • b = # blue in block, B = # blue total
    • y = # yellow in block, Y = # yellow total
    • Tentative measure = abs( b/B – y/Y )
    • To get the measure for a district consisting of many blocks, just sum up the measure for each block.
    • If a district has blocks that are all equally mixed according to the proportion of B to Y, then tentative measure = 0. If the district is perfectly segregated, then tentative measure = 2. So by dividing the tentative measure by 2, you get a score that is on the scale of 0 to 1, 0 = perfectly mixed, 1 = perfectly segregated.

Granovetter’s Model – Model of People’s Willingness to Participate in a Collective Behavior

  • Model
    • N individuals
    • Each person has a threshold (T_j) for person j
    • Join the behavior if T_j others join
  • Examplesz
    • Thresholds: [1, 1, 1, 2, 2] = nobody takes part
    • Thresholds: [0, 1, 2, 3, 4] = everyone takes part
  • Implications
    • The tail wags the dog.
    • Collective action is more likely if there are
      • Lower thresholds
      • More variation in thresholds (in order to cause the cascading effect)
    • Even though the average threshold is lower in the first example compared to the second example, nobody takes part. You need to know both the average and the distribution of the thresholds.

Standing Ovation Model – Model of Peer Effects

  • Signal S = Q (quality of show) + E (error)
    • Q is a measure of the average score that people give to the show
    • E is a measure of the diversity in the interpretation of the show (e.g. audience is more diverse, performance is complex and multidimensional)
  • Threshold T to stand
    • If S > T, stand; and
    • Stand if more thzan X% of people stand
  • Types of audience
    • Celebrities – audience sitting in front who cannot see anybody but everybody sees them
    • Academics – audience sitting at the back who sees everybody but people cannot see them
    • The celebrities really don’t know what the other people think, but everybody else is using them as a cue for what to do.
  • Groups
    • If a person goes with a group, when a person in the group stands up the whole group stands up
  • Implications
    • Higher Q, more stand
    • Lower T, more stand
    • Lower X, more stand
    • If Q < T, the more variation in E, more stand
    • If people sitting in the front stands, more stand
    • Create big groups of people, more stand


Central Limit Theorem

  • When we sum up a bunch of random variables that are independent and have finite variance, the sum is a normal distribution
    • 68% of the time, the sum is within +/- 1 standard deviation
    • 95% of the time, the sum is within +/- 2 standard deviations
    • 99.75% of the time, the sum is within +/- 3 standard deviations
    • 3.4 out of 1 million times the sum is within +/- 6 standard deviation
  • Binomial distribution
    • Distribution for binary outcomes with each probability p for each random variable
    • Mean = p*N
    • Standard Deviation = Sqrt ( p*(1-p)*N )
  • Example application
    • 380 seats on the plane, airline sells 400 tickets.
    • 90% of the people who buys a plane ticket shows up, so for each person (i.e. a random variable), p = 0.9.
    • Mean = 400*0.9 = 360, Standard deviation = Sqrt ( 0.9*0.1*400 ) = 6
    • Within 1 s.d., total number of people showing up is 354 to 366
    • Within 2 s.d., total number of people showing up is 348 to 372
    • Within 3 s.d., total number of people showing up is 342 to 378

Six Sigma

  • Know the range of ‘measurements’ that your process needs to produce, e.g. metal thickness 500-560mm.
  • Mean of the process is 530mm.
  • To get the required metal thickness within 6 sigmas, standard deviation needs to be (560-530) / 6 = 5mm.

Game of Life

  • Setup
    • 2-D grid of cells. Each cell has 8 neighbors.
    • Each cell has 2 possible states: on or off
  • Rules
    • If off, you turn on if 3 neighbors are on
    • If on, stay on if 2 or 3 neighbors are on, else turn off
  • Implications
    • Different phenoma can be produced depending on the initial condition:
      • Class I: Fixed pattern (i.e. no movement)
      • Class II: Alternating patterns
      • Class III: Random patterns
      • Class IV: Complex (i.e. nice) patterns
    • Simple rules produce incredible phenomena

Cellular Automata

  • Setup
    • 1-D set of cells
    • Rules depend on the states of 3 cells: cell on the left, cell itself, and the cell on the right. Hence one complete rule set needs to describe 2^3 = 8 rules.
    • There are a total of 2^8 = 256 complete rule sets.
    • Each rule set is numbered based on the decimal value assuming that each rule set is a 8-bit binary number.
  • Langton’s Lambda
    • For a rule set, lambda = No. of bits that are on
    • All rule sets that produce Class III patterns have lambdas between 2 to 6
    • All rule sets that produce Class IV patterns have lambdas between 3 to 5
  • Conclusion
    • Chaos (Class III) and complexity (Class IV) are caused by intermediate levels of interdependence (i.e. whether I am on or off depends a lot on the pattern of the left, myself, and right cells).
    • With lambdas close to 0 or close to 8, most of the rules lead to the same outcome, so there are lower levels of interdependence.

Aggregation of Preferences

  • Illustrative example
    • A = Apple, B = Banana, C = Coconut, > = prefers left item over right item
    • Person 1: A > B > C
    • Person 2: B > C > A
    • Person 3: C > A > B
    • Implied collective preference
      • C > A > B > C (non-transitive, i.e. not rational)
      • C > A because 2 out of the 3 have C > A
      • A > B because 2 out of the 3 have A > B
      • B > C because 2 out of the 3 have B > C
  • Condorcet Paradox
    • Even if each person have rational preferences (i.e. transitive, if A > B, B > C, then A > C), the aggregated preferences may not be transitive.


Multi-Criteria Decision Making

  • Comparatively
    • Using a set of common parameters, compare two options with each other.
    • For each parameter, decide which option is better. +1 to the option’s score if it is better, 0 if it is worse.
    • For each option, add up the score across all parameters.
    • Pick the option with the higher score.
  • Quantitatively
    • Assign weights to each parameter
    • For each parameter, decide which option is better.
    • Calculate the total weighted score for each option

Spatial Choice Models

  • Each person has an ideal point, and will pick the choice closest to his/her ideal point.
  • For each parameter, write down the value for the ideal point.
  • Then for parameter for each choice, calculate the deviation from the ideal point.
  • Then calculate which is the choice that is closest overall to the ideal point (e.g. 6 parameters is like a point in 6-dimensional space)
  • A parameter where a person has a spatial preference is one where a person would likely have a subjective preference on the ideal value (e.g. size). For parameters where more is better (e.g. speed) or less is better (e.g. price), those are not appropriate to have spatial preferences.

Value of Information

  1. Calculate value without the information
  2. Calculate value with the information (since you need to decide before you get the information, you still need to put probabilities to the different answers that the information might reveal)
  3. Value of information = Difference between 1 and 2 above


Types of Models

  • Rational Actor
    • Assume that there is a mathematical objective function that someone is trying to optimize.
    • Applies when there are large stakes, repeated situations, group decisions, and when it is straight-foward (e.g. $20 vs $10)
  • Behavioral
    • People are not rational in systematic ways.
  • Rule Based
    • Assume that people follow certain rules, e.g. Schelling model

Behavioral Biases

  • Prospect Theory
    • People are risk adverse when looking at gains (i.e. choose certain gain vs. taking a chance with a higher expected value), and risk-loving when looking at losses (i.e. takes a chance with a lower expected loss vs. choosing certain loss).
  • Hyperbolic Discounting
    • People discount the near future a lot more than we discount that same amount of time in the far future. Immediate gratification matters a lot more.
  • Status Quo Bias
    • Tendency  to stick to what we’re currently doing and not make changes.
  • Base Rate Bias
    • People are influenced by what they are currently thinking. E.g. if you were asked what year you think an object was made, and then the price, the two numbers you come up with would be close.

Types of Rules for Rule-Based Models

  • Rules
    • Fixed vs. Adaptive – Adaptive rules mean you change your rules depending on what is happening
  • Contexts
    • Decision vs. Game – Game is a strategic context where your payoff depends on what other people do.


Categorical Models

  • Categorizing data helps to explain the variation in the data.
  • Example
    • You have the calorie data for a bunch of fruits and desserts. If you compute the total variation (sum of squared difference between the data point and the mean), you get a certain large figure.
    • When you split the objects into fruits and desserts, and calculate the variation of each object to the mean of the class of objects, then total variation drops down a lot.
    • If Total Variation = 53,200, Fruit Variation = 200, Dessert Variation = 5000, then categorizing helped to explain (53,200 – 5,200) / 53,200 = 48,000 / 53,200 = 90.2% of the variation (a.k.a R-Squared)

Linear Models – Interpreting Output

  • R-Squared
    • The % amount of total variation (calculated in the same way above using the mean of all data points) explained by the line (variation after the line is calculated by sum of squared distances between the data points and the line).
  • Standard error of regression
    • How much variation was there in the data to begin with
  • Standard error of coefficient
    • 68% confidence that coefficient value is +/- 1 standard error from the estimated coefficient.
    • 95% confidence that coefficient value is +/- 2 standard error from the estimated coefficient.
  • P-value of a coefficient
    • Probability that the sign of the coefficient is wrong.
    • So p-value = 0 and coefficient is positive means that we are absolutely sure that the coefficient is positive. If p-value = 1.4% and coefficient is positive means that there is a 1.4% chance that the coefficient could be negative.

Linear Models – Incorporating Non-Linearity

  • Two ways to incorporate non-linearity
    • Fit straight lines to different segments of the data
    • Use non-linear terms in the linear regression

Linear Models – Pitfalls To Be Aware Of

  • Correlation is not causation
  • Extrapolation is affected by f
  • eedback: Other variables changing as a result of the change in your ‘independent variable’
  • Big coefficient thinking (i.e. only focusing on the variables with big coefficients) can make people blind to new innovative approaches to addressing issues.


Percolation Model

  • You have a 2-D grid. Let P be the probability that you fill in a square. You can move from one filled in square to another.
  • For a grid to percolate, you must be able to move from one side to the other side of the grid.
  • When P < 59.2746%, a graph doesn’t percolate. Once P goes above that tipping point it percolates.

Diffusion Model

  • At time t, W_t people out of total N people have the disease.
  • tau is the transmission rate.
    • When two people meet, the probability of one transmitting to the other = (W_t / N) * ( (N – W_t) / N) * tau
  • c is the contact rate
    • N * c = # of meetings out of N people.
    • N * c * (W_t / N) * ( (N – W_t) / N) * tau = # of new people contracting the disease
  • Diffusion
    • W_(t+1) = W_t + N * c * (W_t / N) * ( (N – W_t) / N) * tau
    • Process starts out slow, it accelerates, then it decelerates. There is no tipping point.

SIS Model (Susceptible, Infected, and then Susceptible)

  • a is the rate of infected people getting better
    • W_(t+1)
    • = # of infected in previous period + # of newly infected – # cured
    • = W_t + N * c * (W_t / N) * ( (N – W_t) / N) * tau – a * W_t
    • = W_t + W_t * ( c * tau * (N – W_t) / N – a)
  • Basic reproduction number
    • If W_t is small, then (N – W_t) / N ~= 1
    • The disease spreads if c * tau > a, i.e. c*tau / a > 1.
    • c*tau / a = basic reproduction number
    • Tipping point for spreading is when c*tau / a > 1.
  • Example usage
    • When you vaccinate V% of people, the reproduction number drops by that % (I’m thinking the impact is on tau)
    • Old reproduction number * (1 – V) = New reproduction number
    • To get new production number <= 1
      • 1 – 1 / old reproduction number <= V
      • So we can calculate the % V that of the population required to be vaccinated to prevent a disease from spreading, using the reproduction number of a disease.

SIR Model (Susceptible, Infected, and then Recovery)

  • Unlike the SIS model, some diseases after you recover from them you don’t get it again. In such situations, use the SIR Model.

Direct vs. Contextual Tips

  • Direct tip: small action or event that has a large effect on end state.
  • Contextual tip: change in the environment by a tiny bit has a huge effect on the end state (e.g. percolation model)

Measuring Tippyness

  • Diversity Index
    • Tells you approximately how many different types of things there are.
    • Diversity index = 1 / Sum (probability of each type ^ 2)
    • E.g. when you have 4 possible outcomes each with 1/4 probability, the diversity index = 4.
  • Entropy
    • Entropy = – Sum ( probability of each type * log_2 (probability of each type) )
    • Entropy tells us the number of pieces of information we have to know in order to identify the outcome (i.e. type).
    • E.g. when you have 4 possible outcomes each with 1/4 probability, entropy = 2. You just need to ask 2 questions, (i) is it the first two outcomes? yes / no, (ii) after you know which two, just ask is it one of the two, and you will know what the final outcome is using 2 pieces of information.ccccccc
  • Tipping point
    • When something goes over a tipping point, the diversity index or entropy goes up or down, e.g. initially you can go left or right, after the tipping point you can only go right.


Basic Growth Model

  • Assumption 1: Output is increasing and concave in labor and machines
    • O_t = Sqrt (L_t) * Sqrt (M_t)
  • Assumption 2: Output is consumed or invested
    • O_t = E_t + I_t
    • I_t = s * O_t, where s is the savings rate
  • Assumption 3: Machines can be built but they depreciate
    • M_(t+1) = M_t + I_t – d*M_t, where d is the depreciation rate
  • Long run equilibrium occurs when Investment = Depreciation
    • I_t = d * M_t
    • s * O_t = d * M_t
    • s * Sqrt (L_t) * Sqrt (M_t) = d * M_t
    • Solve for M_t

Solow Growth Model

  • Adds an additional Technology paramter
    • Output O_t = A_t * K_t ^ beta * L_t ^ (1-beta)
    • A_t = Technology at time t
    • K_t = Capital at time t
    • L_t = Labor at time t
  • Innovation multiplier
    • If A_t = 2, equilibrium output is 4x the basic growth model.
    • If A_t = 3, equilibrium output is 9x the basic growth model.
    • As labor and capital become more productive, there are more incentives to invest in more capital.


Perspectives, Heuristics, and Teams

  • A perspective is a representation of the set of all possible solutions ((i.e. a way of looking at the problem and the solutions). Better perspectives have fewer local optima.
  • Different heuristics approach the problem solving in different ways.
  • A team can only get stuck on a solution that’s a local optimum for every member of the team. With diverse perspective, and diverse heuristics, the diversity will give us different local optima, and those different local optima will mean that when we take intersections across all members, we end up with better points.
  • Communication errors, and errors in evaluating solutions, can hurt teams.

Markov Models

  • Build a matrix of transition probabilities from one state to another. A vector can be used to represent the current state. Multiplying the matrix of transition probabilities with the vector will give you another vector representing the next state. An equilibrium state can then calculated by solving for the vector that will stay the same after multiplication.
  • Example
    • Vector = [No. of alert people, No. of bored people]
    • Matrix = [{Prob of alert people staying alert, Prob of bored people turning alert}, {Prob of alert people turning bored, Prob of bored people staying bored}]
    • Matrix * Vector = Vector giving the number of alert and bored people at the next state
  • Markov Convergence Theorem – Given the following 4 conditions, a Markov process converges to an equilibrium distribution which is unique
    • Finite states
    • Fixed transition probabilities
    • Can eventually get fro many one state to any other
    • Not a simple cycle
  • Interventions that change the number of elements in a state does not change the eventual equilibrium point. However changing transition probabilities would make permanent changes to the equilibrium point.


Lyapunov Functions

  • Maximum version
    • F(x) is a Lyapunov function if
      • It has a maxmimum value; and
      • There is a k > 0 such that if x_(t+1) != x_t, F( x_(t+1) ) > F( x_t ) + k
  • Minimum version
    • F(x) is a Lyapunov function if
      • It has a minimum value; and
      • There is a k > 0 such that if x_(t+1) != x_t, F( x_(t+1) ) < F( x_t ) – k
  • Key points
    • If a Lyapunov function exists, then at some point x_(t+1) will be equal to x_t, i.e. the system goes to equilibrium.
    • An externality is an action by one party that materially affects the happiness of someone who is not directly a party to the action. Without externalities or with positive externalities, it is easy to construct a Lyapunov function. If there are negative externalities (e.g. making myself happier will make other people less happy), then the system could continue to churn.
    • The equilibrium point can be at a point that is not the maximum or minimum value of the function.
  • Compared with Markov processes
    • Markov process has a unique equilibrium point / distribution, Lyapunov functions can have different equilibrium points depending on the starting condition.
    • Lyapunov functions, the system stops at equilibrium. Markov processes, the system can keep churning while the distribution stays the same.
  • Examples
    • Organization in cities
      • People switch between two location so as to avoid crowds.
      • Lyapunov function – Total number of people that all people meet each week.

Pure Coordination Game

  • 2 players, if both choose the same option, both gets the same reward, if different options were chosen, both gets nothing.
  • With the pure coordination game, people will change their behavior to match those around them.

Axelrod’s Culture Model

  • Setup
    • Features (e.g. coordination games): {1, 2, …, N}
    • Traits (what action you take on that feature): a_i in {1, 2, 3, 4, 5, 6, 7} if there are 7 options for a particular feature.
    • Person (vector of traits on features) : (a_1, a_2, …, a_i, …, a_N)
    • Each person is placed in a 2-D grid. Each cell has 1 peron.
    • Look at the 4 neighbors, and interact with a probability that is equal to the similarity with your neighbor (i.e. % of traits that agree), i.e. if the neighbor is like us, we tend to interact with them, if not, we tend not to. Upon interaction, pick a feature, and match their traits.
  • Results
    • People near each other will either be exactly the same or differ by a lot.
    • Boundaries can be self-reinforcing. People don’t interact across the boundaries, and the cultures remain disparate.

Bednar et al Model

  • Consistency rule
    • Pick two attributes, set the value of the second equal to the value of the first.
    • Assume that in addition to people trying to coordinate, they are also trying to be consistent
  • Introduce innovation / errors
    • There is a chance of a trait being changed regardless of consistency or coordination.
    • Innovation / errors can propagate in two directions, across different traits in the same person, or across persons in the same feature.
    • Small errors lead to substantial population level heterogeneity.
  • Conclusions
    • Culture = multiple coordination games where we are trying to be consistent.
    • Ways in which people can coordinate differently:
      • Idiosyncratically coordinate on the wrong things.
      • Payoffs on a particular trait can change over time (e.g. shaking vs. bowing)
      • A way of coordination may be suboptimal but it makes us consistent.


Urn Models

  • Basic model: There are blue and red balls in an urn.
  • Bernoulli model: Pick a ball and return. Outcomes are independent.
  • Polya process
    • Initial Urn = {1 Blue, 1 Red}
    • Pick a ball, see its color, and add in a new ball that is the same color as the ball selected.
    • Result 1: Any probability of red balls is an equilibrium and equally likely, i.e. for a sequence of N balls picked out, P(# of R balls = 0) = P(# of R balls = 1) = P(# of R balls = 2) = …. = P(# of R balls = N).
    • Result 2: Any history (i.e. sequence) of B blue and R red balls is equally likely, i.e. the probability of getting a sequence is not affected by the order of getting the B and R balls, just by the # of B and R balls in the sequence.
  • Balancing process
    • Initial Urn = {1 Blue, 1 Red}
    • Inverse of Polya. Add a new ball that is the opposite color as the ball selected.
    • Result: The balancing process converges to equal percentages of the two colors of balls.

Path Dependent Outcomes and Equilibria

  • Path dependent outcomes = color of ball in a given period depends on the path
  • Path dependent equilibrium = % of red balls in long run depends on the path
  • Polya process has both path dependent equilibria and path dependent outcomes.
  • Balancing process has only path dependent outcomes.

Path Dependence and Phat Dependence

  • Path dependent = outcome probabilities depend upon the sequence of past outcomes
  • Phat dependent = outcome probabilities depend upon past outcomes but not their order
  • Polya process is Phat. E.g. getting RRB balls result in the same outcome probabilities as getting BRR balls.
  • Path dependence is typically not a deterministic process, because what happens along the way has an impact on the outcome.
  • Independent = Outcome doesn’t depend on starting point or what happens along the way, it is probabilistic, not deterministic.

Sway Process – Getting Full Path Dependence

  • Initial Urn = {1 Blue, 1 Red}
  • Process
    • In period t, add a ball of same color as selected ball and add 2^(t-s) – 2^(t-s-1) balls of color previously chosen in each period s < t
  • If the past takes on more weight over time then you can get full path dependence (i.e. every single draw from the urn matters).

Markov Process Are Not Path Dependent

  • Markov processes are not path dependent because they have fixed transition probabilities. In the Polya process, the transition probabilities change.

Path Dependence and Chaos

  • Chaos = Extreme Sensitivity to Initial Conditions (ESTIC)
  • Example: Tent Map
    • X in (0, 1)
    • F(X) = 2X if X < 0.5, 2-2X if X > 0.5
    • Outcome is deterministic, it all depends on the initial point. It is not path dependent.

Path Dependence and Increasing Returns

  • You can get increasing returns without path dependence.
    • Gas/electric urn model. Initial U = {5 B, 1 R}.
    • If pick R, add 1 B and 1 R. If pick B, add 10 B.
    • There is increasing returns in both R and B balls, but the process is not path dependent.
  • You can get path dependence with increasing returns
    • Symbiots urn model. Initial U = {1 B, 1 R, 1 G, 1 Y}.
    • If pick R, add 1 G. If pick G, add 1 R.
    • If pick B, add 1 Y. If pick Y, add 1 B.
    • It doesn’t have increasing returns for each color, but it is path dependent because picking either R or G increases the probability of getting R or G in future, similarly for B and Y.
  • Large public projects are likely to bump into each other, creating externalities. Since one project can affect other projects, these externalities can create path dependence.

Path Dependence and Tipping Points


Structure of Networks

  • Degree
    • Degree (node): # of edges attached to a node
    • Degree (network): Average degree of all nodes = 2*Edges / Nodes
    • Average degree of neighbors of nodes will be at least as large as the average degree of the network.
    • Applications: Measure of density of connections, social capital in the network (e.g. productive capacity of the group), and speed of diffusion.
  • Path Length
    • Path length A to B: minimal # of edges that must be traversed to go from node A to node B.
    • Average path length: Average path length between all pairs of nodes in a network.
    • Applications: # flights needed, social distance, likelihood of information spreading
  • Connectedness
    • A graph is connected if you can get from any node to any other.
    • Applications: Markov process, terrorist group capabilities, internet/power failure, information isolation
  • Clustering coefficient
    • % of triples of nodes that have edges between all three nodes. The maximum number of triples is given by N choose 3, where N is the number of nodes.
    • Applications: Redundancy/robustness, social capital, innovation adoption (triangles)

Logic of Network Formation

  • Random attachment
    • N nodes, p = probability of two nodes being connected.
    • There is a contextual tipping point: For large N, the network almost always becomes connected when p > 1/(N-1)
  • Small worlds
    • People have some percentage of “local” or “clique” friends and some percentage of random friends.
    • When you have more random friends, we get less clustering and we also get a shorter average path length.
  • Preferential attachment
    • Nodes arrive to the graph, and the probability it connects to an existing node is proportional to the node’s degree, e.g. a 4-degree node is 4x more likely to be connected to, compared to a 1-degree node.
    • This has a path dependent outcome but not a path dependent equilibrium in terms of degree distribution.

Network Function

  • Random clique network – explaining 6 degrees of separation
    • Each person has C clique friends, and R random friends (small worlds model)
    • K-neighbor = set of all nodes that are of path length K (exactly) away
      • 1-neighbors = R + C
      • 2-neighbors = CR + RR + RC (e.g. CR = neighbors that are 2 steps away via a clique friend followed by the clique friend’s random friend). Note that CC = C.
      • 3-neighbors = RRR + RRC + RCR + CRR + CRC. Note that RCC = RC, CCR = CR, CCC = C.


Sources of Randomness

  • Noise (measurement)
  • Errors
  • Uncertainty
  • Complexity
  • Capriciousness

Skill and Luck

  • Outcome = a * Luck + (1-a) * Skill, a in (0, 1)
    • If you see consistent outcomes, then a is probably small. If you see keep seeing large changes in outcome, then a is probably big.
  • Paradox of skill
    • When you have the very best competing, the differences in their skill levels may be close. So the winner will be determined by luck.

Random Walk

  • Process
    • X = 0
    • Each period, flip a fair coin. Heads, +1 to X. Tails, -1 to X.
  • Result 1: After N (even #) of flips, you should expect to be at 0.
  • Result 2: For any number K, a random walk will pass both -K and +K an infinite number of times.
  • Result 3: For any number K, a random walk will have a streak of K heads (and K tails) an infinite number of times.

Normal Random Walk and Efficient Market Hypothesis

  • Process
    • X = 0
    • Each period, draw a number from a normal distribution with mean 0 and standard deviation 1 and add to X.
  • Efficient Market Hypothesis (EMH)
    • Prices reflect all relevant information, so it’s impossible to beat the market.
  • Critiques of EMH
    • There is way too much fluctuation in market prices.
    • There are consistent winners.

Finite Memory Random Walks

  • E.g. Value of V at time T, V_T = X_T + X_(T-1) + X_(T-2) + X_(T-3) + X_(T-4)
  • Can be used to predict aggregate statistics
    • E.g. 28 teams, each team follows the process V_T above where X can represent players. Champion team is the one with the highest V_T. If we run this process for 28 years, the aggregate statistics is close to what we have for NBA, NFL, MLB, etc.


Colonel Blotto Game

  • Setup
    • 2 players each with T troops
    • N fronts (T >> N)
    • Actions: allocate of troops across the N fronts. For each front, the player with more troops wins the front.
    • Payoffs: # fronts won
  • This is a zero-sum game. Any strategy can have a counter-strategy that beats it.
  • There is going to be an equilibrium where we choose strategies randomly, therefore the winner is going to be random. There is skill involved if you are able to strategically figure out where your opponent will be placing his/her troops.

Colonel Blotto – Troop Advantages

  • As the number of fronts increases, a country needs a larger relative resource advantage to guarantee victory (i.e. the advantage of having more troops decreases as we have more fronts).
  • If you’re the weaker player, you want to add dimensions.

Multi-Player Blotto

  • With multiple players, you can get cycles. E.g. player 1 beats 3, 3 beats 2, 2 beats 1.

Models Applied to Competition

  • Random model implies
    • Equal wins
    • No time dependency
  • Skill + Luck implies
    • Unequal wins
    • Semi-consistent rankings
    • No time dependency
  • Finite memory random walk implies
    • Unequal wins
    • Semi-consistent rankings
    • Time dependency
    • Movement from top to bottom (a lot more regression to the mean)
  • Colonel Blotto (equal troops)
    • Outcomes same as random
    • Lots of maneuvering
  • Colonel Blotto (unequal troops)
    • Outcomes same as skill + luck
    • Lots of maneuvering
  • Colonel Blotto (unequal troops, limited movement)
    • Outcomes same as finite walk.
    • Lots of maneuvering
    • Cycles

Determining Whether a Game is More Blotto or Skill-Luck

  • Dimensionality: If players are making high dimensional strategic decisions, then it’s more Blotto-like.
  • Zero sum: If actions are only good relative to other actions, then it may be more Blotto-like.
  • In a skill-luck game, you can have players all getting better.


Prisoner’s Dilemma

  • Setup
    • Payoff when both cooperates: (T, T)
    • Payoff when only player 1 defects: (F, 0)
    • Payoff when only player 2 defects: (0, F)
    • Payoff when both players defect: (R, R)
    • T > R > 0
    • F > T
    • 2T > F (If F is too large w.r.t. T, then it would make sense for each player to alternate between (F, 0) and (0, F) because his average payoff would be greater than T).
  • Pareto efficient
    • A state is pareto efficient if there is no way to make everybody better off.
    • Note that (R, R) is the only state that is not pareto efficient.
  • Nash equilibrium
    • Nash equilibrium is the state that results when each player moves based on how the other player moves.
    • The Nash equilibrium here is the (R, R) state, because whichever state you start off with, it will move and end up in at the (R, R) state.
  • Self-interest game
    • In a self-interest game, R > F > T.
    • The only pareto efficient outcome is (R, R), and Nash equilibrium is also (R, R).
  • Examples of Prisoner’s Dilemma
    • Arms race, price competition, technology adoption, political campaigns, food sharing, hedonic treadmills


  • Setup
    • Cost of cooperation: c
    • Benefit to other(s): b
    • b > c
    • Individually, you incur positive cost c if you cooperate. Socially, others would prefer you cooperate.
  • Direct reciprocity
    • p = probability we meet again
    • Payoff if we deviate: 0
    • Payoff if we cooperate: -c + p*b
    • Cooperation happens if (-c + p*b) > 0, i.e. p > c/b
  • Indirect reciprocity
    • q = probability of reputation spreading to others
    • Payoff if we deviate: 0
    • Payoff if we cooperate: -c + q*b
    • Cooperation happens if (-c + q*b) > 0, i.e. q > c/b
  • Network reciprocity
    • In a regular graph, each node has the same number of neighbors.
    • If each node has k neighbors, if k < b/c, we are likely to get cooperation.
    • Look at a node sitting at the boundary between cooperators and defectors
      • If you are surrounded by k cooperators and you are also cooperating, payoff = k*(b – c)
      • If you are surrounded by (k-1) cooperators and connected to 1 defector, and you are defecting, payoff = (k-1)*b
      • To cooperate, k*(b-c) > (k-1)*b, i.e. b/c > k
  • Whether you prefer a dense or sparse network, depends on the mechanism you are using to get cooperation
    • For direct / reputation reciprocity, denser the network the more the cooperation.
    • For network reciprocity, sparser the network the more cooperation.
  • Group selection
    • Within a group where majority cooperates, defectors do better.
    • However groups that have more cooperators win more wars, so if there are frequent enough competitions between groups, you get a force towards cooperation.
  • Kin selection
    • Players are related and you care about other people based on their relatedness, r. E.g. r = 0.5 for your child.
    • You will cooperate with your kin if r*b > c
  • Laws
    • Pass laws that force people to act in society’s interests.
  • Incentives
    • Induce people to take the cooperative action using incentives / disincentives.

Collective Action

  • Setup
    • Let x_j be the action of person j, where x_j is in range [0, 1]
    • Payoff to j = -x_j + beta*sum(x_i), where i = 1 to N. beta is in range (0, 1)
  • Common pool resource problem
    • x_j = amount consumed by j
    • X = total consumed
    • Amount available next period C_(t+1) = ( C_t – X )^2
  • In solving collective action problems, particulars matter
    • Cattle grazing
      • Tag cattle, and rotation scheme to prevent overgrazing of the grass.
    • Lobster
      • Need mechanisms to monitor the total population of lobsters in order to control the fishing, unlike the cattle case where the amount of resource can be clearly measured.
    • Drawing water from a stream
      • Actions of people upstream matter more because they affect downstream people.


Hidden Action

  • Moral hazard – setup
    • Action: effort = 0, 1 (not observed)
    • Outcome = {Good, Bad}
    • Prob(Good | effort = 1) = 1
    • Prob(Good | effort = 0) = p
    • Cost effort = c
  • Contract: Pay M if Good, 0 Bad
    • Incentive compatible: makes sense to put in effort
    • Effort 1: payoff = M – c
    • Effort 0: payoff = p*M
    • M – c >= p*M
    • M >= c / (1-p)
    • M is increasing in both c and p

Hidden Information

  • Setup
    • Ability of workers: High (H), Low (L)
    • For high ability workers, cost per hour of effect = c
    • For low ability workers, cost per hour of effect =  C > c
  • Contract: Pay M but first you must work K hours to test work ability
    • Incentive compatible:
    • For high ability workers: M > K*c
    • For low ability works: M < K*C
    • Choose K so that K > M/C so that low ability workers won’t take the job while high ability workers will.


  • Ascending bid
    • Individuals call out bids until no one bid a higher price.
    • Rational: Only bid up to your value
    • Psychological: Can lead to higher prices just to get the thrill of winning/
    • Rule following: More on how to change your bids, but won’t go above what you deem as the value.
    • Outcome: Highest value bidder gets it at the second highest value.
  • Second price
    • Each person submits a bid. Highest bidder gets it at the second highest bid.
    • Rational: Bid your true value.
    • Outcome: Highest value bidder gets it at the second highest value.
  • Sealed bid
    • Each person submits a bid. Highest bidder gets it at the highest bid.
    • Consider a two bidder model, value of other bidder is uniform in [$0, $1]
    • Scenario 1: Other bidder bids true value
      • Probability (Other bidder’s bid < B) = B
      • V = Value, B = Your Bid, Surplus = V – B, Probability of winning = B
      • Expected winnings = B*(V – B).
      • Take derivative w.r.t. B to get optimal B = V / 2.
    • Scenario 2: Other bidder also bids half her true value
      • Probability (Other bidder’s bid < B) = 2B
      • Expected winnings = 2B*(V – B)
      • Take derivative w.r.t. B to get optimal B = V / 2.
    • Outcome: Highest value bidder gets it at half her value, which is also the expected value of the second highest bidder.
  • Revenue Equivalence Theorem
    • With rational bidders, a wide class of auction mechanisms including sealed bid, second price, ascending price, produce identical expected outcomes (Roger Myerson)

Public Projects

  • Clark-Groves-Vicry Pivot Mechanism
    • Pay the minimal amount you’d have to contribute for the project to be viable.
    • Each person claims value: V1, V2, V3
    • If V1 + V2 + V3 > Cost, do the project.
    • Each person pays Max {Cost – sum of other people’s values, 0}
    • Incentive compatible: Each person has an incentive to reveal her true value.
    • However, this mechanism runs into the problem where sometimes V1 + V2 + V3 > Cost, but the total amount people pay < Cost, so the project cannot be funded.
  • It is impossible to find a mechanism that is efficient (project gets done when total value > cost), people are not coerced into joining, incentive compatible (people always tell their true value), balanced.


Replicator Dynamics

  • Set of types = {1, 2, 3, …, N}
  • Payoff for each type = pi(i)
  • Proportion of each type = Pr(i)
  • Weight of each type = pi(i)*Pr(i)
  • Proportion of each type in period T+1, Pr_(T+1)(x) = (pi_T(x) * Pr_T(x)) / Sum_i(pi_T(i) * Pr_T(i))

Fisher’s Fundamental Theorem

  • Fundamental components
    • There is no cardinal (i.e. there is variation within populations / species)
    • Rugged landscapes exist (i.e. there may be local optima which make finding the global optima difficult)
    • Replicator dynamics (evolution can occur through weighing (1) the rational observation of the actual payoff to being a certain type, and (2) the number of people of that type)
  • Conclusion
    • Higher variance increase rate of adaptation
    • The change in average fitness due to selection will be proportional to the variance (of the fitness). [Fitness in this case is the payoff in replicator dynamics]

Reconciliation with Six Sigma

  • Six Sigma prefers low variance but Fisher’s theorem advocates higher variance.
  • If the landscape stays fixed, i.e. the target stays fixed, you want to use Six Sigma.
  • If the landscape changes, you want more variation.

Diversity Prediction Theorem

  • Mathematical relationship
    • Crowd’s error = Average error – Diversity
    • Crowd’s error = (Actual answer – mean prediction)^2
    • Average error = Average squared error (compared to the actual answer) of the individuals’ predictions
    • Diversity = Average squared error (compared to the mean prediction) of the individuals’ predictions
  • Conclusion
    • Wisdom of crowds come from reasonably smart people who are diverse
    • Madness of crowds come from like-minded people who are all wrong



4 thoughts on “Model Thinking: Notes from Coursera Course by Scott Page

  1. Thanks much. It greatly assisted me in taking the course 🙂

    Posted by Andrea K. Iskandar | March 21, 2016, 1:43 pm
  2. Thanks for your notes, it is always fruitful to see – from someone else’s perspective. Thanks for taking the time to capture and present your thoughts… a rare gift !

    Posted by MarkTellam | July 26, 2017, 5:03 pm
  3. Totally amazing

    Posted by wong | December 21, 2018, 2:33 am


  1. Pingback: Model thinking MOOC | Khouaja' blog - August 13, 2016

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Copyright © 2005 – 2018 All Rights Reserved.

Enter your email address to follow this blog and receive notifications of new posts by email.


Blog Stats

  • 644,797 hits
%d bloggers like this: