step 1: simplify the problem,, forget about the millions of users

If we want to find the connection between two people, I would start with one person and do a simple BFS.

Step 2: handle millions of users

When we deal with a service the size of Linkedin or Facebook, we cannot possibly keep all of our data on one machine. Our friends may not live on the same machine as we do. Instead , we can replace out friends list with a list of their IDs, and traverse as follows:

  • For each friend ID: in machine_index = getMachineIDForUser(personId);
  • Go to machine #machine_index
  • On that machine, do: Person friend = getPersonWithID(person_id);

We have defined a class Server, which holds a list of all the machines, and class machine which represents a single machine.

Optimization 1: Reduce Machine Jumps

Jumping from one machine to another is expensive, instead of randomly jumping from machine to machine with each friend, try to batch these jumps, e.g, if five of my friends live on one machine, I should look up them all at once.

Optimization 2: Smart Division of People and Machines

People are much likely to be friends with people who live in the same country as they d. rather than dividing people up across machines, try to divide them up by country, city, state. this will reduce the number of jumps.

Question: BFS usually requires "marking" a node as visited. How do you do that in this case

It's bad to modify the data, because there could be multiple searches going on at the same time. so we could use a hash table to to save the node id which has been visited.

Follow up:

  • In the real world servers fail, how does this affect you?
  • How could you take advantage of caching?
  • do you search until the end of the graph? How do you decide when to give up?
  • 在真实世界中,你的朋友里,有一些人的朋友会更多。 那么通过他是否更有可能让你找到与特定某个人的联系。 你怎么利用这个数据来选择遍历的顺序。