Redis and IMDB Graphs

Project Goals

Our team goals are to learn about database implementation and data structures for Project 2. Utilizing the IMDb database, we are replicating the Oracle of Bacon using Redis and IMDb’s Top 250 Movies’ Actors. With Redis, we want to be able to make our own graphs, data structures and functions into Redis. After using Redis, we want to go one more step back and learn how to implement a database system like Redis.

Sprint 1

IMDB Data Collection

First, we had to learn about how to collect IMDB data. We decided to focus on the filmography of actors that have a movie in the IMDb Top 250 movies list. This is the process we went through:

We stored the list of top 250 movies from the imdb database in a dictionary using the movie ids as keys and the movie titles as values. We then used the movie ids to query the imdb database for the full information about each movie in the top 250 movies. This database call gave us the cast of each of the movies. Many movies have very large casts, so we only used the first 10 cast members in the list. We queried the database for the filmography of each actor / actress, or the list of all the movies they have been in. We then built a redis database, setting the actor / actress’s id as the key and setting the value to a tuple consisting of their name and filmography. We chose to use ids as keys because we wanted to be confident that we were using a unique identifier even if the actors / actresses had the same name.

The code for getting each actor and their entire filmography into redis can be found here: Sprint 1 Version of Data Wrangling

Our Redis Knowledge

We all read Redis Zero to Master in 30 Minutes Part 1
Redis is a data structure engine, where a value can only be accessed by a key
Can add data structures the same way the 5 that are already there exist
All you need is implementation and commands to access it!!!!!!!!!!!
Read through and working on implementing example of adding an AVL tree data structure to Redis: Hacking Redis: Adding a Command
Need to add commands to access it
Implement in separate file, and require it in redis.h
Define structure
Define management functions
Define outward facing commands
Actual implementation (graph class in C as described below)
Persistence-- RDB and or Append Only File

Next Steps

We plan to create a graph class in C and write methods to implement this data structure in Redis. We are looking into testing both adjacency lists and adjacency matrices to both see which would better suit our needs and increase our overall knowledge. In terms of implementing these features in Redis, we are thinking about either creating a local Redis file with our graph program, or directly contributing to Redis, potentially through hiredis (its C interface).
Once we have the graph data structure in Redis, we will create a new database with it in order to be able to traverse through actors/actresses and movies to find connections between them. We decided that it would be easiest to start by instituting Kevin Bacon as the starting point of our graph and creating our version of the Oracle of Bacon.
After we have a working implementation for the Bacon Number, we will develop a more general search algorithm which can be applied to search between any two actors/actresses and return the connection between them. Because this is a more broad problem and we are interested in learning more about search algorithms, we will look at depth first search, breadth first search, and bidirectional search in order to be able to see the pros and cons of each.

Sprint 2

Re-Evaluated Project Steps

Coming into sprint 2, we had to re-evaluate our project goals and steps, in order to achieve what we want to learn more efficiently. While we were planning out algorithms and data structures we could use for creating graph/searching. As a result, we scrapped our original idea of using each actor's entire filmography and instead just looked at actors and their films that are in the top 250 movies.

Create a dictionary of each top 250 movie as a key, and the top 10 actors for each movie as the corresponding value.
Use the dictionary to make adjacency lists of each actor and all of the actors that share the same filmography in the top 250 movies list
Use bi-directional breadth-first search to our adjacency lists to find shortest path between two actors

What We Did This Sprint

Code for making the adjacency lists and putting them into Redis: Making Adjacency Lists in Redis
We also started looking at how to make redis commands.

Writing a Custom Redis Command in C
Looking at the Redis Source Code

Next, we tried making out own Redis commands and replicating the ping pong command in redis. We got a little stuck during this sprint, but will devote all of next sprint to get this working

Next Steps

Now that we have created a data structure and framework to store the actors and their corresponding actors, we are working on integrating these structures into Redis directly. Currently, we are treating each actor as a key and appending each corresponding actor as a value in a list. We are now in the process of figuring out how to add commands into Redis, and once we do that, will be able to add a command to search between two of the actors in our table.
In terms of searching, we will start out with breadth first search for simplicity and pick one actor/actress to act as the head node. The function we write will allow us to search for the degrees of separation from that particular actor to any other actor.
Once we have breadth first search working, we will implement bidirectional search, which is slightly more complicated to setup and code. This will allow us to choose two head nodes (any two actors) and locate the shortest path between them.

Sprint 3

Breadth-First Search in Python

We spent a lot of time this sprint not looking further at storing data, but understanding data structures and search algorithms. We created our own working breadth-first search algorithm, and made sure it worked with our data in redis.

def bfs(graph, start, end):
    # maintain a queue of paths
    queue = []
    # push the first path into the queue
    queue.append([start])
    while queue:
        # get the first path from the queue
        path = queue.pop(0)
        # get the last node from the path
        node = path[-1]
        # path found
        if node == end:
            return path
        # enumerate all adjacent nodes, construct a new path and push it into the queue
        for adjacent in graph.get(node, []):
            new_path = list(path)
            new_path.append(adjacent)
            queue.append(new_path)

Sprint 4

Sprint Goals

For this sprint, we had really cemented the algorithm and steps to attaining the desired result. We were first going to implement breadth-first search in C, using linked lists and hashtables, and then, if time permitted, implement bi-directional breadth-first search. We also wanted to use real actor names, instead of id’s in the interface portion of the redis command. This sprint really cemented our learning of C, in terms of data structures, and our ability to utilize redis commands through hiredis.

Breadth-First Search in C

To complete breadth first search in C, we had to create a hashtable data structure and a linked list data structure to be able to implement the algorithm in a manner similar to the python code described above. The linked list was used as a queue, and the hashtable served 2 purposes. It was used to keep track of the visited actors in the adjacency list during the breadth first search, and to make up a child:parent hashtable.

For the linked list data structure, we all had practice implementing this data structure in an earlier sprint exercise, so we felt comfortable just taking this structure and using it. We made some modifications in order to allow it to act like a FIFO data structure queue by changing the push method to push nodes to the end of the linked list rather than the beginning. The pop method pops a node from the beginning, giving queue functionality. This linked list is imported from a file we made called singlelinkedlist.h.

For the hashtable data structure, we had all also implemented a hashtable data structure, but needed one with keys and values being strings, or in other words, char*. So we found an implementation online to use. This is stored in a file hashs.h.

The traversal of the graph using breadth first search to create the child:parent hashtable is commented and annotated and shown below. Start with a current node "name". Make a query for all of the current node's neighbors. A queue of actors is used to store each neighbor for the current actor node that we are visiting. We then add each neighbor, and it's parent (the current node) to a hashtable of child:parent. Then mark the current node as visited.

/* 
    BREADTH FIRST SEARCH IMPLEMENTATION
*/

//Switch back to DB0 for breadth-first search
redisReply *switch_back;
switch_back = redisCommand(c, "SELECT 0");

//initialize LLs and hashtables
Node *actors; //used in BFS as the Queue
Node *path; //used later for building the path

//visited hashtable
hashtable_t *hashtable = ht_create( 65536 );

//hashtable of parents to build path later
hashtable_t *hashtable_parents = ht_create( 65536 );

//initialize the start node for the actors Queue
actors = make_node(start, NULL);
char *name = start;

//while the start and end are not equal
while (strcmp(name, end) != 0) {
    name = peek(&actors);

    //check-- has name been visited?
    char *check = ht_get(hashtable, name);

    //if name is unvisited
    if(strcmp(check, "NULL")==0){

        //break out if you have reached
        //finishing condition
        if (strcmp(name, end) == 0) { 
            break; 
        }
        
        //continue BFS
        //set up the query for the list of actors
        //in the adjacency list (neighbors)
        char word_reply[100] = "";
        strcat(word_reply, "LRANGE ");
        strcat(word_reply, name);
        strcat(word_reply, " 0 -1");
        //store the query response
        reply = redisCommand(c,word_reply);
        if (reply->type == REDIS_REPLY_ARRAY) {
            for (k = 0; k < reply->elements; k++) {
                //push each neighbor onto the queue (as you
                //do in BFS)
                push(&actors, reply->element[k]->str);
                //check if each neighbor is in the hashtable
                //of parents
                char *check2 = ht_get(hashtable_parents, reply->element[k]->str);
                if (strcmp(check2, "NULL") == 0) {
                    //if not, then put that neighbor and it's parent, name
                    //in the hashtable. It is keeping it traceable
                    //by the path later. 
                    ht_set(hashtable_parents, reply->element[k]->str, name);
                }
            }
        }
        //once done, set name to visited
        //recall that name is just the current node
        //we are visiting.
        ht_set(hashtable, name, "visited");
    }
    //traversing graph by popping from actors
    //then the new current node, name, will be set to 
    //the beginning of actors again at the beginning of the
    //while loop.
    pop(&actors);
}

Using the parents hashtable, we were able to build up the path from the source actor to the destination actor as follows:

        /*
            BUILDING UP THE PATH FROM THE PARENT HASHTABLE
        */

        //Gets first parent to make linked list
        //To trace back and print the path
        //parent is an id in hashtable_parents
        char *parent = ht_get(hashtable_parents, end);

        //Use parent hashtable to build up path
        //all the way up to but not including the final parent
        while (strcmp(parent, start) != 0) {
            //Need to store in redis reply after looking
            //up the actual name from id stored in parent
            redisReply *actor_name;

            //Get the actor name
            actor_name = redisCommand(c,"GET %s", parent);
            char *parent_name = actor_name->str;

            //push parent name to build up final path
            push(&final_path, parent_name);
            //get new parent
            parent = ht_get(hashtable_parents, parent);
        }

        //Get the final parent, the last name that needs
        //to get pushed onto the path
        redisReply *actor_name;
        actor_name = redisCommand(c,"GET %s", parent);
        char *parent_name = actor_name->str;

        push(&final_path, parent_name);

        reverse(&final_path);
        print_list(final_path);

Bidirectional Search in C

Bidirectional breadth first search is very similar to regular breadth first search, with just a few modifications. The primary change we made was to create two separate queues to iterate through: one for nodes visited branching from the starting actor, and one for nodes visited branching from the ending actor. For both queues, the head node goes through the same process as in breadth first search - checking to make sure it has not been visited, checking and adding its children nodes, and adding it to the hashtable of visited actors. Before popping these head nodes from the queues, they are compared to see if they are equal (meaning they are the same person). If they are equal, a complete path from the start and end actors has been found, and the program moves onto the next step.

This is where the path is generated and printed out. We use the same method of storing the parent node of each node in two hashtables, one for the start and one for the end queue. In order to get the paths, we simply add each parent node to a linked list until the start or end actor has been found. Once that occurs, we are left with two lists, the path from the start to the meeting actor, and the path from the end to the meeting actor. Finally, we reverse the first list in order to join the two lists.

Once we finished bidirectional search, we ran time tests to see how long our implementation of breadth first search took compared to our bidirectional search code:

Actors	BFS Time (sec)	Bidirectional Time (sec)
Colin Firth - Bette Davis	1.547	0.119
Bette Davis - Bill Murray	0.801	0.110
Brad Pitt - Bette Davis	1.234	0.155
Natalie Portman - Bill Murray	0.423	0.059
Jamie Foxx - Bette Davis	1.400	0.082

Working More with Multiple Redis Databases

Once we had breadth-first search working, we wanted to be able to have actual actor names, instead of just actual actor ids. At the same time, we wanted to be able to implement that into the redis command, and have it done in C.

Get Key-Value File System in redis server with actor name as key, and actor id as value. This is for taking in an actor name first, converting it into id's, which is how our adjacency list is comprised of.
Get Key-Value File System in redis server with actor id as key, and actor name as value. This is for getting the final display of the degrees of separation path.
Write code for switching between 3 different databases to do the three different parts of the algorithm: 1. Get actor names from argument of command, and use name-to-id lookup table to get the start and end id's. 2. Do search algorithm with the actor id's. 3. Use the id-to-name lookup table to print out the path from start to end.

Using Python, we store all of the two lookup tables into the redis servers, and then use the switch command to access the other databases. Here is an example of the switch command:

    
      //SETTING UP LOOKUP TABLE:
      redisReply *actors_ids;
      actors_ids = redisCommand(c, "SELECT 2");

      redisReply *switch_back;
      switch_back = redisCommand(c, "SELECT 0");

We have to do this each time that we want to use different databases.

Problems we faced

When creating both the breadth-first search and bidirectional search algorithms, we encountered a few problems along the way. We had a difficult time retracing the path in both cases. Initially, we were going to store the path that it took to get to each node in a linked list on the node itself. Then we came up with the parent hashtable idea, which we liked better as a solution. We had a bug where the second to last node in the path was not being displayed because of the order we were pushing things to the path and looking them up in the parent hashtable. We fixed this through careful examination of the logic flow, and used the same type of debugging for the bugs we found in bidirectional search.

In bidirectional search, we encountered a problem with the while loop that handled popping nodes off of the queues. The while loop would check whether the start node and end nodes were equal to each other, and execute if they weren’t. At the end of the loop, the current heads of the start queue and end queue would be popped. When it got to the two actors / actresses that were equal, the meet in the middle point for bidirectional search, the while loop would never execute or set the value of the start and end. Because the while loop wasn’t running, we weren’t able to access those values. We knew that the path met, but not at which value. To fix this issue, we changed the while loop so it continued until the start and end queues were empty and we put in a check at the end of the while loop. If the two actors were equal, break, otherwise pop them off their queues and continue.

Authors and Contributors

Shivali Chandra (@skchandra), Cynthia Chen (@yunhsincynthiachen), Nitya Dhanushkodi (@ndhanushkodi) and Marena Richardson (@marenar)

Support or Contact

Having trouble with Pages? Check out our documentation or contact support and we’ll help you sort it out.