Our team goals are to learn about database implementation and data structures for Project 2. Utilizing the IMDb database, we are replicating the Oracle of Bacon using Redis and IMDb’s Top 250 Movies’ Actors. With Redis, we want to be able to make our own graphs, data structures and functions into Redis. After using Redis, we want to go one more step back and learn how to implement a database system like Redis.
First, we had to learn about how to collect IMDB data. We decided to focus on the filmography of actors that have a movie in the IMDb Top 250 movies list. This is the process we went through:
We stored the list of top 250 movies from the imdb database in a dictionary using the movie ids as keys and the movie titles as values. We then used the movie ids to query the imdb database for the full information about each movie in the top 250 movies. This database call gave us the cast of each of the movies. Many movies have very large casts, so we only used the first 10 cast members in the list. We queried the database for the filmography of each actor / actress, or the list of all the movies they have been in. We then built a redis database, setting the actor / actress’s id as the key and setting the value to a tuple consisting of their name and filmography. We chose to use ids as keys because we wanted to be confident that we were using a unique identifier even if the actors / actresses had the same name.
The code for getting each actor and their entire filmography into redis can be found here: Sprint 1 Version of Data Wrangling
Coming into sprint 2, we had to re-evaluate our project goals and steps, in order to achieve what we want to learn more efficiently. While we were planning out algorithms and data structures we could use for creating graph/searching. As a result, we scrapped our original idea of using each actor's entire filmography and instead just looked at actors and their films that are in the top 250 movies.
Code for making the adjacency lists and putting them into Redis: Making Adjacency Lists in Redis
We also started looking at how to make redis commands.
We spent a lot of time this sprint not looking further at storing data, but understanding data structures and search algorithms. We created our own working breadth-first search algorithm, and made sure it worked with our data in redis.
def bfs(graph, start, end): # maintain a queue of paths queue = [] # push the first path into the queue queue.append([start]) while queue: # get the first path from the queue path = queue.pop(0) # get the last node from the path node = path[-1] # path found if node == end: return path # enumerate all adjacent nodes, construct a new path and push it into the queue for adjacent in graph.get(node, []): new_path = list(path) new_path.append(adjacent) queue.append(new_path)
For this sprint, we had really cemented the algorithm and steps to attaining the desired result. We were first going to implement breadth-first search in C, using linked lists and hashtables, and then, if time permitted, implement bi-directional breadth-first search. We also wanted to use real actor names, instead of id’s in the interface portion of the redis command. This sprint really cemented our learning of C, in terms of data structures, and our ability to utilize redis commands through hiredis.
To complete breadth first search in C, we had to create a hashtable data structure and a linked list data structure to be able to implement the algorithm in a manner similar to the python code described above. The linked list was used as a queue, and the hashtable served 2 purposes. It was used to keep track of the visited actors in the adjacency list during the breadth first search, and to make up a child:parent hashtable.
For the linked list data structure, we all had practice implementing this data structure in an earlier sprint exercise, so we felt comfortable just taking this structure and using it. We made some modifications in order to allow it to act like a FIFO data structure queue by changing the push method to push nodes to the end of the linked list rather than the beginning. The pop method pops a node from the beginning, giving queue functionality. This linked list is imported from a file we made called singlelinkedlist.h.
For the hashtable data structure, we had all also implemented a hashtable data structure, but needed one with keys and values being strings, or in other words, char*. So we found an implementation online to use. This is stored in a file hashs.h.
The traversal of the graph using breadth first search to create the child:parent hashtable is commented and annotated and shown below. Start with a current node "name". Make a query for all of the current node's neighbors. A queue of actors is used to store each neighbor for the current actor node that we are visiting. We then add each neighbor, and it's parent (the current node) to a hashtable of child:parent. Then mark the current node as visited.
/* BREADTH FIRST SEARCH IMPLEMENTATION */ //Switch back to DB0 for breadth-first search redisReply *switch_back; switch_back = redisCommand(c, "SELECT 0"); //initialize LLs and hashtables Node *actors; //used in BFS as the Queue Node *path; //used later for building the path //visited hashtable hashtable_t *hashtable = ht_create( 65536 ); //hashtable of parents to build path later hashtable_t *hashtable_parents = ht_create( 65536 ); //initialize the start node for the actors Queue actors = make_node(start, NULL); char *name = start; //while the start and end are not equal while (strcmp(name, end) != 0) { name = peek(&actors); //check-- has name been visited? char *check = ht_get(hashtable, name); //if name is unvisited if(strcmp(check, "NULL")==0){ //break out if you have reached //finishing condition if (strcmp(name, end) == 0) { break; } //continue BFS //set up the query for the list of actors //in the adjacency list (neighbors) char word_reply[100] = ""; strcat(word_reply, "LRANGE "); strcat(word_reply, name); strcat(word_reply, " 0 -1"); //store the query response reply = redisCommand(c,word_reply); if (reply->type == REDIS_REPLY_ARRAY) { for (k = 0; k < reply->elements; k++) { //push each neighbor onto the queue (as you //do in BFS) push(&actors, reply->element[k]->str); //check if each neighbor is in the hashtable //of parents char *check2 = ht_get(hashtable_parents, reply->element[k]->str); if (strcmp(check2, "NULL") == 0) { //if not, then put that neighbor and it's parent, name //in the hashtable. It is keeping it traceable //by the path later. ht_set(hashtable_parents, reply->element[k]->str, name); } } } //once done, set name to visited //recall that name is just the current node //we are visiting. ht_set(hashtable, name, "visited"); } //traversing graph by popping from actors //then the new current node, name, will be set to //the beginning of actors again at the beginning of the //while loop. pop(&actors); }
Using the parents hashtable, we were able to build up the path from the source actor to the destination actor as follows:
/* BUILDING UP THE PATH FROM THE PARENT HASHTABLE */ //Gets first parent to make linked list //To trace back and print the path //parent is an id in hashtable_parents char *parent = ht_get(hashtable_parents, end); //Use parent hashtable to build up path //all the way up to but not including the final parent while (strcmp(parent, start) != 0) { //Need to store in redis reply after looking //up the actual name from id stored in parent redisReply *actor_name; //Get the actor name actor_name = redisCommand(c,"GET %s", parent); char *parent_name = actor_name->str; //push parent name to build up final path push(&final_path, parent_name); //get new parent parent = ht_get(hashtable_parents, parent); } //Get the final parent, the last name that needs //to get pushed onto the path redisReply *actor_name; actor_name = redisCommand(c,"GET %s", parent); char *parent_name = actor_name->str; push(&final_path, parent_name); reverse(&final_path); print_list(final_path);
Bidirectional breadth first search is very similar to regular breadth first search, with just a few modifications. The primary change we made was to create two separate queues to iterate through: one for nodes visited branching from the starting actor, and one for nodes visited branching from the ending actor. For both queues, the head node goes through the same process as in breadth first search - checking to make sure it has not been visited, checking and adding its children nodes, and adding it to the hashtable of visited actors. Before popping these head nodes from the queues, they are compared to see if they are equal (meaning they are the same person). If they are equal, a complete path from the start and end actors has been found, and the program moves onto the next step.
This is where the path is generated and printed out. We use the same method of storing the parent node of each node in two hashtables, one for the start and one for the end queue. In order to get the paths, we simply add each parent node to a linked list until the start or end actor has been found. Once that occurs, we are left with two lists, the path from the start to the meeting actor, and the path from the end to the meeting actor. Finally, we reverse the first list in order to join the two lists.
Once we finished bidirectional search, we ran time tests to see how long our implementation of breadth first search took compared to our bidirectional search code:
Actors | BFS Time (sec) | Bidirectional Time (sec) |
Colin Firth - Bette Davis | 1.547 | 0.119 |
Bette Davis - Bill Murray | 0.801 | 0.110 |
Brad Pitt - Bette Davis | 1.234 | 0.155 |
Natalie Portman - Bill Murray | 0.423 | 0.059 |
Jamie Foxx - Bette Davis | 1.400 | 0.082 |
Once we had breadth-first search working, we wanted to be able to have actual actor names, instead of just actual actor ids. At the same time, we wanted to be able to implement that into the redis command, and have it done in C.
//SETTING UP LOOKUP TABLE: redisReply *actors_ids; actors_ids = redisCommand(c, "SELECT 2"); redisReply *switch_back; switch_back = redisCommand(c, "SELECT 0");We have to do this each time that we want to use different databases.
When creating both the breadth-first search and bidirectional search algorithms, we encountered a few problems along the way. We had a difficult time retracing the path in both cases. Initially, we were going to store the path that it took to get to each node in a linked list on the node itself. Then we came up with the parent hashtable idea, which we liked better as a solution. We had a bug where the second to last node in the path was not being displayed because of the order we were pushing things to the path and looking them up in the parent hashtable. We fixed this through careful examination of the logic flow, and used the same type of debugging for the bugs we found in bidirectional search.
In bidirectional search, we encountered a problem with the while loop that handled popping nodes off of the queues. The while loop would check whether the start node and end nodes were equal to each other, and execute if they weren’t. At the end of the loop, the current heads of the start queue and end queue would be popped. When it got to the two actors / actresses that were equal, the meet in the middle point for bidirectional search, the while loop would never execute or set the value of the start and end. Because the while loop wasn’t running, we weren’t able to access those values. We knew that the path met, but not at which value. To fix this issue, we changed the while loop so it continued until the start and end queues were empty and we put in a check at the end of the while loop. If the two actors were equal, break, otherwise pop them off their queues and continue.
Shivali Chandra (@skchandra), Cynthia Chen (@yunhsincynthiachen), Nitya Dhanushkodi (@ndhanushkodi) and Marena Richardson (@marenar)
Having trouble with Pages? Check out our documentation or contact support and we’ll help you sort it out.