Neo4j, Cypher, and Baconators

Graph databases are a new type of database that stores all the data in a big funky web, instead of boring old rows and tables.  Your “records” are stored in things called “nodes”, and nodes can be connected to each other by “edges”.  Once you have all your data and connections in place, you can start to pull NEW data out it by examining the relationships between stuff.  Think about like how facebook recommends friend-of-friend, or amazon recommends things-you-might-also-like.

One of the most popular graph databases out there and the one I’ve been playing with is Neo4j.  I think it’s the most popular because it’s documented inside-and-out, making it really easy to work with.  One of the way to examine data in Neo4j is using its Cypher query language, which is like SQL for graph databases.  It’s really powerful, and once you figure it out can do some really cool stuff with it.

I’m gonna try to walk you through setting up Neo4j and doing some tricks with it.  One of the example datasets on the Neo4j site is an IMDB database.  So you’ve got this fancy new toy that can find connections between things, and a movie database, what’s the first thing you want to do?  Write a Kevin Baconator, right?  So that’s what we’re gonna do, setup Neo4j and find the links from any actor to Kevin Bacon.

First Neo4j runs in Java so you need to install that first if you haven’t already.  I needed to install the JDK to get it to run, but there is probably an easier way to do it.  Next you need to download and install Neo4j.  As of right now you can get the enterprise edition for free, might as well get that one.  It has better error reporting for Cypher queries etc, and can setup master-slave shards.  (I’m not gonna walk you through that, but it’s a neat trick and is covered in a chapter in Seven Databases In Seven Weeks).

Also I had to do this trick to get Neo4j running: https://github.com/neo4j/neo4j/issues/391

Now you’ve got Neo4j up and running, you can open a browser and go to http://localhost:7474 to pull up the dashboard.  It looks a bit empty though, lets fix that.  Download the cineasts_12k dataset and unzip it into ..\neo4j\data\graph.db  Now when you pull up the dashboard it should look like this:

Image

Go to the “Explore and Edit: Data Browser” tab.  First lets do a quick query, type the following into the edit box and hit ctrl+enter:

START bacon=node(*)
WHERE (bacon.name! = “Kevin Bacon”)
RETURN bacon

So what does that all mean?  First, the line

START bacon=node(*)

This starts the query, scans all the nodes and puts them into a variable called bacon.  The second line:

WHERE (bacon.name! = “Kevin Bacon”)

narrows the selection down to nodes that the “name” property equals Kevin Bacon.  The ! after bacon.name means that nodes without a name property are also discarded.

(edit) Michael in the comments section pointed out that this database is indexed by name:

START bacon=node:Person(name=”Kevin Bacon”)
RETURN bacon

This improved query will check the “Person” index for any nodes with the name Kevin Bacon. Since it is querying via Lucene lookup, it has significant performance improvements vs. scanning through all the nodes.

The RETURN line just returns the result of our query… which should Node759, which is Kevin Bacon’s node.  Click on it to view his info, or click on the little graph-looking button in the upper right to see it in graph view:Image

Wait, that looks like crap, we want to see actor.name and movie.title, not a bunch of stupid numbers!  Let’s fix that… click the “Style” button and then click “New Profile”.  Edit the stuff until it looks like this:

Image

Now if a node has a “name” property it will display that, or if it has a “title” property it will display that:

Image

Ha!  That’s better 🙂  You can click around on the different nodes to expand/contract them, and see other actors and movies that are connected to Kevin Bacon.  Pretty neat, right?

Now let’s do some cool shit… enter the following cypher query into the edit box:

START keanu=node:Person(name=”Keanu Reeves”), bacon=node(759)
MATCH p = shortestPath( keanu-[*..7]-bacon )
RETURN extract(n in nodes(p) :
coalesce(n.title?,n.name?)) as `names and titles`,
(length(p) / 2) AS `Baconator Score`

Whoa, what’s all that stuff? The first line:

START keanu=node:Person(name=”Keanu Reeves”), bacon=node(759)
This gets two nodes, one named keanu and another named bacon that is equal to node(759), which we found in the last query.  You can also change this clause to search for other actors than Keanu Reeves.

MATCH p = shortestPath( keanu-[*..16]-bacon )
This creates a variable p which is set to the shortest path between the keanu and bacon nodes. The *..16 means it will find a path of any length, up to 16 links bewteen the two nodes.

RETURN extract(n in nodes(p) :
This pulls all the nodes out of the path ‘p’ we found earlier, and assigns them to a variable ‘n’.

coalesce(n.title?,n.name?)) as `names and titles`,
This is going to take the title or name property of each node n, and return them as a column ‘names and titles’

(length(p) / 2) AS `Baconator Score`
This will return a second column with the length of the ‘p’ path as the Baconator Score. Since the length will be a link from the actor to the movie, then a link from the movie to the next actor, we divide it by 2 to get the true baconator score.

Now to the fun part: Hit ctrl+enter a Neo4j will find the shortest link possible between Keanu Reeeves and Kevin Bacon:

baconated

Cool! But Forest Whitaker is kinda boring… Let’s find all the shortest paths between Keanu and Kevin Bacon and pick the funnest one:

START keanu=node:Person(name=”Keanu Reeves”), bacon=node(759)
MATCH p = allShortestPaths( keanu-[*..16]-bacon )
RETURN extract(n in nodes(p) :
coalesce(n.title?,n.name?)) as `names and titles`,
(length(p) / 2) AS `Baconator Score`

Changing the match clause to “allShortestPaths” returns a bunch of results… let’s see, Keanu was in Dracula with Gary Oldman, and Kevin Bacon was in JFK with him, that’s a much better match 🙂

Anyway, I’d recommend reading through the Neo4j documentation, and reading the Cypher quick-start guide and documentation. These are some pretty cool tools that can be used to take data you already have, and pull new data out of it. Hope this quick tutorial helped yall get started, I’ll do another blog post with some more fun queries 🙂
Cheers!

Advertisements

2 thoughts on “Neo4j, Cypher, and Baconators

  1. Great article ! Please don’t use
    START bacon=node(*)
    As this literally scans over all nodes.
    In this dataset people are indexed by name

    So use start n=node:Person(name=”Krvin Bacon”)

    Try it out and please update the post accordingly.

    You can see all indexes in the UI under “Indexes”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s