|GCHQ Open Sources Gaffer|
|Written by Kay Ewbank|
|Wednesday, 16 December 2015|
A graph database developed by the UK communications agency and optimized for retrieving data on nodes of interest has been made available on Github.
Gaffer is a graph database that you can use to store large-scale graphs in which the nodes and edges have statistics such as counts, histograms and sketches. It's a graph database rather than a graph processing system, and differs from other graph storage systems because you can update properties within the store itself.
For example, if the edges in a graph have a count statistic, then when there is a new observation of an edge, the edge can simply be inserted into the graph with a count of 1. If this edge already exists in the graph then this count of 1 will be added onto the existing edge.
As is explained about the database on Github:
"The ability to do these updates without the need for query-update-put is key to the ability to ingest large volumes of data."
The types of statistics you can use include maps, sets, histograms, hyperloglog sketches and bitmaps used to store timestamps.
The way the graph stores properties is another interesting and clever technique. Properties are stored separately for different time windows, so the graph doesn't just keep growing in size and becoming unworkable.
You might choose to store daily summaries of properties, so that older properties are essentially archived off, as would be edges that have not been seen for a given period. This technique also means a user can specify a time period of interest at query-time, and Gaffer will aggregate the properties over that time window before returning the results to the user.
You also get good control over the views on the data. The Github documentation explains:
"For example, we may have a graph containing a range of edge types, e.g. red, blue and green, and at query time we may choose to view only red edges within a certain time window."
You're not just limited to edge analysis. You can have databases without any edges, meaning Gaffer can be used for machine learning applications where you need to keep track of any "feature vectors" describing a set of items up-to-date.
Gaffer uses Accumulo for storing data, though you can swap to other stores if you want. It makes use of Accumulo's iterator stack for efficient server-side merging of properties, and for query-time filtering of results.
Bearing in mind that that this being developed by GCHQ, the UK's signal intelligence analysis service, the requirements it was designed to meet make interesting reading:
Not only is this a very useful looking graph database framework, it shows GCHQ is using modern data analysis techniques in innovative ways to keep track of potential terrorists.
or email your comment to: firstname.lastname@example.org
|Last Updated ( Wednesday, 16 December 2015 )|