GCHQ Open Sources Gaffer
Written by Kay Ewbank   
Wednesday, 16 December 2015

A graph database developed by the UK communications agency and optimized for retrieving data on nodes of interest has been made available on Github. 

Gaffer is a graph database that you can use to store large-scale graphs in which the nodes and edges have statistics such as counts, histograms and sketches. It's a graph database rather than a graph processing system, and differs from other graph storage systems because you can update properties within the store itself.

 

gchqlogo

 

For example, if the edges in a graph have a count statistic, then when there is a new observation of an edge, the edge can simply be inserted into the graph with a count of 1. If this edge already exists in the graph then this count of 1 will be added onto the existing edge.

As is explained about the database on Github:

"The ability to do these updates without the need for query-update-put is key to the ability to ingest large volumes of data."

The types of statistics you can use include maps, sets, histograms, hyperloglog sketches and bitmaps used to store timestamps.

The way the graph stores properties is another interesting and clever technique. Properties are stored separately for different time windows, so the graph doesn't just keep growing in size and becoming unworkable.

You might choose to store daily summaries of properties, so that older properties are essentially archived off, as would be edges that have not been seen for a given period. This technique also means a user can specify a time period of interest at query-time, and Gaffer will aggregate the properties over that time window before returning the results to the user.

You also get good control over the views on the data. The Github documentation explains:

"For example, we may have a graph containing a range of edge types, e.g. red, blue and green, and at query time we may choose to view only red edges within a certain time window."

You're not just limited to edge analysis. You can have databases without any edges, meaning Gaffer can be used for machine learning applications where you need to keep track of any "feature vectors" describing a set of items up-to-date.

Gaffer uses Accumulo for storing data, though you can swap to other stores if you want. It makes use of Accumulo's iterator stack for efficient server-side merging of properties, and for query-time filtering of results.

 

accumulo

 

Bearing in mind that that this being developed by GCHQ, the UK's signal intelligence analysis service, the requirements it was designed to meet make interesting reading:

  • Allow the creation of graphs with summarised properties within Accumulo with a very minimal amount of coding.

  • Allow flexibility of statistics that describe the entities and edges.

  • Allow easy addition of new types of nodes and edges.

  • Allow quick retrieval of data on nodes of interest.

  • Deal with data of different security levels - all data has a visibility, and this is used to restrict who can see data based on their authorizations.

  • Support automatic age-off of data.

Not only is this a very useful looking graph database framework, it shows GCHQ is using modern data analysis techniques in innovative ways to keep track of potential terrorists.

 

gchqlogo

More Information

Gaffer On Github

Apache Accumulo

Related Articles

UK Launches Games Portal To Discover Cyber Security Talent

GCHQ Builds A Raspberry Pi Cluster

Graph Databases (Book Review)

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter,subscribe to the RSS feed and follow us on, Twitter, FacebookGoogle+ or Linkedin

 

Banner


Why Students Participate in Summer of Code
23/10/2019

Google Summer of Code is widely seen as beneficial both to the participating students and to the open source organizations. Now researchers have looked into what motivates students to participate and  [ ... ]



Amazon Migrates Away From Oracle DB
24/10/2019

Amazon has completed the migration of the databases in its consumer business away from Oracle, turning off the last Oracle database, in the consumer division at least, though some third-party applicat [ ... ]


More News

 

graphics

 



 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 16 December 2015 )