GCHQ Open Sources Gaffer
Written by Kay Ewbank   
Wednesday, 16 December 2015

A graph database developed by the UK communications agency and optimized for retrieving data on nodes of interest has been made available on Github. 

Gaffer is a graph database that you can use to store large-scale graphs in which the nodes and edges have statistics such as counts, histograms and sketches. It's a graph database rather than a graph processing system, and differs from other graph storage systems because you can update properties within the store itself.

 

gchqlogo

 

For example, if the edges in a graph have a count statistic, then when there is a new observation of an edge, the edge can simply be inserted into the graph with a count of 1. If this edge already exists in the graph then this count of 1 will be added onto the existing edge.

As is explained about the database on Github:

"The ability to do these updates without the need for query-update-put is key to the ability to ingest large volumes of data."

The types of statistics you can use include maps, sets, histograms, hyperloglog sketches and bitmaps used to store timestamps.

The way the graph stores properties is another interesting and clever technique. Properties are stored separately for different time windows, so the graph doesn't just keep growing in size and becoming unworkable.

You might choose to store daily summaries of properties, so that older properties are essentially archived off, as would be edges that have not been seen for a given period. This technique also means a user can specify a time period of interest at query-time, and Gaffer will aggregate the properties over that time window before returning the results to the user.

You also get good control over the views on the data. The Github documentation explains:

"For example, we may have a graph containing a range of edge types, e.g. red, blue and green, and at query time we may choose to view only red edges within a certain time window."

You're not just limited to edge analysis. You can have databases without any edges, meaning Gaffer can be used for machine learning applications where you need to keep track of any "feature vectors" describing a set of items up-to-date.

Gaffer uses Accumulo for storing data, though you can swap to other stores if you want. It makes use of Accumulo's iterator stack for efficient server-side merging of properties, and for query-time filtering of results.

 

accumulo

 

Bearing in mind that that this being developed by GCHQ, the UK's signal intelligence analysis service, the requirements it was designed to meet make interesting reading:

  • Allow the creation of graphs with summarised properties within Accumulo with a very minimal amount of coding.

  • Allow flexibility of statistics that describe the entities and edges.

  • Allow easy addition of new types of nodes and edges.

  • Allow quick retrieval of data on nodes of interest.

  • Deal with data of different security levels - all data has a visibility, and this is used to restrict who can see data based on their authorizations.

  • Support automatic age-off of data.

Not only is this a very useful looking graph database framework, it shows GCHQ is using modern data analysis techniques in innovative ways to keep track of potential terrorists.

 

gchqlogo

More Information

Gaffer On Github

Apache Accumulo

Related Articles

UK Launches Games Portal To Discover Cyber Security Talent

GCHQ Builds A Raspberry Pi Cluster

Graph Databases (Book Review)

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter,subscribe to the RSS feed and follow us on, Twitter, FacebookGoogle+ or Linkedin

 

Banner


Apache Fury Adds Optimized Serializers For Scala
31/10/2024

Apache Fury has been updated to add GraalVM native images and with optimized serializers for Scala collection. The update also reduces Scala collection serialization cost via the use of  encoding [ ... ]



Flutter Forked As Flock
05/11/2024

One of developers who worked on the Flutter team at Google has created an open-source form of the framework. Matt Carroll says Flock will be "Flutter+", will remain constantly up to date with Flutter, [ ... ]


More News

 

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 16 December 2015 )