Chapter 8. Indexing for better Performance

Titan supports two different kinds of indexing to speed up query processing: graph indexes and vertex-centric indexes. Most graph queries start the traversal from a list of vertices or edges that are identified by their properties. Graph indexes make these global retrieval operations efficient on large graphs. Vertex-centric indexes speed up the actual traversal through the graph, in particular when traversing through vertices with many incident edges.

8.1. Graph Index

Graph indexes are global index structures over the entire graph which allow efficient retrieval of vertices or edges by their properties for sufficiently selective conditions. For instance, consider the following queries

g.V().has('name', 'hercules')
g.E().has('reason', textContains('loves'))

The first query asks for all vertices with the name hercules. The second asks for all edges where the property reason contains the word loves. Without a graph index answering those queries would require a full scan over all vertices or edges in the graph to find those that match the given condition which is very inefficient and infeasible for huge graphs.

Titan distinguishes between two types of graph indexes: composite and mixed indexes. Composite indexes are very fast and efficient but limited to equality lookups for a particular, previously-defined combination of property keys. Mixed indexes can be used for lookups on any combination of indexed keys and support multiple condition predicates in addition to equality depending on the backing index store.

Both types of indexes are created through the Titan management system and the index builder returned by TitanManagement.buildIndex(String, Class) where the first argument defines the name of the index and the second argument specifies the type of element to be indexed (e.g. Vertex.class). The name of a graph index must be unique. Graph indexes built against newly defined property keys, i.e. property keys that are defined in the same management transaction as the index, are immediately available. Graph indexes built against property keys that are already in use require the execution of a reindex procedure to ensure that the index contains all previously added elements. Until the reindex procedure has completed, the index will not be available. It is encouraged to define graph indexes in the same transaction as the initial schema.

[Note]Note

In the absence of an index, Titan will default to a full graph scan in order to retrieve the desired list of vertices. While this produces the correct result set, the graph scan can be very inefficient and lead to poor overall system performance in a production environment. Enable the force-index configuration option in production deployments of Titan to prohibit graph scans.

8.1.1. Composite Index

Composite indexes retrieve vertices or edges by one or a (fixed) composition of multiple keys. Consider the following composite index definitions.

graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('byNameComposite', Vertex.class).addKey(name).buildCompositeIndex()
mgmt.buildIndex('byNameAndAgeComposite', Vertex.class).addKey(name).addKey(age).buildCompositeIndex()
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'byNameComposite').call()
mgmt.awaitGraphIndexStatus(graph, 'byNameAndAgeComposite').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"), SchemaAction.REINDEX).get()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"), SchemaAction.REINDEX).get()
mgmt.commit()

First, two property keys name and age are already defined. Next, a simple composite index on just the name property key is built. Titan will use this index to answer the following query.

g.V().has('name', 'hercules')

The second composite graph index includes both keys. Titan will use this index to answer the following query.

g.V().has('age', 30).has('name', 'hercules')

Note, that all keys of a composite graph index must be found in the query’s equality conditions for this index to be used. For example, the following query cannot be answered with either of the indexes because it only contains a constraint on age but not name.

g.V().has('age', 30)

Also note, that composite graph indexes can only be used for equality constraints like those in the queries above. The following query would be answered with just the simple composite index defined on the name key because the age constraint is not an equality constraint.

g.V().has('name', 'hercules').has('age', inside(20, 50))

Composite indexes do not require configuration of an external indexing backend and are supported through the primary storage backend. Hence, composite index modifications are persisted through the same transaction as graph modifications which means that those changes are atomic and/or consistent if the underlying storage backend supports atomicity and/or consistency.

[Note]Note

A composite index may comprise just one or multiple keys. A composite index with just one key is sometimes referred to as a key-index.

8.1.1.1. Index Uniqueness

Composite indexes can also be used to enforce property uniqueness in the graph. If a composite graph index is defined as unique() there can be at most one vertex or edge for any given concatenation of property values associated with the keys of that index. For instance, to enforce that names are unique across the entire graph the following composite graph index would be defined.

graph.tx().rollback()  //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
mgmt.buildIndex('byNameUnique', Vertex.class).addKey(name).unique().buildCompositeIndex()
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'byNameUnique').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameUnique"), SchemaAction.REINDEX).get()
mgmt.commit()
[Note]Note

To enforce uniqueness against an eventually consistent storage backend, the consistency of the index must be explicitly set to enabling locking.

8.1.2. Mixed Index

Mixed indexes retrieve vertices or edges by any combination of previously added property keys. Mixed indexes provide more flexibility than composite indexes and support additional condition predicates beyond equality. On the other hand, mixed indexes are slower for most equality queries than composite indexes.

Unlike composite indexes, mixed indexes require the configuration of an indexing backend and use that indexing backend to execute lookup operations. Titan can support multiple indexing backends in a single installation. Each indexing backend must be uniquely identified by name in the Titan configuration which is called the indexing backend name.

graph.tx().rollback()  //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('nameAndAge', Vertex.class).addKey(name).addKey(age).buildMixedIndex("search")
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'nameAndAge').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("nameAndAge"), SchemaAction.REINDEX).get()
mgmt.commit()

The example above defines a mixed index containing the property keys name and age. The definition refers to the indexing backend name search so that Titan knows which configured indexing backend it should use for this particular index. The search parameter specified in the buildMixedIndex call must match the second clause in the Titan configuration definition like this: index.search.backend If the index was named solrsearch then the configuration definition would appear like this: index.solrsearch.backend.

The mgmt.buildIndex example specified above uses text search as its default behavior. An index statement that explicity defines the index as a text index can be written as follows:

mgmt.buildIndex('nameAndAge',Vertex.class).addKey(name,Mapping.TEXT.getParameter()).addKey(age,Mapping.TEXT.getParameter()).buildMixedIndex("search")

See Chapter 20, Index Parameters and Full-Text Search for more information on text and string search options, and see the documentation section specific to the indexing backend in use for more details on how each backend handles text versus string searches.

While the index definition example looks similar to the composite index above, it provides greater query support and can answer any of the following queries.

g.V().has('name', textContains('hercules')).has('age', inside(20, 50))
g.V().has('name', textContains('hercules'))
g.V().has('age', lt(50))

Mixed indexes support full-text search, range search, geo search and others. Refer to Chapter 19, Search Predicates and Data Types for a list of predicates supported by a particular indexing backend.

[Note]Note

Unlike composite indexes, mixed indexes do not support uniqueness.

8.1.2.1. Adding Property Keys

Property keys can be added to an existing mixed index which allows subsequent queries to include this key in the query condition.

graph.tx().rollback()  //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
location = mgmt.makePropertyKey('location').dataType(Geoshape.class).make()
nameAndAge = mgmt.getGraphIndex('nameAndAge')
mgmt.addIndexKey(nameAndAge, location)
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'nameAndAge').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("nameAndAge"), SchemaAction.REINDEX).get()
mgmt.commit()

To add a newly defined key, we first retrieve the existing index from the management transaction by its name and then invoke the addIndexKey method to add the key to this index.

If the added key is defined in the same management transaction, it will be immediately available for querying. If the property key has already been in use, adding the key requires the execution of a reindex procedure to ensure that the index contains all previously added elements. Until the reindex procedure has completed, the key will not be available in the mixed index.

8.1.2.2. Mapping Parameters

When adding a property key to a mixed index - either through the index builder or the addIndexKey method - a list of parameters can be optionally specified to adjust how the property value is mapped into the indexing backend. Refer to the mapping parameters overview for a complete list of parameter types supported by each indexing backend.

8.1.3. Ordering

The order in which the results of a graph query are returned can be defined using the order().by() directive. The order().by() method expects two parameters:

  • The name of the property key by which to order the results. The results will be ordered by the value of the vertices or edges for this property key.
  • The sort order: either increasing incr or decreasing decr

For example, the query g.V().has('name', textContains('hercules')).order().by('age', decr).limit(10) retrieves the ten oldest individuals with hercules in their name.

When using order().by() it is important to note that:

  • Composite graph indexes do not natively support ordering search results. All results will be retrieved and then sorted in-memory. For large result sets, this can be very expensive.
  • Mixed indexes support ordering natively and efficiently. However, the property key used in the order().by() method must have been previously added to the mixed indexed for native result ordering support. This is important in cases where the the order().by() key is different from the query keys. If the property key is not part of the index, then sorting requires loading all results into memory.

8.1.4. Label Constraint

In many cases it is desirable to only index vertices or edges with a particular label. For instance, one may want to index only gods by their name and not every single vertex that has a name property. When defining an index it is possible to restrict the index to a particular vertex or edge label using the indexOnly method of the index builder. The following creates a composite index for the property key name that indexes only vertices labeled god.

graph.tx().rollback()  //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
god = mgmt.getVertexLabel('god')
mgmt.buildIndex('byNameAndLabel', Vertex.class).addKey(name).indexOnly(god).buildCompositeIndex()
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'byNameAndLabel').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndLabel"), SchemaAction.REINDEX).get()
mgmt.commit()

Label restrictions similarly apply to mixed indexes. When a composite index with label restriction is defined as unique, the uniqueness constraint only applies to properties on vertices or edges for the specified label.

8.1.5. Composite vs Mixed Index

  1. Use a composite index for exact match index retrievals. Composite indexes do not require configuring or operating an external index system and are often significantly faster than mixed indexes.

    1. As an exception, use a mixed index for exact matches when the number of distinct values for query constraint is relatively small or if one value is expected to be associated with many elements in the graph (i.e. in case of low selectivity).
  2. Use a mixed indexes for numeric range, full-text or geo-spatial indexing. Also, using a mixed index can speed up the order().by() queries.

8.2. Vertex-centric Index

Vertex-centric indexes are local index structures built individually per vertex. In large graphs vertices can have thousands of incident edges. Traversing through those vertices can be very slow because a large subset of the incident edges has to be retrieved and then filtered in memory to match the conditions of the traversal. Vertex-centric indexes can speed up such traversals by using localized index structures to retrieve only those edges that need to be traversed.

Suppose that Hercules battled hundreds of monsters in addition to the three captured in the introductory Graph of the Gods. Without a vertex-centric index, a query asking for those monsters battled between time point 10 and 20 would require retrieving all battled edges even though there are only a handful of matching edges.

h = g.V().has('name', 'hercules').next()
g.V(h).outE('battled').has('time', inside(10, 20)).inV()

Building a vertex-centric index by time speeds up such traversal queries.

graph.tx().rollback()  //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
time = mgmt.getPropertyKey('time')
battled = mgmt.getEdgeLabel('battled')
mgmt.buildEdgeIndex(battled, 'battlesByTime', Direction.BOTH, Order.decr, time)
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'battlesByTime').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("battlesByTime"), SchemaAction.REINDEX).get()
mgmt.commit()

This example builds a vertex-centric index which indexes battled edges in both direction by time in decreasing order. A vertex-centric index is built against a particular edge label which is the first argument to the index construction method TitanManagement.buildEdgeIndex(). The index only applies to edges of this label - battled in the example above. The second argument is a unique name for the index. The third argument is the edge direction in which the index is built. The index will only apply to traversals along edges in this direction. In this example, the vertex-centric index is built in both direction which means that time restricted traversals along battled edges can be served by this index in both the IN and OUT direction. Titan will maintain a vertex-centric index on both the in- and out-vertex of battled edges. Alternatively, one could define the index to apply to the OUT direction only which would speed up traversals from Hercules to the monsters but not in the reverse direction. This would only require maintaining one index and hence half the index maintenance and storage cost. The last two arguments are the sort order of the index and a list of property keys to index by. The sort order is optional and defaults to ascending order (i.e. Order.ASC). The list of property keys must be non-empty and defines the keys by which to index the edges of the given label. A vertex-centric index can be defined with multiple keys.

graph.tx().rollback()  //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
time = mgmt.getPropertyKey('time')
rating = mgmt.makePropertyKey('rating').dataType(Double.class).make()
battled = mgmt.getEdgeLabel('battled')
mgmt.buildEdgeIndex(battled, 'battlesByRatingAndTime', Direction.OUT, Order.decr, rating, time)
mgmt.commit()
//Wait for the index to become available
mgmt.awaitRelationIndexStatus(graph, 'battlesByRatingAndTime', 'battled').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getRelationIndex(battled, 'battlesByRatingAndTime'), SchemaAction.REINDEX).get()
mgmt.commit()

This example extends the schema by a rating property on battled edges and builds a vertex-centric index which indexes battled edges in the out-going direction by rating and time in decreasing order. Note, that the order in which the property keys are specified is important because vertex-centric indexes are prefix indexes. This means, that battled edges are indexed by rating first and time second.

h = g.V().has('name', 'hercules').next()
g.V(h).outE('battled').property('rating', 5.0) //Add some rating properties
g.V(h).outE('battled').has('rating', gt(3.0)).inV()
g.V(h).outE('battled').has('rating', 5.0).has('time', inside(10, 50)).inV()
g.V(h).outE('battled').has('time', inside(10, 50)).inV()

Hence, the battlesByLocationAndTime index can speed up the first two but not the third query.

Multiple vertex-centric indexes can be built for the same edge label in order to support different constraint traversals. Titan’s query optimizer attempts to pick the most efficient index for any given traversal. Vertex-centric indexes only support equality and range/interval constraints.

[Note]Note

The property keys used in a vertex-centric index must have an explicitly defined data type (i.e. not Object.class) which supports a native sort order. If the data type are floating point numbers, Titan’s custom Decimal or Precision data types must be used which have a fixed number of decimals.

If the vertex-centric index is built against an edge label that is defined in the same management transaction, the index will be immediately available for querying. If the edge label has already been in use, building a vertex-centric index against it requires the execution of a reindex procedure to ensure that the index contains all previously added edges. Until the reindex procedure has completed, the index will not be available.

[Note]Note

Titan automatically builds vertex-centric indexes per edge label and property key. That means, even with thousands of incident battled edges, queries like g.V(h).out('mother') or g.V(h).values('age') are efficiently answered by the local index.

Vertex-centric indexes cannot speed up unconstrained traversals which require traversing through all incident edges of a particular label. Those traversals will become slower as the number of incident edges increases. Often, such traversals can be rewritten as constrained traversals that can utilize a vertex-centric index to ensure acceptable performance at scale.

8.2.1. Ordered Traversals

The following queries specify an order in which the incident edges are to be traversed. Use the localLimit command to retrieve a subset of the edges (in a given order) for EACH vertex that is traversed.

h = g..V().has('name', 'hercules').next()
g.V(h).local(outE('battled').order().by('time', decr).limit(10)).inV().values('name')
g.V(h).local(outE('battled').has('rating', 5.0).order().by('time', decr).limit(10)).values('place')

The first query asks for the names of the 10 most recently battled monsters by Hercules. The second query asks for the places of the 10 most recent battles of Hercules that are rated 5 stars. In both cases, the query is constrained by an order on a property key with a limit on the number of elements to be returned.

Such queries can also be efficiently answered by vertex-centric indexes if the order key matches the key of the index and the requested order (i.e. increasing or decreasing) is the same as the one defined for the index. The battlesByTime index would be used to answer the first query and battlesByLocationAndTime applies to the second. Note, that the battlesByLocationAndTime index cannot be used to answer the first query because an equality constraint on rating must be present for the second key in the index to be effective.

[Note]Note

Ordered vertex queries are a Titan extension to Gremlin which causes the verbose syntax and requires the _() step to convert the Titan result back into a Gremlin pipeline.