This blog is based on the paper A Generalization of Transformer Networks to Graphs with Xavier Bresson at 2021 AAAI Workshop on Deep Learning on Graphs: Methods and Applications (DLG-AAAI’21).

We present Graph Transformer, a transformer neural network that can operate on arbitrary graphs.

### BLOG OUTLINE:

- Background
- Objective
- Key Design Aspects for Graph Transformer
- Proposed Graph Transformer Architecture
- Remarks From the Experiments

**If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material. **

## 1. Background

Lets start with the two keywords, **Transformers **and **Graphs**, for a background.

### Transformers

Transformers [1] based neural networks are the most successful architectures for representation learning in Natural Language Processing (NLP) overcoming the bottlenecks of Recurrent Neural Networks (RNNs) caused by the sequential processing. As the core building block of Transformers, there exists the multi-head attention mechanism, represented by this formula:

Using multi-head attention, a word attends to each other word in a sentence and combines the received information to generate its abstract feature representations.

### Graphs

Graphs are ubiquitous data structures. There is a wide range of application domains where datasets can be represented as graphs. For example, molecular graphs in chemistry, interactions among particles in physics, drug protein interactions in medicine, user and their social and commercial connections in social media, problems in combinatorial optimization, etc.

For learning on graphs, graph neural networks (GNNs) have emerged as the most powerful tool in deep learning.

In short, GNNs consist of several parameterized layers, with each layer taking in a graph with node (and edge) features and builds abstract feature representations of nodes (and edges) by taking the available explicit connectivity structure (*i.e., *graph structure) into account. The so-generated features are then passed to downstream classification layers, usually MLPs, and the target property is predicted.

## 2. Objective

Now, the target is to generalize transformer neural networks to graphs so that it can learn on graphs and datasets with arbitrary structure rather than just the sequential (as can be interpreted to be done by the NLP Transformers).

To proceed with the objective, we focus on extending the key design principles of Transformers from NLP to graphs in general.

## 3. Key Design Aspects for Graph Transformer

We find that attention using graph sparsity and positional encodings are two key design aspects for the generalization of transformers to arbitrary graphs.

Now, we discuss these from the contexts of both NLP and graphs to make the proposed extensions clear for Graph Transformer.

### Graph Sparsity

In NLP, Transformers consider full attention while building feature representations for words. That is, a transformer treats a sentence as a fully connected graph of words. This choice of full attention can be justified for *two *reasons:

First, it is difficult to find meaningful sparse interactions or connections among the words in a sentence. For instance, the dependency of a word in a sentence on another word can vary with context, user’s perspective, and application at hand. There can be numerous plausible ground truth connections among words in a sentence and therefore, text datasets of sentences do not often have explicit word interactions available. It thereby makes sense to perform full attention and let the model decide how the words depend on others.

Second, the so-interpreted fully connected graph in NLP often has less than tens of hundreds of nodes (*i.e.,* sentences are often less than tens or hundreds of words). On this size of graphs, attention to every node is computationally feasible in memory and time.

For these two reasons, full attention can be performed in NLP transformers and subsequent works [2,3,4] have shown it to be fruitful in language modeling and several NLP tasks.

However, in the case of actual graph datasets, graphs have arbitrary connectivity structure available based on the application domain, and have node sizes in ranges of up to millions, or even billions. The available structure presents us with a rich source of information to exploit while learning in the neural network, whereas the node sizes practically makes it impossible to have a fully connected graph for such datasets.

On these accounts, it is practical (for feasibility) and advantageous (for utlizing sparse structure information) to have a Graph Transformer where a node attends to local node neighbors, similar to Graph Attention Networks (GATs) [5].

In fact, the local information aggregation is at the core of GNNs, indicative of the fact that sparsity is a good inductive bias for generalization.

### Positional Encodings

The attention mechanism in Transformer is invariant to the ordering of the nodes. It does not have any notion of where in the sequence (or the sentence) the word is located. This means that, the Transformer regards a multi-set of words rather than the sequence of words in NLP, as illustrated by the following comparison:

That would mean losing out some information about the ordering of the words, isn’t it?

To avoid this and making Transformer aware of the sequential information, some kind of positional encoding is necessary in the transformer. The original transformer by Vaswani et al. [1] uses sinusoidal positional encoding that is added to each word’s feature vector at the inputs. This helps encode the necessary prevalent (sequential) relationship among the words into the model.

We extend this critical design block of positional information encoding for Graph Transformer. In fact, a line of research in GNNs [6,7,8] has recently shown that positional information improves GNNs and overcomes the failure of GNNs for several fundamental tasks.

We therefore leverage the success of the recent works on positional information in GNNs and use Laplacian Positional Encodings [8] in Graph Transformer. We use precomputed Laplacian eigenvectors [9] to add into the node features before the first layer, similar to how positional encodings are added in the original Transformer [1].

Laplacian PEs are a natural generalization of the sinusoidal PEs used in the original transformer, as the sinusoidal PEs can be interpreted as eigenvectors of the line graph, which is the sentence in NLP.

Hence, sparse graph structure during attention and positional encodings at the inputs are the two important things we consider while generalizing transformers to arbitrary graphs.

## 4. Proposed Graph Transformer Architecture

We now present the proposed architecture — the Graph Transformer Layer and the Graph Transformer Layer with edge features. The schematic diagram of a layer shown as follows consist of the major components — the input with PEs, the multi-head attention mechanism with attention restricted to local neighbors, and the feed forward module.

Compared to the standard transformer of Vaswani et al. [1], the key differences (or the extensions) which generalizes transformer to graphs to result in a Graph Transformer are:

i) The attention mechanism is a function of the *neighborhood connectivity *for each node, which is shown by the formula:

ii) Positional Encoding is represented by *Laplacian PEs*. In particular, the eigenvectors of graph Laplacian are precomputed for every graph before training, and *k*-smallest non-trivial eigenvectors of a node are assigned as the node’s PE.

iii) The feed forward module employs *batch normalization *[10] instead of layer normalization [11] that was used in the original transformer [1]. This is supported by our empirical evidences that the use of batch normalization leads to better performance in place of layer normalization.

iv) Graph Transformer is extended to have *edge representation* (see the *Graph Transformer Layer with edge features* at right of the architecture diagram). This architecture can be critical to datasets with rich information along the edges, for instances — bond information along the edges in molecular graphs, or relationship types in knowledge graphs. There are two things to note in this edge-extended architecture: the edge features are fused to the corresponding pairwise implicit attention scores, and there is a designated edge feature pipeline at every layer, as shown by the layer update equations as follows:

This concludes the description of the proposed Graph Transformer. We refer to the paper for the complete layer update equations.

## 5. Remarks From the Experiments

We evaluate Graph Transformer on benchmark datasets and validate our design choices, alongside an attempt to answer some open research questions on: i) whether local attention or full attention for generalization of transformers to graphs, ii) how to encode sparse structure information?, iii) positional encoding candidates.

We *remark* the following results:

a) As already discussed, graph sparsity is critical and its use in attention mechanism always gives better performance as compared to use full attention. The result that *sparsity *is a good inductive bias for graph datasets has already been shown by the success of GNNs on several application areas.

b) Among several combination of design choices on attention, use of PEs, normalization candidate, etc., the architecture *using* i) attention to local neighbors, ii) Laplacian PEs, and iii) batch normalization layer in feed forward module, has the best performance across all datasets used for evaluation. This empirically validates the choice of using these components for the targeted generalization of transformers.

c) Since Laplacian PEs have desirable properties of having i) distance-aware information, ii) distinguishable node features, and iii) being a generalization of original transformer’s PE’s to general graphs, it fares as a suitable PE candidate to be used as Graph Transformer, *even* empirically, as compared to the PE candidates used in the literature for research on Graph Transformer (the related works are discussed in detail in the paper).

d) Overall, Graph Transformer achieves competitive performance among the GNNs compared with on the evaluated datasets. The proposed architecture performs significantly better than baseline GNNs (GCNs [12] and GATs [5]), and helps close the gap between the original transformer and transformer for graphs. Thus, Graph Transformer emerges as a fresh powerful attention based GNN baseline and we hope can easily be extended for future research, provided its simplicity and straightforward generalization from transformers.

The tables of the numerical experiments are in the paper, the code implementation is open-sourced on GitHub, and an accompanying video presentation is on YouTube, with the corresponding links as follows:

Paper: https://arxiv.org/abs/2012.09699

GitHub: https://github.com/graphdeeplearning/graphtransformer

Video: https://youtu.be/h-_HNeBmaaU?t=240

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need.

[2] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.

[3] Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., (2018). Improving language understanding by generative pre-training.

[4] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S., (2020). Language models are few-shot learners.

[5] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P. and Bengio, Y. (2018). Graph attention networks.

[6] Srinivasan, B. and Ribeiro, B., (2019). On the equivalence between positional node embeddings and structural graph representations.

[7] You, J., Ying, R. and Leskovec, J., (2019). Position-aware graph neural networks.

[8] Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y., and Bresson, X. (2020). Benchmarking graph neural networks.

[9] Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation.

[10] Ioffe, S. and Szegedy, C., (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift.

[11] Ba, J.L., Kiros, J.R. and Hinton, G.E., (2016). Layer normalization.

[12] Kipf, T. N., and Welling, M. (2016). Semi-supervised classification with graph convolutional networks.

*This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.*

## Enjoy this article? Sign up for more AI updates.

We’ll let you know when we release more technical education.

## Leave a Reply