This article describes how to extend the simplest formulation of Graph Neural Networks (GNNs) to encode the structure of multi-relational data, such as Knowledge Graphs (KGs).

The article includes 4 main sections:

- an introduction to the key idea of multi-relational data, which describes the peculiarity of KGs;
- a summary of the standard components included in a GNN architecture;
- a description of the simplest formulation of GNNs, known as Graph Convolutional Networks (GCNs);
- a discussion on how to extend the GCN layer in the form of a Relational Graph Convolutional Network (R-GCN) to encode multi-relational data.

**If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material. **

## Knowledge Graphs as Multi-Relational Data

A basic graph structure includes undirected, no-typed, and unique edges for connecting nodes. For instance, in the philosophical domain, we can define a link between two nodes represented by “Socrates” and “Plato” entities. In this specific case, we do not provide any information on the nature of the relationship between these philosophers.

On the other side, KGs include directed, typed, and *multiple* edges for connecting nodes. Considering our running example, the connection from “Socrates” to “Plato” can be labeled with “influenced”. In the opposite direction, another connection can be established from “Plato” to “Socrates”, which can be labeled with “influenced by”.

In other words, KGs are graph-based structures whose nodes represent real-world entities, while edges define multiple relations between these entities. I explained further details on KGs in the following article: Knowledge Graphs at a Glance.

## Graph Neural Networks (GNNs)

I summarized the main building blocks of a GNN architecture in the following article: Understanding the Building Blocks of Graph Neural Networks (Intro).

In particular, I presented an overview of the main components to set up a GNN, including (i) the input layer, (ii) the GNN layers(s), and (iii) the Multilayer Perceptron (MLP) prediction layer(s).

In this architecture, the GNN layer is the key component for encoding the local graph structure, which is employed to update node representation. Different GNN layers employ different types of aggregation of the local graph structure. To illustrate the GNN behavior using NumPy, we require 3 main ingredients:

- a matrix of one-hot vectors (no features) representing nodes;
- a weight matrix describing the hidden features of the nodes;
- an adjacency matrix defining undirected edges between nodes.

### One-hot vector representation of nodes (5,5): X = np.eye(5, 5) n = X.shape[0] np.random.shuffle(X) print(X)

----- [[0. 0. 1. 0. 0.] # Node 1 [0. 1. 0. 0. 0.] # Node 2 [0. 0. 0. 0. 1.] ... [1. 0. 0. 0. 0.] [0. 0. 0. 1. 0.]] # Node 5### Weight matrix (5,3) # Dimension of the hidden features h = 3 # Random initialization with Glorot and Bengio W = np.random.uniform(-np.sqrt(1./h), np.sqrt(1./h),(n,h)) print(W)-----[[-0.4294049 0.57624235 -0.3047382 ] [-0.11941829 -0.12942953 0.19600584] [ 0.5029172 0.3998854 -0.21561317] [ 0.02834577 -0.06529497 -0.31225734] [ 0.03973776 0.47800217 -0.04941563]]### Adjacency matrix of an undirect Graph (5,5) A = np.random.randint(2, size=(n, n)) # Include the self loop np.fill_diagonal(A, 1) # Symmetric adjacency matrix (undirected graph) A_und = (A + A.T) A_und[A_und > 1] = 1 print(A_und)-----[[1 1 1 0 1] # Connections to Node 1 [1 1 1 1 1] [1 1 1 1 0] [0 1 1 1 1] [1 1 0 1 1]]

Considering these ingredients, a “recursive neighborhood diffusion” (Dwivedi et al., 2020) is performed through the so-called “message passing framework” (Gilmer et al., 2017) for the update process. The neighbors’ features are passed to the target node as messages through the edges. Concretely, the required operations are the following (see the NumPy code for further details):

- a linear transformation (or projection) involving the initial representation of the nodes and the weight matrix including their hidden features.
- a neighborhood diffusion to update the representation of a node, aggregating the hidden features of its neighbors. This operation is computed in parallel for each node.

### Linear transformation L_0 = X.dot(W) print(L_0)-----[[0.5029172 0.3998854 -0.21561317] # Node 1 (3rd row of W) [-0.11941829 -0.12942953 0.19600584] # Node 2 (2nd row of W) [0.03973776 0.47800217 -0.04941563] # Node 3 (5th row of W) [-0.4294049 0.57624235 -0.3047382 ] [0.02834577 -0.06529497 -0.31225734]] # Node 5 (4th row of W)### GNN - Neighborhood diffusion ND_GNN = A_und.dot(L_0) print(ND_GNN)-----[[0.45158244 0.68316307 -0.3812803] # Updated Node 1 [ 0.02217754 1.25940542 -0.6860185 ] [-0.00616823 1.3247004 -0.37376116] [-0.48073966 0.85952002 -0.47040533] [-0.01756022 0.78140325 -0.63660287]]### Test on the aggregation assert(ND_GNN[0,0] == L_0[0,0] + L_0[1,0] + L_0[2,0] + L_0[4,0])

Observing the result of the neighborhood diffusion, you can notice that the updated representation of Node 1

[0.5029172 0.3998854 -0.21561317] # Node 1

is the sum of the vectors representing Node 1 (self-loop), Node 2, Node 3, and Node 4. Such nodes are connected to Node 1, according to the adjacency matrix defined previously.

[ 0.5029172 0.3998854 -0.21561317] # Node 1 [-0.11941829 -0.12942953 0.19600584] # Node 2 [ 0.03973776 0.47800217 -0.04941563] # Node 3 [ 0.02834577 -0.06529497 -0.31225734] # Node 5

The code described in this section can be mathematically formalized as follows:

## Graph Convolutional Networks (GCNs)

In the simplest formulation of GNNs, known as Vanilla Graph Convolutional Networks (GCNs), the node update is performed via an “isotropic averaging operation over the neighborhood features” (Dwivedi et al., 2020). In other words, neighbor nodes equally contribute to updating the central node’s representation. More precisely, in the specific case of Vanilla GCNs, an *isotropic average computation* is performed. This computation requires a new ingredient represented by each node’s degree, consisting of the number of its connected edges.

### Degree vector (degree for each node) D = A_und.sum(axis=1) print(D)-----[45 4 4 4] # Degree of Node 1### Reciprocal of the degree (diagonal matrix) D_rec = np.diag(np.reciprocal(D.astype(np.float32))) print(D_rec)-----[[0.250. 0. 0. 0. ] # Reciprocal value of Node 1 degree [0. 0.2 0. 0. 0. ] [0. 0. 0.25 0. 0. ] [0. 0. 0. 0.25 0. ] [0. 0. 0. 0. 0.25]]### GCN - Isotropic average computation ND_GCN = D_rec.dot(ND_GNN) print(ND_GCN)-----[[0.11289561 0.17079077 -0.09532007] # Updated Node 1 (with deg) [ 0.00443551 0.25188109 -0.1372037 ] [-0.00154206 0.3311751 -0.09344029] [-0.12018491 0.21488001 -0.11760133] [-0.00439005 0.19535081 -0.15915072]]### Test on the isotropic average computation: assert(ND_GCN[0,0] == ND_GNN[0,0] * D_rec[0,0])

Each element of the degree vector represents the degree value of the *i*-node. Indeed, the vector’s first element is equal to 4 because the adjacency matrix shows that 4 nodes are connected to Node 1. The degrees’ reciprocal values are then computed to enable the average contribution of the edges connected to the node. Finally, the isotropic average computation is performed according to the GCNs formulation. The updated representation of Node 1 performing the average computation using the Node 1 degree is equal to:

[ 0.11289561 0.17079077 -0.09532007]

This vector is obtained by multiplying the following vector (representing the aggregated representation of Node 1) with the reciprocal of its degree (0.25):

[ 0.45158244 0.68316307 -0.3812803 ]

The code described in this section can be mathematically formalized as follows:

## Relational Graph Convolutional Networks (R-GCNs)

The previous example describes the behavior of the GCNs on an undirected and no-typed graph. As mentioned before, the update process is based on the following steps (in the following explanation, the node degree is not considered for the sake of simplicity).

A *projection step* (or linear transformation) is achieved by multiplying (i) the one-hot feature matrix with (ii) the weight matrix.

- (i) 2D Matrix (n, n) defining the one-hot vectors to represent the nodes.
- (ii) 2D Matrix (n, h) defining the hidden features. The current matrix encodes only one type of relation.

An* aggregation step* is achieved by multiplying (i) the adjacency matrix with (ii) the matrix resulting from the projection step.

- (i) 2D Symmetric matrix (n, n) describing undirected and untyped edges.
- (ii) 2D Matrix (n, h) resulting from the projection step.

To extend the GCN layer to encode a KG structure, we need to represent our data as a directed and multi-typed graph. The update/aggregation process is similar to the previous one, but the ingredients are a bit more complex. Details on the steps to perform are available below.

A* projection step* is achieved by multiplying (i) the one-hot feature matrix with (ii) the weight *tensor*.

- (i) 2D Matrix (n, n) defining the initial features of the nodes.
- (ii)
*3D**Tensor (r, n, h)*describing the node hidden features. This tensor is able to encode different relations by stackingbatches of matrices with size*r*. Each of these batches encodes a single typed relation.*(n, h)*

The projection step will no longer be a simple multiplication of matrices, but it will be a *batch matrix multiplication*, in which (i) is multiplied with each batch of (ii).

An *aggregation step*, which is achieved by multiplying (i) the *(directed) adjacency tensor* with (ii) the *tensor* resulting from the projection step.

- (i)
*3D Tensor (r, n, n)*describing directed andedges. This tensor is composed of*r-typed*batches of adjacency matrices*r*. In detail, each of these matrices describes the edges between nodes, according to a specific type of relation. Moreover, compared to the adjacency matrix of an undirected graph, each of these adjacency matrices is not symmetric because it encodes a specific edge direction.*(n, n)* - (ii)
*3D Tensor (r, n, h)*resulting from the projection step described above.

As happened for the projection step, the aggregation phase consists of a *batch matrix multiplication*. Each batch of (i) is multiplied with each batch of (ii). This aggregation defines the GCN transformation for each batch. At the end of the process, the batches have to be added together (R-GCN) to obtain a node representation that incorporates the neighborhood aggregation according to different relations types.

The following code example shows an R-GCN layer’s behavior encoding a directed and multi-typed graph, or a KG, with 2 types of edges (or relations).

### Recall: One-hot vector representation of nodes (n,n) print(X)-----[[0. 0. 1. 0. 0.] # Node 1 [0. 1. 0. 0. 0.] ... [0. 0. 0. 0. 1.] [1. 0. 0. 0. 0.] [0. 0. 0. 1. 0.]]### Number of relation types (r) num_rels = 2 print(num_rels)-----2### Weight matrix of relation number 1 (n,n) ## Initialization according to Glorot and Bengio (2010)) W_rel1 = np.random.uniform(-np.sqrt(1./h),np.sqrt(1./h),(n,h)) print(W_rel1)-----[[-0.46378913 -0.09109707 0.52872529] [ 0.03829597 0.22156061 -0.2130242 ] [ 0.21535272 0.38639244 -0.55623279] [ 0.28884178 0.56448816 0.28655701] [-0.25352144 0.334031 -0.45815514]]### Weight matrix of relation number 2 (n,h) ## Random initialization with uniform distribution W_rel2 = np.random.uniform(1/100, 0.5, (n,h)) print(W_rel2)-----[[0.22946783 0.4552118 0.15387093] [0.15100992 0.073714 0.01948981] [0.34262941 0.11369778 0.14011786] [0.25087085 0.03614765 0.29131763] [0.081897 0.29875971 0.3528816 ]]### Tensor including both weight matrices (r,n,h) W_rels = np.concatenate((W_rel1, W_rel2)) W_rels = np.reshape(W_rels,(num_rels, n, h)) print(W_rels)-----[[[-0.46378913 -0.09109707 0.52872529] [ 0.03829597 0.22156061 -0.2130242 ] [ 0.21535272 0.38639244 -0.55623279] [ 0.28884178 0.56448816 0.28655701] [-0.25352144 0.334031 -0.45815514]] [[ 0.22946783 0.4552118 0.15387093] [ 0.15100992 0.073714 0.01948981] [ 0.34262941 0.11369778 0.14011786] [ 0.25087085 0.03614765 0.29131763] [ 0.081897 0.29875971 0.3528816 ]]]### Linear trasformationwith batch matrix multiplication (r,n,h) L_0_rels = np.matmul(X, W_rels) print(L_0_rels)-----[[[ 0.21535272 0.38639244 -0.55623279] # Node 1 (3rd row of W_rel1) [ 0.03829597 0.22156061 -0.2130242 ] [-0.25352144 0.334031 -0.45815514] [-0.46378913 -0.09109707 0.52872529] [ 0.28884178 0.56448816 0.28655701]] [[ 0.34262941 0.11369778 0.14011786] # Node 1 (3rd row of W_rel2) [ 0.15100992 0.073714 0.01948981] [ 0.081897 0.29875971 0.3528816 ] [ 0.22946783 0.4552118 0.15387093] [ 0.25087085 0.03614765 0.29131763]]]### Adjacency matrix of relation number 1 (n,n) A_rel1 = np.random.randint(2, size=(n, n)) np.fill_diagonal(A, 0) # No self_loop print(A_rel1)-----[[0 1 1 1 1] # Connections to Node 1 with Rel 1 [1 1 0 0 1] # Connections to Node 2 with Rel 1 [1 0 0 1 0] [0 0 1 1 1] [1 1 0 1 0]]### Adjacency matrix of relation number 2 (n,n) A_rel2 = np.random.randint(3,size=(n,n)) np.fill_diagonal(A_rel2, 0) # No self loop A_rel2[A_rel2>1] = 0-----[[0 0 0 1 0] # Connections to Node 1 with Rel 2 [1 0 0 0 0] # Connections to Node 2 with Rel 2 [1 0 0 1 1] [0 0 0 0 0] [0 1 0 0 0]]### Tensor including both adjacency matrices (r,n,n) A_rels = np.concatenate((A_rel1, A_rel2)) A_rels = np.reshape(A_rels, (num_rels, n, n)) print(A_rels)-----[[[0 1 1 1 1] # Connections to Node 1 with Rel 1 [1 1 0 0 1] [1 0 0 1 0] [0 0 1 1 1] [1 1 0 1 0]] [[0 0 0 1 0] # Connections to Node 2 with Rel 2 [1 0 0 0 0] [1 0 0 1 1] [0 0 0 0 0] [0 1 0 0 0]]]### (GCN) Neighborhood diffusion for each typed edge (r,n,h) ND_GCN = np.matmul(A_rels, L_0_rels) print(ND_GCN)-----[[[-0.39017282 1.0289827 0.14410296] # Updated Node 1 with Rel 1 [ 0.54249047 1.17244121 -0.48269997] [-0.24843641 0.29529538 -0.0275075 ] [-0.42846879 0.80742209 0.35712716] [-0.21014043 0.51685598 -0.2405317 ]] [[0.22946783 0.4552118 0.15387093] # Updated Node 1 with Rel 2 [ 0.34262941 0.11369778 0.14011786] [ 0.82296809 0.60505722 0.58530642] [ 0. 0. 0. ] [ 0.15100992 0.073714 0.01948981]]]### (R-GCN) Aggregation of GCN (n,h) RGCN = np.sum(ND_GCN, axis=0) print(RGCN)-----[[-0.16070499 1.48419449 0.29797389] Updated Node 1(Rel 1 + Rel 2) [ 0.88511988 1.28613899 -0.34258211] [ 0.57453168 0.9003526 0.55779892] [-0.42846879 0.80742209 0.35712716] [-0.05913052 0.59056998 -0.22104189]]### Test of the aggregation assert(RGCN[0,0] == L_0_rels[0,1,0] + L_0_rels[0,2,0] + L_0_rels[0,3,0] + L_0_rels[0,4,0] + L_0_rels[1,3,0])

As you can notice from this example, the result of the neighborhood diffusion (GCN) is a *3D tensor* of size *(r, n, h)* instead of a *2D matrix *of size *(n,h)*. The reason is that neighbor diffusion is performed in a separate way for each type of relation. The R-GCN layer performed an aggregation of the node representation achieved by the GCN for each type of relation. To clarify this aspect, consider the aggregated representation of Node 1

[-0.16070499 1.48419449 0.29797389]

This vector is obtained by summing the updated representation of Node 1 with Relation 1

[-0.39017282 1.0289827 0.14410296]

and the updated representation of Node 1 with Relation 2

[ 0.22946783 0.4552118 0.15387093]

The code described in this section can be mathematically formalized as follows:

R-GCNs represent a powerful graph neural architecture to encode multi-relational data, such as KGs. In a future article, I will show you how this encoding power can be exploited to perform specific tasks within KGs, including node classification and link prediction.

Much more details on the R-GCN architecture are available in the following research paper: Modeling Relational Data with Graph Convolutional Networks.

*This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.*

## Enjoy this article? Sign up for more AI updates.

We’ll let you know when we release more technical education.

kamir bouchareb st says

very good thanks