Efficient High Order Data Processing
In this section, we’ll delve into the efficient high-order data processing capabilities provided by PyGHO, particularly focusing on the handling of high-order tensors, tuple feature precomputation, and data loading strategies for both sparse and masked tensor data structures.
Adding High Order Features to PyG Dataset
HOGNNs and MPNNs share common tasks, allowing us to leverage PyTorch Geometric’s (PyG) data processing routines. However, to cater to the unique requirements of HOGNNs, PyGHO significantly extends these routines while maintaining compatibility with PyG. This extension ensures convenient high-order feature precomputation and preservation.
Efficient High-Order Feature Precomputation
High-order feature precomputation can be efficiently conducted in parallel using the PyGHO library. Consider the following example:
# Ordinary PyG dataset
from torch_geometric.datasets import ZINC
trn_dataset = ZINC("dataset/ZINC", subset=True, split="train")
# High-order graph dataset
from pygho.hodata import Sppretransform, ParallelPreprocessDataset
trn_dataset = ParallelPreprocessDataset(
"dataset/ZINC_trn", trn_dataset,
pre_transform=Sppretransform(tuplesamplers=partial(KhopSampler, hop=3), keys=keys), num_workers=8)
The ParallelPreprocessDataset
class takes an ordinary PyG dataset as
input and performs transformations on each graph in parallel (utilizing
8 processes in this example). The tuplesamplers
parameter represents
functions that take a graph as input and produce a sparse tensor. You
can apply multiple samplers simultaneously, and the resulting output can
be assigned specific names using the annotate
parameter. In this
example, we utilize partial(KhopSampler, hop=3)
, a sampler designed
for NGNN, to sample a 3-hop ego-network rooted at each node. The
shortest path distance to the root node serves as the tuple features.
The produced SparseTensor is then saved and can be effectively used to
initialize tuple representations.
Since the dataset preprocessing routine is closely related to data
structures, we have designed two separate routines for sparse and dense
tensors. These routines only differ in the pre_transform
function.
For dense tensors, we can simply use
Mapretransform(None, tuplesamplers)
. In this case, the
tuplesamplers
is a function producing dense tuple features. In pygho.hodata.MaTupleSampler
We provide spdsampler
and
rdsampler
to compute shortest path distance and resisitance distance
between nodes. One example is
trn_dataset = ParallelPreprocessDataset("dataset/ZINC_trn",
trn_dataset,
pre_transform=Mapretransform(
partial(spdsampler,
hop=4)),
num_worker=0)
Defining Custom Tuple Samplers
In addition to the provided tuple samplers, you can define your own
tuple sampler. For sparse data, a sampler is a function or callable
object that takes a torch_geometric.data.Data
object as input and
produces a sparse tensor as output. Here’s an example of a custom sparse
tuple sampler that assigns 0
as a feature for each tuple (i, i)
in the graph:
def SparseToySampler(data: PygData) -> SparseTensor:
"""
Sample k-hop subgraph on a given PyG graph.
Args:
- data (PygData): The input PyG data.
Returns:
- SparseTensor for the precomputed tuple features.
"""
n = data.num_nodes
tupleid = torch.stack((torch.arange(n), torch.arange(n)))
tuplefeat = torch.zeros((n,))
ret = SparseTensor(tupleid, tuplefeat, shape=(n, n))
return ret
For dense data, a sampler is a function or callable object that takes a
torch_geometric.data.Data
object as input and produces a tensor
along with the masked shape of the features. Here’s a custom dense tuple
sampler that assigns 0
as a feature for each tuple (i, i)
in the
graph:
def DenseToySampler(data: PygData) -> Tuple[Tensor, List[int]]:
"""
Sample k-hop subgraph on a given PyG graph.
Args:
- data (PygData): The input PyG data.
Returns:
- Tensor: The precomputed tuple features.
- List[int]: The masked shape of the features.
"""
n = data.num_nodes
val = torch.eye(n)
return val, [n, n]
Please note that for dense data, the function returns a tuple consisting of the value and the masked shape, as opposed to returning a MaskedTensor. This is because the mask can typically be inferred from the feature itself, making it unnecessary to explicitly include it in the returned data. In such cases, the mask can be determined as val == 1 .
Using Multiple Tuple Samplers
You can use multiple tuple samplers simultaneously. For instance:
trn_dataset = ParallelPreprocessDataset(
"dataset/ZINC_trn", trn_dataset,
pre_transform=Sppretransform(tuplesamplers=[partial(KhopSampler, hop=1),partial(KhopSampler, hop=2)], annotate=["1hop", "2hop"], keys=keys), num_workers=8)
This code precomputes two tuple features simultaneously and assigns them different annotations, “1hop” and “2hop,” to distinguish between them.
For dense, it works similarly
trn_dataset = ParallelPreprocessDataset(
"dataset/ZINC_trn",
trn_dataset,
pre_transform=Mapretransform(
[partial(spdsampler,hop=1),partial(spdsampler,hop=2)],
annotate=["1hop","2hop"]),
num_worker=0)
Sparse-Sparse Matrix Multiplication Precomputation
Efficient Sparse-Sparse Matrix Multiplication in our library can be
achieved through precomputation. The keys
parameter in
Sppretransform
is a list of strings, where each string indicates a
specific precomputation. For example, consider the key:
"X___A___1___X___0"
Here, the precomputation involves sparse matrix multiplication \(AX\), but only computes the output elements that exist in \(X\). These precomputation results can be shared among matrices with the same indices. The key elements signify the following:
The first
X
refers to the target sparse matrix indices.A
andX
represent the matrices involved in the multiplication, the adjacency matrixA
, and the tuple featureX
.1
denotes that dimension1
ofA
will be reduced.0
signifies that dimension0
ofX
will be reduced.
You don’t need to manually feed the precomputation results to the model.
Converting the batch to a dictionary and using it as the datadict
parameter is sufficient:
for batch in dataloader:
datadict = batch.to_dict()
Dense data does not require precomputation currently.
If you use annotate in transformation, for example,
Sppretransform(tuplesamplers=partial(KhopSampler, hop=1),annotate=["1hop"], keys=keys)
Then the key can be
"X1hop___A___1___X1hop___0"
More details are shown in Multiple Tensor Mini-batch and DataLoader
Enabling batch training in HOGNNs demands handling graphs of varying sizes, which presents a challenge. We employ different strategies for Sparse and Masked Tensor data structures.
Sparse Tensor Data
For Sparse Tensor data, we adopt a relatively straightforward solution. We concatenate the tensors of each graph along the diagonal of a larger tensor. For example, in a batch of \(B\) graphs with adjacency matrices \(A_i\in \mathbb{R}^{n_i\times n_i}\), node features \(x\in \mathbb{R}^{n_i\times d}\), and tuple features \(X\in \mathbb{R}^{n_i\times n_i\times d'}\) for \(i=1,2,\ldots,B\), the features for the entire batch are represented as \(A\in \mathbb{R}^{n\times n}\), \(x\in \mathbb{R}^{n\times d}\), and \(X\in \mathbb{R}^{n\times n\times d'}\), where \(n=\sum_{i=1}^B n_i\). This arrangement allows tensors in batched
data to have the same number of dimensions as those of a single graph, facilitating the sharing of common operators.
We provide PygHO’s own DataLoader to simplify this process:
from pygho.subgdata import SpDataloader
trn_dataloader = SpDataloader(trn_dataset, batch_size=32, shuffle=True, drop_last=True)
Masked Tensor Data
As concatenation along the diagonal leads to a lot of non-existing elements, handling Masked Tensor data involves a different strategy for saving space. In this case, tensors are padded to the same shape and stacked along a new axis. For instance, in a batch of \(B\) graphs with adjacency matrices \(A_i\in \mathbb{R}^{n_i\times n_i}\), node features \(x\in \mathbb{R}^{n_i\times d}\), and tuple features \(X\in \mathbb{R}^{n_i\times n_i\times d'}\) for \(i=1,2,\ldots,B\), the features for the entire batch are represented as \(A\in \mathbb{R}^{B\times \tilde{n}\times \tilde{n}}\), \(x\in \mathbb{R}^{B\times \tilde{n}\times d}\), and \(X\in \mathbb{R}^{B\times \tilde{n}\times \tilde{n}\times d'}\), where \(\tilde{n}=\max\{n_i|i=1,2,\ldots,B\}\).
The 0 for padding will be masked in the result MaskedTensor.
We also provide a DataLoader for this purpose:
from pygho.subgdata import MaDataloader
trn_dataloader = MaDataloader(trn_dataset, batch_size=256, device=device, shuffle=True, drop_last=True)
This padding and stacking strategy ensures consistent shapes across tensors, allowing for efficient processing of dense data.