What is a benchmark dataset in NLP?

The General Language Understanding Evaluation benchmark dataset is a collection of tools created with this in mind . It is designed to encourage and favor models that share common linguistic knowledge across tasks. These tasks include textual entailment, sentiment analysis and question answering.

What is benchmark dataset?

Benchmark dataset. “ Any resource that has been published explicitly as a dataset . that can be used for evaluation, is publicly available or . accessible on request, and has clear evaluation methods . defined ”

What is an example of benchmark data?

A call center might benchmark its customer satisfaction rating by asking customers to rate their service based on their experiences. They might also collect data about waiting times, call lengths, first contact resolution rating, occupancy and shrinkage.

What does benchmark mean in deep learning?

In deep learning, benchmarks often mean using specific tasks or datasets to test how good a model is at doing something , like recognizing pictures or understanding language. By using benchmarks, we can figure out what works best and keep improving our technology.

What is the purpose of benchmarking data?

Benchmarking can compare your company's products, processes, and functions against other companies in the same industry or marketplace. The goal is for you to identify areas where there are opportunities for improvement so that they may yield more excellent success rates than before .

What is a benchmark in database?

Database benchmarking is a well-defined, proven technique for comparing and analyzing how databases or database management systems (DBMS) perform .

How do you create a benchmark data?

8 steps in the benchmarking process Select a subject to benchmark. ... Decide which organizations or companies you want to benchmark. ... Document your current processes. ... Collect and analyze data. ... Measure your performance against the data you've collected. ... Create a plan. ... Implement the changes. ... Repeat the process.

What does sets a benchmark mean?

a level of quality that can be used as a standard when comparing other things : Her outstanding performances set a new benchmark for singers throughout the world.

What is benchmark database?

A benchmark is a standardized test for a database . The results give both users and vendors an idea of how the database will perform in use. In addition, people running the benchmark can tweak several factors to determine how the database will perform under different conditions.

What is a benchmark data analysis?

Benchmarking analysis is a type of market research used by a business that compares their data to that of their competitors or industry practices using a selection of useful metrics .

How do you calculate benchmark data?

The process for creating group-level benchmark scores is the same for both raw and standardized benchmarks. In most circ*mstances, the group-level benchmarks are created by calculating the weighted average of a benchmark variable for the members of the group (e.g., males and females).

TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs (2024)

Zhuofeng Li^∗△ Zixing Gou^∗▽ Xiangnan Zhang^♢ Zhongyuan Liu^∘ Sirui Li^†Yuntong Hu^† Chen Ling^† Zheng Zhang^† Liang Zhao^†^△Shanghai University^▽Shandong University^♢Johns Hopkins University^∘China University of Petroleum (East China)^†Emory University

Abstract

Text-Attributed Graphs (TAGs) augment graph structures with natural language descriptions, facilitating detailed depictions of data and their interconnections across various real-world settings. However, existing TAG datasets predominantly feature textual information only at the nodes, with edges typically represented by mere binary or categorical attributes. This lack of rich textual edge annotations significantly limits the exploration of contextual relationships between entities, hindering deeper insights into graph-structured data. To address this gap, we introduce Textual-Edge Graphs Datasets and Benchmark (TEG-DB), a comprehensive and diverse collection of benchmark textual-edge datasets featuring rich textual descriptions on nodes and edges. The TEG-DB datasets are large-scale and encompass a wide range of domains, from citation networks to social networks. In addition, we conduct extensive benchmark experiments on TEG-DB to assess the extent to which current techniques, including pre-trained language models, graph neural networks, and their combinations, can utilize textual node and edge information. Our goal is to elicit advancements intextual-edge graph research, specifically in developing methodologies that exploit rich textual node and edge descriptions to enhance graph analysis and provide deeper insights into complex real-world networks. The entire TEG-DB project is publicly accessible as an open-source repository on Github, accessible at https://github.com/Zhuofeng-Li/TEG-Benchmark.

^*^*footnotetext: Both authors contributed equally to this work.

1 Introduction

Text-attributed graphs (TAGs) are graph structures in which nodes are equipped with rich textual information, allowing for deeper analysis and interpretation of complex relationships[47, 18, 16]. TAGs are widely utilized in a variety of real-world applications, including social networks [31, 30], citation networks [25], and recommendation systems [40, 15]. Due to the universal representational capabilities of language, TAGs have emerged as a promising format for potentially unifying a wide range of existing graph datasets. This field has recently garnered rapidly growing interest, particularly in the development of foundational models for graph data [23, 16, 42].

TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs (1)

Unfortunately, a central issue in designing the TAG foundation model is the lack of comprehensive datasets with rich textual information on both nodes and edges. Most traditional graph datasets solely offer node attribute embeddings, devoid of the original textual sentences, which results in a significant loss of context and limits the application of advanced techniques such as large language models (LLMs) [24]. Despite some TAG datasets being present recently [42], their data usually only have text information on nodes where the edges are usually represented as binary or categorical. However, the textual information of edges in TAGs is crucial for elucidating the meaning of individual documents and their semantic correlations. For instance, as shown in Figure 1, this scientific article network illustrates the citation patterns of articles authored by Einstein and Planck in the field of quantum mechanics. When we need to conclude that ’Planck endorsed the probabilistic nature of quantum mechanics while Einstein opposed this view,’ and if we consider it in terms of a TAG view, focusing solely on the content of the papers authored by Einstein (Paper A) and Planck (Paper E), we would only conclude that both Einstein and Planck supported quantum mechanics. However, to further deduce that Einstein opposed studying quantum mechanics from a probabilistic perspective, it is necessary to adopt the Textual-Edge Graph (TEG) approach. This approach not only focuses on the paper contents but also pays greater attention to the citation information from the edge between Paper A and Book B, as well as the edge between Paper E and Book D. These citation edges provide essential context and reveal the relationships and influence between different scholarly works.

While compelling, TEGs face three significant challenges that make them an open problem. (1) Comprehensive TEG datasets are absent. Currently, there is a lack of comprehensive TEG datasets that simultaneously incorporate textual information from both nodes and edges, spanning multiple domains of varying sizes, and encompassing various mainstream graph learning tasks. This deficiency hinders the evaluation of TEG-based methods across diverse applications and domains. (2) Existing experimental settings for TEG are disorganized. Due to the inherent variety and complexity of TEGs, coupled with the absence of a standardized data format, existing works have adopted different datasets with different experimental settings [19, 18, 17, 50, 49, 22, 23]. This causes great difficulties in model comparisons in this field. (3) Comprehensive benchmarks and analyses for TEG-based methods are missing.While some techniques can accommodate edge features, they typically process binary or categorical data. It remains unclear if these methods can effectively utilize rich textual information on edges, particularly in leveraging complex interactions between graph nodes.

Present work. Recognizing all the above challenges, our research proposes the Textual-Edge Graphs Datasets and Benchmark (TEG-DB). TEG-DB is a pioneering initiative offering a diverse collection of benchmark graph datasets with rich textual descriptions on both nodes and edges. To address the issue of inadequate TEG datasets, our TEG datasets as shown in Table 1 cover an extensive array of domains, including Book Recommendation, E-commerce, Academic, and Social networks. Ranging in size from small to large, each dataset contains abundant raw text data associated with both nodes and edges, facilitating comprehensive analysis and modeling across various fields. Moreover, to address the inconsistency in experimental settings and the lack of comprehensive analyses for TEG-based methods, we first represent the TEG dataset in a unified format, then conduct extensive benchmark experiments and perform a comprehensive analysis. These experiments are designed to evaluate the capabilities of current computational techniques, such as pre-trained language models and graph neural networks, as well as their integrations. Our contributions are summarized below:

•
To the best of our knowledge, TEG-DB is the first open dataset and benchmark specifically designed for textual-edge graphs. We provide 9 comprehensive TEG datasets encompassing 4 diverse domains as shown in Table 1. Each dataset, varying in size from small to large, contains abundant raw text data associated with both nodes and edges. Our TEG datasets aim to bridge the gap of TEG dataset scarcity and provide a rich resource for advancing research in the TEG domain.
•
We develop a standardized pipeline for TEG research, encompassing crucial stages such as data preprocessing, data loading, and model evaluation. With this framework, researchers can seamlessly replicate experiments, validate findings, and iterate on existing approaches with greater efficiency and confidence. Additionally, this standardized pipeline facilitates collaboration and knowledge sharing within the TEG community, fostering innovation and advancement in the field.
•
We conduct extensive benchmark experiments and perform a comprehensive analysis of TEG-based methods, delving deep into various aspects such as the impact of different models, the effect of embeddings generated by Pre-trained Language Models (PLMs) of various scales, and the influence of different domain datasets. By addressing key challenges and highlighting promising opportunities, our research stimulates and guides future directions for TEG exploration and development.

2 Related Works

In this section, we will begin by providing a brief introduction to three commonly used learning paradigms for TAGs. Following this, we will delve into the comparisons between the current graph learning benchmarks and our proposed benchmark.

PLM-based methods. PLM-based methods leverage the power of PLM to enhance the text modeling within each node due to their pre-training on a vast corpus. The early works on modeling textual attributes were based on shallow networks, e.g., Skip-Gram [28] and GloVe [32]. In recent years, Large Language Models (LLM) have become trending tools. Models like Llama[36], PaLM[3], and GPT[1] show their strong comprehension and inferring ability in cross-field natural language based tasks like code generation[4], legal consulting[7], make creative arts[21], as well as understanding and learning from Graphs[6]. One of the key applications of pre-trained language models is text representation, in which low-dimensional embeddings capture the underlying semantics of texts. On the TAGs, the PLMs use the local textual information of each node to learn a good representation for the downstream task.

GNN-based methods. The rapid advancements in graph representation learning within machine learning have led to numerous studies addressing various tasks, such as node classification [20] and link prediction [48]. Graph neural networks (GNNs) are acknowledged as robust tools for modeling graph data. These methods, including GCN [20], GAT [37], GraphSAGE [11], GIN [41], and RevGAT [29], develop effective message-passing mechanisms that facilitate information aggregation between nodes, thereby enhancing graph representations. GNNs typically utilize the "cascade architecture" advocated by GraphSAGE for textual graph representation, wherein node features are initially encoded independently using text modeling tools (e.g., PLMs) and then aggregated by GNNs to generate the final representation.

LLM as Predictor. In recent years, several recent studies[43, 5, 10] have delved into the potential of Large Language Models (LLMs) in analyzing graph-structured data. However, there is a lack of comprehensive research on the ability of LLMs to effectively identify and utilize key topological structures across various prompt scenarios, task complexities, and datasets. Chen et al.[5] and Guo et al.[10] proposed using LLMs on graph data but primarily focused on node classification within specific citation network datasets, limiting the exploration of LLMs’ performance across various tasks and datasets. Furthermore, Ye et al.[43] fine-tuned LLMs on a specific dataset to outperform GNNs, focusing on a different research goal, which emphasizes LLMs’ inherent ability to understand and leverage graph structures.

Benchmarks for text-attributed graphs. We conducted an extensive survey of various text-attributed graph datasets previously utilized in the literature. Our observations reveal that the majority of traditional node-level datasets are essentially text-attributed graphs. However, these datasets have shortcomings in representation learning on TAGs due to a lack of raw textual information. Recently, certain TAG datasets [42, 19] have emerged to address these limitations. However, they still have shortcomings when exploring representation learning on TEGs. Firstly, these datasets typically only include text information on nodes, while edges are often represented as binary or categorical, limiting a comprehensive understanding of node relationships. Secondly, they lack coverage across diverse domains and tasks, potentially hindering the discovery of robust and generalizable models. Lastly, their representation formats lack uniformity, introducing inconsistencies and complexities in analysis and modeling techniques across datasets.

3 Preliminaries

A Textual-Edge Graph (TEG) is a graph-structured data format in which both nodes and edges have free-form text descriptions. These textual annotations provide rich contextual information about the complex relationships between entities, enabling a more detailed and comprehensive representation of data relations than traditional graphs.

Definition 1 (Textual-edge Graphs). Formally, a TEG can be represented as $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , which consists of a set of nodes $\mathcal{V}$ and a set of edges $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ . Each node $v_{i}\in\mathcal{V}$ contains a textual description $d_{i}$ , and each edge $e_{ij}\in\mathcal{E}$ also associates with its text description $d_{ij}$ describing the relation between $v_{i}$ and $v_{j}$ .

Challenges. Current research on TEGs faces three significant challenges: (1) The scarcity of large-scale, diverse TEG datasets; (2) Inconsistent experimental setups and methodologies in previous TEG research; and (3) The absence of standardized benchmarks and comprehensive analyses for evaluating TEG-based methods. These limitations impede the development of more effective and efficient approaches in this emerging field.

DatasetNodesEdgesNodes-ClassGraph DomainSizeNodes-textEdges-textNode ClassificationLink PredictionPreviousTwitch Social Network [33]7,12688,6172Social NetworksSmall✗✗✓✗Facebook Page-Page Network [34]22,470171,0024Social NetworksSmall✗✗✓✗ogbn-arxiv [13]169,3431,166,24340AcademicMedium✓✗✓✗Citeseer [35]3,3274,7326AcademicSmall✗✗✓✗Pubmed [35]19,71744,3383AcademicSmall✗✗✓✗Cora [27]2,7085,4297AcademicSmall✗✗✓✗CitationV8 [42]1,106,7596,120,897-AcademicLarge✓✗✗✓GoodReads [42]676,0848,582,32411Book RecommendationLarge✓✗✗✓Sports-Fitness [42]173,0551,773,50013E-commerceMedium✓✗✓✗Ele-Photo [42]48,362500,92812E-commerceSmall✓✗✓✗Books-History [42]41,551358,57412E-commerceSmall✓✗✓✗Books-Children [42]76,8751,554,57824E-commerceSmall✓✗✓✗ogbn-arxiv-TA [42]169,3431,166,24340AcademicMedium✓✗✓✗OursGoodreads-History540,8072,368,53911Book RecommendationLarge✓✓✓✓Goodreads-Crime422,6532,068,22311Book RecommendationLarge✓✓✓✓Goodreads-Children216,624858,58611Book RecommendationLarge✓✓✓✓Goodreads-Comics148,669631,64911Book RecommendationMedium✓✓✓✓Amazon-Movie137,4112,724,028399E-commerceMedium✓✓✓✓Amazon-Apps31,94962,03662E-commerceSmall✓✓✓✓Reddit478,022676,6843Social NetworksLarge✓✓✓✓Twitter18,76123,764503Social NetworksSmall✓✓✓✓Citation4,972,4565,970,96524AcademicLarge✓✓✓✓

4 A Comprehensive Dataset and Benchmark ofTextual-Edge Graphs

We begin by offering a brief overview of the TEG-DB in Section 4.1. Afterward, we provide a comprehensive overview of the TEG datasets in Section 4.2, detailing their composition and the preprocessing steps to represent them in a unified format. Finally, we discuss three main methods for handling TEGs: PLM-based, GNN-based paradigm, and LLM as Predictor methods in Section 4.3.

4.1 Overview of TEG-DB

In order to overcome the constraints intrinsic to preceding studies, we propose the establishment of the Textual-Edge Graphs Datasets and Benchmark, referred to as TEG-DB. This framework functions as a standardized evaluation methodology for examining the effectiveness of representation learning approaches in the context of TEGs. To ensure the comprehensiveness and scalability of TEG datasets, TEG-DB collects and constructs a novel set of datasets covering diverse domains like book recommendation, e-commerce, academia, and social networks, varying in size from small to large. These datasets are suitable for various mainstream graph learning tasks such as node classification and link prediction. Table 1 compares previous datasets with our TEG datasets. To enhance usability, we unify the TEG data format and propose a modular pipeline with three main methods for handling TEGs. To further foster TEG model design, we extensively benchmark TEG-based methods and conduct a thorough analysis. Overall, TEG-DB provides a scalable, unified, modular, and regularly updated evaluation framework for assessing representation learning methods on textual graphs.

4.2 Data Preparation and Construction

In order to construct the dataset with simultaneous satisfaction of both rich textual information on nodes and edges, nine datasets from diverse domains and different scales are chosen. Specifically, we collect four User-Book Review networks from Goodreads datasets [38] in the Book Recommendation domain and two shopping networks from Amazon datasets [12] in the E-commerce domain. One citation network from the Open Research Corpus [2] in the academic domain. Two social networks from Reddit [11] and Twitter [46]. The statistics of the datasets are shown in Table 1.

The creation of text-attributed graph datasets involves three main steps. Firstly, preprocessing the textual attributes within the original dataset, which includes tasks such as handling missing values, filtering out non-English statements, removing anomalous symbols, and truncating excessive length. Secondly, constructing the graph itself. The connectivity between nodes is derived from inherent relationships provided within the dataset, such as citation relationships between papers in citation networks. It is important to note that during graph construction, self-edges and isolated nodes are eliminated. Lastly, refining the constructed graph. It is noteworthy that our dataset encompasses all major tasks in graph representation learning: node classification and link prediction. Below are the specifics of each dataset:

User-Book Review Network. Four datasets within the realm of User-Book Review Networks, specifically labeled as Goodreads-History, Goodreads-Crime, Goodreads-Children, and Goodreads-Cosmics, were formulated. The Goodreads datasets [38] are the main source. Nodes represent different types of books and reviewers, while edges indicate book reviews. Node labels are assigned based on the book category. The descriptions of books are used as node textual information, and reviews of users are used as edges textual information. The corresponding tasks are to predict the categories of the books, which is formulated as a multi-label classification problem, and to predict whether there are connections between users and books. These comprehensive data help infer user preferences and identify similar tastes, enhancing online book recommendations, unlike existing datasets that often lack interaction texts.

Shopping Networks. Two datasets, Amazon-Apps and Amazon-Movie, are classified under Shopping Networks. The Amazon item dataset [12] is the primary source, encompassing item reviews and descriptions. Nodes represent different types of items and reviewers, while edges indicate item reviews. The descriptions of items are used as node textual information, and reviews of users are used as edge textual information. The corresponding tasks are to predict the categories of the items, formulated as a multi-label classification problem, and to predict whether there are connections between users and items. These datasets have the potential to significantly enhance recommendation systems, providing richer data for more accurate suggestions and a personalized shopping experience.

Citation Networks. The raw data for the citation network is sourced from the Open Research Corpus, derived from the complete Semantic Scholar corpus [2]. Nodes represent papers and edges represent the citation relationships between papers. The descriptions of papers are used as node textual information, and citation information, such as the context and paragraphs in which papers are cited, is utilized as textual edge data.The corresponding task involves predicting the domain to which a paper belongs, formulated as a multi-class classification problem, and predicting there exists a citation relationship between papers. This dataset enhances academic network expressiveness, particularly benefiting tasks like node classification and link prediction in graph machine learning.

Social Networks. The Reddit dataset, sourced from Reddit [11], and the Twitter dataset, derived from Twitter [46], represent two prominent social media platforms. Nodes represent users and topics. The edges in the dataset indicate two types of reviews: those between users, representing user-user links, and those between users and topics, representing user-topic links. The descriptions of topics are used as node textual information, and reviews are used as edge textual information. The corresponding tasks are to predict the categories of the topics, formulated as a multi-class classification problem, and to predict whether there are connections between users and topics. These datasets can enhance recommendation systems and offer a more personalized shopping experience. Utilizing these datasets enhances recommendation algorithm performance, providing more personalized and relevant suggestions, while also offering valuable insights into user interests and preferences for social network research and business decision-making.

4.3 Adapting Existing Methods to Solve Problems in TEGs

PLM-based Paradigm. PLMs are trained on massive amounts of text data, allowing them to learn the semantic relationships between words, phrases, and sentences. This enables them to understand the meaning behind the text, not just on a superficial level, but also in terms of context and intent. So PLM-based methods leverage the power of PLM to enhance the text modeling within each node. The formulation of these methods is as follows:

\displaystyle\boldsymbol{h}_{u}^{(k+1)}=\mathrm{MLP}_{\boldsymbol{\psi}}^{(k)}%\left(\boldsymbol{h}_{u}^{(k)}\right)

(1)

where $\boldsymbol{h}_{u}^{(k)}$ denotes the node representation of node $u$ in layer $k$ and the initial feature vector $\boldsymbol{h}_{u}^{(0)}$ of the node $u$ is obtained by encoding the text on node $u$ using the PLM. $\psi$ denotes the learnable parameters within the $\mathrm{MLP}$ .

Although PLMs have considerably improved the representation of node text attributes, these models do not account for topological structures. This limitation hinders their ability to fully capture the complete topological information present in TEGs.

Edge-aware GNN-based Paradigm. GNNs are employed to propagate information across the graph, allowing for the extraction of meaningful representations via message passing, which are formally defined as follows:

\displaystyle\boldsymbol{h}_{u}^{(k+1)}=\operatorname{UPDATE}_{\boldsymbol{%\omega}}^{(k)}\left(\boldsymbol{h}_{u}^{(k)},\operatorname{AGGREGATE}_{%\boldsymbol{\omega}}^{(k)}\left(\left\{\boldsymbol{h}_{v}^{(k)},\boldsymbol{e}%_{v,u},v\in\mathcal{N}(u)\right\}\right)\right)

(2)

where $\boldsymbol{h}_{u}^{(k)}$ denotes the node representation of node $u$ in layer $k$ and the initial node feature vector $\boldsymbol{h}_{u}^{(0)}$ uses pre-learned by PLMs or other shallow text encoder (e.g., Skip-Gram). $\boldsymbol{e}_{v,u}$ represents edge features from node $v$ to node $u$ . $k$ represents the layers of GNNs, $\mathcal{N}$ denotes the set of neighbors, $u$ denotes the target node, $\boldsymbol{\omega}$ means the learning parameters in GNNs.

However, this approach presents two primary issues: (1) In TEGs, the text on edges is highly informative, yet most GNN methods neglect edge features. This omission can limit the model’s capacity to capture more complex relationships within the graph, thereby reducing its effectiveness in accurately representing the intricate interdependencies and interactions between nodes. (2) GNN-based methods are unable to fully capture the contextualized text semantics of edges. This is because the textual information associated with edges can be highly variable and context-dependent, making it challenging to represent and incorporate effectively within the GNN framework.

LLM as Predictor. Leveraging the robust text extraction capabilities of LLMs, LLMs can be directly employed to process raw text as textual prompt inputs to address graph-level task questions. Specifically, we can adopt a text template for each dataset to include the corresponding nodes and edges text to answer a given question, e.g. node classification or link prediction. We can formally define as follows:

\displaystyle A=f\{\mathcal{G},Q\}

(3)

where $f$ is a prompt providing graph information. $\mathcal{G}$ represents a TEG and $Q$ is a question.

5 Experiments

In this section, we first introduce the detailed experimental settings in Section 5.1. Then, we conduct comprehensive benchmarks and perform a comprehensive analysis for link prediction and node classification in Section 5.2 and Section 5.3 respectively.

MethodsGoodreads-ChildrenGoodreads-CrimeGPT-3.5-TURBOBERT-LargeBERT-Basew/o edge textGPT-3.5-TURBOBERT-LargeBERT-Basew/o edge textAUCF1AUCF1AUCF1AUCF1AUCF1AUCF1AUCF1AUCF1MLP0.89520.81980.89480.81930.89470.81920.89290.81810.89110.81440.89090.81450.89200.81530.89130.8149GraphSAGE0.95200.88660.94930.88210.95030.88480.94000.87360.92410.85410.95370.88870.95290.88680.90530.8320GeneralConv0.95190.89070.95210.89210.95400.89530.93560.87350.93250.86250.95680.89570.92570.85260.91170.8426GINE0.95180.89390.94630.88780.94910.89140.93890.87480.91250.84290.95170.88780.95380.89280.91320.8448EdgeConv0.94870.88510.94880.88840.95040.88910.93520.87650.91040.84100.95450.89140.95350.88970.90360.8345GraphTransformer0.94870.87510.94410.87420.94310.87630.92410.83330.90780.83090.94650.87690.94790.88170.89850.8256MethodsAmazon-AppsAmazon-MovieGPT-3.5-TURBOBERT-LargeBERT-Basew/o edge textGPT-3.5-TURBOBERT-LargeBERT-Basew/o edge textAUCF1AUCF1AUCF1AUCF1AUCF1AUCF1AUCF1AUCF1MLP0.86420.77520.86390.76980.86340.76980.86550.77380.82270.72690.82130.75530.83490.75550.82050.7317GraphSAGE0.86620.78530.88130.79710.87830.80150.86340.73660.85000.76650.90670.82980.91780.84260.85070.7591GeneralConv0.88100.81780.87680.81310.87570.80900.86800.81290.86590.79280.92060.84850.89370.84830.86170.7918GINE0.85590.80990.86800.80920.85550.81230.86710.80650.86030.79110.91870.84540.91650.84560.85910.7879EdgeConv0.87200.81800.88130.81530.88040.81840.85200.80430.85650.78420.91710.84360.91810.84680.85520.7837GraphTransformer0.83950.76470.87480.79260.87360.78460.84690.73290.83390.74530.90350.81960.90440.81850.83930.7550MethodsCitationTwitterGPT-3.5-TURBOBERT-LargeBERT-Basew/o edge textGPT-3.5-TURBOBERT-LargeBERT-Basew/o edge textAUCF1AUCF1AUCF1AUCF1AUCF1AUCF1ACCF1AUCF1MLP0.91700.85980.91730.85610.89350.86130.88570.80150.69910.54300.81150.78980.81360.71480.70070.5430GraphSAGE0.93690.87580.94570.88320.97800.93000.89250.83450.67790.61930.86090.81770.83590.79640.56680.5940GeneralConv0.92580.87390.92810.86370.93270.87570.89840.83970.78880.70940.85310.77560.80620.65520.70170.6163GINE0.94820.89390.94430.88250.97360.92720.87440.81450.66960.61350.83060.77190.87380.78800.72130.6161EdgeConv0.71360.53930.71320.53520.74010.65260.69650.54490.68540.61230.82900.66140.75130.67450.61240.5664GraphTransformer0.93500.86970.94390.87130.97890.93200.91720.84410.68590.67640.89670.82230.87680.81650.59080.5423

MethodsGoodreads-ChildrenGoodreads-CrimeAmazon-AppsAmazon-MovieCitationTwitterAUCF1AUCF1AUCF1AUCF1AUCF1AUCF1GPT-3.5-TURBO0.47700.14130.45070.11040.50000.52000.48430.13420.88600.35140.48000.3312GPT-40.87800.60900.88900.60400.62120.14130.50000.30000.47350.31840.43000.6144

5.1 Experimental Settings

Baselines. (1) For the PLM-based Paradigm, we use three various sizes of PLM to encode texts in nodes for generating initial embeddings for nodes. These three models, representing large, medium, and small scales, include GPT-3.5-TURBO as the large-scale model, Bert-Large [8] as the medium-scale model, and Bert-Base [8] as the small-scale model. (2) For GNN-based methods, we select 5 popular GNN models: GraphSAGE [11], GeneralConv [44], GINE [14], EdgeConv [39], and GraphTransformer [45]. We utilize three distinct scales of the PLMs, which are identical to those employed in the PLM-based paradigm, to encode text in nodes and edges. Afterward, these text embeddings on nodes and edges serve as their initial characteristics. (3) For LLM-based Predictor methods, we chose state-of-the-art models GPT-3.5-TURBO and GPT-4, accessed via API, balancing performance and cost considerations.

Implementation details. We conduct experiments on 3 PLM-based, 18 GNN-based, and 2 LLM-based methods. For PLM-based methods, the dimensions of node embedding are 3072, 1024, and 768 generated by GPT-3.5-TURBO, Bert-Large, and Bert respectively. We set the MLP hidden layer to 2, with the number of hidden units in each layer being one-fourth of the units in the previous layer. For GNN-based methods, we adhere to the settings outlined in the respective paper. The parameters shared by all GNN models include dimensions of node and edge embeddings, model layers, and hidden units, with respective values set to 3072, 1024, and 768, as generated by GPT-3.5-TURBO, Bert-Large, and Bert, and 2, 256, respectively. We utilize cross-entropy loss with the Adam optimizer to train and optimize all the above models. The batch size is 1024. Each experiment is repeated three times. GNNs are mainly derived from the implementation in the PyG library [9]. For the node classification task, numerical node labels corresponding to the nodes within the graph are necessary. This involves converting the categorical node categories found in the original data into numerical node labels within the graph. For the link prediction, we randomly sample node pairs that do not exist in the graph as negative samples, along with some edges present as positive samples. For LLM-based predictor methods, we focus on node classification and link prediction tasks. For node classification, inspired by the recent LLM-based classification algorithm[26], we use GPT-4 and GPT-3.5-TURBO models to predict the classification of text nodes by providing the probability for each class. We randomly select 1,000 text nodes along with all classification labels for this task. For the link prediction task, we also apply the GPT-4 and GPT-3.5-TURBO models to determine whether two text edges are related, providing an answer with the corresponding probability. For this task, we randomly select 1,000 pairs of positive text edge indices from the graph and an equal number of negative edges.

Evaluations metrics. We investigate the performance of different baselines through two tasks: link prediction and node classification. For the link prediction task, we use the Area Under ROC Curve (AUC) metric and F1 score to evaluate the model performance. For node classification, the choice of evaluation metrics depends on the nature of the classification tasks involved. In the context of datasets encompassing Goodreads-Children, Goodreads-Crime, and comics from Goodreads, along with Amazon-Apps and Amazon-Movie datasets from Amazon, the classification tasks involve multi-label node classification. Hence, metrics such as AUC-micro and F1-micro are chosen for evaluation. Conversely, datasets about citation networks and social networks are characterized by multi-class node classification, thus metrics such as ACC and F1 are selected for assessment.

MethodsGoodreads-ChildrenGoodreads-CrimeGPT-3.5-TURBOBERT-LargeBERTw/o edge textGPT-3.5-TURBOBERT-LargeBERTw/o edge textAUC*F1*AUC*F1*AUC*F1*AUC*F1*AUC*F1*AUC*F1*AUC*F1*AUC*F1*MLP0.85050.56630.85930.58100.85970.57490.84520.58110.91490.66150.91500.66190.91510.66020.91540.6624GraphSAGE0.93420.78710.91620.74970.91520.74400.87130.62270.95490.81890.94450.78320.94630.78480.92210.7048GeneralConv0.93520.78460.91610.75020.91520.74510.86810.61620.95460.82000.94460.78540.94560.78880.92250.7262GINE0.93240.77770.91540.74660.91370.75520.85230.65580.95040.80730.94100.77660.94290.78520.91550.7117EdgeConv0.93380.78080.91280.74630.91210.74520.85830.64660.94900.80520.94000.76570.94050.77260.91870.6830GraphTransformer0.93400.78230.91370.74970.91500.74910.85170.65650.95050.81510.94520.77950.94640.78340.92200.6944MethodsAmazon-AppsAmazon-MovieGPT-3.5-TURBOBERT-LargeBERTw/o edge textGPT-3.5-TURBOBERT-LargeBERTw/o edge textAUC*F1*AUC*F1*AUC*F1*AUC*F1*AUC*F1*AUC*F1*AUC*F1*AUC*F1*MLP0.75200.32040.89350.41690.89700.31070.73520.30670.96180.52790.97520.53310.97500.51730.94930.4625GraphSAGE0.92740.38990.92260.37940.92290.39290.91610.33480.96740.51650.97730.49190.97710.51850.96810.5096GeneralConv0.89470.36040.91710.38170.92230.38030.91510.39320.97750.51560.97680.48270.97680.50060.97570.5115GINE0.91700.35880.91700.26230.91850.35920.90280.35070.95070.42460.97580.47810.97590.50850.91680.4127EdgeConv0.87640.34770.86390.27390.88000.30630.85680.22470.93600.50600.93720.46720.92630.47430.94920.4853GraphTransformer0.91950.35480.92170.34250.92250.38180.91550.38600.97630.51750.97640.48560.97710.51240.97560.5126MethodsCitationTwitterGPT-3.5-TURBOBERT-LargeBERTw/o edge textGPT-3.5-TURBOBERT-LargeBERTw/o edge textACCF1ACCF1ACCF1ACCF1ACCF1ACCF1ACCF1ACCF1MLP0.78680.78590.75150.74710.80440.80320.72930.72710.81150.72610.83610.81930.85330.83290.81960.7383GraphSAGE0.78830.78740.75590.75250.80460.80600.73410.73080.84110.79030.84460.83050.83840.82470.82860.7802GeneralGNN0.79060.78890.75460.75260.80570.80420.73610.73370.86100.83970.83680.81310.86090.85130.84010.8089GINE0.79340.79250.75990.75740.81060.81000.73160.72840.84380.81860.84010.82550.84600.83280.82540.7907EdgeConv0.41400.38450.40820.37630.42000.39060.39350.35410.85510.84420.86490.85740.86940.86070.85290.8431GraphTransformer0.79030.78850.75310.75170.80700.80560.73690.73510.85630.82730.83420.82110.84020.82610.81970.7888

MethodsGoodreads-ChildrenGoodreads-CrimeAmazon-AppsAmazon-MovieCitationAUC*F1*AUC*F1*AUC*F1*AUC*F1*ACCF1GPT-3.5-TURBO0.52000.03000.54000.07000.50000.01000.51590.00170.70980.3402GPT-40.67000.18000.61000.14000.49950.00020.51750.00290.84320.8450

5.2 Effectiveness Analysis for Link Prediction

In this subsection, we analyze the link prediction from the various models applied in the study. Table 3 and 3 represent the effect of link prediction on different datasets from various distinct. The results on other datasets can be found in Appendix B.1. We can further draw several observations from Table 3 and 3. First, For PLM-based and GNN-based methods, the state-of-the-art methods for Goodreads-Children and Goodreads-Crime datasets are GeneralConv and GINE, respectively. Under the condition of using the same embeddings, they outperform the worst method by approximately 5% and 7% in terms of AUC and F1 across these two datasets. For the Amazon-Apps and Amazon-Movie datasets, the state-of-the-art methods are EdgeConv and GeneralConv. They outperform the worst method by approximately 3% and 7% in terms of AUC and F1 for Amazon-Apps, and by 8% and 7% in terms of AUC and F1 for Amazon-Movie, respectively. For the Citation and Twitter datasets, the state-of-the-art method is GraphTransformer. It outperforms the worst method by approximately 20% and 30% in terms of AUC and F1 for Citation, and by 12% and 9% in terms of AUC and F1 for Twitter, respectively. Second, For the LLM as Predictor methods, we find that they do not perform well in predicting links. The best method among them has an AUC and F1 gap of approximately 10% - 30% compared to the best PLM-based and GNN-based methods for all datasets. Third,Using edge text provides at least approximately a 3% improvement in AUC and at least approximately an 8% improvement in F1 compared to not using edge text for all datasets.

5.3 Effectiveness Analysis for Node Classification

In this subsection, we analyze the node classification results from various models. Table 5 and 5 display the impact on different datasets from various distinct, with additional results in Appendix B.2. We can derive some insights from the data. First, for PLM-based and GNN-based methods, the state-of-the-art models for Goodreads-Children and Goodreads-Crime are GraphSAGE and GeneralConv, respectively, outperforming the worst method by approximately 8% and 20% in AUC-micro and F1-micro for Goodreads-Children, and by 4% and 15% for Goodreads-Crime. In the E-commerce domain, GraphSAGE is the top method for Amazon-Apps and Amazon-Movie, outperforming the worst method by about 10% and 6% in AUC-micro and F1-micro for Amazon-Apps, and by 1% and 10% for Amazon-Movie. GINE and EdgeConv also show superior performance, exceeding the worst method by approximately 35% and 40% in ACC and F1 for Citation, and by 5% and 12% for Twitter. Second, LLM as Predictor methods perform poorly in node classification, with the best method showing an AUC-micro gap of about 30% compared to the best PLM-based and GNN-based methods. Their low F1-micro score could be due to the large number of predicted categories. Third, incorporating edge text results in at least a 3% improvement in AUC-micro and a 6% improvement in F1-micro across all datasets, compared to not using edge text.

Observation. (1) The state-of-the-art model varies across different datasets. Data variability and complexity play significant roles in influencing model performance. (2) Edge text is crucial for TEG tasks. Including edge text enriches relationship information, enabling a more precise depiction of interactions and relationships between nodes, which enhances overall model performance. (3) The scale of PLMs significantly impacts the performance of TEG tasks. Larger model scales result in higher-quality text embeddings and better semantic understanding, leading to improved model performance. (4) When using LLMs as predictors, they struggle to fully comprehend graph topology information. LLMs are designed for linear sequence data and do not inherently capture the complex relationships and structures present in graph data, leading to lower performance on TEGs link prediction and node classification.

5.4 Parameter Sensitivity Analysis

We further analyze the impact of text embeddings generated from PLMs. For the link prediction task, as shown in Table 3, using small-scale PLMs like BERT improves the AUC and F1 scores by approximately 5% compared to not using text embeddings. Medium-scale models such as BERT-Large and large-scale models like GPT-3.5-TURBO improve the AUC and F1 scores by about 7% across all datasets. For node classification, as shown in Table 5, the improvement is slightly less pronounced. Small-scale PLMs like BERT improve the AUC-micro and F1-micro scores by approximately 3%, while medium-scale models like BERT-Large and large-scale models like GPT-3.5-TURBO improve these scores by about 3.5% across all datasets.

6 Discussion

Textual-Edge graphs have emerged as a prominent graph format, which finds extensive applications in modeling real-world tasks. Our research focuses on comprehensively understanding the textual attributes of nodes and their topological connections. Furthermore, we believe that exploring strategies to enhance the efficiency of LLMs in processing TEGs is deemed meaningful. Despite the proven effectiveness of LLMs, their operational efficiency, especially in managing TEGs, poses a significant challenge. Notably, employing APIs like GPT4 for extensive graph tasks may result in considerable expenses under current billing models. Additionally, deploying open-source large models such as LLaMa for tasks like parameter updates or inference in local environments demands substantial computational resources and storage capacity. Moreover, the constraints imposed by context windows in LLMs also impact their effectiveness in encoding node and edge text within TEGs.

7 Conclusion

We introduce the inaugural TEG benchmark, TEG-DB, tailored to delve into graph representation learning on TEGs. It incorporates textual content on both nodes and edges compared to traditional TAG with only node information. We gather and furnish nine comprehensive textual-edge datasets to foster collaboration between the NLP and GNN communities in exploring the data collectively. Our benchmark offers a thorough assessment of various learning approaches, affirming their efficacy and constraints. Additionally, we plan to persist in uncovering and building more research-oriented TEGs to further propel the ongoing robust growth of the domain.

References

[1]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
[2]Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, IzBeltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, VuHa, etal.Construction of the literature graph in semantic scholar.arXiv preprint arXiv:1805.02262, 2018.
[3]Rohan Anil, AndrewM Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, etal.Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023.
[4]Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu, Wang You, Ting Song, Yan Xia, etal.Low-code llm: Visual programming over llms.arXiv preprint arXiv:2304.08103, 2, 2023.
[5]Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, etal.Exploring the potential of large language models (llms) in learning on graphs.arXiv preprint arXiv:2307.03393, 2023.
[6]Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, etal.Exploring the potential of large language models (llms) in learning on graphs.ACM SIGKDD Explorations Newsletter, 25(2):42–61, 2024.
[7]Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and LiYuan.Chatlaw: Open-source legal large language model with integrated external knowledge bases.arXiv preprint arXiv:2306.16092, 2023.
[8]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
[9]Matthias Fey and JanE. Lenssen.Fast graph representation learning with PyTorch Geometric.In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
[10]Jiayan Guo, Lun Du, and Hengyu Liu.Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking.arXiv preprint arXiv:2305.15066, 2023.
[11]WilliamL. Hamilton, Rex Ying, and Jure Leskovec.Inductive representation learning on large graphs.In Advances in Neural Information Processing Systems (NeurIPS), pages 1024–1034, 2017.
[12]Ruining He and Julian McAuley.Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering.In proceedings of the 25th international conference on world wide web, pages 507–517, 2016.
[13]Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec.Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020.
[14]Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec.Strategies for pre-training graph neural networks.In International Conference on Learning Representations, 2020.
[15]Zan Huang, Wingyan Chung, and Hsinchun Chen.A graph model for e-commerce recommender systems.Journal of the American Society for information science and technology, 55(3):259–274, 2004.
[16]Bowen Jin, Gang Liu, Chi Han, Meng Jiang, Heng Ji, and Jiawei Han.Large language models on graphs: A comprehensive survey.arXiv preprint arXiv:2312.02783, 2023.
[17]Bowen Jin, Wentao Zhang, YuZhang, YuMeng, Xinyang Zhang, QiZhu, and Jiawei Han.Patton: Language model pretraining on text-rich networks.arXiv preprint arXiv:2305.12268, 2023.
[18]Bowen Jin, YuZhang, YuMeng, and Jiawei Han.Edgeformers: Graph-empowered transformers for representation learning on textual-edge networks.In International Conference on Learning Representations, 2023.
[19]Bowen Jin, YuZhang, QiZhu, and Jiawei Han.Heterformer: Transformer-based deep node representation learning on heterogeneous text-rich networks.In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1020–1031, 2023.
[20]ThomasN. Kipf and Max Welling.Semi-supervised classification with graph convolutional networks.In International Conference on Learning Representations (ICLR), 2017.
[21]JingYu Koh, Daniel Fried, and Ruslan Salakhutdinov.Generating images with multimodal language models, 2023.
[22]Haoyu Kuang, Jiarong Xu, Haozhe Zhang, Zuyu Zhao, QiZhang, Xuan-Jing Huang, and Zhongyu Wei.Unleashing the power of language models in text-attributed graph.In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8429–8441, 2023.
[23]Chen Ling, Zhuofeng Li, Yuntong Hu, Zheng Zhang, Zhongyuan Liu, Shuang Zheng, and Liang Zhao.Link prediction on textual edge graphs.arXiv preprint arXiv:2405.16606, 2024.
[24]Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Tianjiao Zhao, etal.Domain specialization as the key to make large language models disruptive: A comprehensive survey.arXiv preprint arXiv:2305.18703, 2305, 2023.
[25]Xiaozhong Liu, Jinsong Zhang, and Chun Guo.Full-text citation analysis: A new method to enhance scholarly networks.Journal of the American Society for Information Science and Technology, 64(9):1852–1863, 2013.
[26]Yifan Liu, Chenchen Kuai, Haoxuan Ma, Xishun Liao, BrianYueshuai He, and Jiaqi Ma.Semantic trajectory data mining with llm-informed poi classification, 2024.
[27]Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore.Automating the construction of internet portals with machine learning.In Information Retrieval, pages 127–163. Springer, 2000.
[28]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.Efficient estimation of word representations in vector space.In Proceedings of the International Conference on Learning Representations (ICLR), 2013.
[29]YixinLi MingChen.Revisiting graph neural networks for link prediction.International Conference on Machine Learning (ICML), 2020.
[30]SethA Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin.Information network or social network? the structure of the twitter follow graph.In Proceedings of the 23rd international conference on world wide web, pages 493–498, 2014.
[31]Dmitry Paranyushkin.Infranodus: Generating insight using text network analysis.In The world wide web conference, pages 3584–3589, 2019.
[32]Jeffrey Pennington, Richard Socher, and ChristopherD Manning.Glove: Global vectors for word representation.In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association for Computational Linguistics, 2014.
[33]Benedek Rozemberczki, Ryan Davies, Rik Sarkar, and Charles Sutton.Gemsec: Graph embedding with self clustering.Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pages 65–72, 2019.
[34]Benedek Rozemberczki, Otilia Kiss, and Rik Sarkar.Multi-scale attributed node embedding.Journal of Complex Networks, 8(3):cnz037, 2020.
[35]Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad.Collective classification in network data.AI magazine, 29(3):93–93, 2008.
[36]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
[37]Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.Graph attention networks.In International Conference on Learning Representations (ICLR), 2018.
[38]Mengting Wan and Julian McAuley.Item recommendation on monotonic behavior chains.In Proceedings of the 12th ACM conference on recommender systems, pages 86–94, 2018.
[39]Yue Wang, Yongbin Sun, Ziwei Liu, SanjayE Sarma, MichaelM Bronstein, and JustinM Solomon.Dynamic graph cnn for learning on point clouds.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2319–2328, 2019.
[40]Shiwen Wu, Fei Sun, Wentao Zhang, XuXie, and Bin Cui.Graph neural networks in recommender systems: a survey.ACM Computing Surveys, 55(5):1–37, 2022.
[41]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka.How powerful are graph neural networks?In International Conference on Learning Representations (ICLR), 2019.
[42]Hao Yan, Chaozhuo Li, Ruosong Long, Chao Yan, Jianan Zhao, Wenwen Zhuang, Jun Yin, Peiyan Zhang, Weihao Han, Hao Sun, Weiwei Deng, QiZhang, Lichao Sun, Xing Xie, and Senzhang Wang.A comprehensive study on text-attributed graphs: Benchmarking and rethinking.In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023.
[43]Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang.Natural language is all a graph needs.arXiv preprint arXiv:2308.07134, 2023.
[44]Jiaxuan You, Zhitao Ying, and Jure Leskovec.Design space for graph neural networks.Advances in Neural Information Processing Systems, 33:17009–17021, 2020.
[45]Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and HyunwooJ Kim.Graph transformer networks.In Advances in Neural Information Processing Systems, pages 11960–11970, 2019.
[46]Chao Zhang, Guangyu Zhou, Quan Yuan, Honglei Zhuang, YuZheng, Lance Kaplan, Shaowen Wang, and Jiawei Han.Geoburst: Real-time local event detection in geo-tagged tweet streams.In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 513–522, 2016.
[47]DelvinCe Zhang, Menglin Yang, Rex Ying, and HadyW Lauw.Text-attributed graph representation learning: Methods, applications, and challenges.In Companion Proceedings of the ACM on Web Conference 2024, pages 1298–1301, 2024.
[48]M.Zhang and Y.Chen.Link prediction based on graph neural networks.In Advances in Neural Information Processing Systems, 2018.
[49]Jianan Zhao, Meng Qu, Chaozhuo Li, Hao Yan, Qian Liu, Rui Li, Xing Xie, and Jian Tang.Learning on large-scale text-attributed graphs via variational inference.arXiv preprint arXiv:2210.14709, 2022.
[50]Tao Zou, LeYu, Yifei Huang, Leilei Sun, and Bowen Du.Pretraining language models with text-attributed heterogeneous graphs.arXiv preprint arXiv:2310.12580, 2023.

Appendix A Datasets

A.1 Dataset format

For each dataset, all unprocessed raw files are represented in .json format. After preprocessing, we store the graph-type data compatible with PyTorch Geometric (PyG) [9] in the .pt format using PyTorch. Specifically, we have retained the raw text on nodes, the labels on nodes, the raw text on edges, and the adjacency matrix. We uniformly store the text embeddings of node and edge text in .npy files and load them during data processing.

A.2 Datasets license

The datasets are subject to the MIT license. For precise license information, please refer to the corresponding GitHub repository.

Appendix B Experiment

B.1 Effectiveness Analysis for Link Prediction

In this subsection, we further analyze the link prediction from the various models applied in the study. Table 9 and 9 represent the effect of link prediction on different datasets from various distinct. We can further draw several observations from Table 9 and 9. First, For PLM-based and GNN-based methods, the state-of-the-art methods for Goodreads-Comics and Goodreads-History datasets are GeneralConv and GINE, respectively. Under the condition of using the same embeddings, they outperform the worst method by approximately 6% and 7% in terms of AUC and F1 across these two datasets. For the Reddit dataset, the state-of-the-art method is GeneralConv. It outperforms the worst method by approximately 3% and 5% in terms of AUC and F1, respectively. Second, for the LLM as a predictor method, we find that they do not perform well in predicting links. The best method among them has an AUC and F1 gap of approximately 10% - 30% compared to the best PLM-based and GNN-based methods for all datasets. Third,Using edge text provides at least approximately a 3% improvement in AUC and at least approximately an 8% improvement in F1 compared to not using edge text for all datasets.

B.2 Effectiveness Analysis for Node Classification

In this subsection, we further analyze the node classification results from various models. Table 9 and 9 display the impact on different datasets. We can derive some insights. First, for PLM-based and GNN-based methods, the state-of-the-art models for Goodreads-Comics and Goodreads-History are GeneralConv and GINE, respectively, outperforming the worst method by approximately 8% and 15% in AUC-micro and F1-micro for Goodreads-Comics, and by 6% and 9% for Goodreads-History. GraphTransformer outperforms the worst method by approximately 2% and 1% in ACC and F1 for Citation. Second, LLM as Predictor methods perform poorly in node classification, with the best method showing an AUC-micro gap of about 20% compared to the best PLM-based and GNN-based methods. Their low F1-micro score could be due to the large number of predicted categories. Third, incorporating edge text results in at least a 3% improvement in AUC-micro and a 6% improvement in F1-micro across almost all datasets, compared to not using edge text.

MethodsGoodreads-ComicsGoodreads-HistoryGPT-3.5-TURBOBERT-LargeBERTNoneBERT-LargeBERTNoneAUCF1AUCF1AUCF1AUCF1AUCF1AUCF1AUCF1MLP0.89020.81360.89000.81300.89000.81280.89280.81670.89220.88970.89230.88970.89130.8149GraphSAGE0.94060.86890.95110.88540.95370.88600.94030.87320.95870.87020.95910.86980.90530.8320GeneralConv0.94780.88430.95350.89300.95440.89420.94580.88250.96240.89000.96290.88970.91170.8426GINE0.94890.88700.94800.88570.94710.88330.94460.88190.96310.86690.96340.89370.91320.8448EdgeConv0.94480.88190.94950.88670.94770.88530.94440.88100.94570.86950.94560.86500.90360.8345GraphTransformer0.93800.86870.94330.87470.94660.87810.93620.86610.95890.86980.95900.86900.89850.8256

MethodsRedditGPT-3.5-TURBOBERT-LargeBERTNoneAUCF1AUCF1AUCF1AUCF1MLP0.99090.96510.98660.95760.89000.81280.89280.8167GraphSAGE0.99080.98100.98970.98000.95370.88600.94030.8732GeneralConv0.99640.98090.99560.98150.95440.89420.94580.8825GINE0.99620.98090.99580.98010.94710.88330.94460.8819EdgeConv0.99260.98180.99260.98030.94770.88530.94440.8810GraphTransformer0.99440.98100.99400.98030.94660.87810.93620.8661

MethodsGoodreads-ComicsGoodreads-HistoryRedditAUCF1AUCF1AUCF1GPT-3.5-TURBO0.45650.35880.60310.52340.49800.3440GPT-40.54460.24610.86610.86850.66320.6478

MethodsGoodreads-ComicsGoodreads-HistoryGPT-3.5-TURBOBERT-LargeBERTNoneBERT-LargeBERTNoneAUC*F1*AUC*F1*AUC*F1*AUC*F1*AUC*F1*AUC*F1*AUC*F1*MLP0.83610.51170.83600.52110.83700.52140.83730.52140.78310.80990.78250.80970.78240.8096GraphSAGE0.90680.73790.89650.71180.89650.70880.86890.64010.85430.89750.85380.89700.80440.8088GeneralConv0.91070.74550.89820.71340.89910.71160.87390.65410.85430.89860.85380.89810.81190.8126GINE0.90060.71870.89430.70840.89320.71400.86270.64570.85410.90150.85490.90220.81330.8226EdgeConv0.90150.71270.89230.70660.89310.70890.86480.62600.85200.89740.85150.89600.80590.8116GraphTransformer0.90270.72850.89400.71750.89660.71510.87040.65540.85550.90090.86470.89950.81010.8089

MethodsRedditGPT-3.5-TURBOBERT-LargeBERTNoneACCF1ACCF1ACCF1ACCF1MLP0.98390.98170.97930.97740.98030.97840.97950.9779GraphSAGE0.99740.99620.99750.99640.99730.99630.99740.9965GeneralConv0.99750.99660.99740.99630.99730.99640.99730.9964GINE0.99730.99620.99730.99630.99740.99650.99740.9962EdgeConv0.99730.99600.99730.99600.99730.99600.99730.9959GraphTransformer0.99730.99630.99740.99650.99740.99660.99730.9964

MethodsGoodreads-ComicsGoodreads-HistoryRedditAUC*F1*AUC*F1*ACCF1GPT-3.5-TURBO0.49000.04000.68270.41470.86250.9262GPT-40.56000.06000.82020.73940.97670.9882

TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs (2024)

Abstract

1 Introduction

2 Related Works

3 Preliminaries

4 A Comprehensive Dataset and Benchmark ofTextual-Edge Graphs

4.1 Overview of TEG-DB

4.2 Data Preparation and Construction

4.3 Adapting Existing Methods to Solve Problems in TEGs

5 Experiments

5.1 Experimental Settings

5.2 Effectiveness Analysis for Link Prediction

5.3 Effectiveness Analysis for Node Classification

5.4 Parameter Sensitivity Analysis

6 Discussion

7 Conclusion

References

Appendix A Datasets

A.1 Dataset format

A.2 Datasets license

Appendix B Experiment

B.1 Effectiveness Analysis for Link Prediction

B.2 Effectiveness Analysis for Node Classification

FAQs

What is a benchmark dataset in NLP? ›

How do you calculate benchmark data? ›

References