TIFG: Text-Informed Feature Generation with Large Language Models (2024)

Xinhao Zhang
Portland State University
xinhaoz@pdx.edu &Jinghan Zhang
Portland State University
jinghanz@pdx.edu &Fengran Mo
University of Montreal
fengran.mo@umontreal.ca \ANDYuzhong Chen
Visa Research
yuzchen@visa.com &Kunpeng Liu
Portland State University
kunpeng@pdx.edu
*Corresponding Authors.

Abstract

Textual information of data is of vital importance for data mining and feature engineering. However, existing methods focus on learning the data structures and overlook the textual information along with the data. Consequently, they waste this valuable resource and miss out on the deeper data relationships embedded within the texts. In this paper, we introduce Text-Informed Feature Generation (TIFG), a novel LLM-based text-informed feature generation framework. TIFG utilizes the textual information to generate features by retrieving possible relevant features within external knowledge with Retrieval Augmented Generation (RAG) technology. In this approach, the TIFG can generate new explainable features to enrich the feature space and further mine feature relationships. We design the TIFG to be an automated framework that continuously optimizes the feature generation process, adapts to new data inputs, and improves downstream task performance over iterations. A broad range of experiments in various downstream tasks showcases that our approach can generate high-quality and meaningful features, and is significantly superior to existing methods.

1 Introduction

The domain knowledge and world information are usually stored in text form.In terms of data science, the textual information, such as data descriptions, attributions, and labels, is essential for models to learn the data structures and feature relationships, which clarifies the real-world meanings and connections of the dataXiang etal. (2021). However, feature engineering methods, especially feature generationKhurana etal. (2018); Wang etal. (2022), often focus solely on the data and overlook the textual content. As a result, the features generated by these methods are usually difficult to explain or lack practical meaningsZhang etal. (2024c).

TIFG: Text-Informed Feature Generation with Large Language Models (1)

To enrich data with textual information, it is important to understand the given text and mine the relationships with the feature in other forms.Although Large Language Models (LLMs) with powerful contextual capabilities can integrate, extract, and comprehend textual information to some extent, the misunderstanding of domain-specific text still exists.This is because the pre-training data used for LLMs cannot include the latest or all relevant expert knowledgeZheng and Casari (2018); Wang etal. (2024).Fine-tuning LLMs with domain-specific data helps mitigate such an issue but is high-cost, while a common practice is to retrieve expert knowledge to enable the LLMs to utilize the textual information effectively.

An intuitive approach is to leverage Retrieval Augmented Generation (RAG) to retrieve relevant information from an external knowledge base or library with a query for LLMs. Then the LLMs can integrate the retrieved knowledge to enhance its outputs. This approach draws advantages from both retrieval and generating components, thus equipping the LLMs with state-of-the-art specific knowledge for certain domains without extra training or fine-tuning. As shown in Figure1, where the goal is to mine into the relationships of the features, such as Height and Weight with BMI, as heart attack factors, we adopt an LLM to learn about the labels and read the description for the given information. Then the LLM forms a query to retrieve relevant knowledge about BMI, which is related to heart diseases in terms of heart diseases. We expect the LLM can extract the vital information to generate the new feature of BMI to enrich the data with explicit relationships between old and new features, which can enhance the performance for a domain-specific task. The existing studies on feature generation usually focus only on the data structureGuo etal. (2017); Song etal. (2019); Shi etal. (2018). Although efficient, these approaches lack transparency and are hard to explain.

Our Targets. Comprehensively, we aim to address three main challenges in generating features with textual information: 1) how to extract and utilize the textual information for feature generation, 2) how to generate reliable and explainable features with low cost of resources, and 3) how to achieve an automated and continuous optimization process.

Our Approach. To address the aforementioned challenges, we design TIFG, a novel LLM-based text-informed feature generation method, which adopts an LLM to generate features dynamically and automatically by mining the text information. Specifically, 1) We adopt an LLM to analyze and decode the structures and meanings of textual information, e.g., dataset descriptions and feature labels. The LLM infers, reasons, and generates potential new features along with their derivation methods and feature meanings to the dataset and the associated task. With advanced reasoning capabilities, the LLM can provide a logical and cognitive process for each new feature. 2) We integrate the LLM with Retrieval-Augmented Generation (RAG) technology to provide the LLM with a reliable knowledge base without additional training in specialized domains. With the retrieved relevant textual information, we enrich the knowledge resources to feed to the LLM and the model then can select certain information to generate a new feature.3) We design an automated feature generation framework that continually refines its feature generation capabilities.The new feature generated by the LLM can be automatically integrated into the data table and used for evaluating its effectiveness in downstream tasks.We then keep the successfully validated features to update the textual information for the data table and start the next iteration cycle.The iteratively generated textual features can significantly enhance the domain-specific interpretability to align with the needs of these fields.

In summary, our contribution includes:

  1. 1.

    We introduce a novel text-informed feature generation method TIFG with LLM, which utilizes the textual information and external knowledge to generate new features and enrich the feature space for downstream task performance improvement.

  2. 2.

    We develop a novel paradigm by incorporating RAG into the feature generation process to assess reliable external knowledge and enrich the feature space with extra information.

  3. 3.

    We conduct a series of experiments to validate the effectiveness and robustness of our TIFG method across different tasks. Experimental results demonstrate that our method has clear advantages over existing methods.

2 Related Work

2.1 Automated Feature Generation

Feature Engineering. Feature engineering is an essential part of machine learning, including selecting, modifying or creating new features from raw data to improve the task performanceSeveryn and Moschitti (2013). The target of this process is to optimize the feature representation space. In this process, we tailor the best and most interpretable feature set for specific machine learning tasks for better model accuracy and generalization. Among feature engineering, feature generation is the approaches that generating new features from existing ones in a dataset. This is usually through various mathematical or logical transformation operationsGuo etal. (2017); Song etal. (2019). The goal of feature generation is to create complex latent feature spacesZhong etal. (2016); Schölkopf etal. (2021). However, existing feature generation methods often lack transparency and require a large amount of manual operations.

Automated Feature Generation. With the advancement of deep learning technologies and LLMs, automated feature generation has significant development and applicationPan etal. (2020); Wang etal. (2022). In this field, the researchers have developed various automated feature engineering methods, such as ExploreKitKatz etal. (2016) and CognitoKhurana etal. (2016). These methods increase efficiency in processing large datasets and reduce the need for manual intervention. However, these methods often focus on the structural and numerical information of the data table and overlook the semantic and textual information. Also, the “black box” operations of feature generation with deep learning methods make the generated features challenging to explain and validate.

2.2 Retrieval Augmented Generation for Large Language Models

Retrieval Augmented Generation (RAG)Lewis etal. (2020) is a technique that integrates the information retrieval capabilities with the generative language models to enhance the performance of the final output. RAG effectively assists LLMs with tasks requiring extensive and specific domain knowledgeZhang etal. (2024a); Huang and Huang (2024). Given a domain-specific task, the LLMs can access a large external library, which might be a set of documents or knowledge related to the taskHu and Lu (2024). Then, the LLMs generate a query according to the task information and use it for searching based on the similarity between the query and the candidate documents. This approach helps the LLMs to produce responses that are more accurate, domain-relevant, and reduce hallucinationsYang etal. (2024).However, the vanilla results of RAG might not be used for direct feature generation which requires additional processing to align with the existing featuresLi etal. (2024).This is the main difference between the goal of general RAG and our proposed framework for feature generation.

3 Problem Statement

We formulate the task as searching for the potential features to enrich and reconstruct an optimal and explainable feature representation space to advance certain downstream tasks, such as classification, regression, etc. Concretely, we denote the original tabular dataset as 𝒟0={0;y}subscript𝒟0subscript0𝑦\mathcal{D}_{0}=\{\mathcal{F}_{0};\mathit{y}\}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_y } that includes an original feature set ={fi}i=1I0superscriptsubscriptsubscript𝑓𝑖𝑖1subscript𝐼0\mathcal{F}=\{f_{i}\}_{i=1}^{I_{0}}caligraphic_F = { italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, features {fi}subscript𝑓𝑖\{f_{i}\}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and its target label y𝑦\mathit{y}italic_y, with textual information 𝒞0subscript𝒞0\mathcal{C}_{0}caligraphic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT including the labels and the data description. Our optimization objective is to automatically search for new features {gt},t=1,2,formulae-sequencesubscript𝑔𝑡𝑡12\{g_{t}\},t=1,2,\ldots{ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } , italic_t = 1 , 2 , … that can reconstruct an optimal feature set superscript\mathcal{F}^{*}caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT:

=argmax^𝒫𝒜(^,y),superscript^argmaxsubscript𝒫𝒜^𝑦\mathcal{F}^{*}=\underset{\hat{\mathcal{F}}}{\text{argmax}}\;\mathbf{\mathcal{%P}}_{\mathcal{A}}(\hat{\mathcal{F}},\mathit{y}),caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT over^ start_ARG caligraphic_F end_ARG end_UNDERACCENT start_ARG argmax end_ARG caligraphic_P start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_F end_ARG , italic_y ) ,(1)

where 𝒜𝒜\mathcal{A}caligraphic_A is a downstream ML task (e.g., classification, regression, ranking, detection), 𝒫𝒫\mathcal{P}caligraphic_P is the performance indicator of 𝒜𝒜\mathcal{A}caligraphic_A and ^={0,{gt}}^subscript0subscript𝑔𝑡\hat{\mathcal{F}}=\{\mathcal{F}_{0},\{g_{t}\}\}over^ start_ARG caligraphic_F end_ARG = { caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , { italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } } is an optimized and reconstructed feature set reconstructed on 0subscript0\mathcal{F}_{0}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

4 Methodology

TIFG: Text-Informed Feature Generation with Large Language Models (2)

In this section, we introduce our novel feature generation method named Text-Informed Feature Generation (TIFG), which generates features dynamically and automatically by mining the text information including feature labels and data descriptions of a dataset with a large language model (LLM). Within this framework, we utilize the LLM as a text-miner and a feature generator, which embeds and conducts retrieval based on the text information of the dataset and existing features. Then, the LLM generates new features according to the retrieved external knowledge and the output of RAG. The model evaluates the qualities and reliability of newly generated features to optimize and update the feature set. The overview of the TIFG framework is shown in Figure2, which includes three stages: (1) Retrieval with Texts and Features; (2) Feature Generation and Adjustments; and (3) LLM Evaluation and Data Table Update.

4.1 Retrieval with Texts and Features

In this stage, our objective is to identify and extract potential new features and their generation methods that could enhance the performance of downstream tasks. First, we collect the textual information of the current feature set =t1={fi}i=1Isubscript𝑡1superscriptsubscriptsubscript𝑓𝑖𝑖1𝐼\mathcal{F}=\mathcal{F}_{t-1}=\{f_{i}\}_{i=1}^{I}caligraphic_F = caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = { italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, including the goals of the downstream tasks 𝒜superscript𝒜\mathcal{A^{{}^{\prime}}}caligraphic_A start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, and the textual information 𝒞=𝒞t1𝒞subscript𝒞𝑡1\mathcal{C}=\mathcal{C}_{t-1}caligraphic_C = caligraphic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to construct a query 𝒬tsubscript𝒬𝑡\mathcal{Q}_{t}caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through a pre-trained LLM pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT:

𝒬t=pϕ(𝒜,𝒞t1).subscript𝒬𝑡subscript𝑝italic-ϕsuperscript𝒜subscript𝒞𝑡1\mathcal{Q}_{t}=p_{\phi}(\mathcal{A}^{{}^{\prime}},\mathcal{C}_{t-1}).caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( caligraphic_A start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) .(2)

where pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT denotes the LLM with parameters ϕitalic-ϕ\phiitalic_ϕ, and t=1,2,𝑡12t=1,2,\ldotsitalic_t = 1 , 2 , … is the number of iterations. After integration, the LLM embeds the query:

qt=pϕ(embed(𝒬t)),subscript𝑞𝑡subscript𝑝italic-ϕembedsubscript𝒬𝑡q_{t}=p_{\phi}(\text{embed}(\mathcal{Q}_{t})),italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( embed ( caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ,(3)

where embed()embed\text{embed}(\cdot)embed ( ⋅ ) is the embedding process that transforms the input into a vector representation.We then retrieve through the whole library for the top-k𝑘kitalic_k most relevant documents to query 𝒬tsubscript𝒬𝑡\mathcal{Q}_{t}caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.We evaluate the relevance of each document jk,j=1,2,,kformulae-sequencesubscript𝑗subscript𝑘𝑗12𝑘\mathcal{R}_{j}\in\mathcal{R}_{k},j=1,2,\ldots,kcaligraphic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_j = 1 , 2 , … , italic_k to the query Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by the cosine similarity of the embeddings and select the top-k𝑘kitalic_k documents with highest similarity:

sim(qt,rj)=qtrjqtrj,simsubscript𝑞𝑡subscript𝑟𝑗subscript𝑞𝑡subscript𝑟𝑗normsubscript𝑞𝑡normsubscript𝑟𝑗\text{sim}(q_{t},r_{j})=\frac{q_{t}\cdot r_{j}}{\|q_{t}\|\|r_{j}\|},sim ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ end_ARG ,(4)

where rj=pϕ(embed(j))subscript𝑟𝑗subscript𝑝italic-ϕembedsubscript𝑗r_{j}=p_{\phi}(\text{embed}(\mathcal{R}_{j}))italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( embed ( caligraphic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) and \|\cdot\|\in\mathbb{R}∥ ⋅ ∥ ∈ blackboard_R is the norm function.

The set of documents ={k},k=1,2,formulae-sequencesubscript𝑘𝑘12\mathcal{R}=\{\mathcal{R}_{k}\},k=1,2,\ldotscaligraphic_R = { caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } , italic_k = 1 , 2 , … consists of documents ksubscript𝑘\mathcal{R}_{k}caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which contains the information of potential features {gk},k=1,2,formulae-sequencesubscript𝑔𝑘𝑘12\{g_{k}\},k=1,2,\ldots{ italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } , italic_k = 1 , 2 , … that can be accessed through specific calculations, combinations or judgments with existing features in t1subscript𝑡1\mathcal{F}_{t-1}caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT:

gk={pϕ(o(t1)k)},subscript𝑔𝑘subscript𝑝italic-ϕconditional𝑜subscript𝑡1subscript𝑘g_{k}=\{p_{\phi}(\mathit{o}(\mathcal{F}_{t-1})\mid\mathcal{R}_{k})\},italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_o ( caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∣ caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } ,(5)

where o𝑜\mathit{o}italic_o denotes the calculation, combination, or judgment actions on features in t1subscript𝑡1\mathcal{F}_{t-1}caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Specially, we denotes the operations as follows:

  • Calculation involves arithmetic transformations, statistical measures, or algorithmic functions applied to derive gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In our case, we define a calculator for features as a function oc:n:subscript𝑜𝑐superscript𝑛\mathit{o}_{c}:\mathbb{R}^{n}\rightarrow\mathbb{R}italic_o start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R applied to a vector of existing features, resulting in a scalar or vector output as the new feature.

  • Combination involves integrating two or more features into a single new feature through methods such as concatenation, averaging, or more complex fusion techniques. In our case, we define a calculator for combination as a function of:n×mp:subscript𝑜𝑓superscript𝑛superscript𝑚superscript𝑝\mathit{o}_{f}:\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R}^{p}italic_o start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, where n𝑛nitalic_n, m𝑚mitalic_m, and p𝑝pitalic_p are dimensions of the input features and the resulting feature, respectively.

  • Judgment involves logical or rule-based decision-making processes that integrate domain knowledge or empirical rules to form a new feature. In our case, we define a calculator for judgement as a decision function od:𝒳{0,1}:subscript𝑜𝑑𝒳01\mathit{o}_{d}:\mathcal{X}\rightarrow\{0,1\}italic_o start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT : caligraphic_X → { 0 , 1 } where 𝒳𝒳\mathcal{X}caligraphic_X is the input space composed of one or more feature vectors, and the output is a new categorical feature based on predefined criteria.

4.2 Feature Generation and Adjustments

In this stage, our objective is to generate a new feature and align it with the existing features in the data table. Now we have introduced external knowledge about features through RAG. This knowledge includes the deeper structural details and relationships between features, which are not directly reflected in the original dataset. Here we adopt the LLM to analyze and extract feature structures and relationships, and further generate new features to expand the feature space.

From the k𝑘kitalic_k documents with potential features {gk}subscript𝑔𝑘\{g_{k}\}{ italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, the LLM first assesses these potential features and then ranks them based on their potential to enhance the performance of downstream tasks according to LLM’s evaluation. Here, to help the model’s evaluation, we employ a “Chain of Thought”Wei etal. (2022) strategy to encourage the LLM to think and formulate logical hypotheses and reasoning for the possible outcomes of integrating each potential feature into the data table step-by-step.

The LLM then selects the most promising document t{k}subscript𝑡subscript𝑘\mathcal{R}_{t}\in\{\mathcal{R}_{k}\}caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } from these candidates, and generate a new feature gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT according to tsubscript𝑡\mathcal{R}_{t}caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We integrate gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into the data table:

gt={pϕ(o(t1)t)},t={t1,gt},formulae-sequencesubscript𝑔𝑡subscript𝑝italic-ϕconditional𝑜subscript𝑡1subscript𝑡subscript𝑡subscript𝑡1subscript𝑔𝑡g_{t}=\{p_{\phi}(\mathit{o}(\mathcal{F}_{t-1})\mid\mathcal{R}_{t})\},\mathcal{%F}_{t}=\{\mathcal{F}_{t-1},g_{t}\},italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_o ( caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∣ caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } , caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ,(6)
𝒟t={t;y}={{t1,gt};y}.subscript𝒟𝑡subscript𝑡𝑦subscript𝑡1subscript𝑔𝑡𝑦\mathcal{D}_{t}=\{\mathcal{F}_{t};\mathit{y}\}=\{\{\mathcal{F}_{t-1},g_{t}\};%\mathit{y}\}.caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_y } = { { caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ; italic_y } .(7)

Here, the gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is generated by the corresponding calculation, combination, or judgment method otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT mentioned in document tsubscript𝑡\mathcal{R}_{t}caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with existing features and is of the same length as fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Finally, we feed 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into the downstream task and obtain the performance metric 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is then fed back to the LLM for performance evaluation.

4.3 LLM Evaluation and Data Table Update

In this stage, our objective is to test the effectiveness of the newly generated feature and decide whether to keep it. We decide whether to keep the feature be the improvement of 𝒫t𝒫t1subscript𝒫𝑡subscript𝒫𝑡1\mathcal{P}_{t}-\mathcal{P}_{t-1}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - caligraphic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, where a higher value of 𝒫𝒫\mathcal{P}caligraphic_P indicates better performance:

t={tif𝒫t>𝒫t1,t1otherwise.subscript𝑡casessubscript𝑡ifsubscript𝒫𝑡subscript𝒫𝑡1subscript𝑡1otherwise\mathcal{F}_{t}=\begin{cases}\mathcal{F}_{t}&\text{if }\mathcal{P}_{t}>%\mathcal{P}_{t-1},\\\mathcal{F}_{t-1}&\text{otherwise}.\end{cases}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > caligraphic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_CELL start_CELL otherwise . end_CELL end_ROW(8)

When the adding of new feature gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT brings an improvement in the downstream task performance of the data table, we formally adopt gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as a new feature within the table. The LLM updates the dataset text information 𝒞=𝒞t=pϕ(𝒞t1|t)𝒞subscript𝒞𝑡subscript𝑝italic-ϕconditionalsubscript𝒞𝑡1subscript𝑡\mathcal{C}=\mathcal{C}_{t}=p_{\phi}(\mathcal{C}_{t-1}|\mathcal{R}_{t})caligraphic_C = caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( caligraphic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to incorporate the information related to gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The feature generation process expands the dimensionality of the feature space. As the feature space expands, the model can understand and interpret the data more comprehensively, further improving the performance of downstream tasks and enhancing generalization capabilities.

After that, we move forward to generate new queries to search for more potential features with the updated data table 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the data description 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This iteration process continues until a predefined maximum number of iteration T𝑇Titalic_T is reached, or the optimal feature set is found when achieving peak performance in the downstream task.

1:Input: Tabular dataset 𝒟0={0;y}subscript𝒟0subscript0𝑦\mathcal{D}_{0}=\{\mathcal{F}_{0};\mathit{y}\}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_y }, textual information 𝒞0subscript𝒞0\mathcal{C}_{0}caligraphic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, goals of downstream tasks 𝒜superscript𝒜\mathcal{A^{{}^{\prime}}}caligraphic_A start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT,iteration time T𝑇Titalic_T, LLM pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT

2:0subscript0\mathcal{F}\leftarrow\mathcal{F}_{0}caligraphic_F ← caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝒞𝒞0𝒞subscript𝒞0\mathcal{C}\leftarrow\mathcal{C}_{0}caligraphic_C ← caligraphic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝒟𝒟0𝒟subscript𝒟0\mathcal{D}\leftarrow\mathcal{D}_{0}caligraphic_D ← caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

3:𝒫DownStreamTask(𝒟)𝒫DownStreamTask𝒟\mathcal{P}\leftarrow\text{DownStreamTask}(\mathcal{D})caligraphic_P ← DownStreamTask ( caligraphic_D )

4:fort=1𝑡1t=1italic_t = 1 to T𝑇Titalic_Tdo

5:𝒬tpϕ(𝒜,𝒞)subscript𝒬𝑡subscript𝑝italic-ϕsuperscript𝒜𝒞\mathcal{Q}_{t}\leftarrow p_{\phi}(\mathcal{A}^{{}^{\prime}},\mathcal{C})caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( caligraphic_A start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , caligraphic_C )

6:qtpϕ(embed(𝒬t))subscript𝑞𝑡subscript𝑝italic-ϕembedsubscript𝒬𝑡q_{t}\leftarrow p_{\phi}(\text{embed}(\mathcal{Q}_{t}))italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( embed ( caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )

7:{k}RetrieveFromLibrary(qt)subscript𝑘RetrieveFromLibrarysubscript𝑞𝑡\{\mathcal{R}_{k}\}\leftarrow\text{RetrieveFromLibrary}(q_{t}){ caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ← RetrieveFromLibrary ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )\triangleright Retrieve {k}subscript𝑘\{\mathcal{R}_{k}\}{ caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } with top-k cosine similarity.

8:tpϕ({k})subscript𝑡subscript𝑝italic-ϕsubscript𝑘\mathcal{R}_{t}\leftarrow p_{\phi}(\{\mathcal{R}_{k}\})caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( { caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } )\triangleright The LLM selects the most promising document from these candidates.

9:gt{pϕ(o()t)}subscript𝑔𝑡subscript𝑝italic-ϕconditional𝑜subscript𝑡g_{t}\leftarrow\{p_{\phi}(\mathit{o}(\mathcal{F})\mid\mathcal{R}_{t})\}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← { italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_o ( caligraphic_F ) ∣ caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) }

10:𝒞tpϕ(𝒞t1|t)subscript𝒞𝑡subscript𝑝italic-ϕconditionalsubscript𝒞𝑡1subscript𝑡\mathcal{C}_{t}\leftarrow p_{\phi}(\mathcal{C}_{t-1}|\mathcal{R}_{t})caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( caligraphic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

11:𝒟t{{,gt};y}subscript𝒟𝑡subscript𝑔𝑡𝑦\mathcal{D}_{t}\leftarrow\{\{\mathcal{F},g_{t}\};\mathit{y}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← { { caligraphic_F , italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ; italic_y }

12:𝒫tDownStreamTask(𝒟t)subscript𝒫𝑡DownStreamTasksubscript𝒟𝑡\mathcal{P}_{t}\leftarrow\text{DownStreamTask}(\mathcal{D}_{t})caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← DownStreamTask ( caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

13:if𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 𝒫𝒫\mathcal{P}caligraphic_Pthen

14:𝒫𝒫t𝒫subscript𝒫𝑡\mathcal{P}\leftarrow\mathcal{P}_{t}caligraphic_P ← caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, {,gt}subscript𝑔𝑡\mathcal{F}\leftarrow\{\mathcal{F},g_{t}\}caligraphic_F ← { caligraphic_F , italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, 𝒞𝒞t𝒞subscript𝒞𝑡\mathcal{C}\leftarrow\mathcal{C}_{t}caligraphic_C ← caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

15:endif

16:endfor

17:return \mathcal{F}caligraphic_F

5 Experiments

In this section, we present four experiments to demonstrate the effectiveness and impacts of the TIFG. First, we compare the performance of the TIFG against several baseline methods on four downstream tasks. Second, we present the information gain during feature generation. Then, we showcase the relationship between new features and existing features. Finally, we discuss the reason for the improvement of performance.

5.1 Experiments Settings

Datasets.We evaluate the TIFG method on four real-world datasets from Kaggle, including Insurance Claim Prediction (ICP)Gupta (2023), Animal Information Dataset (AID)Banerjee (2023), Global Country Information Dataset 2023 (GCI)cou (2023) and Diabetes Health Indicators Dataset (DIA)Teboul (2023). The detailed information is shown in Table1. For each dataset, we randomly selected 55%percent5555\%55 % of data for training.

DatasetsSamplesFeaturesClassTarget
ICP15000124Claim
AID2051510Social Structure
GCI195344Life Expectancy
DIA253680212Diabetes Binary

Metrics. We evaluate the model performance by the following metrics: Overall Accuracy (Acc) measures the proportion of true results (both true positives and true negatives) in the total dataset. Precision (Prec) reflects the ratio of true positive predictions to all positive predictions for each class. Recall (Rec), also known as sensitivity, reflects the ratio of true positive predictions to all actual positives for each class. F-Measure (F1) is the harmonic mean of precision and recall, providing a single score that balances both metrics.

Downstream Tasks and Baseline Models. We apply the TIFG model across a range of classification tasks, including Random Forests (RF), Decision Tree (DT), K-Nearest Neighbor (KNN) and Multilayer Perceptrons (MLP). We compare the performance outcomes in these tasks both with and without our method.

We compare the TIFG method with raw data (Raw), the Least Absolute Shrinkage and Selection Operator (Lasso)Zhang etal. (2024b), the Feature Engineering for Predictive Modeling using Reinforcement LearningKhurana etal. (2018) (RL), and direct LLM generationOpenAI .

RAG Settings. We employ the Wikipedia111en.wikipedia.org/ as the external knowledge library for RAG. Further, we utilize the Google API222https://developers.google.com/custom-search/v1/introduction to search across the web to enhance the search outcomes. However, although we can retrieve the relevant information we want, the results usually contain a substantial amount of irrelevant data, such as text from interactive web interfaces. Thus, before applying these search results to generate prompts, we adopt the LLM to clean and extract essential information from the raw data and filter out the irrelevant content. Then, we utilize the refined information to generate prompts and develop features.

5.2 Experimental Results

Overall Performance. Table3 shows the overall results of TIFG and the baseline models’ performance. From the table we can see that the TIFG method consistently surpasses baseline methods across a variety of metrics and datasets. Table2 shows the number of newly generated features for different datasets. On the dataset ICP, the TIFG model achieves the highest accuracy of 97.9%percent97.997.9\%97.9 %, marking a significant improvement of 4.9%percent4.94.9\%4.9 % over the raw data and 2.4%percent2.42.4\%2.4 % over the LLM method for the Random Forest (RF) model. For the GCI dataset, TIFG boosts accuracy to 81.6%percent81.681.6\%81.6 %, which is an impressive 15.1%percent15.115.1\%15.1 % increase compared to raw data on the same RF task, and 4.3%percent4.34.3\%4.3 % over the LLM method, demonstrating its robustness in handling diverse data characteristics.

In terms of precision, TIFG also demonstrates superior performance across datasets. Notably, on the GCI dataset under the RF model, TIFG reaches a precision of 79.1%percent79.179.1\%79.1 %, which surpasses the highest precision of 77.8%percent77.877.8\%77.8 % achieved by the LLM method. This result highlights TIFG ’s capability in reducing misclassifications.

For the F1 score, TIFG on the ICP dataset improves performance significantly, from 63.6%percent63.663.6\%63.6 % in raw data to 78.8%percent78.878.8\%78.8 % in three iterations, reaching an F1 score of 79.7%percent79.779.7\%79.7 %. As the F1 Score is the harmonic mean of precision and recall, this marked improvement underscores TIFG’s ability to balance between reducing misclassifications and minimizing missed classifications effectively. Such enhancements underscore TIFG’s overall efficacy in enriching feature spaces and enhancing model performance across multiple metrics.

DatasetsOriginalTIFG
ICP1216(+4)
AID1516 (+1)
GCI3439 (+5)
DIA2123 (+2)

MetricsModelRFDTKNNMLP
AccDatasetICPAIDGCIDIAICPAIDGCIDIAICPAIDGCIDIAICPAIDGCIDIA
Raw0.9490.6350.6740.8590.9640.4230.5710.7940.6130.4620.5100.8470.8410.4040.4080.864
Lasso0.9750.6540.7350.8600.9700.4420.5920.7960.6630.5190.5710.8490.8530.4420.5100.863
RL0.9600.6870.7550.8590.9650.5190.5920.7960.6430.5770.5310.8480.8440.5580.5310.864
LLM0.9750.6730.7760.8620.9720.5390.6330.7990.6680.5960.5710.8490.8560.5390.5100.863
TIFG0.9790.7120.8160.8640.9730.5960.6740.8150.7120.6350.6530.8590.8950.5960.6120.867
PrecDatasetICPAIDGCIDIAICPAIDGCIDIAICPAIDGCIDIAICPAIDGCIDIA
Raw0.9490.1750.6750.6810.9640.1180.5510.5880.5800.1120.4500.6400.8510.1150.4420.702
Lasso0.9750.1880.6900.6840.9700.1500.6070.5900.6480.1370.5620.6470.8590.1320.5820.703
RL0.9600.2450.7630.6830.9650.3190.5540.5900.6390.1810.4500.6420.8510.1310.5510.708
LLM0.9750.3220.7780.6950.9720.2230.6090.5950.6510.1700.5590.6440.8620.1660.5090.701
TIFG0.9790.2700.7910.7180.9730.2940.6690.6050.7070.1790.6580.6760.8960.2200.5720.729
RecDatasetICPAIDGCIDIAICPAIDGCIDIAICPAIDGCIDIAICPAIDGCIDIA
Raw0.9490.1560.6400.5700.9640.1260.5640.5970.6120.1690.4260.5730.8400.0900.4160.571
Lasso0.9740.1940.7200.5750.9700.1410.5940.5990.6610.1190.5440.5800.8540.1370.5280.591
RL0.9600.2060.7420.5680.9650.2440.5490.5980.6410.2000.4390.5790.8430.1560.5410.589
LLM0.9760.2680.7730.5720.9720.2620.6310.6050.6670.1810.5880.5750.8550.1840.5170.604
TIFG0.9790.2080.7910.5710.9730.3980.7030.5960.7090.2140.6540.5430.8950.2150.5560.568
F1DatasetICPAIDGCIDIAICPAIDGCIDIAICPAIDGCIDIAICPAIDGCIDIA
Raw0.9490.1460.6360.5870.9640.1220.5560.5920.5780.1150.4320.5870.8400.1010.4250.589
Lasso0.9750.1800.6790.5940.9700.1370.5990.5940.6400.1230.5450.5950.8500.1300.5310.614
RL0.9600.2090.7410.5840.9650.2680.5490.5940.6220.1900.4290.5940.8430.1420.5370.611
LLM0.9750.2640.7720.5900.9720.2360.6170.6000.6450.1640.5360.5900.8550.1720.4390.628
TIFG0.9790.2160.7880.5880.9730.3180.6810.6000.6960.1920.6520.5470.8940.2040.5460.585

Information Gain. We then present the information gain during the feature generation process. Here we adopt the information entropy to quantify the information in the dataset. Figure3 shows the information entropy gain for datasets before and after TIFG. We can see that the new features contribute differently to the information entropy. The fundamental reason that TIFG enhances the task performance is by introducing new information to enrich the dataset’s overall information context.

TIFG: Text-Informed Feature Generation with Large Language Models (3)

Case Study.In this case study, we present how our approach add new features to the Global Country Information Dataset. Here, TIFG generates and adds five new features to the dataset with feature labels and data descriptions. The TIFG also generates the corresponding calculation methods and explanations to the new features. We demonstrate the changes in model performance and information gain resulting from these additions below.

In Table4, we demonstrate the feature information and data description before and after TIFG, and in Table5 we showcase the detailed information of newly generated features in the order of generation along with the functions and meanings.

We first compare the changes in accuracy for downstream tasks and information gain when generating new features. Figure4 shows that each new feature improves the metrics for downstream tasks and adds extra information to the dataset. The TIFG increases the dimensionality of the dataset’s feature space by incorporating external information. Thus, the model in a downstream task can better capture the complex relationships between features to improve task performance.

DatasetNumber of FeaturesData Description
Original35This comprehensive dataset provides a wealth of information about all countries worldwide, …, enabling in-depth analyses and cross-country comparisons.
New40 (+5)This comprehensive dataset provides a wealth of information about all countries worldwide, …, enabling in-depth analyses and cross-country comparisons.Newly added to this dataset are five key variables designed to deepen insights into economic pressures, population density, resource utilization, educational investments, and environmental by stress.
No.LabelCalculationReasoning
1Population Load RatioPopulation/Land Area (Km2)PopulationsuperscriptLand Area (Km2)\text{Population}/\text{Land Area (Km}^{2}\text{)}Population / Land Area (Km start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )This ratio shows the number of people per square kilometer, indicating the population load of a country.
2Resource Utilization Rate(Agricultural Land (%)+Forested Area (%))/100Agricultural Land (%)Forested Area (%)100(\text{Agricultural Land (\%)}+\text{Forested Area (\%)})/100( Agricultural Land (%) + Forested Area (%) ) / 100This ratio indicates the proportion of land used for agriculture and forestry, with higher rates suggesting greater utilization of resources.
3Education Investment Effectiveness(Gross Primary Education Enrollment (%)+Gross Tertiary Education Enrollment (%))/2Gross Primary Education Enrollment (%)Gross Tertiary Education Enrollment (%)2(\text{Gross Primary Education Enrollment (\%)}+\text{Gross Tertiary Education% Enrollment (\%)})/2( Gross Primary Education Enrollment (%) + Gross Tertiary Education Enrollment (%) ) / 2This rate shows the average level of investment in primary and higher education in a country, where higher educational investment is usually associated with better economic development.
4Environmental Stress IndexCO2 Emissions/(Forested Area (%)/100×Land Area (Km2)))\text{CO2 Emissions}/(\text{Forested Area (\%)}/100\times\text{Land Area (Km}^%{2}\text{)}))CO2 Emissions / ( Forested Area (%) / 100 × Land Area (Km start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) )This index reflects the environmental stress of a country, specifically the amount of carbon dioxide emissions per unit of forest area, with higher values indicating greater environmental pressure.
5GDP per CapitaGDP/PopulationGDPPopulation\text{GDP}/\text{Population}GDP / PopulationGDP per capita is an important measure of a country’s economic level and the standard of living of its residents.
TIFG: Text-Informed Feature Generation with Large Language Models (4)
TIFG: Text-Informed Feature Generation with Large Language Models (5)

Correlation. We further explore the correlation between the new features generated by TIFG and the existing features in the Global Country Information Dataset. Figure5 shows a detailed correlation analysis between these newly generated features and several key features in the dataset. The correlation of features validates the effectiveness of our TIFG method. We demonstrate the deeper connections between these new features and critical social-economic indicators. This can enhance the model’s ability to understand and capture the deeper structure of the data. For example, the high correlation between population load ratio and land area and urbanization rate highlights issues of population density distribution. The connection between resource utilization rate and environmental stress index reflects the close relationship between resource management and environmental protection. These correlation results validate the soundness and correctness of new features.

6 Conclusion

In this paper, we introduce a novel method, TIFG, that utilizes an LLM for text-informed feature generation. Our target is to enrich data by effectively utilizing the textual information to generate domain-specific features. TIFG leverages LLM to extract and integrate the textual information and retrieve relevant and reliable information to generate new features. To ensure the correctness of generation, we adopt RAG technology with external knowledge to consistently produce reliable and precise features. These features significantly enhance machine learning models without the need for high resource costs or domain-specific fine-tuning. The experimental results demonstrate that TIFG outperforms existing methods in generating meaningful features and enhancing performance across various domains. Also, TIFG’s automated and adaptable framework continually evolves with new data and shows great potential for broader adaptability in various fields. In future work, we plan to refine and expand the TIFG paradigm to more feature engineering methods.

Limitations and Ethics Statements

While our TIFG shows significant advancements and wide adaptability, several limitations require further exploration, including computational demands and limited scalability with complex tasks or large external libraries. Besides, the effectiveness of the retrieval and final output heavily relies on the quality of the external library, which might affect the performance of the model, especially under scenarios with poorly organized prompts or limited external knowledge. Finally, adopting the TIFG in domain-specific tasks with unique requirements may have inherent limitations.

The TIFG framework can enrich data with textual information and external knowledge. However, as the framework uses pre-trained GPT-3.5 Turbo and Wikipedia as the generation model and external knowledge base, respectively, it may inherit the ethical concerns associated with these resources, such as responding to harmful queries or exhibiting biased behaviors.

References

  • cou (2023)2023.Countries of the world 2023.Kaggle.Available online: https://www.kaggle.com/datasets/nelgiriyewithana/countries-of-the-world-2023?select=world-data-2023.csv.
  • Banerjee (2023)Sourav Banerjee. 2023.Animal information dataset.Kaggle dataset.
  • Guo etal. (2017)Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.Deepfm: a factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247.
  • Gupta (2023)Suresh Gupta. 2023.Health insurance data set.Kaggle.Accessed: date-of-access.
  • Hu and Lu (2024)Yucheng Hu and Yuxing Lu. 2024.Rag and rau: A survey on retrieval-augmented language model in natural language processing.arXiv preprint arXiv:2404.19543.
  • Huang and Huang (2024)Yizheng Huang and Jimmy Huang. 2024.A survey on retrieval-augmented text generation for large language models.arXiv preprint arXiv:2404.10981.
  • Katz etal. (2016)Gilad Katz, Eui ChulRichard Shin, and Dawn Song. 2016.Explorekit: Automatic feature generation and selection.In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 979–984.
  • Khurana etal. (2018)Udayan Khurana, Horst Samulowitz, and Deepak Turaga. 2018.Feature engineering for predictive modeling using reinforcement learning.In Proceedings of the AAAI Conference on Artificial Intelligence, volume32.
  • Khurana etal. (2016)Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan Parthasrathy. 2016.Cognito: Automated feature engineering for supervised learning.In 2016 IEEE 16th international conference on data mining workshops (ICDMW), pages 1304–1307. IEEE.
  • Lewis etal. (2020)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, etal. 2020.Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474.
  • Li etal. (2024)Jiahao Li, Quan Wang, Licheng Zhang, Guoqing Jin, and Zhendong Mao. 2024.Feature-adaptive and data-scalable in-context learning.arXiv preprint arXiv:2405.10738.
  • (12)OpenAI.Gpt-3.5 turbo fine-tuning and api updates.https://openai.com/index/gpt-3-5-turbo-fine-tuning-and-api-updates/.Accessed: 2024-05-20.
  • Pan etal. (2020)Tongyang Pan, Jinglong Chen, Jingsong Xie, Zitong Zhou, and Shuilong He. 2020.Deep feature generating network: A new method for intelligent fault detection of mechanical systems under class imbalance.IEEE Transactions on Industrial Informatics, 17(9):6282–6293.
  • Schölkopf etal. (2021)Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, NanRosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021.Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634.
  • Severyn and Moschitti (2013)Aliaksei Severyn and Alessandro Moschitti. 2013.Automatic feature engineering for answer selection and extraction.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 458–467.
  • Shi etal. (2018)Hongtao Shi, Hongping Li, Dan Zhang, Chaqiu Cheng, and Xuanxuan Cao. 2018.An efficient feature generation approach based on deep learning and feature selection techniques for traffic classification.Computer Networks, 132:81–98.
  • Song etal. (2019)Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019.Autoint: Automatic feature interaction learning via self-attentive neural networks.In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1161–1170.
  • Teboul (2023)Alex Teboul. 2023.Diabetes health indicators dataset.Kaggle.Accessed: date-of-access.
  • Wang etal. (2022)Dongjie Wang, Yanjie Fu, Kunpeng Liu, Xiaolin Li, and Yan Solihin. 2022.Group-wise reinforcement feature generation for optimal and explainable representation space reconstruction.In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1826–1834.
  • Wang etal. (2024)Zaitian Wang, Pengfei Wang, Kunpeng Liu, Pengyang Wang, Yanjie Fu, Chang-Tien Lu, CharuC Aggarwal, Jian Pei, and Yuanchun Zhou. 2024.A comprehensive survey on data augmentation.arXiv preprint arXiv:2405.09591.
  • Wei etal. (2022)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, EdChi, QuocV Le, Denny Zhou, etal. 2022.Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837.
  • Xiang etal. (2021)Ziyu Xiang, Mingzhou Fan, GuillermoVázquez Tovar, William Trehern, Byung-Jun Yoon, Xiaofeng Qian, Raymundo Arroyave, and Xiaoning Qian. 2021.Physics-constrained automatic feature engineering for predictive modeling in materials science.In Proceedings of the AAAI Conference on Artificial Intelligence, volume35, pages 10414–10421.
  • Yang etal. (2024)Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, RyanJ Prenger, and Animashree Anandkumar. 2024.Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36.
  • Zhang etal. (2024a)Tianjun Zhang, ShishirG Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and JosephE Gonzalez. 2024a.Raft: Adapting language model to domain specific rag.arXiv preprint arXiv:2403.10131.
  • Zhang etal. (2024b)Xinhao Zhang, Zaitian Wang, LuJiang, Wanfu Gao, Pengfei Wang, and Kunpeng Liu. 2024b.Tfwt: Tabular feature weighting with transformer.
  • Zhang etal. (2024c)Xinhao Zhang, Jinghan Zhang, Banafsheh Rekabdar, Yuanchun Zhou, Pengfei Wang, and Kunpeng Liu. 2024c.Dynamic and adaptive feature generation with llm.arXiv preprint arXiv:2406.03505.
  • Zheng and Casari (2018)Alice Zheng and Amanda Casari. 2018.Feature engineering for machine learning: principles and techniques for data scientists." O’Reilly Media, Inc.".
  • Zhong etal. (2016)Guoqiang Zhong, Li-Na Wang, Xiao Ling, and Junyu Dong. 2016.An overview on data representation learning: From traditional feature learning to recent deep learning.The Journal of Finance and Data Science, 2(4):265–278.

Appendix A Appendix

Here we present the detailed correlation heatmap in Figure6 of GCI and the feature list in Table6 before and after TIFG.

The heatmap demonstrates the correlation between original (left) and newly generated (right) features in the Global Country Information Dataset after applying TIFG. This chart includes a matrix where rows and columns represent the original features with Pearson’s correlation values. On the right side, the nodes represent the new features, and the lines demonstrate the strength of the correlation with the original ones.

We showcase the new features generated by TIFG in all four datasets. The new features are highly relevant to its dataset’s target and original features. These features are explainable and reasonable in their specific domains.

TIFG: Text-Informed Feature Generation with Large Language Models (6)

DatasetOriginal FeaturesGenerated Features
ICPage, sex, weight, bmi, hereditary_diseases, no_of_dependents,Comorbidity Score,
smoker, city, bloodpressure, diabetes, regular_ex, job_titleCholCheck, Healthcare Utilization Score
AIDAnimal, Height (cm), Weight (kg), Color, Lifespan (years),Reproductive Efficiency
Diet, Habitat, Predators, Average Speed (km/h), Countries Found,
Conservation Status, Family, Gestation Period (days), Top Speed (km/h),
Offspring per Birth
GCICountry, Density(P/Km2), Abbreviation, Agricultural Land(%),Population Load Ratio,
Land Area(Km2), Armed Forces size, Birth Rate, Calling Code,Resource Utilization Rate,
Capital/Major City, Co2-Emissions, CPI, CPI Change (%),Education Investment Effectiveness,
Currency-Code, Fertility Rate, Forested Area (%), Gasoline Price,Environmental Stress Index,
GDP, Gross primary education enrollment (%),GDP per Capita
Gross tertiary education enrollment (%), Infant mortality,
Largest city, Maternal mortality ratio, Minimum wage,
Official language, Out of pocket health expenditure,
Physicians per thousand, Population,
Population: Labor force participation (%), Tax revenue (%),
Total tax rate, Unemployment rate, Urban_population,
Latitude, Longitude
DIAHighBP, HighChol, CholCheck, BMI, Smoker, Stroke,LifestyleScore,
HeartDiseaseorAttack, PhysActivity, Fruits, Veggies,HealthRiskScore
HvyAlcoholConsump, AnyHealthcare, NoDocbcCost,
GenHlth, MentHlth, PhysHlth, DiffWalk, Sex, Age,
Education, Income

TIFG: Text-Informed Feature Generation with Large Language Models (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Carlyn Walter

Last Updated:

Views: 5885

Rating: 5 / 5 (50 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Carlyn Walter

Birthday: 1996-01-03

Address: Suite 452 40815 Denyse Extensions, Sengermouth, OR 42374

Phone: +8501809515404

Job: Manufacturing Technician

Hobby: Table tennis, Archery, Vacation, Metal detecting, Yo-yoing, Crocheting, Creative writing

Introduction: My name is Carlyn Walter, I am a lively, glamorous, healthy, clean, powerful, calm, combative person who loves writing and wants to share my knowledge and understanding with you.