TIFG: Text-Informed Feature Generation with Large Language Models (2024)

Xinhao Zhang
Portland State University
xinhaoz@pdx.edu &Jinghan Zhang
Portland State University
jinghanz@pdx.edu &Fengran Mo
University of Montreal
fengran.mo@umontreal.ca \ANDYuzhong Chen
Visa Research
yuzchen@visa.com &Kunpeng Liu
Portland State University
kunpeng@pdx.edu*Corresponding Authors.

Abstract

Textual information of data is of vital importance for data mining and feature engineering. However, existing methods focus on learning the data structures and overlook the textual information along with the data. Consequently, they waste this valuable resource and miss out on the deeper data relationships embedded within the texts. In this paper, we introduce Text-Informed Feature Generation (TIFG), a novel LLM-based text-informed feature generation framework. TIFG utilizes the textual information to generate features by retrieving possible relevant features within external knowledge with Retrieval Augmented Generation (RAG) technology. In this approach, the TIFG can generate new explainable features to enrich the feature space and further mine feature relationships. We design the TIFG to be an automated framework that continuously optimizes the feature generation process, adapts to new data inputs, and improves downstream task performance over iterations. A broad range of experiments in various downstream tasks showcases that our approach can generate high-quality and meaningful features, and is significantly superior to existing methods.

1 Introduction

The domain knowledge and world information are usually stored in text form.In terms of data science, the textual information, such as data descriptions, attributions, and labels, is essential for models to learn the data structures and feature relationships, which clarifies the real-world meanings and connections of the dataXiang etal. (2021). However, feature engineering methods, especially feature generationKhurana etal. (2018); Wang etal. (2022), often focus solely on the data and overlook the textual content. As a result, the features generated by these methods are usually difficult to explain or lack practical meaningsZhang etal. (2024c).

TIFG: Text-Informed Feature Generation with Large Language Models (1)

To enrich data with textual information, it is important to understand the given text and mine the relationships with the feature in other forms.Although Large Language Models (LLMs) with powerful contextual capabilities can integrate, extract, and comprehend textual information to some extent, the misunderstanding of domain-specific text still exists.This is because the pre-training data used for LLMs cannot include the latest or all relevant expert knowledgeZheng and Casari (2018); Wang etal. (2024).Fine-tuning LLMs with domain-specific data helps mitigate such an issue but is high-cost, while a common practice is to retrieve expert knowledge to enable the LLMs to utilize the textual information effectively.

An intuitive approach is to leverage Retrieval Augmented Generation (RAG) to retrieve relevant information from an external knowledge base or library with a query for LLMs. Then the LLMs can integrate the retrieved knowledge to enhance its outputs. This approach draws advantages from both retrieval and generating components, thus equipping the LLMs with state-of-the-art specific knowledge for certain domains without extra training or fine-tuning. As shown in Figure1, where the goal is to mine into the relationships of the features, such as Height and Weight with BMI, as heart attack factors, we adopt an LLM to learn about the labels and read the description for the given information. Then the LLM forms a query to retrieve relevant knowledge about BMI, which is related to heart diseases in terms of heart diseases. We expect the LLM can extract the vital information to generate the new feature of BMI to enrich the data with explicit relationships between old and new features, which can enhance the performance for a domain-specific task. The existing studies on feature generation usually focus only on the data structureGuo etal. (2017); Song etal. (2019); Shi etal. (2018). Although efficient, these approaches lack transparency and are hard to explain.

Our Targets. Comprehensively, we aim to address three main challenges in generating features with textual information: 1) how to extract and utilize the textual information for feature generation, 2) how to generate reliable and explainable features with low cost of resources, and 3) how to achieve an automated and continuous optimization process.

Our Approach. To address the aforementioned challenges, we design TIFG, a novel LLM-based text-informed feature generation method, which adopts an LLM to generate features dynamically and automatically by mining the text information. Specifically, 1) We adopt an LLM to analyze and decode the structures and meanings of textual information, e.g., dataset descriptions and feature labels. The LLM infers, reasons, and generates potential new features along with their derivation methods and feature meanings to the dataset and the associated task. With advanced reasoning capabilities, the LLM can provide a logical and cognitive process for each new feature. 2) We integrate the LLM with Retrieval-Augmented Generation (RAG) technology to provide the LLM with a reliable knowledge base without additional training in specialized domains. With the retrieved relevant textual information, we enrich the knowledge resources to feed to the LLM and the model then can select certain information to generate a new feature.3) We design an automated feature generation framework that continually refines its feature generation capabilities.The new feature generated by the LLM can be automatically integrated into the data table and used for evaluating its effectiveness in downstream tasks.We then keep the successfully validated features to update the textual information for the data table and start the next iteration cycle.The iteratively generated textual features can significantly enhance the domain-specific interpretability to align with the needs of these fields.

In summary, our contribution includes:

1.
We introduce a novel text-informed feature generation method TIFG with LLM, which utilizes the textual information and external knowledge to generate new features and enrich the feature space for downstream task performance improvement.
2.
We develop a novel paradigm by incorporating RAG into the feature generation process to assess reliable external knowledge and enrich the feature space with extra information.
3.
We conduct a series of experiments to validate the effectiveness and robustness of our TIFG method across different tasks. Experimental results demonstrate that our method has clear advantages over existing methods.

2 Related Work

2.1 Automated Feature Generation

Feature Engineering. Feature engineering is an essential part of machine learning, including selecting, modifying or creating new features from raw data to improve the task performanceSeveryn and Moschitti (2013). The target of this process is to optimize the feature representation space. In this process, we tailor the best and most interpretable feature set for specific machine learning tasks for better model accuracy and generalization. Among feature engineering, feature generation is the approaches that generating new features from existing ones in a dataset. This is usually through various mathematical or logical transformation operationsGuo etal. (2017); Song etal. (2019). The goal of feature generation is to create complex latent feature spacesZhong etal. (2016); Schölkopf etal. (2021). However, existing feature generation methods often lack transparency and require a large amount of manual operations.

Automated Feature Generation. With the advancement of deep learning technologies and LLMs, automated feature generation has significant development and applicationPan etal. (2020); Wang etal. (2022). In this field, the researchers have developed various automated feature engineering methods, such as ExploreKitKatz etal. (2016) and CognitoKhurana etal. (2016). These methods increase efficiency in processing large datasets and reduce the need for manual intervention. However, these methods often focus on the structural and numerical information of the data table and overlook the semantic and textual information. Also, the “black box” operations of feature generation with deep learning methods make the generated features challenging to explain and validate.

2.2 Retrieval Augmented Generation for Large Language Models

Retrieval Augmented Generation (RAG)Lewis etal. (2020) is a technique that integrates the information retrieval capabilities with the generative language models to enhance the performance of the final output. RAG effectively assists LLMs with tasks requiring extensive and specific domain knowledgeZhang etal. (2024a); Huang and Huang (2024). Given a domain-specific task, the LLMs can access a large external library, which might be a set of documents or knowledge related to the taskHu and Lu (2024). Then, the LLMs generate a query according to the task information and use it for searching based on the similarity between the query and the candidate documents. This approach helps the LLMs to produce responses that are more accurate, domain-relevant, and reduce hallucinationsYang etal. (2024).However, the vanilla results of RAG might not be used for direct feature generation which requires additional processing to align with the existing featuresLi etal. (2024).This is the main difference between the goal of general RAG and our proposed framework for feature generation.

3 Problem Statement

We formulate the task as searching for the potential features to enrich and reconstruct an optimal and explainable feature representation space to advance certain downstream tasks, such as classification, regression, etc. Concretely, we denote the original tabular dataset as $\mathcal{D}_{0}=\{\mathcal{F}_{0};\mathit{y}\}$ that includes an original feature set $\mathcal{F}=\{f_{i}\}_{i=1}^{I_{0}}$ , features $\{f_{i}\}$ and its target label $\mathit{y}$ , with textual information $\mathcal{C}_{0}$ including the labels and the data description. Our optimization objective is to automatically search for new features $\{g_{t}\},t=1,2,\ldots$ that can reconstruct an optimal feature set $\mathcal{F}^{*}$ :

\mathcal{F}^{*}=\underset{\hat{\mathcal{F}}}{\text{argmax}}\;\mathbf{\mathcal{%P}}_{\mathcal{A}}(\hat{\mathcal{F}},\mathit{y}),

(1)

where $\mathcal{A}$ is a downstream ML task (e.g., classification, regression, ranking, detection), $\mathcal{P}$ is the performance indicator of $\mathcal{A}$ and $\hat{\mathcal{F}}=\{\mathcal{F}_{0},\{g_{t}\}\}$ is an optimized and reconstructed feature set reconstructed on $\mathcal{F}_{0}$ .

4 Methodology

TIFG: Text-Informed Feature Generation with Large Language Models (2)

In this section, we introduce our novel feature generation method named Text-Informed Feature Generation (TIFG), which generates features dynamically and automatically by mining the text information including feature labels and data descriptions of a dataset with a large language model (LLM). Within this framework, we utilize the LLM as a text-miner and a feature generator, which embeds and conducts retrieval based on the text information of the dataset and existing features. Then, the LLM generates new features according to the retrieved external knowledge and the output of RAG. The model evaluates the qualities and reliability of newly generated features to optimize and update the feature set. The overview of the TIFG framework is shown in Figure2, which includes three stages: (1) Retrieval with Texts and Features; (2) Feature Generation and Adjustments; and (3) LLM Evaluation and Data Table Update.

4.1 Retrieval with Texts and Features

In this stage, our objective is to identify and extract potential new features and their generation methods that could enhance the performance of downstream tasks. First, we collect the textual information of the current feature set $\mathcal{F}=\mathcal{F}_{t-1}=\{f_{i}\}_{i=1}^{I}$ , including the goals of the downstream tasks $\mathcal{A^{{}^{\prime}}}$ , and the textual information $\mathcal{C}=\mathcal{C}_{t-1}$ to construct a query $\mathcal{Q}_{t}$ through a pre-trained LLM $p_{\phi}$ :

\mathcal{Q}_{t}=p_{\phi}(\mathcal{A}^{{}^{\prime}},\mathcal{C}_{t-1}).

(2)

where $p_{\phi}$ denotes the LLM with parameters $\phi$ , and $t=1,2,\ldots$ is the number of iterations. After integration, the LLM embeds the query:

q_{t}=p_{\phi}(\text{embed}(\mathcal{Q}_{t})),

(3)

where $\text{embed}(\cdot)$ is the embedding process that transforms the input into a vector representation.We then retrieve through the whole library for the top- $k$ most relevant documents to query $\mathcal{Q}_{t}$ .We evaluate the relevance of each document $\mathcal{R}_{j}\in\mathcal{R}_{k},j=1,2,\ldots,k$ to the query $Q_{t}$ by the cosine similarity of the embeddings and select the top- $k$ documents with highest similarity:

\text{sim}(q_{t},r_{j})=\frac{q_{t}\cdot r_{j}}{\|q_{t}\|\|r_{j}\|},

(4)

where $r_{j}=p_{\phi}(\text{embed}(\mathcal{R}_{j}))$ and $\|\cdot\|\in\mathbb{R}$ is the norm function.

The set of documents $\mathcal{R}=\{\mathcal{R}_{k}\},k=1,2,\ldots$ consists of documents $\mathcal{R}_{k}$ , which contains the information of potential features $\{g_{k}\},k=1,2,\ldots$ that can be accessed through specific calculations, combinations or judgments with existing features in $\mathcal{F}_{t-1}$ :

g_{k}=\{p_{\phi}(\mathit{o}(\mathcal{F}_{t-1})\mid\mathcal{R}_{k})\},

(5)

where $\mathit{o}$ denotes the calculation, combination, or judgment actions on features in $\mathcal{F}_{t-1}$ . Specially, we denotes the operations as follows:

•
Calculation involves arithmetic transformations, statistical measures, or algorithmic functions applied to derive $g_{t}$ . In our case, we define a calculator for features as a function $\mathit{o}_{c}:\mathbb{R}^{n}\rightarrow\mathbb{R}$ applied to a vector of existing features, resulting in a scalar or vector output as the new feature.
•
Combination involves integrating two or more features into a single new feature through methods such as concatenation, averaging, or more complex fusion techniques. In our case, we define a calculator for combination as a function $\mathit{o}_{f}:\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R}^{p}$ , where $n$ , $m$ , and $p$ are dimensions of the input features and the resulting feature, respectively.
•
Judgment involves logical or rule-based decision-making processes that integrate domain knowledge or empirical rules to form a new feature. In our case, we define a calculator for judgement as a decision function $\mathit{o}_{d}:\mathcal{X}\rightarrow\{0,1\}$ where $\mathcal{X}$ is the input space composed of one or more feature vectors, and the output is a new categorical feature based on predefined criteria.

4.2 Feature Generation and Adjustments

In this stage, our objective is to generate a new feature and align it with the existing features in the data table. Now we have introduced external knowledge about features through RAG. This knowledge includes the deeper structural details and relationships between features, which are not directly reflected in the original dataset. Here we adopt the LLM to analyze and extract feature structures and relationships, and further generate new features to expand the feature space.

From the $k$ documents with potential features $\{g_{k}\}$ , the LLM first assesses these potential features and then ranks them based on their potential to enhance the performance of downstream tasks according to LLM’s evaluation. Here, to help the model’s evaluation, we employ a “Chain of Thought”Wei etal. (2022) strategy to encourage the LLM to think and formulate logical hypotheses and reasoning for the possible outcomes of integrating each potential feature into the data table step-by-step.

The LLM then selects the most promising document $\mathcal{R}_{t}\in\{\mathcal{R}_{k}\}$ from these candidates, and generate a new feature $g_{t}$ according to $\mathcal{R}_{t}$ . We integrate $g_{t}$ into the data table:

g_{t}=\{p_{\phi}(\mathit{o}(\mathcal{F}_{t-1})\mid\mathcal{R}_{t})\},\mathcal{%F}_{t}=\{\mathcal{F}_{t-1},g_{t}\},

(6)

\mathcal{D}_{t}=\{\mathcal{F}_{t};\mathit{y}\}=\{\{\mathcal{F}_{t-1},g_{t}\};%\mathit{y}\}.

(7)

Here, the $g_{t}$ is generated by the corresponding calculation, combination, or judgment method $o_{t}$ mentioned in document $\mathcal{R}_{t}$ with existing features and is of the same length as $f_{i}$ .

Finally, we feed $\mathcal{D}_{t}$ into the downstream task and obtain the performance metric $\mathcal{P}_{t}$ , which is then fed back to the LLM for performance evaluation.

4.3 LLM Evaluation and Data Table Update

In this stage, our objective is to test the effectiveness of the newly generated feature and decide whether to keep it. We decide whether to keep the feature be the improvement of $\mathcal{P}_{t}-\mathcal{P}_{t-1}$ , where a higher value of $\mathcal{P}$ indicates better performance:

\mathcal{F}_{t}=\begin{cases}\mathcal{F}_{t}&\text{if }\mathcal{P}_{t}>%\mathcal{P}_{t-1},\\\mathcal{F}_{t-1}&\text{otherwise}.\end{cases}

(8)

When the adding of new feature $g_{t}$ brings an improvement in the downstream task performance of the data table, we formally adopt $g_{t}$ as a new feature within the table. The LLM updates the dataset text information $\mathcal{C}=\mathcal{C}_{t}=p_{\phi}(\mathcal{C}_{t-1}|\mathcal{R}_{t})$ to incorporate the information related to $g_{t}$ . The feature generation process expands the dimensionality of the feature space. As the feature space expands, the model can understand and interpret the data more comprehensively, further improving the performance of downstream tasks and enhancing generalization capabilities.

After that, we move forward to generate new queries to search for more potential features with the updated data table $\mathcal{D}_{t}$ and the data description $\mathcal{C}_{t}$ . This iteration process continues until a predefined maximum number of iteration $T$ is reached, or the optimal feature set is found when achieving peak performance in the downstream task.

1:Input: Tabular dataset $\mathcal{D}_{0}=\{\mathcal{F}_{0};\mathit{y}\}$ , textual information $\mathcal{C}_{0}$ , goals of downstream tasks $\mathcal{A^{{}^{\prime}}}$ ,iteration time $T$ , LLM $p_{\phi}$

2: $\mathcal{F}\leftarrow\mathcal{F}_{0}$ , $\mathcal{C}\leftarrow\mathcal{C}_{0}$ , $\mathcal{D}\leftarrow\mathcal{D}_{0}$

3: $\mathcal{P}\leftarrow\text{DownStreamTask}(\mathcal{D})$

4:for $t=1$ to $T$ do

5: $\mathcal{Q}_{t}\leftarrow p_{\phi}(\mathcal{A}^{{}^{\prime}},\mathcal{C})$

6: $q_{t}\leftarrow p_{\phi}(\text{embed}(\mathcal{Q}_{t}))$

7: $\{\mathcal{R}_{k}\}\leftarrow\text{RetrieveFromLibrary}(q_{t})$ $\triangleright$ Retrieve $\{\mathcal{R}_{k}\}$ with top-k cosine similarity.

8: $\mathcal{R}_{t}\leftarrow p_{\phi}(\{\mathcal{R}_{k}\})$ $\triangleright$ The LLM selects the most promising document from these candidates.

9: $g_{t}\leftarrow\{p_{\phi}(\mathit{o}(\mathcal{F})\mid\mathcal{R}_{t})\}$

10: $\mathcal{C}_{t}\leftarrow p_{\phi}(\mathcal{C}_{t-1}|\mathcal{R}_{t})$

11: $\mathcal{D}_{t}\leftarrow\{\{\mathcal{F},g_{t}\};\mathit{y}\}$

12: $\mathcal{P}_{t}\leftarrow\text{DownStreamTask}(\mathcal{D}_{t})$

13:if $\mathcal{P}_{t}$ > $\mathcal{P}$ then

14: $\mathcal{P}\leftarrow\mathcal{P}_{t}$ , $\mathcal{F}\leftarrow\{\mathcal{F},g_{t}\}$ , $\mathcal{C}\leftarrow\mathcal{C}_{t}$

15:endif

16:endfor

17:return $\mathcal{F}$

5 Experiments

In this section, we present four experiments to demonstrate the effectiveness and impacts of the TIFG. First, we compare the performance of the TIFG against several baseline methods on four downstream tasks. Second, we present the information gain during feature generation. Then, we showcase the relationship between new features and existing features. Finally, we discuss the reason for the improvement of performance.

5.1 Experiments Settings

Datasets.We evaluate the TIFG method on four real-world datasets from Kaggle, including Insurance Claim Prediction (ICP)Gupta (2023), Animal Information Dataset (AID)Banerjee (2023), Global Country Information Dataset 2023 (GCI)cou (2023) and Diabetes Health Indicators Dataset (DIA)Teboul (2023). The detailed information is shown in Table1. For each dataset, we randomly selected $55\%$ of data for training.

Datasets	Samples	Features	Class	Target
ICP	15000	12	4	Claim
AID	205	15	10	Social Structure
GCI	195	34	4	Life Expectancy
DIA	253680	21	2	Diabetes Binary

Metrics. We evaluate the model performance by the following metrics: Overall Accuracy (Acc) measures the proportion of true results (both true positives and true negatives) in the total dataset. Precision (Prec) reflects the ratio of true positive predictions to all positive predictions for each class. Recall (Rec), also known as sensitivity, reflects the ratio of true positive predictions to all actual positives for each class. F-Measure (F1) is the harmonic mean of precision and recall, providing a single score that balances both metrics.

Downstream Tasks and Baseline Models. We apply the TIFG model across a range of classification tasks, including Random Forests (RF), Decision Tree (DT), K-Nearest Neighbor (KNN) and Multilayer Perceptrons (MLP). We compare the performance outcomes in these tasks both with and without our method.

We compare the TIFG method with raw data (Raw), the Least Absolute Shrinkage and Selection Operator (Lasso)Zhang etal. (2024b), the Feature Engineering for Predictive Modeling using Reinforcement LearningKhurana etal. (2018) (RL), and direct LLM generationOpenAI .

RAG Settings. We employ the Wikipedia¹¹1en.wikipedia.org/ as the external knowledge library for RAG. Further, we utilize the Google API²²2https://developers.google.com/custom-search/v1/introduction to search across the web to enhance the search outcomes. However, although we can retrieve the relevant information we want, the results usually contain a substantial amount of irrelevant data, such as text from interactive web interfaces. Thus, before applying these search results to generate prompts, we adopt the LLM to clean and extract essential information from the raw data and filter out the irrelevant content. Then, we utilize the refined information to generate prompts and develop features.

5.2 Experimental Results

Overall Performance. Table3 shows the overall results of TIFG and the baseline models’ performance. From the table we can see that the TIFG method consistently surpasses baseline methods across a variety of metrics and datasets. Table2 shows the number of newly generated features for different datasets. On the dataset ICP, the TIFG model achieves the highest accuracy of $97.9\%$ , marking a significant improvement of $4.9\%$ over the raw data and $2.4\%$ over the LLM method for the Random Forest (RF) model. For the GCI dataset, TIFG boosts accuracy to $81.6\%$ , which is an impressive $15.1\%$ increase compared to raw data on the same RF task, and $4.3\%$ over the LLM method, demonstrating its robustness in handling diverse data characteristics.

In terms of precision, TIFG also demonstrates superior performance across datasets. Notably, on the GCI dataset under the RF model, TIFG reaches a precision of $79.1\%$ , which surpasses the highest precision of $77.8\%$ achieved by the LLM method. This result highlights TIFG ’s capability in reducing misclassifications.

For the F1 score, TIFG on the ICP dataset improves performance significantly, from $63.6\%$ in raw data to $78.8\%$ in three iterations, reaching an F1 score of $79.7\%$ . As the F1 Score is the harmonic mean of precision and recall, this marked improvement underscores TIFG’s ability to balance between reducing misclassifications and minimizing missed classifications effectively. Such enhancements underscore TIFG’s overall efficacy in enriching feature spaces and enhancing model performance across multiple metrics.

Datasets	Original	TIFG
ICP	12	16(+4)
AID	15	16 (+1)
GCI	34	39 (+5)
DIA	21	23 (+2)

Metrics	Model	RF				DT				KNN				MLP
Acc	Dataset	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA
	Raw	0.949	0.635	0.674	0.859	0.964	0.423	0.571	0.794	0.613	0.462	0.510	0.847	0.841	0.404	0.408	0.864
	Lasso	0.975	0.654	0.735	0.860	0.970	0.442	0.592	0.796	0.663	0.519	0.571	0.849	0.853	0.442	0.510	0.863
	RL	0.960	0.687	0.755	0.859	0.965	0.519	0.592	0.796	0.643	0.577	0.531	0.848	0.844	0.558	0.531	0.864
	LLM	0.975	0.673	0.776	0.862	0.972	0.539	0.633	0.799	0.668	0.596	0.571	0.849	0.856	0.539	0.510	0.863
	TIFG	0.979	0.712	0.816	0.864	0.973	0.596	0.674	0.815	0.712	0.635	0.653	0.859	0.895	0.596	0.612	0.867
Prec	Dataset	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA
	Raw	0.949	0.175	0.675	0.681	0.964	0.118	0.551	0.588	0.580	0.112	0.450	0.640	0.851	0.115	0.442	0.702
	Lasso	0.975	0.188	0.690	0.684	0.970	0.150	0.607	0.590	0.648	0.137	0.562	0.647	0.859	0.132	0.582	0.703
	RL	0.960	0.245	0.763	0.683	0.965	0.319	0.554	0.590	0.639	0.181	0.450	0.642	0.851	0.131	0.551	0.708
	LLM	0.975	0.322	0.778	0.695	0.972	0.223	0.609	0.595	0.651	0.170	0.559	0.644	0.862	0.166	0.509	0.701
	TIFG	0.979	0.270	0.791	0.718	0.973	0.294	0.669	0.605	0.707	0.179	0.658	0.676	0.896	0.220	0.572	0.729
Rec	Dataset	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA
	Raw	0.949	0.156	0.640	0.570	0.964	0.126	0.564	0.597	0.612	0.169	0.426	0.573	0.840	0.090	0.416	0.571
	Lasso	0.974	0.194	0.720	0.575	0.970	0.141	0.594	0.599	0.661	0.119	0.544	0.580	0.854	0.137	0.528	0.591
	RL	0.960	0.206	0.742	0.568	0.965	0.244	0.549	0.598	0.641	0.200	0.439	0.579	0.843	0.156	0.541	0.589
	LLM	0.976	0.268	0.773	0.572	0.972	0.262	0.631	0.605	0.667	0.181	0.588	0.575	0.855	0.184	0.517	0.604
	TIFG	0.979	0.208	0.791	0.571	0.973	0.398	0.703	0.596	0.709	0.214	0.654	0.543	0.895	0.215	0.556	0.568
F1	Dataset	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA	ICP	AID	GCI	DIA
	Raw	0.949	0.146	0.636	0.587	0.964	0.122	0.556	0.592	0.578	0.115	0.432	0.587	0.840	0.101	0.425	0.589
	Lasso	0.975	0.180	0.679	0.594	0.970	0.137	0.599	0.594	0.640	0.123	0.545	0.595	0.850	0.130	0.531	0.614
	RL	0.960	0.209	0.741	0.584	0.965	0.268	0.549	0.594	0.622	0.190	0.429	0.594	0.843	0.142	0.537	0.611
	LLM	0.975	0.264	0.772	0.590	0.972	0.236	0.617	0.600	0.645	0.164	0.536	0.590	0.855	0.172	0.439	0.628
	TIFG	0.979	0.216	0.788	0.588	0.973	0.318	0.681	0.600	0.696	0.192	0.652	0.547	0.894	0.204	0.546	0.585

Information Gain. We then present the information gain during the feature generation process. Here we adopt the information entropy to quantify the information in the dataset. Figure3 shows the information entropy gain for datasets before and after TIFG. We can see that the new features contribute differently to the information entropy. The fundamental reason that TIFG enhances the task performance is by introducing new information to enrich the dataset’s overall information context.

TIFG: Text-Informed Feature Generation with Large Language Models (3)

Case Study.In this case study, we present how our approach add new features to the Global Country Information Dataset. Here, TIFG generates and adds five new features to the dataset with feature labels and data descriptions. The TIFG also generates the corresponding calculation methods and explanations to the new features. We demonstrate the changes in model performance and information gain resulting from these additions below.

In Table4, we demonstrate the feature information and data description before and after TIFG, and in Table5 we showcase the detailed information of newly generated features in the order of generation along with the functions and meanings.

We first compare the changes in accuracy for downstream tasks and information gain when generating new features. Figure4 shows that each new feature improves the metrics for downstream tasks and adds extra information to the dataset. The TIFG increases the dimensionality of the dataset’s feature space by incorporating external information. Thus, the model in a downstream task can better capture the complex relationships between features to improve task performance.

Dataset	Number of Features	Data Description
Original	35	This comprehensive dataset provides a wealth of information about all countries worldwide, …, enabling in-depth analyses and cross-country comparisons.
New	40 (+5)	This comprehensive dataset provides a wealth of information about all countries worldwide, …, enabling in-depth analyses and cross-country comparisons.Newly added to this dataset are five key variables designed to deepen insights into economic pressures, population density, resource utilization, educational investments, and environmental by stress.

No.	Label	Calculation	Reasoning
1	Population Load Ratio	$\text{Population}/\text{Land Area (Km}^{2}\text{)}$	This ratio shows the number of people per square kilometer, indicating the population load of a country.
2	Resource Utilization Rate	$(\text{Agricultural Land (\%)}+\text{Forested Area (\%)})/100$	This ratio indicates the proportion of land used for agriculture and forestry, with higher rates suggesting greater utilization of resources.
3	Education Investment Effectiveness	$(\text{Gross Primary Education Enrollment (\%)}+\text{Gross Tertiary Education% Enrollment (\%)})/2$	This rate shows the average level of investment in primary and higher education in a country, where higher educational investment is usually associated with better economic development.
4	Environmental Stress Index	$\text{CO2 Emissions}/(\text{Forested Area (\%)}/100\times\text{Land Area (Km}^%{2}\text{)}))$	This index reflects the environmental stress of a country, specifically the amount of carbon dioxide emissions per unit of forest area, with higher values indicating greater environmental pressure.
5	GDP per Capita	$\text{GDP}/\text{Population}$	GDP per capita is an important measure of a country’s economic level and the standard of living of its residents.

TIFG: Text-Informed Feature Generation with Large Language Models (4)

TIFG: Text-Informed Feature Generation with Large Language Models (5)

Correlation. We further explore the correlation between the new features generated by TIFG and the existing features in the Global Country Information Dataset. Figure5 shows a detailed correlation analysis between these newly generated features and several key features in the dataset. The correlation of features validates the effectiveness of our TIFG method. We demonstrate the deeper connections between these new features and critical social-economic indicators. This can enhance the model’s ability to understand and capture the deeper structure of the data. For example, the high correlation between population load ratio and land area and urbanization rate highlights issues of population density distribution. The connection between resource utilization rate and environmental stress index reflects the close relationship between resource management and environmental protection. These correlation results validate the soundness and correctness of new features.

6 Conclusion

In this paper, we introduce a novel method, TIFG, that utilizes an LLM for text-informed feature generation. Our target is to enrich data by effectively utilizing the textual information to generate domain-specific features. TIFG leverages LLM to extract and integrate the textual information and retrieve relevant and reliable information to generate new features. To ensure the correctness of generation, we adopt RAG technology with external knowledge to consistently produce reliable and precise features. These features significantly enhance machine learning models without the need for high resource costs or domain-specific fine-tuning. The experimental results demonstrate that TIFG outperforms existing methods in generating meaningful features and enhancing performance across various domains. Also, TIFG’s automated and adaptable framework continually evolves with new data and shows great potential for broader adaptability in various fields. In future work, we plan to refine and expand the TIFG paradigm to more feature engineering methods.

Limitations and Ethics Statements

While our TIFG shows significant advancements and wide adaptability, several limitations require further exploration, including computational demands and limited scalability with complex tasks or large external libraries. Besides, the effectiveness of the retrieval and final output heavily relies on the quality of the external library, which might affect the performance of the model, especially under scenarios with poorly organized prompts or limited external knowledge. Finally, adopting the TIFG in domain-specific tasks with unique requirements may have inherent limitations.

The TIFG framework can enrich data with textual information and external knowledge. However, as the framework uses pre-trained GPT-3.5 Turbo and Wikipedia as the generation model and external knowledge base, respectively, it may inherit the ethical concerns associated with these resources, such as responding to harmful queries or exhibiting biased behaviors.

References

cou (2023)2023.Countries of the world 2023.Kaggle.Available online: https://www.kaggle.com/datasets/nelgiriyewithana/countries-of-the-world-2023?select=world-data-2023.csv.
Banerjee (2023)Sourav Banerjee. 2023.Animal information dataset.Kaggle dataset.
Guo etal. (2017)Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.Deepfm: a factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247.
Gupta (2023)Suresh Gupta. 2023.Health insurance data set.Kaggle.Accessed: date-of-access.
Hu and Lu (2024)Yucheng Hu and Yuxing Lu. 2024.Rag and rau: A survey on retrieval-augmented language model in natural language processing.arXiv preprint arXiv:2404.19543.
Huang and Huang (2024)Yizheng Huang and Jimmy Huang. 2024.A survey on retrieval-augmented text generation for large language models.arXiv preprint arXiv:2404.10981.
Katz etal. (2016)Gilad Katz, Eui ChulRichard Shin, and Dawn Song. 2016.Explorekit: Automatic feature generation and selection.In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 979–984.
Khurana etal. (2018)Udayan Khurana, Horst Samulowitz, and Deepak Turaga. 2018.Feature engineering for predictive modeling using reinforcement learning.In Proceedings of the AAAI Conference on Artificial Intelligence, volume32.
Khurana etal. (2016)Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan Parthasrathy. 2016.Cognito: Automated feature engineering for supervised learning.In 2016 IEEE 16th international conference on data mining workshops (ICDMW), pages 1304–1307. IEEE.
Lewis etal. (2020)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, etal. 2020.Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474.
Li etal. (2024)Jiahao Li, Quan Wang, Licheng Zhang, Guoqing Jin, and Zhendong Mao. 2024.Feature-adaptive and data-scalable in-context learning.arXiv preprint arXiv:2405.10738.
(12)OpenAI.Gpt-3.5 turbo fine-tuning and api updates.https://openai.com/index/gpt-3-5-turbo-fine-tuning-and-api-updates/.Accessed: 2024-05-20.
Pan etal. (2020)Tongyang Pan, Jinglong Chen, Jingsong Xie, Zitong Zhou, and Shuilong He. 2020.Deep feature generating network: A new method for intelligent fault detection of mechanical systems under class imbalance.IEEE Transactions on Industrial Informatics, 17(9):6282–6293.
Schölkopf etal. (2021)Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, NanRosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021.Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634.
Severyn and Moschitti (2013)Aliaksei Severyn and Alessandro Moschitti. 2013.Automatic feature engineering for answer selection and extraction.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 458–467.
Shi etal. (2018)Hongtao Shi, Hongping Li, Dan Zhang, Chaqiu Cheng, and Xuanxuan Cao. 2018.An efficient feature generation approach based on deep learning and feature selection techniques for traffic classification.Computer Networks, 132:81–98.
Song etal. (2019)Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019.Autoint: Automatic feature interaction learning via self-attentive neural networks.In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1161–1170.
Teboul (2023)Alex Teboul. 2023.Diabetes health indicators dataset.Kaggle.Accessed: date-of-access.
Wang etal. (2022)Dongjie Wang, Yanjie Fu, Kunpeng Liu, Xiaolin Li, and Yan Solihin. 2022.Group-wise reinforcement feature generation for optimal and explainable representation space reconstruction.In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1826–1834.
Wang etal. (2024)Zaitian Wang, Pengfei Wang, Kunpeng Liu, Pengyang Wang, Yanjie Fu, Chang-Tien Lu, CharuC Aggarwal, Jian Pei, and Yuanchun Zhou. 2024.A comprehensive survey on data augmentation.arXiv preprint arXiv:2405.09591.
Wei etal. (2022)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, EdChi, QuocV Le, Denny Zhou, etal. 2022.Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837.
Xiang etal. (2021)Ziyu Xiang, Mingzhou Fan, GuillermoVázquez Tovar, William Trehern, Byung-Jun Yoon, Xiaofeng Qian, Raymundo Arroyave, and Xiaoning Qian. 2021.Physics-constrained automatic feature engineering for predictive modeling in materials science.In Proceedings of the AAAI Conference on Artificial Intelligence, volume35, pages 10414–10421.
Yang etal. (2024)Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, RyanJ Prenger, and Animashree Anandkumar. 2024.Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36.
Zhang etal. (2024a)Tianjun Zhang, ShishirG Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and JosephE Gonzalez. 2024a.Raft: Adapting language model to domain specific rag.arXiv preprint arXiv:2403.10131.
Zhang etal. (2024b)Xinhao Zhang, Zaitian Wang, LuJiang, Wanfu Gao, Pengfei Wang, and Kunpeng Liu. 2024b.Tfwt: Tabular feature weighting with transformer.
Zhang etal. (2024c)Xinhao Zhang, Jinghan Zhang, Banafsheh Rekabdar, Yuanchun Zhou, Pengfei Wang, and Kunpeng Liu. 2024c.Dynamic and adaptive feature generation with llm.arXiv preprint arXiv:2406.03505.
Zheng and Casari (2018)Alice Zheng and Amanda Casari. 2018.Feature engineering for machine learning: principles and techniques for data scientists." O’Reilly Media, Inc.".
Zhong etal. (2016)Guoqiang Zhong, Li-Na Wang, Xiao Ling, and Junyu Dong. 2016.An overview on data representation learning: From traditional feature learning to recent deep learning.The Journal of Finance and Data Science, 2(4):265–278.

Appendix A Appendix

Here we present the detailed correlation heatmap in Figure6 of GCI and the feature list in Table6 before and after TIFG.

The heatmap demonstrates the correlation between original (left) and newly generated (right) features in the Global Country Information Dataset after applying TIFG. This chart includes a matrix where rows and columns represent the original features with Pearson’s correlation values. On the right side, the nodes represent the new features, and the lines demonstrate the strength of the correlation with the original ones.

We showcase the new features generated by TIFG in all four datasets. The new features are highly relevant to its dataset’s target and original features. These features are explainable and reasonable in their specific domains.

TIFG: Text-Informed Feature Generation with Large Language Models (6)

Dataset	Original Features	Generated Features
ICP	age, sex, weight, bmi, hereditary_diseases, no_of_dependents,	Comorbidity Score,
	smoker, city, bloodpressure, diabetes, regular_ex, job_title	CholCheck, Healthcare Utilization Score
AID	Animal, Height (cm), Weight (kg), Color, Lifespan (years),	Reproductive Efficiency
	Diet, Habitat, Predators, Average Speed (km/h), Countries Found,
	Conservation Status, Family, Gestation Period (days), Top Speed (km/h),
	Offspring per Birth
GCI	Country, Density(P/Km2), Abbreviation, Agricultural Land(%),	Population Load Ratio,
	Land Area(Km2), Armed Forces size, Birth Rate, Calling Code,	Resource Utilization Rate,
	Capital/Major City, Co2-Emissions, CPI, CPI Change (%),	Education Investment Effectiveness,
	Currency-Code, Fertility Rate, Forested Area (%), Gasoline Price,	Environmental Stress Index,
	GDP, Gross primary education enrollment (%),	GDP per Capita
	Gross tertiary education enrollment (%), Infant mortality,
	Largest city, Maternal mortality ratio, Minimum wage,
	Official language, Out of pocket health expenditure,
	Physicians per thousand, Population,
	Population: Labor force participation (%), Tax revenue (%),
	Total tax rate, Unemployment rate, Urban_population,
	Latitude, Longitude
DIA	HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,	LifestyleScore,
	HeartDiseaseorAttack, PhysActivity, Fruits, Veggies,	HealthRiskScore
	HvyAlcoholConsump, AnyHealthcare, NoDocbcCost,
	GenHlth, MentHlth, PhysHlth, DiffWalk, Sex, Age,
	Education, Income