Analyzing Data using ChatGPT and LLMs
You might have heard of ChatGPT cannot do math or solve simple problem, and you might have heard that ChatGPT can even replace data analyst. Well both is true to an extent. ChatGPT’s GPT-4 has an Advanced Data Analysis plugin that is available to premium (paid) accounts that allows users to upload datasets to ChatGPT and run code directly on the dataset, allowing for accurate data analysis. But did you know that you don’t always need such plugins to analyze datasets well with LLMs? Let’s first understand the limitations of ChatGPT and LLMs to analyze datasets.
Types of Data Problems where ChatGPT and LLMs fall short
ChatGPT, while advanced in many aspects, can encounter difficulties in accurately solving certain types of math problems. Here are some scenarios where it might fail to provide the correct answer:
- Complex Calculations: ChatGPT and LLMs can struggle with math problems that involve intricate or multi-step calculations, particularly those that require high precision or are beyond typical arithmetic.
- Abstract or Advanced Mathematical Concepts: Problems involving advanced topics like higher-level calculus, abstract algebra, or complex number theory might pose a challenge, especially if they require deep understanding or specialized methods.
- Interpreting Ambiguous or Poorly Structured Problems: If a math problem is not clearly stated or has multiple interpretations, LLMs might misinterpret the question or make incorrect assumptions, leading to erroneous answers.
- Limited Context or Information: For problems that require additional contextual information or specific knowledge not provided in the question, ChatGPT may not be able to deduce the correct solution.
- Real-Time Data or Problems Requiring External Resources: ChatGPT and most LLMs cannot access or incorporate real-time data or external databases, which can be a limitation for problems that depend on the most current data or specific datasets.
- Extremely Large Numbers or Computations: Handling very large numbers or problems that require extensive computational resources might be problematic, as ChatGPT’s capabilities are optimized for general-purpose tasks rather than high-performance computing.
- Visual or Spatial Problems: Problems that heavily rely on visual interpretation, such as certain geometry or topology problems, may not be accurately solvable by ChatGPT, as it primarily processes text-based information.
- Non-standard Notations or Unconventional Problem Formats: If a math problem uses non-standard notations or is presented in an unconventional format, ChatGPT might not recognize or correctly interpret the symbols and structure.
- Interactive or Adaptive Problems: Problems that require interaction or adapt based on previous answers (like some advanced logic or puzzle problems) might not be effectively solvable in a single query-response format.
While LLMs and ChatGPT is capable of handling a wide range of mathematical problems, its performance can be limited by the complexity of the calculations, the level of mathematical theory involved, the clarity and structure of the problem, and the format in which the problem is presented. Given the types of data problems ChatGPT might not be good at, let’s exam the reasons why LLMs fall short.
LLM Limitations that Hinder Data Analysis
- Training Data Limitations: Most LLM training involves a vast but finite dataset of text, which may not include extensive or in-depth coverage of higher-level or specialized mathematical topics. This can lead to gaps in its understanding of advanced mathematical concepts, which are often less commonly discussed in general text sources akin to asking a professional basketball player to cook a world class meal.
- Abstract Reasoning: Advanced mathematics often requires a level of abstract reasoning and conceptual understanding that goes beyond pattern recognition and information retrieval. LLM models like ChatGPT are primarily designed for processing and generating text based on patterns in data, rather than conceptual understanding or abstract thinking.
- Symbolic Manipulation: Higher mathematics frequently involves complex symbolic manipulation, which can be challenging for language models. ChatGPT is optimized for natural language processing and may not adeptly handle the symbolic language and notation used in advanced mathematics.
- Contextual Limitations: Mathematical concepts at an advanced level often require a deep understanding of context and the ability to integrate multiple concepts simultaneously. AI models may not fully grasp the intricate context and interrelations necessary for solving complex mathematical problems.
- No Dynamic Learning or Real-time Feedback: ChatGPT does not learn dynamically from new information or interactions in real-time. It cannot update its knowledge base with new mathematical theories or corrections after its last training update, which can be a limitation for staying current with the latest developments in mathematics.
- Linear Processing: Complex mathematical problems often require nonlinear thinking and the ability to backtrack, revise, and explore multiple solution paths simultaneously. LLMs often use linear processing of information which might not be optimal for such multidimensional problem-solving.
- Limited Computational Capabilities: Some advanced mathematical problems require significant computational resources, which might exceed ChatGPT’s capabilities, especially for tasks that demand high computational precision or solving large-scale mathematical models.
LLMs struggles with abstract or advanced mathematical concepts arise because as stated on the proverbial box label, ChatGPT and LLMs are large language models that excels in pattern recognition and text generation, but as a result, lacks the deep conceptual understanding, abstract reasoning, and symbolic manipulation skills that are often essential in advanced mathematics. Which leads up to the types of datasets that pure LLMs are not great at.
Types of dataset analysis that pure LLMs are not great at
As you probably already know, LLMs are limited in their ability to perform accurate mathematical calculations, making them unsuitable for tasks requiring precise quantitative analysis on datasets, such as:
- Descriptive Statistics: Summarizing numerical columns quantitatively, through measures like the mean or variance.
- Correlation Analysis: Obtaining the precise correlation coefficient between columns.
- Statistical Analysis: Such as hypothesis testing to determine if there are statistically significant differences between groups of data points.
- Machine Learning: Performing predictive modeling on a dataset such as using linear regressions, gradient boosted trees, or neural networks.
However if you are performing such quantitative tasks on datasets, it is better to use OpenAI’s Advanced Data Analysis plugin. The plugin allows programming languages step in to run code for such tasks on a dataset.
So, why would anyone want to analyze datasets using only LLMs and without such plugins?
Types of dataset analysis that LLMs are great at
LLMs are excellent at identifying patterns and trends which stems from their extensive training on diverse and voluminous data, enabling them to discern intricate patterns that may not be immediately apparent. For such pattern-based tasks, using LLMs alone may in fact produce better results within a shorter timeframe than using code.
This makes them well-suited for tasks based on pattern-finding within datasets, such as:
- Anomaly Detection: Spotting unusual data points.
- Clustering: Grouping similar data points.
- Cross-Column Relationships: Identifying trends across data columns.
- Textual Analysis: Categorizing text-based data.
- Trend Analysis: Recognizing patterns and trends over time.
How to Prompt an LLM to Do Better Data Analysis?
Let’s through an example of a prompting technique for doing data analysis. Use the following 4 prompt engineering techniques to analyze:
- Breaking down a complex task into simple steps
- Referencing intermediate outputs from each step
- Formatting the LLM’s response
- Separating the instructions from the dataset
Let’s begin with the prompt.
Example of Prompt for ChatGPT WITHOUT using the data analysis plugin
# CONTEXT #
I sell mattresses. I have a dataset of information on my customers: [year of birth, marital status, income, number of children, days since last purchase, amount spent].#############
# OBJECTIVE #
I want you use the dataset to cluster my customers into groups and then give me ideas on how to target my marketing efforts towards each group. Use this step-by-step process and do not use code:1. CLUSTERS: Use the columns of the dataset to cluster the rows of the dataset, such that customers within the same cluster have similar column values while customers in different clusters have distinctly different column values. Ensure that each row only belongs to 1 cluster.
For each cluster found,
2. CLUSTER_INFORMATION: Describe the cluster in terms of the dataset columns.
3. CLUSTER_NAME: Interpret [CLUSTER_INFORMATION] to obtain a short name for the customer group in this cluster.
4. MARKETING_IDEAS: Generate ideas to market my product to this customer group.
5. RATIONALE: Explain why [MARKETING_IDEAS] is relevant and effective for this customer group.#############
# STYLE #
Business analytics report#############
# TONE #
Professional, technical#############
# AUDIENCE #
My marketing team. Convince them that your marketing strategy is well thought-out and fully backed by data.#############
# RESPONSE: MARKDOWN REPORT #
<For each cluster in [CLUSTERS]>
— Customer Group: [CLUSTER_NAME]
— Profile: [CLUSTER_INFORMATION]
— Marketing Ideas: [MARKETING_IDEAS]
— Rationale: [RATIONALE]<Annex>
Give a table of the list of row numbers belonging to each cluster, in order to back up your analysis. Use these table headers: [[CLUSTER_NAME], List of Rows].#############
# START ANALYSIS #
If you understand, ask me for my dataset.
Technique 1: Breaking down a complex task into simple steps
LLMs are great at performing simple tasks, but not so great at complex ones. As such, with complex tasks like this one, it is important to break down the task into simple step-by-step instructions for the LLM to follow. The idea is to give the LLM the steps that you yourself would take to execute the task.
While you could give ChatGPT the overall task of “Cluster the customers into groups and then give ideas on how to market to each group” with step-by-step instructions, LLMs are significantly more likely to deliver the correct results.
Technique 2: Referencing intermediate outputs from each step
When providing the step-by-step process to the LLM, we give the intermediate output from each step a capitalized VARIABLE_NAME
, namely CLUSTERS
, CLUSTER_INFORMATION
, CLUSTER_NAME
, MARKETING_IDEAS
and RATIONALE
.
Capitalization is used to differentiate these variable names from the body of instructions given. These intermediate outputs can later be referenced using square brackets as [VARIABLE_NAME]
.
Technique 3: Formatting the LLM’s response
Asking for a markdown report format can simplify and beautify the LLM’s response. Furthermore, having variable names from intermediate outputs again comes in handy here to dictate the structure of the report. In fact, you could even subsequently ask ChatGPT to provide the report as a downloadable file, allowing you to work off of its response in writing your final report.
Technique 4: Separating the task instructions from the dataset
You’ll notice in the prompt that we never gave the dataset to the LLM in our first prompt. Instead, the prompt gives only the task instructions for the dataset analysis, with “If you understand, ask me for my dataset” added to the bottom. But why separate the instructions from the dataset?
The straightforward answer is that LLMs have a limit to their context window, or how much information they can take as input in 1 prompt. A long prompt combining both instructions and data might exceed this limit, leading to truncation and loss of information.
The more intricate answer is that separating the instructions and the dataset helps the LLM maintain clarity in understanding each, with lower likelihood of missing out information. In some cases, you might have experienced scenarios where ChatGPT “forgets” a certain instruction you gave as part of a longer prompt — for example, if you asked for a 200-word response and the LLM gives you a longer paragraph back. By receiving the instructions first, before the dataset that the instructions are for, the LLM can first digest what it should do, before executing it on the dataset provided next. Note however that this separation of instructions and dataset can only be achieved with chat LLMs as they maintain a conversational memory, unlike completion LLMs such as Perplexity which do not.
How good is ChatGPT’s Advanced Data Analysis plugin?
With the Advanced Data Analysis plugin right now, it appears that executing simpler tasks on datasets such as calculating descriptive statistics or creating graphs can be easily done, but more advanced tasks that require computing of algorithms may sometimes result in errors and no outputs, due to computational limits with ChatGPT.
So do actually analyze datasets using LLMs? The answer is it depends on the type of analysis.
For tasks requiring precise mathematical calculations or complex, rule-based processing, conventional data analysis remains a better option. For tasks based on pattern-recognition, it can be challenging or more time-consuming to execute using conventional programming approaches. LLMs are pretty decent at such tasks, and can even provide additional outputs such as annexes to back up its analysis, and full analysis reports in markdown formatting.
Here is a small breakdown of techniques with GPT-4
Technique | Description of Technique | GPT-4 with Data Analysis Plugin | GPT-4 without Plugin |
---|---|---|---|
Descriptive Statistics | Basic statistical calculations like mean, median, mode, standard deviation, etc., to summarize data. | Yes | Limited |
Correlation Analysis | Determining the relationship and measure of association between two variables. | Yes | Limited |
Time Series Analysis | Analyzing data points collected or indexed in time order, often to identify trends, cycles, or seasonal patterns. | Yes | Limited |
Linear Regression | A linear approach to modeling the relationship between a dependent variable and one or more independent variables. | Yes | Limited |
Hypothesis Testing | Testing an assumption regarding a population parameter. | Yes | Limited |
Cluster Analysis | Grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. | Yes | Limited |
Machine Learning Models | Advanced predictive models like random forests, neural networks, etc., for prediction or classification tasks. | No | No |
Principal Component Analysis | A technique used to emphasize variation and bring out strong patterns in a dataset, reducing the dimensionality. | Yes | Limited |
Natural Language Processing | Processing and analyzing large amounts of natural language data, like text classification or sentiment analysis. | Yes | Yes |
Advanced Statistical Modeling | More complex statistical methods like Bayesian analysis, multivariate regression, etc. | No | No |
Image or Video Analysis | Analyzing visual data, like image recognition or processing video data. | No | No |
Real-time Data Analysis | Analyzing data that is being continuously updated in real-time. | No | No |
Big Data Analysis | Handling and processing extremely large datasets that require specialized big data technologies. | No | No |
Interactive Visualizations | Creating dynamic and interactive data visualizations, often used in dashboards and data apps. | No | No |
Closing Thoughts: When to Use LLMs for Data Analysis
Ultimately, the decision to utilize LLMs hinges on the nature of the task at hand, balancing the strengths of LLMs in pattern-recognition against the precision and specificity offered by traditional data analysis techniques through R or Python.