Prompting for Data Transformation: How to Use AI to Clean Messy Datasets

The Problem: You have a messy CSV file with inconsistent formatting, missing values, and duplicate entries. Cleaning it manually would take hours.

What if you could describe what you want in plain English and let AI do the heavy lifting?

This guide shows you how to use AI prompts to transform chaotic data into clean, analysis-ready datasets.

Why AI Excels at Data Cleaning

Traditional Approach

Write custom scripts for each dataset
Handle edge cases manually
Debug regex patterns
Time-consuming and error-prone

AI-Powered Approach

Describe the transformation in natural language
AI generates the code
Iterate quickly on edge cases
Reusable prompts for similar tasks

The 5-Step Prompting Framework

Step 1: Describe the Current State

I have a CSV file with customer data. Issues:
- Names are in ALL CAPS
- Phone numbers have inconsistent formats
- Email addresses have leading/trailing spaces
- Some rows are duplicates

Step 2: Define the Desired State

I need:
- Names in Title Case
- Phone numbers in format: (XXX) XXX-XXXX
- Trimmed email addresses
- Duplicates removed (based on email)

Step 3: Specify Constraints

Requirements:
- Use Python with pandas
- Preserve original data in a backup column
- Log all transformations
- Handle missing values by filling with "N/A"

Step 4: Provide Examples

Input example:
JOHN DOE,5551234567,  john@example.com  

Expected output:
John Doe,(555) 123-4567,john@example.com

Step 5: Request Validation

Include:
- Data quality checks
- Row count before/after
- Summary of changes made

Copy-Paste Prompt Templates

Template 1: Data Cleaning

I have a [FILE_TYPE] with [DATA_DESCRIPTION].

Current issues:
- [ISSUE_1]
- [ISSUE_2]
- [ISSUE_3]

I need to:
- [TRANSFORMATION_1]
- [TRANSFORMATION_2]
- [TRANSFORMATION_3]

Constraints:
- Use [LANGUAGE/LIBRARY]
- Preserve original data
- Handle missing values by [STRATEGY]

Example:
Input: [EXAMPLE_INPUT]
Output: [EXAMPLE_OUTPUT]

Include validation checks and a summary of changes.

Template 2: Format Standardization

Standardize [FIELD_NAME] in my dataset.

Current formats:
- [FORMAT_1]
- [FORMAT_2]
- [FORMAT_3]

Target format: [TARGET_FORMAT]

Handle invalid entries by [STRATEGY].

Write a [LANGUAGE] function that:
- Takes a string input
- Returns standardized output
- Logs invalid entries

Template 3: Data Enrichment

I have a dataset with [FIELDS].

I need to add:
- [NEW_FIELD_1] derived from [SOURCE]
- [NEW_FIELD_2] calculated as [FORMULA]

Rules:
- [RULE_1]
- [RULE_2]

Use [LANGUAGE] and return a new dataframe with added columns.

Real-World Examples

Example 1: Cleaning Contact Data

Prompt:

I have a CSV with contact information. Clean it using Python and pandas.

Issues:
- Names are inconsistent (some UPPERCASE, some lowercase)
- Phone numbers vary: "555-1234", "(555) 1234", "5551234"
- Emails have spaces and mixed case

Transformations:
- Names: Title Case
- Phones: (XXX) XXX-XXXX format
- Emails: lowercase, trimmed

Handle missing values: replace with "Unknown"

Example:
Input: "ALICE SMITH","5551234567","  Alice@Example.COM  "
Output: "Alice Smith","(555) 123-4567","alice@example.com"

AI-Generated Code:

import pandas as pd
import re
 
def clean_contacts(df):
    # Backup original data
    df['name_original'] = df['name']
    df['phone_original'] = df['phone']
    df['email_original'] = df['email']
    
    # Clean names
    df['name'] = df['name'].str.title().fillna('Unknown')
    
    # Clean phones
    def format_phone(phone):
        if pd.isna(phone):
            return 'Unknown'
        digits = re.sub(r'\D', '', str(phone))
        if len(digits) == 10:
            return f'({digits[:3]}) {digits[3:6]}-{digits[6:]}'
        return phone
    
    df['phone'] = df['phone'].apply(format_phone)
    
    # Clean emails
    df['email'] = df['email'].str.strip().str.lower().fillna('Unknown')
    
    return df
 
# Validation
print(f"Rows before: {len(df)}")
df_clean = clean_contacts(df)
print(f"Rows after: {len(df_clean)}")

Example 2: Normalizing Dates

Prompt:

Standardize date formats in my dataset using Python.

Current formats:
- "01/15/2024"
- "2024-01-15"
- "Jan 15, 2024"
- "15-Jan-24"

Target format: YYYY-MM-DD

Handle invalid dates by setting to None.

Return a function that takes a date string and returns standardized format.

AI-Generated Code:

from dateutil import parser
from datetime import datetime
 
def standardize_date(date_str):
    if pd.isna(date_str):
        return None
    try:
        dt = parser.parse(str(date_str))
        return dt.strftime('%Y-%m-%d')
    except:
        return None
 
df['date_standardized'] = df['date'].apply(standardize_date)

Validate Your Cleaned Data

After transforming your data, use our JSON Formatter to validate the output structure.

Validate Data →

Example 3: Removing Duplicates with Logic

Prompt:

Remove duplicate rows from my customer dataset using Python.

Duplicate logic:
- Consider duplicates if email matches
- Keep the row with the most recent "created_at" date
- If dates are equal, keep the first occurrence

Return cleaned dataframe and count of removed duplicates.

AI-Generated Code:

def remove_duplicates(df):
    original_count = len(df)
    
    # Sort by email and created_at (descending)
    df_sorted = df.sort_values(['email', 'created_at'], 
                                ascending=[True, False])
    
    # Keep first occurrence (most recent)
    df_clean = df_sorted.drop_duplicates(subset=['email'], 
                                          keep='first')
    
    removed_count = original_count - len(df_clean)
    
    print(f"Removed {removed_count} duplicates")
    return df_clean

Advanced Techniques

Chained Transformations

Instead of one massive prompt, chain multiple steps:

Step 1:

Clean the phone numbers in my dataset. [DETAILS]

Step 2:

Now standardize the email addresses. [DETAILS]

Step 3:

Finally, remove duplicates based on email. [DETAILS]

Conditional Logic

Transform the "status" field:
- If "active" or "ACTIVE" → "Active"
- If "inactive", "disabled", "suspended" → "Inactive"
- If "pending" or "new" → "Pending"
- All others → "Unknown"

Use Python with a mapping dictionary.

Data Validation Prompts

After cleaning, validate my dataset:
- Check for remaining null values
- Verify email format (regex)
- Ensure phone numbers match pattern
- Confirm no duplicates exist

Return a validation report with:
- Total rows
- Issues found
- Percentage of clean data

Best Practices

1. Start Small

Test prompts on a sample (100 rows) before processing the full dataset.

2. Preserve Original Data

Always keep a backup column or file before transformations.

3. Log Everything

Track what changed, when, and why.

4. Validate Output

Never trust AI output blindly. Always validate:

Row counts
Data types
Value ranges
Format consistency

5. Iterate on Edge Cases

If AI misses an edge case, add it to your prompt:

Also handle:
- Phone numbers with extensions (e.g., "555-1234 x567")
- International formats (e.g., "+1-555-1234")

Common Pitfalls

1. Vague Descriptions

❌ "Clean my data"
✅ "Remove leading/trailing spaces from all string columns"

2. No Examples

❌ "Standardize dates"
✅ "Convert '01/15/2024' to '2024-01-15'"

3. Missing Edge Cases

❌ "Format phone numbers"
✅ "Format phone numbers, handling 7-digit, 10-digit, and international formats"

4. No Validation

❌ Run transformation and assume it worked
✅ Include validation checks in the prompt

Conclusion

AI-powered data cleaning is a game-changer. By writing clear, detailed prompts with examples and constraints, you can transform messy datasets in minutes instead of hours.

Key Takeaways:

Describe current state and desired state clearly
Provide concrete examples
Specify constraints and edge cases
Always validate output
Iterate on prompts for better results

Transform Your Data Now

Use our tools to validate, format, and export your cleaned datasets.

Get Started →

Bonus: Prompt Library

Remove Nulls:

Replace all null/NaN values in [COLUMN] with [VALUE]. Use Python pandas.

Merge Columns:

Create a new column [NAME] by combining [COL1] and [COL2] with [SEPARATOR].

Split Column:

Split [COLUMN] into [COL1] and [COL2] using [DELIMITER]. Handle cases where delimiter is missing.

Type Conversion:

Convert [COLUMN] from [TYPE1] to [TYPE2]. Handle invalid values by [STRATEGY].