Prompting for Data Transformation: How to Use AI to Clean Messy Datasets
Master the art of using AI prompts to clean, transform, and normalize messy data. Includes practical templates and real-world examples.
The Problem: You have a messy CSV file with inconsistent formatting, missing values, and duplicate entries. Cleaning it manually would take hours.
What if you could describe what you want in plain English and let AI do the heavy lifting?
This guide shows you how to use AI prompts to transform chaotic data into clean, analysis-ready datasets.
Why AI Excels at Data Cleaning
Traditional Approach
- Write custom scripts for each dataset
- Handle edge cases manually
- Debug regex patterns
- Time-consuming and error-prone
AI-Powered Approach
- Describe the transformation in natural language
- AI generates the code
- Iterate quickly on edge cases
- Reusable prompts for similar tasks
The 5-Step Prompting Framework
Step 1: Describe the Current State
I have a CSV file with customer data. Issues:
- Names are in ALL CAPS
- Phone numbers have inconsistent formats
- Email addresses have leading/trailing spaces
- Some rows are duplicates
Step 2: Define the Desired State
I need:
- Names in Title Case
- Phone numbers in format: (XXX) XXX-XXXX
- Trimmed email addresses
- Duplicates removed (based on email)
Step 3: Specify Constraints
Requirements:
- Use Python with pandas
- Preserve original data in a backup column
- Log all transformations
- Handle missing values by filling with "N/A"
Step 4: Provide Examples
Input example:
JOHN DOE,5551234567, john@example.com
Expected output:
John Doe,(555) 123-4567,john@example.com
Step 5: Request Validation
Include:
- Data quality checks
- Row count before/after
- Summary of changes made
Copy-Paste Prompt Templates
Template 1: Data Cleaning
I have a [FILE_TYPE] with [DATA_DESCRIPTION].
Current issues:
- [ISSUE_1]
- [ISSUE_2]
- [ISSUE_3]
I need to:
- [TRANSFORMATION_1]
- [TRANSFORMATION_2]
- [TRANSFORMATION_3]
Constraints:
- Use [LANGUAGE/LIBRARY]
- Preserve original data
- Handle missing values by [STRATEGY]
Example:
Input: [EXAMPLE_INPUT]
Output: [EXAMPLE_OUTPUT]
Include validation checks and a summary of changes.
Template 2: Format Standardization
Standardize [FIELD_NAME] in my dataset.
Current formats:
- [FORMAT_1]
- [FORMAT_2]
- [FORMAT_3]
Target format: [TARGET_FORMAT]
Handle invalid entries by [STRATEGY].
Write a [LANGUAGE] function that:
- Takes a string input
- Returns standardized output
- Logs invalid entries
Template 3: Data Enrichment
I have a dataset with [FIELDS].
I need to add:
- [NEW_FIELD_1] derived from [SOURCE]
- [NEW_FIELD_2] calculated as [FORMULA]
Rules:
- [RULE_1]
- [RULE_2]
Use [LANGUAGE] and return a new dataframe with added columns.
Real-World Examples
Example 1: Cleaning Contact Data
Prompt:
I have a CSV with contact information. Clean it using Python and pandas.
Issues:
- Names are inconsistent (some UPPERCASE, some lowercase)
- Phone numbers vary: "555-1234", "(555) 1234", "5551234"
- Emails have spaces and mixed case
Transformations:
- Names: Title Case
- Phones: (XXX) XXX-XXXX format
- Emails: lowercase, trimmed
Handle missing values: replace with "Unknown"
Example:
Input: "ALICE SMITH","5551234567"," Alice@Example.COM "
Output: "Alice Smith","(555) 123-4567","alice@example.com"
AI-Generated Code:
import pandas as pd
import re
def clean_contacts(df):
# Backup original data
df['name_original'] = df['name']
df['phone_original'] = df['phone']
df['email_original'] = df['email']
# Clean names
df['name'] = df['name'].str.title().fillna('Unknown')
# Clean phones
def format_phone(phone):
if pd.isna(phone):
return 'Unknown'
digits = re.sub(r'\D', '', str(phone))
if len(digits) == 10:
return f'({digits[:3]}) {digits[3:6]}-{digits[6:]}'
return phone
df['phone'] = df['phone'].apply(format_phone)
# Clean emails
df['email'] = df['email'].str.strip().str.lower().fillna('Unknown')
return df
# Validation
print(f"Rows before: {len(df)}")
df_clean = clean_contacts(df)
print(f"Rows after: {len(df_clean)}")Example 2: Normalizing Dates
Prompt:
Standardize date formats in my dataset using Python.
Current formats:
- "01/15/2024"
- "2024-01-15"
- "Jan 15, 2024"
- "15-Jan-24"
Target format: YYYY-MM-DD
Handle invalid dates by setting to None.
Return a function that takes a date string and returns standardized format.
AI-Generated Code:
from dateutil import parser
from datetime import datetime
def standardize_date(date_str):
if pd.isna(date_str):
return None
try:
dt = parser.parse(str(date_str))
return dt.strftime('%Y-%m-%d')
except:
return None
df['date_standardized'] = df['date'].apply(standardize_date)Validate Your Cleaned Data
After transforming your data, use our JSON Formatter to validate the output structure.
Validate Data →Example 3: Removing Duplicates with Logic
Prompt:
Remove duplicate rows from my customer dataset using Python.
Duplicate logic:
- Consider duplicates if email matches
- Keep the row with the most recent "created_at" date
- If dates are equal, keep the first occurrence
Return cleaned dataframe and count of removed duplicates.
AI-Generated Code:
def remove_duplicates(df):
original_count = len(df)
# Sort by email and created_at (descending)
df_sorted = df.sort_values(['email', 'created_at'],
ascending=[True, False])
# Keep first occurrence (most recent)
df_clean = df_sorted.drop_duplicates(subset=['email'],
keep='first')
removed_count = original_count - len(df_clean)
print(f"Removed {removed_count} duplicates")
return df_cleanAdvanced Techniques
Chained Transformations
Instead of one massive prompt, chain multiple steps:
Step 1:
Clean the phone numbers in my dataset. [DETAILS]
Step 2:
Now standardize the email addresses. [DETAILS]
Step 3:
Finally, remove duplicates based on email. [DETAILS]
Conditional Logic
Transform the "status" field:
- If "active" or "ACTIVE" → "Active"
- If "inactive", "disabled", "suspended" → "Inactive"
- If "pending" or "new" → "Pending"
- All others → "Unknown"
Use Python with a mapping dictionary.
Data Validation Prompts
After cleaning, validate my dataset:
- Check for remaining null values
- Verify email format (regex)
- Ensure phone numbers match pattern
- Confirm no duplicates exist
Return a validation report with:
- Total rows
- Issues found
- Percentage of clean data
Best Practices
1. Start Small
Test prompts on a sample (100 rows) before processing the full dataset.
2. Preserve Original Data
Always keep a backup column or file before transformations.
3. Log Everything
Track what changed, when, and why.
4. Validate Output
Never trust AI output blindly. Always validate:
- Row counts
- Data types
- Value ranges
- Format consistency
5. Iterate on Edge Cases
If AI misses an edge case, add it to your prompt:
Also handle:
- Phone numbers with extensions (e.g., "555-1234 x567")
- International formats (e.g., "+1-555-1234")
Common Pitfalls
1. Vague Descriptions
❌ "Clean my data"
✅ "Remove leading/trailing spaces from all string columns"
2. No Examples
❌ "Standardize dates"
✅ "Convert '01/15/2024' to '2024-01-15'"
3. Missing Edge Cases
❌ "Format phone numbers"
✅ "Format phone numbers, handling 7-digit, 10-digit, and international formats"
4. No Validation
❌ Run transformation and assume it worked
✅ Include validation checks in the prompt
Conclusion
AI-powered data cleaning is a game-changer. By writing clear, detailed prompts with examples and constraints, you can transform messy datasets in minutes instead of hours.
Key Takeaways:
- Describe current state and desired state clearly
- Provide concrete examples
- Specify constraints and edge cases
- Always validate output
- Iterate on prompts for better results
Transform Your Data Now
Use our tools to validate, format, and export your cleaned datasets.
Get Started →Bonus: Prompt Library
Remove Nulls:
Replace all null/NaN values in [COLUMN] with [VALUE]. Use Python pandas.
Merge Columns:
Create a new column [NAME] by combining [COL1] and [COL2] with [SEPARATOR].
Split Column:
Split [COLUMN] into [COL1] and [COL2] using [DELIMITER]. Handle cases where delimiter is missing.
Type Conversion:
Convert [COLUMN] from [TYPE1] to [TYPE2]. Handle invalid values by [STRATEGY].