🚀 Executive Summary
TL;DR: This technical blog post addresses the time-consuming problem of manually matching expense receipts to credit card transactions. It provides a Python script utilizing OCR (Tesseract/pytesseract) and Pandas to automate the extraction of data from receipts and match it against transaction rows, significantly reducing manual effort.
🎯 Key Takeaways
- The solution leverages `pytesseract` for Optical Character Recognition (OCR) to extract raw text from various receipt image formats (JPG, PNG, etc.).
- Pandas is used to load and clean transaction data from CSV files, ensuring ‘Amount’ is numeric (absolute value) and ‘Date’ is a proper datetime object for reliable processing.
- The core matching logic identifies receipts by searching for the transaction’s formatted amount within the OCR-extracted text, with a recommendation for adding date-based matching for increased robustness.
Automate Expense Receipt Matching with Transaction Rows
Alright, let’s talk about a real time-sink: manually matching expense receipts to credit card statements. For a while, I was burning a couple of hours every week just cross-referencing PDFs and JPEGs with a transaction CSV for our team’s cloud-spend. It’s tedious, error-prone, and frankly, a terrible use of a senior engineer’s time. I built this Python script to reclaim those hours, turning a manual chore into a quick, automated process. This is the kind of automation that frees you up to solve actual problems.
Prerequisites
Before we dive in, make sure you have the following ready:
- Python 3.x installed on your system.
- Basic familiarity with the Python Pandas library for data manipulation.
- A folder containing your receipt images (e.g., JPG, PNG).
- A CSV export of your transaction data (columns like Date, Amount, Description).
- The Tesseract OCR engine installed. This is a system-level dependency that the Python library will call. You can find installers for your OS on the official Tesseract GitHub page.
The Step-by-Step Guide
Step 1: Setting Up Your Project
I’ll skip the standard virtualenv setup since you likely have your own workflow for that. Just make sure you’re in a clean project directory. You’ll need to install a few libraries for this to work. In your terminal, you can run the necessary pip commands to get pandas for data handling, pytesseract for the OCR, and python-dotenv to manage any configuration variables you might need later.
Pro Tip: I always recommend keeping your input files organized. Create a structure like this:
/your_project_folder/
script.py(Our script)transactions.csv(Your bank data)/receipts/(A folder for all receipt images)/processed/(An empty folder for matched receipts)
Step 2: Extracting Text from Receipts with OCR
First, we need to read the receipts. Our goal is to convert every image into raw text. We’ll use the pytesseract library, which is a Python wrapper for Google’s Tesseract-OCR Engine. We’ll create a function that loops through all files in our receipts directory, runs OCR on them, and stores the extracted text in a dictionary, mapping the filename to its text content.
import os
import pytesseract
from PIL import Image
def ocr_receipts_in_folder(folder_path):
"""
Scans all image files in a directory and returns their text content.
"""
receipt_texts = {}
print(f"Scanning receipts in {folder_path}...")
for filename in os.listdir(folder_path):
if filename.lower().endswith(('png', 'jpg', 'jpeg', 'tiff', 'bmp', 'gif')):
try:
file_path = os.path.join(folder_path, filename)
text = pytesseract.image_to_string(Image.open(file_path))
if text:
receipt_texts[filename] = text
print(f" - Successfully extracted text from {filename}")
except Exception as e:
print(f" - Could not process {filename}. Error: {e}")
return receipt_texts
Step 3: Loading and Cleaning Transaction Data
Next, we’ll use Pandas to load our bank transaction data. The key here is to clean it up for reliable matching. We’ll load the CSV, make sure the ‘Amount’ column is a numeric type (and positive, since expenses are often negative in bank statements), and ensure the ‘Date’ column is a proper datetime object.
import pandas as pd
def load_transactions(csv_path):
"""
Loads and cleans the transaction data from a CSV file.
"""
df = pd.read_csv(csv_path)
# Ensure column names are clean
df.columns = [col.strip().lower() for col in df.columns]
# Clean up the amount column
df['amount'] = pd.to_numeric(df['amount'].astype(str).str.replace(r'[$,]', '', regex=True))
df['amount'] = df['amount'].abs() # Use absolute value for matching
# Convert date column to datetime objects
df['date'] = pd.to_datetime(df['date'])
print("Transactions loaded and cleaned successfully.")
return df
Step 4: The Matching Logic
This is the core of our script. We’ll iterate through each transaction row. For each one, we’ll search through all the receipt texts we extracted. A match is found if the transaction amount appears somewhere in the receipt’s text.
Pro Tip: Just matching on the amount can lead to false positives if you have multiple transactions for the same amount. A more robust approach, which I use in my production setups, is to also check if the transaction date is mentioned within a few days of the date found on the receipt. For simplicity, we’ll stick to amount-matching here, but it’s an important enhancement to consider.
def match_receipts_to_transactions(transactions_df, receipt_texts):
"""
Matches receipts to transactions based on the amount.
"""
matches = []
unmatched_transactions = []
# Create a copy to avoid modifying the original dict while iterating
available_receipts = receipt_texts.copy()
for index, row in transactions_df.iterrows():
transaction_amount = f"{row['amount']:.2f}"
found_match = False
# Search for the amount in available receipts
for filename, text in available_receipts.items():
if transaction_amount in text:
matches.append({
'transaction_date': row['date'],
'transaction_description': row['description'],
'transaction_amount': row['amount'],
'matched_receipt': filename
})
# Remove the receipt from the pool to prevent re-matching
del available_receipts[filename]
found_match = True
break # Move to the next transaction
if not found_match:
unmatched_transactions.append(row)
print(f"Matching complete. Found {len(matches)} matches.")
return pd.DataFrame(matches), pd.DataFrame(unmatched_transactions)
Step 5: Running the Script and Reviewing Output
Finally, let’s tie it all together in a main execution block. This will call our functions in order and save the results to new CSV files—one for the successful matches and one for the transactions that still need a receipt. This gives you a clean to-do list.
if __name__ == "__main__":
RECEIPT_FOLDER = 'receipts'
TRANSACTION_FILE = 'transactions.csv'
# Step 1: Process Receipts
receipt_data = ocr_receipts_in_folder(RECEIPT_FOLDER)
# Step 2: Process Transactions
transactions = load_transactions(TRANSACTION_FILE)
# Step 3: Match them
matched_df, unmatched_df = match_receipts_to_transactions(transactions, receipt_data)
# Step 4: Save results
if not matched_df.empty:
matched_df.to_csv('matched_expenses.csv', index=False)
print("Saved matched expenses to 'matched_expenses.csv'")
if not unmatched_df.empty:
unmatched_df.to_csv('unmatched_transactions.csv', index=False)
print("Saved unmatched transactions to 'unmatched_transactions.csv'")
Common Pitfalls
Here is where I usually mess up, or where I see new devs on my team get stuck:
- Imperfect OCR: Tesseract is powerful, but it’s not magic. A blurry photo, a weird font, or a crumpled receipt can result in garbage text. That’s why the output of this script should always be a “draft” for a human to quickly review, not the final word.
-
Format Mismatches: The number one bug is a format mismatch. Your bank statement might show
$1,234.56but the receipt text just has1234.56. My cleaning function handles the basics, but be prepared to tweak the string replacement and numeric conversion logic based on your specific data sources. - Overly Strict Matching: Don’t try to match the full transaction description like “UBER TRIP 5XFGH”. The receipt will never contain that. Stick to universal data points like amount and date. They are far more reliable.
Conclusion
This script isn’t a silver bullet, but it transforms a multi-hour manual task into a 15-minute review session. It’s a classic DevOps win: automate the repetitive, low-value work so you can focus on engineering challenges. You can easily extend this by scheduling it with a cron job (e.g., 0 2 * * 1 python3 script.py to run it every Monday at 2 AM) or by adding more sophisticated matching logic. Feel free to adapt this for your own workflows, and ping me if you have any questions.
🤖 Frequently Asked Questions
âť“ How does the script automate expense receipt matching?
The script automates matching by first using `pytesseract` to perform OCR on receipt images, extracting their text content. It then loads and cleans transaction data from a CSV using Pandas, and finally matches transactions to receipts by searching for the transaction’s amount within the extracted receipt text.
âť“ How does this compare to alternatives for expense management?
Compared to manual matching, this script transforms a multi-hour task into a 15-minute review, significantly improving efficiency. While not a ‘silver bullet’ like potentially more comprehensive commercial expense management systems, it offers a customizable, open-source automation solution for repetitive, low-value work.
âť“ What are common implementation pitfalls when setting up this automation?
Common pitfalls include imperfect OCR results from blurry or poorly formatted receipts, format mismatches in transaction data (e.g., currency symbols or date formats), and overly strict matching logic that might miss valid connections. It’s crucial to review the output and adapt cleaning functions.
Leave a Reply