Gathering the Data

Identify Sources

Challenge
When I first started this project, I assumed there might only be a dozen militaria websites out there. Collecting felt like such a niche hobby that I didn’t expect much data to work with. Instead, I discovered a surprisingly large and diverse network of dealers — some selling broad ranges of militaria, others specializing in just one thing, like flags, medals, or even cookware. Many of these shops were small, often run by a single person, which meant product data varied wildly in quality and format. That inconsistency quickly became one of the biggest hurdles for this project.

Solution
I chose to focus on standalone militaria stores rather than giant platforms like eBay, Amazon, or Facebook. While those had scale, they also came with too much noise, scraping complexity, and legal grey areas. The smaller, independent sites gave me cleaner, more focused product data. I built my list up to about 100 trusted websites, each catering to different corners of the collecting community. This narrowed focus gave me data that was both scrape-able and meaningful.

Result
What started as a hunch turned into a mapped ecosystem of militaria dealers, from mom-and-pop shops to niche specialists. This foundation gave me the raw material I needed — diverse, messy, but relevant data that I could start turning into something structured and useful.

(Photo: One of my favorite shops before it closed — exactly the kind of independent source this project was built on.)

Build the Scraper

Challenge
Every militaria site does things differently — messy HTML, hidden price structures, inconsistent “sold” markers, and image galleries that never follow the same rules. On top of that, I needed to track changes over time instead of overwriting data. The real challenge was building something that could handle 100+ unique layouts without me constantly rewriting code.

Solution
I built a Python scraper framework that runs on AWS EC2 and is driven by JSON configs. Instead of hardcoding, I define selectors, price rules, and availability logic per site. Each site gets its own config file, which keeps the core logic clean and safe.

The scraper went through three major refactors as I learned what worked and what didn’t:

Started on a Raspberry Pi → moved to AWS for stability and storage.
Replaced fragile eval() parsing → JSON config system that’s secure and maintainable.
Redesigned the database → now stores historical price/availability changes instead of overwriting them.

I also tuned performance so the system could run constantly without burning money or stressing dealer servers:

Reduced server load by only pulling full details when changes were detected.
Improved image/price extraction across dozens of layouts.
Scheduled lightweight checks every 10 minutes, plus full refreshes every 12 hours.

Result
The current system processes 350k+ products across 100+ sites. Adding a new site means writing a JSON config, not touching the scraper code. The framework is now flexible, efficient, and production-ready — the foundation for everything else in this project.

Scraper Logic / Workflow

At a high level, every scrape batch runs through the same loop:

Access the products page – Navigate to the right part of the site and locate key elements.
Compile product list – Collect titles, descriptions, prices, images, and metadata into a structured format.
Compare with database – Flag new products vs. existing ones by checking against stored URLs and IDs.
Process discrepancies – Handle price changes, sold markers, or updates to product details.
Update database – Insert new records, archive outdated ones, and log any changes for historical tracking.

Why it matters
This predictable structure makes the system reliable and maintainable. Every step is logged with enough detail to debug issues quickly. If something breaks, I know exactly which step failed and what data caused it. That level of traceability is what makes the scraper production-ready instead of just a one-off script.

Keep Data Fresh

Challenge
My early scraper design tried to re-crawl everything on every run. It worked, but it was painfully slow, burned unnecessary AWS resources, and put way too much strain on dealer websites.

Solution
I redesigned the update cycle to balance freshness with efficiency:

Lightweight checks – Every 10 minutes, the scraper quickly checks front pages for new products or changes.
Full refreshes – Every 12 hours, it does a deep crawl to catch any missed updates.
Change detection – It only pulls full product details when differences are detected, instead of downloading everything each time.

Result
This change cut scraping time from hours to minutes, reduced AWS costs, and made the system friendlier to dealer servers. The data stays fresh enough for real-time insights without overloading infrastructure or source sites.

When Bevo Militaria Adds New Items (Last 12 Months, UTC)

(Above) Most new listings on Bevo Militaria appear in the early morning (UTC), with activity peaking early in the week. This pattern is valuable for collectors looking to spot rare items quickly or track pricing trends in real time.

Handle Dynamic Challenges

Challenge
Not every site played nice. Some relied heavily on JavaScript, others had unpredictable layouts, and my early approach of brute-forcing or using eval() for site-specific logic was fragile and insecure. On top of that, my first setup overwrote old data whenever prices or descriptions changed — which meant losing valuable historical information.

Solution
I reworked the system to handle these edge cases in a safer and more maintainable way:

Dynamic websites – Added Selenium support for JavaScript-heavy pages.
Site-specific configs – Built a JSON configuration format for selectors, availability rules, and post-processing. Much safer and easier to maintain than the old approach.
Historical tracking – Updated the database to store past prices and availability instead of overwriting them.
Smarter availability checks – Instead of brute-forcing every URL, the scraper now checks just the first few pages for new products and uses elimination to mark older ones as unavailable.

Result
These changes made the scraper more secure, faster, and cheaper to run. What used to be fragile and costly is now a system that can adapt to dynamic sites, preserve historical data, and stay efficient at scale.

Skills and Technologies

Data Collection & Processing

Python Scraper Framework – Custom-built, JSON-config driven scrapers for 100+ dealer sites.
Libraries: BeautifulSoup, Selenium, pandas, psycopg2, logging.
Historical Tracking – Database schema designed to store product changes over time.

Machine Learning & AI

Custom ML Model (scikit-learn) – TF-IDF + Logistic Regression classifier trained on 67k labeled products.
Confidence Thresholding – Per-class thresholds with calibrated probabilities.
OpenAI GPT API Fallback – Structured prompts for reliable classification when ML confidence is low.
Streamlit Labeling Tool – Built UI for batch labeling, human-in-the-loop corrections, and dataset growth.

Web Development & Applications

Django Web App (milivault.com) – Search, filter, and user collections powered by AWS + PostgreSQL.
Streamlit Tools – Labeling, SQL explorer, and batch confirmation UIs for internal workflows.
Render – Hosting for the Django site, integrated with AWS backend services.

Cloud Infrastructure

AWS EC2 – Runs scrapers and scheduled jobs.
AWS RDS (PostgreSQL) – Central database for 350k+ products and structured metadata.
AWS S3 – Storage for millions of product images.
IAM Roles & Security Groups – Configured for secure service-to-service communication.

Database Management

PostgreSQL – Optimized for cleaning, batch updates, and historical queries.
Schema Design – Structured to support classification fields (nation, conflict, item type) and user confirmations.
Query Optimization – Tuned for large-scale SELECT and UPDATE operations.

Version Control & Deployment

GitHub – Version control, collaboration, and issue tracking.
Secure Deployments – Automated deployments via SSH, with logging and monitoring for stability.

Previous Main Page

Next Classify the Data