Back to archive

First Experience with Anti-Crisis Data Engineering Measures

M
Mikhail
A few days ago I encountered a challenging task involving attribute processing for a site catalog that users typically filter by. For example, when you compare products on a marketplace, you usually see 2 items with characteristic fields. Amount of raw data: 5,000+ entries. The task was to understand which attribute characteristics to use for comparison. The site is substantial (Ecom), with plenty of categories and product cards.

The Problem

I started exporting current attributes and discovered they were not clustered into groups at all — essentially a massive pile of raw data. The only known information was a column with attribute families.

I tried using formulas and scripts for initial clustering to then filter by categories and remove unnecessary ones — it did not work.

Time Pressure

At a certain point, time started running out. I decided to take a risk and connected AI to Google Sheets via API. Yes, data scientists will judge me for this — my apologies.

The Idea

What if I take not one LLM, but two, create 2 database copies, and have one LLM process data from the beginning while another processes from the end? Then elegantly merge everything and write a script to check for discrepancies after AI processing.

Spoiler: it worked!

After preliminary tests, two candidates were selected:

  • Claude Sonnet 4.5
    • Best for complex reasoning tasks
    • Higher cost but more reliable output
  • DeepSeek V3.2-Exp (Thinking Mode)
    • Excellent throughput and rate limits
    • Cost-effective for bulk processing

The processing pipeline followed these steps:

  1. Export raw attribute data from the database
  2. Split dataset into two halves
    1. First half: rows 1–2624 for Claude
    2. Second half: rows 2625–5247 for DeepSeek
  3. Configure API parameters and rate limiting
  4. Run parallel processing overnight
  5. Merge results and validate discrepancies

The Hybrid Approach

The beauty was that Claude processed attributes from 1 to 5k, while DeepSeek worked backwards from 5k to 1, meeting Claude halfway. This is what allowed solving the task under time crisis conditions.

Implementation

What Went Wrong

Clustering consumed enormous amounts of tokens, burning through my funds.

Root Cause

The initial data was too sparse for clustering. Besides my three-level category tree and attribute family, there was nothing else.

DeepSeek and Claude frequently hallucinated, not outputting exact category values. Setting temperature to 0.1 helped, but processing speed dropped significantly.

Increasing batch size quickly hit Claude's rate limits (DeepSeek handled it better). The error was incorrect API-Delay configuration.

The Pivot

Had to quickly rethink the solution method. Decision: outsource attributes without clustering. LLMs would only analyze a few data columns and send a final score on a 5-point scale with minimal comments.

Final Results

Here are the final parameters that made it work:

config.js
javascript
// Processing parameters
const CONFIG = {
  BATCH_SIZE: 30,
  API_DELAY_CLAUDE: 800,    // ms
  API_DELAY_DEEPSEEK: 500,
  MAX_RETRIES: 3,
 
  // Rate limit management
  RATE_LIMIT_PAUSE: 60000,
  ADAPTIVE_DELAYS: true,
}
 
// Model settings
const MODEL_CONFIG = {
  temperature: 0.1,
  max_tokens: 4000
}
Key Achievement

This approach reduced token consumption by 4x and avoided rate limit filters.

Processing Time

LLMs worked autonomously, even overnight:

MetricValue
Claude processing time17 hours
DeepSeek processing time11 hours
Data discrepancy rate3%
Manual verification4 hours

Model Comparison

Here's how the two models compared across different criteria:

Feature Claude Sonnet 4.5 DeepSeek V3.2 GPT-4o
Batch processing
Rate limit handling
Accuracy
Cost efficiency
SpeedMediumFastMedium
Context window200K128K128K

Final Thoughts

Contrary to my expectations, both models managed to process the entire list. The prompt really pulled its weight here. The hybrid approach with two LLMs working from opposite ends proved to be a viable anti-crisis measure when traditional methods fail.

Sometimes the unconventional solution is exactly what you need when time is against you.