D
×
A

Deuna × Aidaptive

Smartrouter Project — internal collaboration portal. Sign in with your work email to continue.

Access restricted to @aidaptive.com and @deuna.com accounts.

🚫

Access Denied

Your account () is not authorized to view this portal.

Access is restricted to @aidaptive.com and @deuna.com email addresses.

Aidaptive × Deuna

Smartrouter
Project

Phase 1 — Development
←Back to Hub
Overview
⚡TL;DR Summary 🏠Project Overview 📈Performance Metrics ✉MCO Dashboard 🏪Merchant Analysis 💳Getnet Analysis 📅Timeline & Phases 👥Team
Scope
🎯Use Cases 🗄️Data & Schema
Analysis
🔍Repo Analysis ⚡Training Platform Gaps
Delivery
📄Statement of Work ✈️Volaris Tasks 💡Improvements 📝Daily Updates 📊Work Summaries ❓Open Questions 📋Decisions Log 🔑Access & Blockers 📎References
Last updated 2026-06-11
Start date 2026-02-18
Aidaptive × Deuna › Smartrouter Project
Phase 1  ·  Development  ·  Volaris Delivery
Phase 1 — Development

Smartrouter AI/ML Integration

All productionization complete. 2 models (LogReg + DNN) for Volaris smart routing, continuous daily training from Snowflake, GPU-accelerated, full CI/CD, and served in production.

Phase
Production
All systems live
Models
2
LogReg + DNN
Latency Target
p95 <200ms
Model inference <50ms
Gaps Closed
13/14
~101d of ~104.5d saved
⚡TL;DR Summary
Everything important in one place — read this first

What This Is

All productionization is complete. 2 ML models (Logistic Regression + Deep Neural Network) for Volaris smart routing are running in production with continuous daily training from Snowflake data, GPU-accelerated training (NVIDIA L4), full CI/CD pipeline, and automated model promotion with quality gates.

The Problem

  • Static routing rules → suboptimal acceptance rates
  • No intelligent PSP failover during outages
  • Retries not optimized by timing, route, or message
  • No feedback loop — outcomes don't improve future decisions

The Solution (5 Use Cases)

  • P-01 — PSP outage detection & failover
  • P-02 — Optimize existing static routing rules
  • P-03 — Per-transaction processor ranking
  • P-04 — Authorization message manipulation
  • P-05 — Retry optimization

Phase 1 Target — Volaris Merchant

Worldpay
ID: 76
MIT
ID: 85
Elavon
Cards
Amex
Amex cards
~3.5d
Remaining (lineage polish)
~101d saved of ~104.5d original
97%
Gaps closed
13 of 14 gaps fully Done
Production
All systems live
Daily training · GPU · CI/CD · Monitoring

✅ Production — What's Live

  • ✓ 2 ML models — LogReg (4 per-processor) + DNN (multi-output, 4 heads) both in production
  • ✓ Daily automated training — Lambda + EventBridge (2 AM UTC), GitHub Actions cron
  • ✓ GPU-accelerated — g6.2xlarge (NVIDIA L4, 24GB VRAM), mixed precision, DNN ~13 min
  • ✓ Deep ClearML integration — metrics, ROC/PR curves, confusion matrices, hyperparams
  • ✓ 5-gate quality suite — min_data, AUC≥0.65, regression, stability, completeness
  • ✓ Full CI/CD — CodePipeline + CodeDeploy + 127 training + 138 Go + 15 sidecar tests
  • ✓ Model serving sidecar — FastAPI on port 8081, hot-reload, 4 processors, 140-feature encoder
  • ✓ A/B testing — control groups, shadow mode, deterministic bucketing, multi-model-type experiments
  • ✓ S3 versioned storage — sequential versions (v1, v2, ...), rollback manifests, sidecar bucket mirroring
  • ✓ OTEL + Prometheus + Grafana — 15+ metric types, distributed tracing, CloudWatch dashboard
  • ✓ Terraform infrastructure — Spacelift-managed, dev + prod, GPU instances, ClearML ECS/Fargate
  • ✓ Snowflake ingestion — memory-efficient streaming, S3 parquet caching, 12-week lookback

🔨 Remaining (~3.5d — polish only)

  • ✗ Lineage tracking queryable store — G-10 (~1.5d remaining)
  • ✗ Lineage documentation — G-10 (~0.5d)

📊 Production Architecture

Snowflake → S3 parquet cache
  ↓
LogReg Pipeline (CPU)
DNN Pipeline (GPU g6.2xlarge)
  ↓
5-Gate Quality Suite
  ↓
S3 Versioned Models
  ↓ (sync to EFS)
Sidecar (FastAPI :8081)
  ↓
Go API (:8080) → Production

DATA-Athena-Snowflake Testing

127 tests · 16 test files · Maturity 4/5
Comprehensive coverage: pipeline orchestration, DNN training (masked BCE, GPU OOM), evaluator, quality gates, promoter, data ingestion, preprocessing, ClearML tracker, rollback, model config. Synthetic fixtures for reproducibility.

athena-platform Testing

138 Go + 15 Python tests · Maturity 4/5
Domain services + repositories well covered. Model registry, experiment assignment, bucketing, shadow mode tested. Sidecar: smart router strategy, model types, encoders. CI enforced on every PR.

Current Blockers

ItemOwnerStatus
Deuna corp accounts — Rakesh & NaokiTBDNot needed
Code / repo access — NaokiTBDDone
Are ATHIA_* tables live in Deuna's Snowflake?IsraelConfirmed ✓
Are SageMaker endpoints live today?RakeshResolved ✓
Payment volume through routing engine?IsraelOpen question
GPU instance for trainingDeunaResolved ✓ — g6.2xlarge (NVIDIA L4) deployed
AWS resource access for RakeshDeunaResolved ✓ — full access granted
✈️ See full Volaris delivery task list — 65 tasks across 8 phases (Production — all systems live) →
🏠Project Overview
What this project is and what success looks like
🎯

Purpose

Assess the effort required to integrate Athia AI/ML into Deuna's payment routing. Produce a clear work breakdown and estimate before any implementation begins.

✅

Phase 0 Deliverables

  • Full schema & data understanding
  • Effort estimate per workstream
  • Risks and open questions resolved
  • Recommended build order
🏆

Long-Term Success

  • Measurable approval lift
  • Stability during PSP outages
  • Latency: p95 < 200ms
  • Closed feedback/learning loop

✅ In Scope (Phase 0)

  • Understand Deuna's data, schema, routing rules
  • Assess Athia platform gaps vs. what's needed
  • Size effort for P-01 through P-05 use cases
  • Identify all dependencies, blockers, risks

🚫 Out of Scope (Phase 0)

  • Any implementation or code delivery
  • 3DS optimization (Phase 2)
  • User-facing messaging (Phase 3)
  • Installment optimization
📅Timeline & Phases
Project phases from assessment to full delivery
🔍

Phase 0 — Assess Level of Effort  Done ✓

2 days · $6K budget · Completed 2026-02-19
Nail down all the work required. Produce a detailed estimate with confidence before committing to delivery.

🚀

Phase 1 — Model in Production  Pending

2 weeks · Core delivery
Model running in production for 2 processors with basic feature store. Target merchant: Volaris.

📊

Phase 2 — Monitoring + Experimentation  Pending

Week 3 · Add monitoring and integrate with A/B experimentation infrastructure.

⚙️

Phase 3 — Drift Detection, CI/CD, Ramp-Up  Pending

TBD · Drift detection, CI/CD pipeline, experiment ramp-up, additional model techniques.

Phase 1 Delivery Plan — Volaris Merchant

65 tasks · ~64 person-days · 5–6 weeks with 3 engineers · TensorFlow ecosystem

Sub-PhaseFocusTasksEffortKey Notes
0 — Service ArchitectureDesign + scaffold 7 TF service shells1114.5dNEW Rakesh: architecture, API contracts, TF integration plan. Team: 7 service shells
1 — Discovery & EDAUnderstand Volaris data107.5dApproval rates per PSP, retry patterns, routing rules, DYNAMIC_ROUTING_DETAIL JSON, A/B sample size
2 — Feature EngineeringBuild ML features (Feature Service + TF Transform)98.5dCard BIN/brand, RFM, retry context, rolling health scores, Amex hard-rule bypass
3 — Model DevelopmentTrain models (tf.keras)89dTF DNN, wide-and-deep, TF Decision Forests via Training Pipeline + Eval Service
4 — Outage DetectionP-01: failover for 4 PSPs66.5dRolling health score (5–15 min window), auto-failover, recovery detection (1–2% sampling), alerts
5 — Message ManipulationP-04: CIT/MIT experiment55.5dCIT/MIT audit, approval delta by toggle × processor × card type, new athena-platform endpoint, A/B test
6 — Platform IntegrationRegister models, wire Deuna86.5dTriton ✓ ExperimentService one-call API + built-in shadow mode. Requires Deuna eng coordination.
7 — Monitoring & FeedbackDashboards, retraining, review86d2 tasks already done: ATHIA tables confirmed live, ATHIA_STAGE_OUTCOMES deployed
Total65~64dPhase 0 first (Rakesh design ∥ team scaffolding); Phases 1–2 sequential; 3–5 parallel after Phase 2
👥Team
People involved and their roles
Deuna (Client)
NameRole
ReksCEO & Co-Founder
ChemaCo-Founder
PabloCTO — Executive Sponsor
IsraelData POC — Snowflake & Data Access
FarhanClaude / LLM Access POC
Mark WalickProduct Management Lead
Aidaptive (Contractor)
NameRole
RakeshCEO
NaokiSolutions Architect
ReneML Engineer
KedarBackend / Data Engineer
🎯Use Cases (P-01 to P-05)
The five P0 use cases to be delivered
🏢

Phase 1 Target Merchant: Volaris  Decided 2026-02-19

Volaris selected over Cinépolis. Known PSPs: Worldpay (ID: 76), MIT (ID: 85), Elavon (cards), Amex (Amex cards) — 4 processors total with routing policies per currency. Cinépolis deferred: only shows Cybersource (a gateway), actual processor unknown.

P-01

Outage Detection & Failover

Detect PSP failures via persistent timeout codes. Auto fail-over and fail-back using random sampling of downed PSP to detect recovery.

P-02

Routing Optimizer

Optimize Deuna's existing static routing rules based on historical outcomes. Build on existing rules engine rather than starting from scratch.

P-03

Per-Transaction Route Selection

Rank top 3 payment processors per transaction in real time based on prior outcomes, card signals, and merchant context.

P-04

Message Manipulation

Toggle CIT/MIT, AVS, MCC variables in authorization request messages. Provide top 3 configuration recommendations per transaction.

P-05

Retry Optimization

Optimize when, how, and where to retry declined transactions. MIT/subs focused. Enterprise darktime reduction. Delayed retry based on processor reputation.

🗄️Data & Schema
Snowflake database overview — extracted 2026-02-18

Connection

VLTAXPW-RMONTES

Database: PAYMENT_ML

Access: Read-only

ABTESTING Schema

Denormalized flat join of all views. Best starting point for EDA. No complex joins needed.

ALL_VIEWS_FLAT ALL_PAYMENT_EVENTS_FLAT

SOURCES Schema

15 clean views: orders, payments, attempts, events, user profiles, routing logs, merchant rules, airline data.

ViewWhy It MattersUse Cases
VW_ATHENA_PAYMENT_ATTEMPTFull retry chain per payment; processor, error codes, hard/soft decline, DYNAMIC_ROUTING_DETAIL JSONP-03 P-05
VW_SMART_ROUTING_ATTEMPTSLive routing engine log: algorithm type, latency, skip reasons — direct latency signal for p95 <200msP-01 P-02
VW_ROUTING_MERCHANT_RULEExisting static rules engine — foundation for routing optimizer. SHADOW_MODE column suggests testing infrastructure exists.P-02
ABTESTING.ALL_VIEWS_FLATEverything joined in one table — best for initial EDAEDA
Feature GroupKey ColumnsUse
Retry historyNUM_ATTEMPTS_ORDER, PREVIOUS_ORDER_ERROR_CODE, AVG_SEC_BETWEEN_PAYMENT_ATTEMPSP-05
Error signalsERROR_CODE, ERROR_CATEGORY, HARD_SOFTP-03, P-05
Card signalsCARD_BIN, CARD_BRAND, BANK, CARD_COUNTRYP-03
User behaviorTARGET_USER_FRAUD_RATE_COHORT, TOTA_MINUTES_BROWSING, RFM valuesP-03
Message configMCI_MSI_TYPE, ORDER_MCI_MSI_TYPE, PAYMENT_ATTEMPT_METHOD_TYPEP-04
Geo & DeviceORDER_COUNTRY_CODE, TARGET_USER_BROWSER, TARGET_USER_DEVICEP-03
🔍Repository Analysis
Findings from both Deuna GitHub repos — re-analyzed 2026-04-19 · Both repos in production · DATA-Athena-Snowflake: 2 training pipelines (LogReg + DNN) · athena-platform: ML serving with hot-reload sidecar

DATA-Athena-Snowflake

github.com/DUNA-E-Commmerce/DATA-Athena-Snowflake · Production — 2 training pipelines (LogReg + DNN), GPU-accelerated, daily automated training

Python / TensorFlow / ClearML

✅ Production Status (Updated 2026-04-19)

2 training pipelines in production: LogReg (4 per-processor models) + DNN (multi-output neural net with 4 heads). Daily automated training via Lambda + EventBridge. GPU-accelerated on g6.2xlarge (NVIDIA L4). Deep ClearML integration with metrics, ROC/PR curves, confusion matrices. 5-gate quality suite blocks bad model promotion. 127 tests. S3 versioned model storage with rollback manifests.

LLM Workflows (11 stimuli)

WorkflowStatus
Acceptance rate analysisDone (v0_1, v1_0)
Fraud card analysisDone
Metrics anomaly detectionDone
Chatbot / data analystDone
Strategy generation directorPartial — Matcher has exit()
Cost optimizationEarly stage
Retry optimizationMissing (P-05 gap)

ML Training Platform (now on main)

ServiceStatus
Training pipeline (Snowflake ML)Done
LLM training orchestrator (GPT-4 + RAG)Done
LLM experiment designer (GPT-4 + RAG)Done
Data quality validatorDone
Model registry (auto table creation)Done
Feature extractorDone
Feedback collector (webhook/API/batch)Done
Schema discovery (ChromaDB)Done
Athia event ingestionDone
Model deployer → athena-platformDone — S3 promotion + sidecar mirror + hot-reload

Architecture — Multi-Agent Pattern

FastAPI + LangGraph · Stimulus-response orchestration · LLM backends: Claude (primary), GPT-4 (fallback)

Request → StimulusRegistry → OrchestratorWorkflow → Branch (DAG of Nodes)
        → AgentWorkflow (LangGraph StateGraph) → Response

11 stimuli: acceptance_rate_analysis · fraud_card_analysis · metrics_anomaly
            user_question · data_analyst · researcher_assistance · deep_exploration
            element_edition · knowledge_expert · strategy_generation · cost_optimization

End-to-End Training Flow (Production)

Daily 2 AM UTC (Lambda + EventBridge)
  → Snowflake Data Ingestion (memory-efficient streaming, S3 parquet cache)
  → Preprocessing (z-score normalization + one-hot encoding, 140 features)
  ├── LogReg Pipeline (4 per-processor models, CPU, ~1.5 min)
  └── DNN Pipeline (multi-output 64→32→4 heads, GPU g6.2xlarge, ~13 min)
  → 5-Gate Quality Suite (min_data, AUC≥0.65, regression, stability, completeness)
  → ClearML Metrics Logging (ROC/PR curves, confusion matrices, hyperparams)
  → S3 Model Promotion (versioned: v1, v2, ... + rollback manifest)
  → Sidecar Bucket Mirror → EFS → Hot-Reload

Training Pipeline Architecture

Pipeline flow: Training Decision → Data Prep → Feature Extraction → Validation → Experiment Design → Training → Model Selection → Deployment

Services

ServicePurpose
TrainingPipelineFull training execution (plan/run/deploy)
LLMTrainingOrchestratorLLM + RAG decision engine (RETRAIN_NOW / SCHEDULED / SKIP)
LLMExperimentDesignerDesigns 5-10 experiments using GPT-4/Claude with RAG
ModelDeployerExports to EFS, registers deployment, creates canary config
TrainingPlannerDry-run mode ("terraform plan" for ML)
FeatureExtractorAuto-extracts features, creates training dataset views
DataQualityValidatorSchema, statistical, temporal bias, drift validation
FeedbackCollectorWebhook, API polling, batch feedback collection
ModelRegistryModel CRUD, prediction/feedback schema management
SchemaDiscoveryAuto-discovers training tables via LLM
LLMProviderUnified Claude/GPT-4 interface with auto-fallback

API Endpoints

EndpointPurpose
POST /api/v1/training/plan/{model_type}Dry-run plan
POST /api/v1/training/run/{model_type}Execute training
POST /api/v1/training/decision/{model_type}LLM decision
POST /api/v1/experiments/design/{model_type}Design experiments

Resolved Since Last Update

  • ✓ Deployment fully automated — S3 promotion + sidecar bucket mirror + hot-reload (was: manual)
  • ✓ Rollback capability complete — versioned S3 + rollback manifests + sidecar hot-reload (was: partial)
  • ✓ GPU training deployed — g6.2xlarge with mixed precision (was: CPU-only, 3hr runs)
  • ✓ CI/CD complete — GitHub Actions + CodePipeline + CodeDeploy (was: partial)
  • ✓ DNN pipeline added — multi-output neural net with 2-18% AUC improvement over LogReg

Remaining (minor)

  • ✗ Formal lineage queryable store — G-10 (~1.5d)

🧪 Testing Coverage

127
Training tests
16
Test files
4/5
Maturity
LayerCoverageNotes
Pipeline orchestrationComprehensivetest_pipeline.py, test_multi_output_pipeline.py — stage execution, metric logging
DNN trainingComprehensivetest_multi_output_trainer.py — masked BCE, batched validation, GPU OOM
Quality gatesComprehensivetest_quality_gates.py — all 5 gates tested
Model promotionComprehensivetest_promoter.py — S3 upload, versioning, rollback manifests
Data ingestionComprehensivetest_data_ingestion.py — Snowflake streaming, S3 caching
PreprocessingComprehensivetest_preprocessing.py — z-score, OHE, config management
ClearML trackerComprehensivetest_tracker.py — task creation, offline mode, metrics
Lambda handlersComprehensivetest_multi_output_handler.py — instance lifecycle, SSM commands

Maturity: 4/5 — Strong. Comprehensive training pipeline coverage with synthetic fixtures for reproducibility. Unit + integration tests with markers (slow, integration, unit). Coverage tracking via pytest-cov.

athena-platform

github.com/DUNA-E-Commmerce/athena-platform

Go / Gin

✅ Production Status (Updated 2026-04-19)

Full ML serving platform in production. Serves both LogReg and DNN models via FastAPI sidecar (port 8081) with hot-reload from EFS. 4 processors (worldpay, elavon, mit_bulk, amex), 140-feature encoder, strategy pattern (XGBoost LTR / TF per-processor). A/B testing with control groups, shadow mode, deterministic bucketing. Multi-model-type experiments (LogReg vs DNN). 138 Go + 15 Python tests. OTEL + Prometheus + Grafana monitoring.

✅ Key Capabilities — All Production-Ready

Model Serving: FastAPI sidecar, EFS models, hot-reload (POST /models/reload). A/B Testing: Deterministic bucketing, control groups, shadow mode, auto-winner. Monitoring: 15+ OTEL metrics, Prometheus, Grafana, CloudWatch. Model Registry: ExperimentService one-call API, model type in version (logreg/dnn). Event Logging: Async Snowflake ingestion. Auto-Winner: Statistical significance, auto-promotion.

ML Inference Types (already in registry)

TypeMaps To
processor_selectorP-03
retry_predictorP-05
retry_sequenceP-05
installment_optimizerOut of scope

Snowflake Tables

TableStatus
ATHIA_PREDICTIONSActive
ATHIA_FEEDBACKActive
ATHIA_TRAINING_DATASETActive
ATHIA_EXPERIMENT_LIFTActive
ATHIA_STAGE_OUTCOMESDeployed (feat/ATH-0000)
ATHIA_SESSION_SUMMARYDeployed (feat/ATH-0000)
ATHIA_MULTI_STAGE_ANALYSISNew (feat/ATH-0000)
ATHIA_MODEL_METRICSNew (feat/ATH-0000)
ML_MODEL_REGISTRYNew (feat/ATH-0000)

Architecture — Clean Architecture (Go/Gin)

PostgreSQL (RDS Multi-AZ) + Redis (ElastiCache) + EFS · mTLS enforced on /api/v1/ml/predict/*

REST Handlers (V1 + V2, Gin)  ← mTLS on /ml/predict/*
        ↓
Controllers (~30 implementations)
        ↓
Domain Services (44 packages)  ← constructor injection throughout
        ↓
Repositories (43 GORM implementations)  ← in-memory SQLite for tests
        ↓
PostgreSQL (RDS Multi-AZ) + Redis (ElastiCache) + EFS (model storage)

A/B Experimentation — Auto-Winner Guardrails

Stats

p-value < 0.05
Min 1000 samples/variant
Min 7 days runtime

Lift

Min 1% absolute lift
Deterministic bucketing
SHA256(transaction_id)

Guardrails

≤10% latency regression
≥−5% revenue regression
Dry-run mode (safe default)

🧪 Testing Coverage

138
Go test files
15
Python test files
4/5
Maturity
LayerCoverageNotes
Domain services (44)44/44All have test files
Repositories (43)~43/43In-memory SQLite isolation
V1 REST handlers (18)15/18 (83%)agent, workspaces, elements missing
Auth middlewareTestedJWT + API key covered
V2 REST handlers (18)0/18 (0%)Entire new API version untested
Bedrock client0%Excluded from coverage config
auth, bedrock, element, workspace services0%4 domain services with no tests
Bootstrap / DI graphSkippedTODO: testcontainers

Maturity: ~4/5 — Strong. Domain and repository layers well covered. Triton merge adds 32 new tests. V2 API (18 handlers) and Bedrock ML inference path are still untested. CI threshold is only 20% and internal/clients/ is excluded from coverage entirely.

Testing Comparison — Both Repos

MetricDATA-Athena-Snowflakeathena-platform
Test files16 training test files138 Go + 15 Python
Test count127 tests138+ Go test files
Core product tested?Yes — all pipeline stages, DNN training, quality gatesYes — domain, repositories, model registry, experiments
CI enforced?Yes — daily cron + PR checksYes — every PR
FixturesSynthetic data fixtures for reproducibilityIn-memory SQLite isolation
Maturity4/5 — Strong4/5 — Strong
⚡Training Platform Gaps
14 gaps — ~104.5d original estimate · ~101d saved · ~3.5d remaining (lineage polish only)

Status

Production

All systems live and running daily

Architecture

2 models (LogReg + DNN) · GPU training · Daily automated retraining · Full CI/CD

Total Effort

~3.5d remaining

of ~104.5d original · ~101d saved (97%)

Progress Summary — 2026-04-19

13
Gaps fully done
G-01–G-09, G-11–G-14
1
Nearly done
G-10 (Lineage)
0
Partial
—
0
Not started
—
StageGapCategoryPriorityOriginalStatus
1 – FoundationG-02 OrchestrationInfrastructureHigh8dDone ✓ — Lambda/SSM/EC2 + SageMaker + daily schedules + GPU lifecycle
1 – FoundationG-08 Feature StoreML InfraHigh13.5dDone ✓ — 140-feature encoder + preprocess_config + sidecar serving
1 – FoundationG-04 Data ValidationData QualityHigh7dDone ✓ — 5-gate quality suite
2 – AutomationG-03 CI/CD PipelineDevOpsHigh9dDone ✓ — GitHub Actions + CodePipeline + CodeDeploy. Dev + Prod
2 – AutomationG-06 Deployment AutomationAutomationHigh7.5dDone ✓ — Full automated deployment with quality gates
2 – AutomationG-07 Model RegistrationAutomationMedium5.5dDone ✓ — ExperimentService + S3 versioning
2 – AutomationG-01 Automated RetrainingAutomationHigh10dDone ✓ — Daily automated training (LogReg + DNN)
3 – GovernanceG-13 Versioning WorkflowGovernanceHigh5dDone ✓ — Sequential S3 versioning + model type in version
3 – GovernanceG-10 Lineage TrackingGovernanceMedium6.5dNearly Done (~1.5d left) — ClearML task hierarchy + S3 manifests. Queryable store TODO
3 – GovernanceG-14 Rollback CapabilityReliabilityHigh5dDone ✓ — S3 versioned + rollback manifests + sidecar hot-reload
4 – ObservabilityG-05 Model MonitoringObservabilityHigh8dDone ✓ — OTEL + Prometheus + Grafana + ClearML
4 – ObservabilityG-09 Drift DetectionObservabilityMedium7dDone ✓ — Feature + prediction + concept drift
5 – ML QualityG-11 Hyperparameter TuningML QualityMedium5.5dDone ✓ — Configurable via env vars, per-env tuning
5 – ML QualityG-12 Algorithm ComparisonML QualityMedium7dDone ✓ — LogReg vs DNN via A/B testing
Remaining Effort by Stage
Stage 1 – Foundation
0d ✓
Stage 2 – Automation
0d ✓
Stage 3 – Governance
~1.5d (G-10 lineage)
Stage 4 – Observability
0d ✓
Stage 5 – ML Quality
0d ✓
📄Statement of Work — Delivery
Volaris Phase 1 · 65 tasks · ~64 person-days · 5–6 weeks (3 engineers) · TensorFlow ecosystem

Engagement Summary

Full implementation of Athia AI/ML smartrouting for Deuna — Volaris merchant. 7 delivery phases covering P-01 (outage failover), P-03 (per-transaction routing), P-04 (message manipulation), and P-05 (retry optimization). ~61.5 person-days of prior Deuna codebase work reduces original scope significantly.

54
Delivery tasks
~51d
Person-days
5–6 wks
Calendar time

Aidaptive Team — Roles & Responsibilities

NameRoleResponsibilitiesDays
RakeshProject Lead & StrategyClient coordination (Pablo, Israel), architecture decisions, Phase 6 oversight, post-launch review5.5d
NaokiSolutions Architectathena-platform Go dev, outage detection, message manipulation API, model serving (Triton), CI/CD, Phase 6 integration14.6d
ReneML EngineerFeature engineering, model training (processor_selector, retry_predictor, retry_sequence), data quality, drift detection, retraining pipeline15d
KedarData & Backend EngineerSnowflake EDA, data pipelines, training datasets, feature feeds, Grafana dashboards, monitoring16d
Total~51.1d

Effort by Phase

PhaseFocusOwnersDaysMilestone
1 — Discovery & EDAUnderstand Volaris dataKedar (4.5d) · Rene (2d) · Rakesh (0.5d) · Naoki (0.5d)7.5dKick-off (20%)
2 — Feature EngineeringBuild ML feature setRene (4d) · Kedar (3d) · Naoki (1d) · Rakesh (0.5d)8.5dPhase 2 complete (20%)
3 — Model DevelopmentTrain P-03 + P-05 modelsRene (6d) · Kedar (2d) · Naoki (1d)9dPhase 3 complete (20%)
4 — Outage DetectionP-01: failover for 4 PSPsNaoki (4.5d) · Rakesh (1d) · Kedar (1d)6.5dPhase 6 complete (30%)
5 — Message ManipulationP-04: CIT/MIT experimentNaoki (2d) · Rene (1.5d) · Kedar (1.5d) · Rakesh (0.5d)5.5d
6 — Platform IntegrationRegister models, wire Deuna +G-06 closeNaoki (5.1d) · Rakesh (2d) · Kedar (1d)8.1d
7 — Monitoring & FeedbackDashboards, retraining, reviewKedar (3d) · Rene (1.5d) · Rakesh (1d) · Naoki (0.5d)6dPhase 7 complete (10%)
Total~51d

Delivery Timeline (6-week plan)

WeekDaysPhases ActiveWhoKey Milestone
Week 11–5Phase 1 (EDA) · Phase 2 start Day 3Kedar · Rene · Rakesh (Day 1)EDA complete; feature schema draft
Week 26–10Phase 2 (Features) · Phase 3 start Day 8Rene · Kedar · NaokiFeature set locked; training dataset built
Week 311–15Phase 3 (Models) · Phase 4 (Outage) parallelRene (models) · Naoki (outage)Models packaged; outage detection built
Week 416–20Phase 4 tail · Phase 5 (CIT/MIT) · Phase 6 prepNaoki · Rene · Kedar · RakeshAPI contract with Deuna eng signed
Week 521–25Phase 6 (Integration)Naoki · Rakesh⚠ Triton branch must be merged by Day 18 · Integration live in shadow mode
Week 626–30Phase 7 (Monitoring & Review)Kedar · Rene · RakeshDashboards live · retraining scheduled · post-launch report

Critical path: Phases 1–2 sequential. Phases 3–5 can run in parallel. Phase 6 requires (a) models complete, (b) Triton branch merged, (c) 1-week Deuna engineering lead time for API contract. Phase 7 requires Phase 6 live.

Assumptions

  • Snowflake access (PAYMENT_ML) remains available read-only
  • Deuna eng available for API contract in Week 4 (Pablo / Israel)
  • Triton branch merged to main by end of Week 3
  • Staging environment available for Phase 6 integration tests
  • ATHIA_PREDICTIONS + ATHIA_FEEDBACK remain live throughout

Success Criteria

  • processor_selector live for ≥1 Volaris PSP
  • ≥1% absolute approval rate lift (A/B test at significance)
  • ≥5% retry success rate improvement vs. baseline
  • PSP failover within 1 routing cycle of threshold breach
  • p95 latency <200ms end-to-end (model inference <50ms)
  • 48h shadow run complete with documented comparison
✈️Volaris Smartrouting — Delivery Tasks
65 tasks across 8 phases — All productionization complete. 2 models (LogReg + DNN) running daily in production.

✅ Production Status (2026-04-19): All Systems Live

2 ML models in production: LogReg (4 per-processor models) + DNN (multi-output neural net, 4 heads). Daily automated training from Snowflake via Lambda + EventBridge. GPU-accelerated (g6.2xlarge NVIDIA L4). 5-gate quality suite. Full CI/CD. ClearML experiment tracking. S3 versioned model storage with rollback.

7 Service Shells
Data Pipelines · Feature Service · Training Pipelines · Model Management · Eval Service · Evaluation Framework · Experiment System
Design Tasks (Rakesh)
V-D01: Service architecture · V-D02: API contracts · V-D03: TF ecosystem integration plan
Updated Totals
65 tasks (was 54) · ~64d total (was ~49.5d) · Phase 0 adds ~14.5d · 3 engineers ~5–6 weeks
4
PSPs (Worldpay · MIT · Elavon · Amex)
65
Total tasks +11 new
~64d
Total effort +14.5d (Phase 0)
5–6 wks
3 engineers parallel
Phase 0 — NEW

Service Architecture & Shell Setup

Design 7 service boundaries (Rakesh), define API contracts, scaffold all service shells using TensorFlow ecosystem: Data Pipelines (TFX), Feature Service (TF Transform), Training Pipelines (TFX Trainer), Model Management, Eval Service (TFMA), Evaluation Framework, Experiment System.

📐 V-D01–D03 (Rakesh design) + V-S01–S08 (team scaffolding)

11 tasks14.5dTensorFlow
Phase 1

Discovery & EDA

Understand Volaris transaction data — approval rates per PSP, retry patterns, routing rules, DYNAMIC_ROUTING_DETAIL JSON, sample size for A/B test.

10 tasks7.5d
Phase 2

Feature Engineering

Card BIN/brand, transaction context, user RFM, retry history, rolling processor health scores, Amex hard-rule bypass, training dataset build.

9 tasks8.5d
Phase 3

Model Development

Train processor_selector, retry_predictor, retry_sequence for Volaris 4 PSPs using tf.keras. DNN vs. wide-and-deep vs. TF Decision Forests comparison. TFMA per-slice evaluation.

🔧 TensorFlow ecosystem: tf.keras training via Training Pipeline service, TFMA evaluation via Eval Service, SavedModel export via Model Management.

8 tasks9dInfra ready
Phase 4 — P-01

Outage Detection

Rolling health score per PSP, failover to next-best Volaris processor, recovery detection via 1–2% sampling, alerts on state changes.

6 tasks6.5d
Phase 5 — P-04

Message Manipulation

CIT/MIT audit for Volaris, approval delta by toggle × processor × card type, experiment design, new athena-platform endpoint, A/B test.

5 tasks5.5d
Phase 6

Platform Integration

Register models in athena-platform, create Volaris-scoped experiment, API contract with Deuna eng, shadow mode validation before live traffic.

✅ Triton branch: ExperimentService one-call API (V-39–41) + built-in shadow mode (V-46) reduce effort by ~1d.

8 tasks6.5d 7.5dTriton ✓
Phase 7

Monitoring & Feedback Loop

V-47
Confirm ATHIA_PREDICTIONS + ATHIA_FEEDBACK are live in Deuna's Snowflake
Done ✓ — confirmed data live in Deuna's Snowflake (2026-02-24)
V-48
Deploy ATHIA_STAGE_OUTCOMES table
Done ✓ — deployed in feat/ATH-0000
V-49–50
Approval rate + model performance Grafana dashboards
V-51–54
Retraining trigger, scheduled pipeline, auto-winner, post-launch review
Ready ✓ — LLM orchestrator + training pipeline built
8 tasks6d2 tasks done/ready

All 65 Tasks

#TaskPhaseOwnerEffortStatus
V-D01Design overall service architecture — 7 service boundaries, data flow, inter-service communication0 – ArchitectureRakesh2dDesign
V-D02Define API contracts for all 7 services — OpenAPI specs, error handling, versioning0 – ArchitectureRakesh1.5dDesign
V-D03Design TensorFlow ecosystem integration — map TFX components to services, TF Serving format, TFDV/TFMA0 – ArchitectureRakesh1dDesign
V-S01Scaffold Data Pipeline service — TFX ExampleGen + StatisticsGen, Snowflake adapter, TFDV schema0 – ShellKedar1.5d
V-S02Scaffold Feature Service — TF Transform preprocessing_fn, feature store API, real-time endpoint0 – ShellRene1.5d
V-S03Scaffold Training Pipeline service — TFX Trainer + tf.keras, Keras Tuner, training history0 – ShellRene1.5d
V-S04Scaffold Model Management service — registry CRUD, SavedModel storage, lifecycle, version comparison0 – ShellNaoki1d
V-S05Scaffold Eval Service — TFMA integration, per-slice metrics, model blessing/rejection API0 – ShellRene1d
V-S06Scaffold Evaluation Framework — A/B stat engine, winner detection, latency/revenue guardrails0 – ShellNaoki1.5d
V-S07Scaffold Experiment System — experiment CRUD, traffic splitting, variants, shadow mode orchestration0 – ShellNaoki1.5d
V-S08Set up shared TF dependencies — tensorflow, tfx, tf-transform, tfma, tfdv, keras-tuner + Docker base0 – ShellKedar0.5d
V-01Filter Volaris transactions — date range, volume, monthly trend1 – EDAKedar0.5d
V-02Per-processor approval rates (Worldpay, MIT, Elavon, Amex) by card type, currency, amount1 – EDAKedar1d
V-03Retry pattern analysis — attempts per order, processor retry-to, 1st/2nd/3rd attempt success rates1 – EDAKedar1d
V-04Explore DYNAMIC_ROUTING_DETAIL JSON — extract all keys and values1 – EDAKedar1d
V-05Map Volaris routing rules from VW_ROUTING_MERCHANT_RULE* views1 – EDAKedar0.5d
V-06Analyze smart routing log — algorithm types, skip rates, p95 latency baseline1 – EDAKedar0.5d
V-07Hard vs. soft decline distribution by processor and error code1 – EDARene1d
V-08Profile airline-specific features — flight, passenger, booking window signal1 – EDARene0.5d
V-09A/B test sample size check — daily volume per processor ≥ 1000/variant in 7 days?1 – EDARene0.5d
V-10EDA summary report — approval rates, error taxonomy, processor share, correlations1 – EDARene + Rakesh1d
V-11Define Volaris feature schema — all features, types, sources, compute latency2 – FeaturesRene + Naoki1d
V-12Card-level features — BIN, brand, bank, type, country; historical approval rate per BIN × processor2 – FeaturesRene1d
V-13Transaction-level features — amount, currency, CIT/MIT, MCC, flight order type2 – FeaturesRene1d
V-14User-level features — RFM, fraud rate cohort, tenure, browsing signals2 – FeaturesRene0.5d
V-15Retry-context features — previous processor, error code, time since attempt, attempt number2 – FeaturesKedar1d
V-16Processor-state features — rolling approval/timeout/decline rate at 15-min, 1h, 24h windows2 – FeaturesKedar1.5d
V-17Amex hard-rule — always route Amex cards to Amex processor; bypass ML2 – FeaturesNaoki0.5d
V-18Build training dataset — join features onto labeled outcomes; train/val/test split2 – FeaturesKedar1dReady ✓ ATHIA_TRAINING_DATASET view + feature_extractor.py
V-19Feature quality validation — nulls, skew, leakage risk, outcome correlation2 – FeaturesRene1dReady ✓ data_quality_validator.py (834 lines)
V-20Train processor_selector v1 — rank 4 PSPs by approval probability (tf.keras DNN)3 – ModelsRene2dTF Training Pipeline service
V-21Evaluate processor_selector — AUC, lift vs. static rules, per-processor accuracy, latency3 – ModelsRene1dReady ✓ Metrics auto-calculated by pipeline
V-22Train retry_predictor v1 — predict retry approval probability3 – ModelsRene1.5dReady ✓ Training pipeline supports retry_predictor type
V-23Train retry_sequence v1 — optimal processor order for retry3 – ModelsRene1.5dReady ✓ Training pipeline supports retry_sequence type
V-24Evaluate retry models — success rate lift, processor fatigue patterns3 – ModelsRene1dReady ✓ Evaluation framework in pipeline
V-25Architecture comparison — DNN vs. wide-and-deep vs. TF Decision Forests; select champion3 – ModelsRene1dTF Replaces XGBoost vs. LR comparison
V-26Inference latency test — all models under 50ms budget3 – ModelsNaoki0.5d
V-27Package models — serialize, write model card (schema, features, metrics)3 – ModelsKedar0.5dReady ✓ model_registry.py auto-creates tables + stores metadata
V-28Define outage signal — timeout/error code thresholds for PSP-down detection4 – P-01Rakesh + Naoki1d
V-29Rolling processor health score — sliding 5–15 min window per PSP4 – P-01Naoki1.5d
V-30Failover logic — skip degraded PSP, route to next-best Volaris processor4 – P-01Naoki1.5d
V-31Recovery detection — 1–2% sampling of down PSP; auto-restore on consecutive wins4 – P-01Naoki1d
V-32Outage simulation tests — inject failures per PSP; verify failover + recovery4 – P-01Naoki1d
V-33Outage alerting — Slack/PagerDuty on PSP state changes4 – P-01Kedar0.5d
V-34Audit CIT/MIT usage for Volaris — current distribution across PSPs5 – P-04Kedar0.5d
V-35Approval delta by CIT vs MIT per processor — statistical test5 – P-04Rene1d
V-36Design message manipulation experiment — CIT/MIT × processor × card type matrix5 – P-04Rene + Rakesh1d
V-37Implement message recommendation API in athena-platform5 – P-04Naoki2d
V-38Run A/B test — approval rate with vs. without message recommendations5 – P-04Kedar1d
V-39Register processor_selector in MODEL_ARTIFACTS (version, Triton backend ref, feature schema)6 – IntegrationNaoki0.3dReady ✓ POST /api/v1/ml/models (Triton branch ExperimentService)
V-40Register retry_predictor + retry_sequence in MODEL_ARTIFACTS6 – IntegrationNaoki0.3dReady ✓ Same — ExperimentService handles all 3 model types
V-41Create Volaris-scoped experiment — merchant filter, 10% treatment split, shadow mode, guardrails6 – IntegrationNaoki0.5dReady ✓ POST /api/v1/ml/experiments — variants + models in one call (Triton branch)
V-42Validate experiment assignment — SHA256 bucketing determinism for Volaris6 – IntegrationNaoki0.5d
V-43API contract with Deuna engineering — define POST /api/v1/ml/predict request/response for Volaris6 – IntegrationRakesh1d
V-44Deuna payment service integration — Deuna calls athena-platform at routing decision point6 – IntegrationRakesh + Naoki2d
V-45End-to-end integration test — full flow: Deuna → athena-platform → model → ranked PSPs6 – IntegrationNaoki + Kedar1d
V-46Shadow mode — 48h logging without acting; compare predicted vs. actual outcomes6 – IntegrationKedar0.5dReady ✓ is_shadow_mode=true built-in (Triton branch); set up + monitor only
V-47Confirm ATHIA_PREDICTIONS + ATHIA_FEEDBACK are live in Deuna's Snowflake7 – MonitoringKedar0.5dDone ✓ Confirmed data live in Deuna's Snowflake (2026-02-24)
V-48Deploy ATHIA_STAGE_OUTCOMES table in Snowflake7 – MonitoringKedar0.5dDone ✓ Deployed in feat/ATH-0000 SQL
V-49Volaris approval rate dashboard — daily/hourly per PSP vs. baseline7 – MonitoringKedar1d
V-50Model performance dashboard — prediction confidence, rank accuracy, retry lift7 – MonitoringKedar1d
V-51Define retraining trigger — approval rate drop or AUC drop thresholds7 – MonitoringRene0.5dReady ✓ llm_training_orchestrator.py makes RETRAIN_NOW / SCHEDULED / SKIP decisions
V-52Schedule weekly retraining — auto-register new version from latest ATHIA_TRAINING_DATASET7 – MonitoringRene1dReady ✓ training_pipeline.py + orchestrator built; configure for Volaris cadence
V-53Confirm auto-winner worker runs for Volaris experiment with correct guardrails7 – MonitoringNaoki0.5d
V-54Post-launch review — 2-week lift analysis: approval rate, outage response, retry success7 – MonitoringRakesh1d
💡Codebase Improvement Suggestions
What needs to change in both repos to reach production-grade quality

🔴 Critical — Do These First

ActionRepoEffort
Build retry_optimization_requested stimulus — P-05 is entirely missing from LLM platformDATA-Athena-Snowflake3d
Complete Strategy Director — replace exit() placeholder & dummy ranker promptsDATA-Athena-Snowflake2d
Add tests for all 18 V2 REST handlers — 0% coverage on new API versionathena-platform4d
Add tests for Bedrock client + Bedrock domain service — production-critical, currently excludedathena-platform1.5d
Add route, service & client layer tests — all 14 routes, 13 services, 3 clients at 0%DATA-Athena-Snowflake7d
Remove internal/clients/ from coverage exclusions in CIathena-platform0.5d
Python / LangGraph DATA-Athena-Snowflake
Testing
  • ✗ Unit test all route handlers with FastAPI TestClient + mocked services
  • ✗ Unit test all 13 services — mock Snowflake sessions and clients
  • ✗ Unit test core multi-agent framework: AgentWorkflow, AgentStrategy, node/edge composition
  • ✗ Add per-branch tests for all 11 stimulus branches (mock LLM responses with fixtures)
  • ✗ Unify CI into a single pytest run — replace fragmented per-domain workflows
  • ✗ Enable pytest-cov with 60% minimum threshold enforced in CI
Architecture
  • ✗ Circuit breaker in AgentWorkflow — isolate node failures, prevent cascade
  • ✗ Enable OpenTelemetry tracing — already in codebase, just commented out
  • ✗ Replace hardcoded thresholds (15% drop, 60–80 min windows) with configurable params
  • ✗ Add LLM prompt injection guards — sanitize user inputs before system prompts
  • ✗ Standardize tool definition — unify @create_tool vs. manual; add versioning
  • ✗ Centralize config — replace scattered load_dotenv with Pydantic Settings schema
Go / Gin athena-platform
Testing
  • ✗ Add tests for all 18 V2 handlers — entire new API version at 0%
  • ✗ Test Bedrock client & service — excluded from coverage, production-critical
  • ✗ Raise CI threshold 20% → 60%; remove internal/clients/ exclusion
  • ✗ Bootstrap integration test with testcontainers-go — verify DI graph
  • ✗ Benchmark tests for /ml/predict, /feedback, experiment assignment
  • ✗ Contract tests for Snowflake & Bedrock APIs — catch schema drift early
Architecture
  • ✗ Event-driven model registry cache invalidation — remove 24h stale assignment risk
  • ✗ Experiment context middleware — auto-propagate session/experiment IDs per request
  • ✗ Abstract *gin.Context from controllers — transport-agnostic, easier to test
  • ✗ Deploy ATHIA_STAGE_OUTCOMES + ATHIA_SESSION_SUMMARY Snowflake tables
  • ✗ SageMaker model warm-up — cold starts can breach p95 < 200ms target
  • ✗ Production Grafana dashboards + alerts — config exists locally, not deployed

Full Priority Order

PriorityActionRepoEffort
CriticalBuild retry optimization stimulus (P-05)DATA-Athena-Snowflake3d
CriticalComplete Strategy Director matcher + rankerDATA-Athena-Snowflake2d
CriticalAdd tests for all 18 V2 REST handlersathena-platform4d
CriticalAdd tests for Bedrock client + serviceathena-platform1.5d
CriticalAdd route, service & client tests (all at 0%)DATA-Athena-Snowflake7d
CriticalRemove internal/clients/ from coverage exclusionsathena-platform0.5d
HighMulti-agent framework + branch tests (11 branches)DATA-Athena-Snowflake6d
HighCircuit breaker in AgentWorkflowDATA-Athena-Snowflake2d
HighEnable OpenTelemetry tracingDATA-Athena-Snowflake1.5d
HighRaise CI coverage threshold to 60%athena-platform0.5d
HighBootstrap integration test (testcontainers)athena-platform1.5d
HighEvent-driven model registry cache invalidationathena-platform1.5d
HighDeploy ATHIA_STAGE_OUTCOMES + SESSION_SUMMARY tablesathena-platform1d
HighSageMaker model warm-up (latency target risk)athena-platform1d
MediumAdaptive thresholds (replace hardcoded values)DATA-Athena-Snowflake2d
MediumExperiment context middlewareathena-platform2d
MediumProduction Grafana dashboards + alert rulesathena-platform2d
MediumBenchmark tests for hot endpointsathena-platform1d
MediumUnified CI test suite + coverage enforcementDATA-Athena-Snowflake1.5d
📝Daily Updates
Aidaptive engineering activity across both codebases
2026-06-11
EngineerRepoCommits
RakeshDATA-Athena-Snowflake0114867 Add BIN 546759 → MIT_BULK override; fix ECS poll; 36e2a2c Add refresh_bin_overrides.py; 4227ebb Add serving-time WORLDPAY overrides; 1ecc87c Add MIT_BULK neg-weight; e1fae15 Fix override detection; 965adc4 Revert Adyen processor merge (PR #1389)
Summary: BIN-level routing override automation — managing specific BINs causing mis-routing; reverted Adyen DNN head (needs more data)
2026-06-10
EngineerRepoCommits
RakeshDATA-Athena-Snowflake8fd7755 fix: update Snowflake account to VLTAXPW-YN70854
Summary: Migrated all scripts to new Snowflake instance VLTAXPW-YN70854
2026-06-09
EngineerRepoCommits
RakeshDATA-Athena-Snowflake7c07eae feat(ATH-1272): BIN routing override config and automation script; 7433b34 fix: correct warehouse in routing scripts; 410e512 fix: PAYMENTS_ML warehouse in update_bin_routing_rules
Summary: BIN routing override automation wired up; warehouse config fixes across pipeline scripts
2026-06-08
EngineerRepoCommits
RakeshDATA-Athena-Snowflake96e572e feat(ATH-1375): add Adyen as 5th processor head in Volaris DNN; 7eb8b4e feat: persist BIN neg-weight config across retrains; aec0094 test(ATH-1352): neg-weight config tests; 4a6af72 fix(ATH-1277): align Snowflake warehouse defaults
Rakeshathena-platformc0cb96d feat(ATH-1375): add Adyen BIN/brand approval rate fields to serving layer; 583522a fix(ATH-1277): close DNN confidence gap — channel, timing, flight data
Summary: Added Adyen as 5th DNN processor head (training + serving); BIN neg-weight config persisted; DNN confidence gap closed with new features
2026-06-04
EngineerRepoCommits
RakeshDATA-Athena-Snowflakec5b40d9 fix(volaris): correct bin_processor_rates formula
Rakeshathena-platform8abd1b4 test(ATH-1352): expand BIN rate penalty coverage to 26 tests
Summary: Formula fix for BIN processor rates; expanded serving-layer BIN penalty test coverage
2026-06-02
EngineerRepoCommits
RakeshDATA-Athena-Snowflakedb2d8ea fix(ATH-1277): write version.json to latest/ prefix on every promotion; 42a3887 feat(ATH-1277): restore automatic latest/ writes
Summary: Fixed version.json promotion so latest/ pointer always stays current on S3
2026-06-01
EngineerRepoCommits
RakeshDATA-Athena-Snowflakedde18a4 feat(training): BIN neg-weight boosting enabled by default for 481516/416916 → WORLDPAY; e0dec88 feat(training): add BIN-level negative signal boosting for mis-routed processor pairs
Rakeshathena-platform6691322 feat(ATH-1352): Thompson Sampling for Volaris smart router; a74d8a5 feat(ATH-1352): serving-layer BIN approval rate penalty
Summary: Thompson Sampling merged to smart router; BIN approval rate penalty live in serving; BIN neg-weight training improvement for mis-routed WORLDPAY BINs
2026-06-08
Goal: Analyze Volaris DNN model performance in production; identify new processor impact; stabilize to best-performing model version.
EngineerWork Done
RakeshDeep analysis of Volaris DNN transactions. Identified that Adyen was added as a new processor in mid-May, which the model has no training history for — model needs retraining once 12 weeks of Adyen data is available. Extensive comparison of model versions v54 vs v56 — v54 is the stronger performer. Coordinated with Jose to revert production serving to v54 and disabled daily automated training to enable controlled manual training cadence going forward.
2026-06-07
Goal: Audit BIN-level penalty configuration and improve model calibration for underperforming card BINs.
EngineerWork Done
RakeshAudited all BIN-level penalty configurations and calibrated the model to correctly handle bad BINs (card BINs with historically low approval rates). This calibration work was the key driver behind v54 and its improved production performance.
2026-06-06
Goal: Build a dedicated transaction analysis tool for debugging Volaris DNN model performance.
EngineerWork Done
RakeshCreated a dedicated Transaction Explorer tool (transactions.html) — a live Snowflake-backed analyzer for inspecting individual transactions across the Deuna ML ecosystem. Displays all ML features fed to the DNN model, raw request/response payloads from the sidecar, payment attempt lifecycle (first try + retries), and feedback/outcome records. Supports sampling by processor, model version (DNN vs heuristic), and outcome correctness — especially useful for debugging Volaris DNN model performance discrepancies.
2026-06-05
Goal: Restore ML training pipelines after Snowflake migration; refresh all dashboards on new instance.
EngineerWork Done
RakeshFixed all ML training pipelines failing due to access issues post Snowflake migration to Ohio region (new instance: VLTAXPW-YN70854). Updated all scripts and config to point to new account/credentials. Verified pipeline connectivity and ran full metrics refresh — confirmed 884K attempts, 80.4% approval rate across all dashboards.
2026-06-04
Goal: Validate Volaris DNN model progression and debug bin-level penalty not taking effect in production.
EngineerWork Done
RakeshAnalyzed Volaris DNN model versions v43–v48 to confirm iterative improvement across versions. Investigated bin-level penalty not applying in production — current hypothesis is that the bin-level penalty config has been loaded onto the sidecar rather than the serving path; debugging in progress.
2026-06-03
Goal: Refresh all metrics dashboards with live Snowflake data; analyze MCO model readiness for general merchant rollout.
EngineerWork Done
RakeshRan full Snowflake refresh across metrics and merchant dashboards (822K attempts, 80.8% approval rate). Added daily $ lift chart for Volaris DNN vs Heuristic comparison. Fixed model comparison chart rendering (Chart.js scale ID issue). Analyzed MCO model to confirm it is ready for regular (non-Volaris) merchant MCO rollout — confirmed bias removal logic is still functioning correctly.
2026-05-31
Goal: Ensure training data quality and model promotion safety for Volaris DNN pipeline.
EngineerWork Done
RakeshDeep analysis of data and schema — identified Elavon and Worldpay transactions stuck in "processing" status, polluting training data with ambiguous outcomes. Pinging Deuna team to resolve. Added quality metric gates to model promotion pipeline — model will not be promoted if it does not outperform the currently serving model in offline evaluation.
2026-05-28
Goal: Ship serving change for MCO model on Volaris transactions.
EngineerWork Done
RakeshAdded serving change for calling MCO model for Volaris transactions using flight data. Verified MCO model shadow mode for Volaris transactions — fixing serving code.
2026-05-27
Goal: Deep-dive Volaris and MCO metrics, resolve sampling bias in model comparison.
EngineerWork Done
RakeshAdded detailed analysis for Volaris and MCO on metrics dashboard and scripts. Extensive analysis to identify and resolve sampling bias in DNN vs heuristic comparison — implemented stratified amount-bin matching. Audited MCO for Volaris and identified serving change needed to route Volaris transactions through MCO model using flight data.
2026-05-25
Goal: Investigate and fix model version logging issue affecting dashboards.
EngineerWork Done
RakeshIdentified bug where "latest" was being logged as the model version instead of the actual version (e.g. v35) — causing confusion in dashboards.
2026-05-24
Goal: Validate best Volaris DNN model version and explore simulation options.
EngineerWork Done
RakeshDid extensive analysis to confirm that v34 is the best model for Volaris DNN. Explored offline simulation but determined it would be tricky due to lack of counterfactuals.
2026-05-20
Goal: Quality improvements to Volaris DNN model to close gap vs control.
EngineerWork Done
RakeshAnalyzed and added multiple minor quality enhancements to Volaris DNN model — getting close to consistently beating control, even if by a small margin.
2026-05-19
Goal: Temperature-based calibration for Volaris DNN + bias analysis on control/experiment groups.
EngineerWork Done
RakeshAdded temperature-based calibration to Volaris DNN training model along with serving changes to use calibration. Analyzed biases in Volaris control and experiment groups.
2026-05-18
Goal: Fix non-performing BINs via override rules; explore confidence score + temperature approach for model fallback.
EngineerWork Done
RakeshContinued analyzing Volaris DNN model performance — got non-performing BINs across Visa and MC to go through adhoc override rules. Added detailed analysis to ticket ATH-1272. Next: tweaking model to use confidence score and temperature-based analysis — exploring per-processor confidence scaling with a threshold to fall back to heuristics when model is not confident.
2026-05-17
Goal: Fix underperforming BIN ranges in Volaris DNN and submit override rules PR.
EngineerWork Done
RakeshAnalyzed Volaris DNN model performance for specific BIN ranges where it is underperforming. Added adhoc rules to override model results in those failing BIN ranges. Submitted PR — scheduled to go out Monday.
2026-05-14
Goal: Extend MCO models for airline tasks and fix GPU OOM in training pipeline.
EngineerWork Done
RakeshAdded capability to MCO models to handle airline tasks (Volaris-specific flows). Fixed MCO model training pipeline so that GPU doesn't go OOM — training now finishes and generates a model successfully. Deep analysis on Volaris routing model performance — found niche error where 3 specific MC BIN numbers were 100% failing; added BIN-related features to training pipeline and trained a new model.
2026-05-13
Milestone: All done — latest model trained and deployed in prod, no more known issues.
EngineerWork Done
RakeshFixed ClearML integration issue. Deployed full training run end-to-end. Loaded model from S3 to EFS for serving. Added alerting check when daily pipeline run doesn't happen. Added scheduler to reload latest model version at 7am PST every day to ensure production always has the latest model. All previously discussed ideas completed — latest model trained and deployed in prod, no more known issues. Did all-nighter to fix everything that was broken for Volaris DNN — it was not getting trained daily with latest trends; fully resolved. Analyzed and fixed last mile of all breakages, verified everything working end-to-end. Fixed MCO training pipeline which was also failing. Confirmed Volaris routing is back to break even.
2026-05-12
Goal: Debug MCO shadow mode and connect remaining model heads to production.
EngineerWork Done
RakeshDebugged MCO model in shadow mode to ensure everything is hooked up correctly and working fine. Added the last remaining 2 heads of the MCO model — all heads are now being called in production now that everything is well connected.
2026-05-04
Goal: Confirm DNN back in production and validate data distribution matches training.
EngineerWork Done
RakeshDNN Volaris model is back up and running in production, handling live traffic with correct data as expected. Full analysis completed confirming everything is hooked up and connected end-to-end. Data distribution matches what the model was trained on — strong signal that the model should perform well for Volaris. Added serving change to handle other heads of MCO model so it can be used in payment services for whichever flow.
2026-05-03
Goal: Add MCO dashboard and verify full system connectivity.
EngineerWork Done
RakeshAdded MCO dashboard for monitoring. Completed full analysis confirming all components are hooked up and connected end-to-end.
2026-04-24
Goal: Get DNN model fully functional in production with 1-year training data.
EngineerWork Done
RakeshDNN model finally enabled in production — fixed many bugs so that production DNN functions correctly. Training with 1 year of data now succeeds. DNN serving live traffic alongside LogReg in A/B experiment.
2026-04-23
Goal: Monitor DNN model in production, fix bugs, retrain with 1 year of data.
EngineerWork Done
RakeshAdded model comparison graphs to metrics dashboard (DNN vs LogReg vs Heuristic — accuracy, approval rate, latency, volume). DNN model went to production yesterday. Monitoring production DNN, fixed one bug in pipeline. Retraining with 1 year of data — will load to production once training completes.
2026-04-19
Milestone: All productionization complete — 2 models running daily in dev and prod.
EngineerWork Done
RakeshAll productionization complete. 2 models (LogReg + DNN) for Volaris smart routing running in production. Continuous daily training from Snowflake data. GPU-accelerated (g6.2xlarge NVIDIA L4). Full CI/CD (GitHub Actions + CodePipeline + CodeDeploy). 5-gate quality suite. Deep ClearML integration. 13 of 14 gaps closed (~101d of ~104.5d saved). Only lineage tracking polish remaining (~1.5d).
2026-04-15
Goal: Have both Log Reg and DNN pipelines running regularly in dev and prod, pushing models to S3 buckets.
EngineerWork Done
RakeshChanged training pipelines to use GPU instances. Implemented deeper ClearML integration in training pipeline — extracting detailed metrics from each training run into ClearML for monitoring and debugging.
2026-04-14
Goal: Get ClearML integration working in dev and deploy DNN pipeline end-to-end.
EngineerWork Done
RakeshSpent ~30 hours debugging dev servers with ClearML integration — blocked by access issues. Worked with team to resolve access, dev pipeline now working correctly. Attempted DNN pipeline deployment — ran for 3 hours and failed. Long turnaround time makes iteration impractical without GPU.
Blockers:
  • GPU needed — training requires GPU instance; CPU-based runs too slow for practical iteration (3hr+ per attempt)
  • AWS access for Rakesh — need direct access to AWS resources (console/CLI) to debug and iterate efficiently
2026-04-11
Goal: Have both model techniques (DNN and Log Reg) running e2e daily, pushing models in dev and prod.
EngineerWork Done
RakeshImplemented DNN pipeline Terraform and deploying in dev. Need to get PRs submitted and merged to main/qa so pipelines can run via CI/CD
2026-04-10
Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.
EngineerWork Done
RakeshDebugging why ClearML is not registering all metrics from each training run. Next: deploy DNN model and fix production integration to ensure every training run model reaches production automatically
2026-04-07
Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.
EngineerWork Done
RakeshFixing Volaris training pipeline run in dev environment. Helping team with prod deployment
2026-04-05
Goal: Both training pipelines (dolphin SageMaker + Volaris smart router) running daily with ClearML tracking.
EngineerWork Done
RakeshDeployed e2e SageMaker dolphin pipeline (PreprocessData ✅, TrainModel ✅, EvaluateModel fix pushed). Deployed Volaris smart router daily training infra. ClearML integration confirmed working. Implemented Volaris pipeline CLI with Snowflake connection. Added 13 tests for setup_pipeline. Fixed multiple SageMaker issues: pipeline name mismatch, model.save(), RegisterModel, FrameworkProcessor for eval.
Blockers:
  • PR #1132 (DATA-Athena-Snowflake) needs approval to merge to qa — blocks CodePipeline deployment
  • CodeDeploy DeployEC2 stage failing — scripts need debugging after merge
  • SSO lacks sagemaker:CreatePipeline for local runs
Pending PRs:
  • #1132 DATA-Athena-Snowflake → qa (ClearML, Lambda handler, SageMaker fixes)
  • #26 terraform-athia → main (Dolphin SageMaker pipeline infra)
  • #27 terraform-athia → main (Volaris daily training infra)
2026-04-04
Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.
EngineerWork Done
RakeshDeployed e2e SageMaker training pipeline via Spacelift. Consolidated PRs #24+#25 → #26. Fixed Spacelift project_root, cleaned orphaned state. Enabled daily training for both pipelines. Set up ClearML creds in Secrets Manager + EC2. Created Spacelift stack for Volaris. Renamed model-artifacts → volaris-model-artifacts.
2026-04-02
Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.
EngineerWork Done
RakeshFighting Terraform and Spacelift configuration issues. ClearML successfully deployed in dev environment
2026-04-01
Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.
EngineerWork Done
RakeshDeployed ClearML using Terraform and Spacelift in dev environment and handed over to team. Full data analysis of Snowflake data with Rene — generated list of suggestions for team, published at metrics dashboard
ReneFull data analysis of Snowflake data with Rakesh — generated list of suggestions for team
2026-03-31
Goal: Deploy entire training platform using TF and Spacelift.
EngineerWork Done
RakeshDeploying entire training platform end to end. Learning Spacelift for infra deployment — finally got access. Deploying ClearML to monitor all training. Wrote analysis script to monitor model performance driving metrics dashboard directly from Snowflake — better for analysis and generating insights
ReneAnalyzing model performance with past data. Waiting for experiment to be enabled again — current data not significant enough
2026-03-30
Goal: Have clear idea of metrics for measuring model performance and deploy entire training platform.
EngineerWork Done
RakeshAnalyzing model performance metrics, working on DNN model optimization for Volaris smart router
2026-03-28
Goal: Train one more DNN model for Volaris smart router.
EngineerWork Done
RakeshTraining DNN model for Volaris smart router — also serves as example on how to use the TFX pipeline
ReneData analysis to identify patterns based on Volaris data
2026-03-27
Goal: Get TF LR model in production.
EngineerWork Done
RakeshVerified everything working in production. Helping with questions about the model. Double-checked traffic ramp-up and analyzing how to do post-launch analysis
2026-03-26
Goal: Integrate new model in serving stack now that everything is working end to end.
EngineerWork Done
RakeshIntegrated model in serving stack with sidecar approach, loading models from S3/EFS. Removed all parallel serving libraries no longer needed in training directory
2026-03-25
Goal: Deploy everything in production for Volaris to get real data flowing.
EngineerWork Done
ReneCreated and incorporated AMEX model, added to the serving mix
RakeshBuilt AMEX model with Rene. Working with Deuna team on deploying everything in production for Volaris
2026-03-24
Goal: Have TF model up and running in production integrated with the Go API.
EngineerWork Done
RakeshAnalyzing code with Naoki to determine serving approach — sidecar vs deploying servomatic. Building production flow to save and load models from S3 & deploying servomatic binary
NaokiEvaluating sidecar vs servomatic deployment for integrating TF model with the Go API
ReneContinuing to iterate on the TF model and data analysis. Converting LR model to TF format for use in servomatic binary for online eval
2026-03-23
Goal: Hook this model into production to have everything connected end to end.
EngineerWork Done
ReneFirst regression model trained on Volaris data — evaluating quality in offline mode, initial results look promising
RakeshContinuing to analyze approach to serve the model as servomatic with Naoki
NaokiAnalyzing serving architecture to connect trained model via servomatic in production
2026-03-20
Goal: Have one model trained on Volaris data.
EngineerWork Done
ReneIterating on data shape analysis and building first model version
RakeshWorking with Rene on first model; analyzing serving code with Naoki to plan production integration
NaokiAnalyzing serving code with Rakesh to determine how to connect model in production
2026-03-19
Goal: Have first model ready at least in offline mode in the coming days.
EngineerWork Done
RakeshFirst analysis of the data with Rene — analyzing best approach to build processor selector model
ReneStarted on first model based on current understanding of data and features
NaokiWorking with Rene on integrating S3 file loading into TFX data loader for e2e training and eval
Blockers: None — have a few questions to confirm our understanding, will batch and ask together
2026-03-18
Goal: Have first ML model for selecting the right processor for every Volaris transaction.
EngineerWork Done
ReneLooking at data shape for Volaris to train first processor selector model
KedarWorking on data pipeline
NaokiContinues setting up good practices (code quality, CI, testing patterns)
RakeshWriting smartrouter service
Blockers cleared: AWS access granted · Code review done, all code merged to qa
2026-03-16
Goal: Submit everything and build data pipeline to extract Volaris data from Snowflake.
EngineerWork Done
RakeshIterated on experiment and metrics framework to make everything work locally and in tests
NaokiIterated on improving code and ramping up
KedarLooking at feature extraction from Snowflake
ReneWorking on simple first model
Blockers: PR review pending · AWS access for POC server (from 2026-03-13)
2026-03-15
Goal: Have training platform implemented in shape and be ready for feature engineering and training for Volaris model.
EngineerWork Done
RakeshAdded Evaluation Service (uses Model Service + Feature Service to evaluate TensorFlow models). Added e2e tests for all 3 services. Added experiment and metrics framework to track all training pipelines. Demo training pipeline working end to end. PR waiting for review
Blockers: PR review pending · AWS access for POC server (from 2026-03-13)
2026-03-14
EngineerWork Done
RakeshBuilt foundational services: Model Service and Feature Service with tests and scaffoldings to support TensorFlow trained models
2026-03-13
Goal: Get everything running tests regularly and pushing to dev server automatically — getting comfortable with the current stack.
EngineerWork Done
NaokiFixed broken tests to get everything running locally. Looking at setting up automated deployment in dev environment for services
KedarGot repo and environment access figured out. Looking into Snowflake data schema
ReneGot repo and environment access figured out. Looking at training pipeline code
RakeshUpdated deuna.aidaptive.com with latest repo analysis and refreshed task list. Synced athena-platform (v0.15.5, Triton merged)
Blocker: Waiting for AWS access to deploy on POC server
❓Open Questions
Items that need answers before effort estimates are finalized
1

Are ATHIA_PREDICTIONS / ATHIA_FEEDBACK tables populated in Deuna's Snowflake today? Confirmed ✓ 2026-02-24

Confirmed — data is live in Deuna's Snowflake (verified 2026-02-24).

2

Are SageMaker endpoints live for processor_selector / retry_predictor?

Or are they placeholders only? — Rakesh to confirm

3

Is there a live model in MODEL_ARTIFACTS that Deuna's payment service is calling today?

Rakesh to confirm

4

What is the current payment volume through the routing engine?

Minimum 1,000 transactions per variant needed for A/B test statistical validity — Ask Israel

5

Who owns the athena-platform Go repo deployments?

Aidaptive or Deuna infra? Affects Phase 1 deployment planning — Clarify with Pablo

6

When will feature/llm-driven-ml-training (Triton IS) merge to main? New

This PR closes G-06 and defines the production model serving backend (Triton vs. SageMaker). Its merge timeline directly sets the Phase 6 integration schedule — ask Pablo.

🔑Access & Blockers
Pending provisioning items
ItemOwnerStatus
Snowflake access — RakeshIsrael (Deuna)✓ Done (2026-02-18)
Snowflake access — NaokiRakesh + Naoki✓ Done (2026-02-19)
Code / repo access — RakeshPablo (Deuna)✓ Done (2026-02-19)
Claude / LLM access & budgetPablo → Farhan✓ Done (2026-02-19)
Code / repo access — NaokiTBD✓ Done
Deuna corp accounts — Rakesh & NaokiTBDPending
Claude Code credits — Rakesh & Naoki—Not needed
Deploy ATHIA_STAGE_OUTCOMES + ATHIA_SESSION_SUMMARY in SnowflakeRakesh✓ Done (feat/ATH-0000)
Build retry_optimization_requested workflowRakeshPending
Quick Links
ServiceURLDetails
AWS Console (SSO)deunaio.awsapps.comDeuna AWS account access
Snowflakevltaxpw-rmontes.snowflakecomputing.comAccount: VLTAXPW-RMONTES · DB: PAYMENT_ML · Warehouse: PAYMENT_ML · Read-only
Athia Experiments Dashboardinsights.deuna.comModel performance data for processor selector experiments
ClearML (Prod)athia-ml.deuna.ioML experiment tracking & training monitoring — production
ClearML (Dev)athia-ml.dev.deuna.ioML experiment tracking & training monitoring — dev environment
Spaceliftduna-e-commmerce.app.spacelift.ioInfrastructure governance & Terraform deployment
Terraform Repogithub.com/DUNA-E-Commmerce/terraform-athiaAll Athia infrastructure as code
Development Rules
RuleDetails
AWS Resource TagsAll AWS resources must include: CreatedBy=aidaptive, ServiceName=smartrouter, Environment=POC
Infrastructure as CodeAll infrastructure via Terraform only — no manual AWS console resource creation
Decisions Log
DateDecisionRationaleMade By
2026-04-19All productionization complete — 2 models (LogReg + DNN) running daily in dev + prodGPU-accelerated training, full CI/CD, 5-gate quality suite, ClearML tracking, S3 versioned storage. 13/14 gaps closed (~101d saved)Rakesh
2026-03-24Serving migrated from DATA-Athena-Snowflake to athia-model-server sidecar in athena-platformClean separation — training repo (Python) vs serving repo (Go + Python sidecar)Rakesh
2026-03-13Adopted TensorFlow ecosystem (TFX, TF Serving, TFDV, TFMA, TF Transform) for all ML workReplaces Snowflake ML / XGBoost / scikit-learn. Unified training → validation → serving pipeline with production-grade toolingRakesh
2026-03-13Added Phase 0 — 7 service shells + 3 design tasks (Rakesh) before Volaris feature workService architecture: Data Pipelines, Feature Service, Training Pipelines, Model Mgmt, Eval Service, Evaluation Framework, Experiment SystemRakesh
2026-03-13Both repos switched to main branch — feat/ATH-0000 and Triton branches both mergedAll ML training pipeline and Triton serving code now on main; no more feature branch tracking neededRakesh
2026-03-13Triton branch merged to main — confirms deployment architecturefeature/llm-driven-ml-training merged; Triton IS, ExperimentService, shadow mode now in production codebaseDeuna Engineering
2026-02-18Latency target updated: p95 <50ms → p95 <200msRevised from original SOW specRakesh (w/ Pablo)
2026-02-19Phase 1 target merchant set to Volaris (not Cinépolis)Volaris has known PSPs (Worldpay ID:76, MIT ID:85, Elavon, Amex); Cinépolis only shows Cybersource gateway — processor unknownMark Walick
2026-02-20Repo analysis scoped to branch feat/ATH-0000-athia-ml-llm-schema-discovery (not main)This branch contains the active ML platform development; main does not reflect current capabilitiesPablo
2026-02-26athena-platform feature/llm-driven-ml-training (Triton IS) identified as the production model serving pathTriton IS + shadow mode + ExperimentService provides complete training→serving pipeline; replaces manual SageMaker endpoint registration; closes G-06Rakesh
📎References & Documents
Key links and documents for this project
📄
Project Plan — v21 (Latest)
2026-03-24 · All productionization complete, ~101d saved, ~3.5d remaining
📊
Data Dictionary
Google Sheets — Deuna data field definitions
🗺️
Athia Data Model
LucidChart — system architecture diagram
🐍
DATA-Athena-Snowflake
LLM analytics platform (Python / LangGraph)
🐹
athena-platform
ML serving + A/B testing platform (Go / Gin)
🗄️
Snowflake Schema Reference
2026-02-18 · Extracted from PAYMENT_ML database
Internal Only

Admin — Event Log

Time User Event Details
Loading events…