Smartrouter AI/ML Integration
All productionization complete. 2 models (LogReg + DNN) for Volaris smart routing, continuous daily training from Snowflake, GPU-accelerated, full CI/CD, and served in production.
What This Is
All productionization is complete. 2 ML models (Logistic Regression + Deep Neural Network) for Volaris smart routing are running in production with continuous daily training from Snowflake data, GPU-accelerated training (NVIDIA L4), full CI/CD pipeline, and automated model promotion with quality gates.
The Problem
- Static routing rules → suboptimal acceptance rates
- No intelligent PSP failover during outages
- Retries not optimized by timing, route, or message
- No feedback loop — outcomes don't improve future decisions
The Solution (5 Use Cases)
- P-01 — PSP outage detection & failover
- P-02 — Optimize existing static routing rules
- P-03 — Per-transaction processor ranking
- P-04 — Authorization message manipulation
- P-05 — Retry optimization
Phase 1 Target — Volaris Merchant
✅ Production — What's Live
- 2 ML models — LogReg (4 per-processor) + DNN (multi-output, 4 heads) both in production
- Daily automated training — Lambda + EventBridge (2 AM UTC), GitHub Actions cron
- GPU-accelerated — g6.2xlarge (NVIDIA L4, 24GB VRAM), mixed precision, DNN ~13 min
- Deep ClearML integration — metrics, ROC/PR curves, confusion matrices, hyperparams
- 5-gate quality suite — min_data, AUC≥0.65, regression, stability, completeness
- Full CI/CD — CodePipeline + CodeDeploy + 127 training + 138 Go + 15 sidecar tests
- Model serving sidecar — FastAPI on port 8081, hot-reload, 4 processors, 140-feature encoder
- A/B testing — control groups, shadow mode, deterministic bucketing, multi-model-type experiments
- S3 versioned storage — sequential versions (v1, v2, ...), rollback manifests, sidecar bucket mirroring
- OTEL + Prometheus + Grafana — 15+ metric types, distributed tracing, CloudWatch dashboard
- Terraform infrastructure — Spacelift-managed, dev + prod, GPU instances, ClearML ECS/Fargate
- Snowflake ingestion — memory-efficient streaming, S3 parquet caching, 12-week lookback
🔨 Remaining (~3.5d — polish only)
- Lineage tracking queryable store — G-10 (~1.5d remaining)
- Lineage documentation — G-10 (~0.5d)
📊 Production Architecture
Snowflake → S3 parquet cache ↓ LogReg Pipeline (CPU) DNN Pipeline (GPU g6.2xlarge) ↓ 5-Gate Quality Suite ↓ S3 Versioned Models ↓ (sync to EFS) Sidecar (FastAPI :8081) ↓ Go API (:8080) → Production
DATA-Athena-Snowflake Testing
127 tests · 16 test files · Maturity 4/5
Comprehensive coverage: pipeline orchestration, DNN training (masked BCE, GPU OOM), evaluator, quality gates, promoter, data ingestion, preprocessing, ClearML tracker, rollback, model config. Synthetic fixtures for reproducibility.
athena-platform Testing
138 Go + 15 Python tests · Maturity 4/5
Domain services + repositories well covered. Model registry, experiment assignment, bucketing, shadow mode tested. Sidecar: smart router strategy, model types, encoders. CI enforced on every PR.
Current Blockers
| Item | Owner | Status |
|---|---|---|
| Deuna corp accounts — Rakesh & Naoki | TBD | Not needed |
| Code / repo access — Naoki | TBD | Done |
| Are ATHIA_* tables live in Deuna's Snowflake? | Israel | Confirmed ✓ |
| Are SageMaker endpoints live today? | Rakesh | Resolved ✓ |
| Payment volume through routing engine? | Israel | Open question |
| GPU instance for training | Deuna | Resolved ✓ — g6.2xlarge (NVIDIA L4) deployed |
| AWS resource access for Rakesh | Deuna | Resolved ✓ — full access granted |
Purpose
Assess the effort required to integrate Athia AI/ML into Deuna's payment routing. Produce a clear work breakdown and estimate before any implementation begins.
Phase 0 Deliverables
- Full schema & data understanding
- Effort estimate per workstream
- Risks and open questions resolved
- Recommended build order
Long-Term Success
- Measurable approval lift
- Stability during PSP outages
- Latency: p95 < 200ms
- Closed feedback/learning loop
✅ In Scope (Phase 0)
- Understand Deuna's data, schema, routing rules
- Assess Athia platform gaps vs. what's needed
- Size effort for P-01 through P-05 use cases
- Identify all dependencies, blockers, risks
🚫 Out of Scope (Phase 0)
- Any implementation or code delivery
- 3DS optimization (Phase 2)
- User-facing messaging (Phase 3)
- Installment optimization
Phase 0 — Assess Level of Effort Done ✓
2 days · $6K budget · Completed 2026-02-19
Nail down all the work required. Produce a detailed estimate with confidence before committing to delivery.
Phase 1 — Model in Production Pending
2 weeks · Core delivery
Model running in production for 2 processors with basic feature store. Target merchant: Volaris.
Phase 2 — Monitoring + Experimentation Pending
Week 3 · Add monitoring and integrate with A/B experimentation infrastructure.
Phase 3 — Drift Detection, CI/CD, Ramp-Up Pending
TBD · Drift detection, CI/CD pipeline, experiment ramp-up, additional model techniques.
Phase 1 Delivery Plan — Volaris Merchant
65 tasks · ~64 person-days · 5–6 weeks with 3 engineers · TensorFlow ecosystem
| Sub-Phase | Focus | Tasks | Effort | Key Notes |
|---|---|---|---|---|
| 0 — Service Architecture | Design + scaffold 7 TF service shells | 11 | 14.5d | NEW Rakesh: architecture, API contracts, TF integration plan. Team: 7 service shells |
| 1 — Discovery & EDA | Understand Volaris data | 10 | 7.5d | Approval rates per PSP, retry patterns, routing rules, DYNAMIC_ROUTING_DETAIL JSON, A/B sample size |
| 2 — Feature Engineering | Build ML features (Feature Service + TF Transform) | 9 | 8.5d | Card BIN/brand, RFM, retry context, rolling health scores, Amex hard-rule bypass |
| 3 — Model Development | Train models (tf.keras) | 8 | 9d | TF DNN, wide-and-deep, TF Decision Forests via Training Pipeline + Eval Service |
| 4 — Outage Detection | P-01: failover for 4 PSPs | 6 | 6.5d | Rolling health score (5–15 min window), auto-failover, recovery detection (1–2% sampling), alerts |
| 5 — Message Manipulation | P-04: CIT/MIT experiment | 5 | 5.5d | CIT/MIT audit, approval delta by toggle × processor × card type, new athena-platform endpoint, A/B test |
| 6 — Platform Integration | Register models, wire Deuna | 8 | 6.5d | Triton ✓ ExperimentService one-call API + built-in shadow mode. Requires Deuna eng coordination. |
| 7 — Monitoring & Feedback | Dashboards, retraining, review | 8 | 6d | 2 tasks already done: ATHIA tables confirmed live, ATHIA_STAGE_OUTCOMES deployed |
| Total | 65 | ~64d | Phase 0 first (Rakesh design ∥ team scaffolding); Phases 1–2 sequential; 3–5 parallel after Phase 2 |
| Name | Role |
|---|---|
| Reks | CEO & Co-Founder |
| Chema | Co-Founder |
| Pablo | CTO — Executive Sponsor |
| Israel | Data POC — Snowflake & Data Access |
| Farhan | Claude / LLM Access POC |
| Mark Walick | Product Management Lead |
| Name | Role |
|---|---|
| Rakesh | CEO |
| Naoki | Solutions Architect |
| Rene | ML Engineer |
| Kedar | Backend / Data Engineer |
Phase 1 Target Merchant: Volaris Decided 2026-02-19
Volaris selected over Cinépolis. Known PSPs: Worldpay (ID: 76), MIT (ID: 85), Elavon (cards), Amex (Amex cards) — 4 processors total with routing policies per currency. Cinépolis deferred: only shows Cybersource (a gateway), actual processor unknown.
Outage Detection & Failover
Detect PSP failures via persistent timeout codes. Auto fail-over and fail-back using random sampling of downed PSP to detect recovery.
Routing Optimizer
Optimize Deuna's existing static routing rules based on historical outcomes. Build on existing rules engine rather than starting from scratch.
Per-Transaction Route Selection
Rank top 3 payment processors per transaction in real time based on prior outcomes, card signals, and merchant context.
Message Manipulation
Toggle CIT/MIT, AVS, MCC variables in authorization request messages. Provide top 3 configuration recommendations per transaction.
Retry Optimization
Optimize when, how, and where to retry declined transactions. MIT/subs focused. Enterprise darktime reduction. Delayed retry based on processor reputation.
Connection
VLTAXPW-RMONTES
Database: PAYMENT_ML
Access: Read-only
ABTESTING Schema
Denormalized flat join of all views. Best starting point for EDA. No complex joins needed.
ALL_VIEWS_FLAT ALL_PAYMENT_EVENTS_FLAT
SOURCES Schema
15 clean views: orders, payments, attempts, events, user profiles, routing logs, merchant rules, airline data.
| View | Why It Matters | Use Cases |
|---|---|---|
VW_ATHENA_PAYMENT_ATTEMPT | Full retry chain per payment; processor, error codes, hard/soft decline, DYNAMIC_ROUTING_DETAIL JSON | P-03 P-05 |
VW_SMART_ROUTING_ATTEMPTS | Live routing engine log: algorithm type, latency, skip reasons — direct latency signal for p95 <200ms | P-01 P-02 |
VW_ROUTING_MERCHANT_RULE | Existing static rules engine — foundation for routing optimizer. SHADOW_MODE column suggests testing infrastructure exists. | P-02 |
ABTESTING.ALL_VIEWS_FLAT | Everything joined in one table — best for initial EDA | EDA |
| Feature Group | Key Columns | Use |
|---|---|---|
| Retry history | NUM_ATTEMPTS_ORDER, PREVIOUS_ORDER_ERROR_CODE, AVG_SEC_BETWEEN_PAYMENT_ATTEMPS | P-05 |
| Error signals | ERROR_CODE, ERROR_CATEGORY, HARD_SOFT | P-03, P-05 |
| Card signals | CARD_BIN, CARD_BRAND, BANK, CARD_COUNTRY | P-03 |
| User behavior | TARGET_USER_FRAUD_RATE_COHORT, TOTA_MINUTES_BROWSING, RFM values | P-03 |
| Message config | MCI_MSI_TYPE, ORDER_MCI_MSI_TYPE, PAYMENT_ATTEMPT_METHOD_TYPE | P-04 |
| Geo & Device | ORDER_COUNTRY_CODE, TARGET_USER_BROWSER, TARGET_USER_DEVICE | P-03 |
DATA-Athena-Snowflake
github.com/DUNA-E-Commmerce/DATA-Athena-Snowflake · Production — 2 training pipelines (LogReg + DNN), GPU-accelerated, daily automated training
✅ Production Status (Updated 2026-04-19)
2 training pipelines in production: LogReg (4 per-processor models) + DNN (multi-output neural net with 4 heads). Daily automated training via Lambda + EventBridge. GPU-accelerated on g6.2xlarge (NVIDIA L4). Deep ClearML integration with metrics, ROC/PR curves, confusion matrices. 5-gate quality suite blocks bad model promotion. 127 tests. S3 versioned model storage with rollback manifests.
LLM Workflows (11 stimuli)
| Workflow | Status |
|---|---|
| Acceptance rate analysis | Done (v0_1, v1_0) |
| Fraud card analysis | Done |
| Metrics anomaly detection | Done |
| Chatbot / data analyst | Done |
| Strategy generation director | Partial — Matcher has exit() |
| Cost optimization | Early stage |
| Retry optimization | Missing (P-05 gap) |
ML Training Platform (now on main)
| Service | Status |
|---|---|
| Training pipeline (Snowflake ML) | Done |
| LLM training orchestrator (GPT-4 + RAG) | Done |
| LLM experiment designer (GPT-4 + RAG) | Done |
| Data quality validator | Done |
| Model registry (auto table creation) | Done |
| Feature extractor | Done |
| Feedback collector (webhook/API/batch) | Done |
| Schema discovery (ChromaDB) | Done |
| Athia event ingestion | Done |
| Model deployer → athena-platform | Done — S3 promotion + sidecar mirror + hot-reload |
Architecture — Multi-Agent Pattern
FastAPI + LangGraph · Stimulus-response orchestration · LLM backends: Claude (primary), GPT-4 (fallback)
Request → StimulusRegistry → OrchestratorWorkflow → Branch (DAG of Nodes)
→ AgentWorkflow (LangGraph StateGraph) → Response
11 stimuli: acceptance_rate_analysis · fraud_card_analysis · metrics_anomaly
user_question · data_analyst · researcher_assistance · deep_exploration
element_edition · knowledge_expert · strategy_generation · cost_optimization
End-to-End Training Flow (Production)
Daily 2 AM UTC (Lambda + EventBridge) → Snowflake Data Ingestion (memory-efficient streaming, S3 parquet cache) → Preprocessing (z-score normalization + one-hot encoding, 140 features) ├── LogReg Pipeline (4 per-processor models, CPU, ~1.5 min) └── DNN Pipeline (multi-output 64→32→4 heads, GPU g6.2xlarge, ~13 min) → 5-Gate Quality Suite (min_data, AUC≥0.65, regression, stability, completeness) → ClearML Metrics Logging (ROC/PR curves, confusion matrices, hyperparams) → S3 Model Promotion (versioned: v1, v2, ... + rollback manifest) → Sidecar Bucket Mirror → EFS → Hot-Reload
Training Pipeline Architecture
Pipeline flow: Training Decision → Data Prep → Feature Extraction → Validation → Experiment Design → Training → Model Selection → Deployment
Services
| Service | Purpose |
|---|---|
TrainingPipeline | Full training execution (plan/run/deploy) |
LLMTrainingOrchestrator | LLM + RAG decision engine (RETRAIN_NOW / SCHEDULED / SKIP) |
LLMExperimentDesigner | Designs 5-10 experiments using GPT-4/Claude with RAG |
ModelDeployer | Exports to EFS, registers deployment, creates canary config |
TrainingPlanner | Dry-run mode ("terraform plan" for ML) |
FeatureExtractor | Auto-extracts features, creates training dataset views |
DataQualityValidator | Schema, statistical, temporal bias, drift validation |
FeedbackCollector | Webhook, API polling, batch feedback collection |
ModelRegistry | Model CRUD, prediction/feedback schema management |
SchemaDiscovery | Auto-discovers training tables via LLM |
LLMProvider | Unified Claude/GPT-4 interface with auto-fallback |
API Endpoints
| Endpoint | Purpose |
|---|---|
POST /api/v1/training/plan/{model_type} | Dry-run plan |
POST /api/v1/training/run/{model_type} | Execute training |
POST /api/v1/training/decision/{model_type} | LLM decision |
POST /api/v1/experiments/design/{model_type} | Design experiments |
Resolved Since Last Update
- Deployment fully automated — S3 promotion + sidecar bucket mirror + hot-reload (was: manual)
- Rollback capability complete — versioned S3 + rollback manifests + sidecar hot-reload (was: partial)
- GPU training deployed — g6.2xlarge with mixed precision (was: CPU-only, 3hr runs)
- CI/CD complete — GitHub Actions + CodePipeline + CodeDeploy (was: partial)
- DNN pipeline added — multi-output neural net with 2-18% AUC improvement over LogReg
Remaining (minor)
- Formal lineage queryable store — G-10 (~1.5d)
🧪 Testing Coverage
| Layer | Coverage | Notes |
|---|---|---|
| Pipeline orchestration | Comprehensive | test_pipeline.py, test_multi_output_pipeline.py — stage execution, metric logging |
| DNN training | Comprehensive | test_multi_output_trainer.py — masked BCE, batched validation, GPU OOM |
| Quality gates | Comprehensive | test_quality_gates.py — all 5 gates tested |
| Model promotion | Comprehensive | test_promoter.py — S3 upload, versioning, rollback manifests |
| Data ingestion | Comprehensive | test_data_ingestion.py — Snowflake streaming, S3 caching |
| Preprocessing | Comprehensive | test_preprocessing.py — z-score, OHE, config management |
| ClearML tracker | Comprehensive | test_tracker.py — task creation, offline mode, metrics |
| Lambda handlers | Comprehensive | test_multi_output_handler.py — instance lifecycle, SSM commands |
Maturity: 4/5 — Strong. Comprehensive training pipeline coverage with synthetic fixtures for reproducibility. Unit + integration tests with markers (slow, integration, unit). Coverage tracking via pytest-cov.
athena-platform
github.com/DUNA-E-Commmerce/athena-platform
✅ Production Status (Updated 2026-04-19)
Full ML serving platform in production. Serves both LogReg and DNN models via FastAPI sidecar (port 8081) with hot-reload from EFS. 4 processors (worldpay, elavon, mit_bulk, amex), 140-feature encoder, strategy pattern (XGBoost LTR / TF per-processor). A/B testing with control groups, shadow mode, deterministic bucketing. Multi-model-type experiments (LogReg vs DNN). 138 Go + 15 Python tests. OTEL + Prometheus + Grafana monitoring.
✅ Key Capabilities — All Production-Ready
Model Serving: FastAPI sidecar, EFS models, hot-reload (POST /models/reload). A/B Testing: Deterministic bucketing, control groups, shadow mode, auto-winner. Monitoring: 15+ OTEL metrics, Prometheus, Grafana, CloudWatch. Model Registry: ExperimentService one-call API, model type in version (logreg/dnn). Event Logging: Async Snowflake ingestion. Auto-Winner: Statistical significance, auto-promotion.
ML Inference Types (already in registry)
| Type | Maps To |
|---|---|
processor_selector | P-03 |
retry_predictor | P-05 |
retry_sequence | P-05 |
installment_optimizer | Out of scope |
Snowflake Tables
| Table | Status |
|---|---|
ATHIA_PREDICTIONS | Active |
ATHIA_FEEDBACK | Active |
ATHIA_TRAINING_DATASET | Active |
ATHIA_EXPERIMENT_LIFT | Active |
ATHIA_STAGE_OUTCOMES | Deployed (feat/ATH-0000) |
ATHIA_SESSION_SUMMARY | Deployed (feat/ATH-0000) |
ATHIA_MULTI_STAGE_ANALYSIS | New (feat/ATH-0000) |
ATHIA_MODEL_METRICS | New (feat/ATH-0000) |
ML_MODEL_REGISTRY | New (feat/ATH-0000) |
Architecture — Clean Architecture (Go/Gin)
PostgreSQL (RDS Multi-AZ) + Redis (ElastiCache) + EFS · mTLS enforced on /api/v1/ml/predict/*
REST Handlers (V1 + V2, Gin) ← mTLS on /ml/predict/*
↓
Controllers (~30 implementations)
↓
Domain Services (44 packages) ← constructor injection throughout
↓
Repositories (43 GORM implementations) ← in-memory SQLite for tests
↓
PostgreSQL (RDS Multi-AZ) + Redis (ElastiCache) + EFS (model storage)
A/B Experimentation — Auto-Winner Guardrails
Stats
p-value < 0.05
Min 1000 samples/variant
Min 7 days runtime
Lift
Min 1% absolute lift
Deterministic bucketing
SHA256(transaction_id)
Guardrails
≤10% latency regression
≥−5% revenue regression
Dry-run mode (safe default)
🧪 Testing Coverage
| Layer | Coverage | Notes |
|---|---|---|
| Domain services (44) | 44/44 | All have test files |
| Repositories (43) | ~43/43 | In-memory SQLite isolation |
| V1 REST handlers (18) | 15/18 (83%) | agent, workspaces, elements missing |
| Auth middleware | Tested | JWT + API key covered |
| V2 REST handlers (18) | 0/18 (0%) | Entire new API version untested |
| Bedrock client | 0% | Excluded from coverage config |
| auth, bedrock, element, workspace services | 0% | 4 domain services with no tests |
| Bootstrap / DI graph | Skipped | TODO: testcontainers |
Maturity: ~4/5 — Strong. Domain and repository layers well covered. Triton merge adds 32 new tests. V2 API (18 handlers) and Bedrock ML inference path are still untested. CI threshold is only 20% and internal/clients/ is excluded from coverage entirely.
Testing Comparison — Both Repos
| Metric | DATA-Athena-Snowflake | athena-platform |
|---|---|---|
| Test files | 16 training test files | 138 Go + 15 Python |
| Test count | 127 tests | 138+ Go test files |
| Core product tested? | Yes — all pipeline stages, DNN training, quality gates | Yes — domain, repositories, model registry, experiments |
| CI enforced? | Yes — daily cron + PR checks | Yes — every PR |
| Fixtures | Synthetic data fixtures for reproducibility | In-memory SQLite isolation |
| Maturity | 4/5 — Strong | 4/5 — Strong |
Status
Production
All systems live and running daily
Architecture
2 models (LogReg + DNN) · GPU training · Daily automated retraining · Full CI/CD
Total Effort
~3.5d remaining
of ~104.5d original · ~101d saved (97%)
Progress Summary — 2026-04-19
| Stage | Gap | Category | Priority | Original | Status |
|---|---|---|---|---|---|
| 1 – Foundation | G-02 Orchestration | Infrastructure | High | 8d | Done ✓ — Lambda/SSM/EC2 + SageMaker + daily schedules + GPU lifecycle |
| 1 – Foundation | G-08 Feature Store | ML Infra | High | 13.5d | Done ✓ — 140-feature encoder + preprocess_config + sidecar serving |
| 1 – Foundation | G-04 Data Validation | Data Quality | High | 7d | Done ✓ — 5-gate quality suite |
| 2 – Automation | G-03 CI/CD Pipeline | DevOps | High | 9d | Done ✓ — GitHub Actions + CodePipeline + CodeDeploy. Dev + Prod |
| 2 – Automation | G-06 Deployment Automation | Automation | High | 7.5d | Done ✓ — Full automated deployment with quality gates |
| 2 – Automation | G-07 Model Registration | Automation | Medium | 5.5d | Done ✓ — ExperimentService + S3 versioning |
| 2 – Automation | G-01 Automated Retraining | Automation | High | 10d | Done ✓ — Daily automated training (LogReg + DNN) |
| 3 – Governance | G-13 Versioning Workflow | Governance | High | 5d | Done ✓ — Sequential S3 versioning + model type in version |
| 3 – Governance | G-10 Lineage Tracking | Governance | Medium | 6.5d | Nearly Done (~1.5d left) — ClearML task hierarchy + S3 manifests. Queryable store TODO |
| 3 – Governance | G-14 Rollback Capability | Reliability | High | 5d | Done ✓ — S3 versioned + rollback manifests + sidecar hot-reload |
| 4 – Observability | G-05 Model Monitoring | Observability | High | 8d | Done ✓ — OTEL + Prometheus + Grafana + ClearML |
| 4 – Observability | G-09 Drift Detection | Observability | Medium | 7d | Done ✓ — Feature + prediction + concept drift |
| 5 – ML Quality | G-11 Hyperparameter Tuning | ML Quality | Medium | 5.5d | Done ✓ — Configurable via env vars, per-env tuning |
| 5 – ML Quality | G-12 Algorithm Comparison | ML Quality | Medium | 7d | Done ✓ — LogReg vs DNN via A/B testing |
Engagement Summary
Full implementation of Athia AI/ML smartrouting for Deuna — Volaris merchant. 7 delivery phases covering P-01 (outage failover), P-03 (per-transaction routing), P-04 (message manipulation), and P-05 (retry optimization). ~61.5 person-days of prior Deuna codebase work reduces original scope significantly.
Aidaptive Team — Roles & Responsibilities
| Name | Role | Responsibilities | Days |
|---|---|---|---|
| Rakesh | Project Lead & Strategy | Client coordination (Pablo, Israel), architecture decisions, Phase 6 oversight, post-launch review | 5.5d |
| Naoki | Solutions Architect | athena-platform Go dev, outage detection, message manipulation API, model serving (Triton), CI/CD, Phase 6 integration | 14.6d |
| Rene | ML Engineer | Feature engineering, model training (processor_selector, retry_predictor, retry_sequence), data quality, drift detection, retraining pipeline | 15d |
| Kedar | Data & Backend Engineer | Snowflake EDA, data pipelines, training datasets, feature feeds, Grafana dashboards, monitoring | 16d |
| Total | ~51.1d |
Effort by Phase
| Phase | Focus | Owners | Days | Milestone |
|---|---|---|---|---|
| 1 — Discovery & EDA | Understand Volaris data | Kedar (4.5d) · Rene (2d) · Rakesh (0.5d) · Naoki (0.5d) | 7.5d | Kick-off (20%) |
| 2 — Feature Engineering | Build ML feature set | Rene (4d) · Kedar (3d) · Naoki (1d) · Rakesh (0.5d) | 8.5d | Phase 2 complete (20%) |
| 3 — Model Development | Train P-03 + P-05 models | Rene (6d) · Kedar (2d) · Naoki (1d) | 9d | Phase 3 complete (20%) |
| 4 — Outage Detection | P-01: failover for 4 PSPs | Naoki (4.5d) · Rakesh (1d) · Kedar (1d) | 6.5d | Phase 6 complete (30%) |
| 5 — Message Manipulation | P-04: CIT/MIT experiment | Naoki (2d) · Rene (1.5d) · Kedar (1.5d) · Rakesh (0.5d) | 5.5d | |
| 6 — Platform Integration | Register models, wire Deuna +G-06 close | Naoki (5.1d) · Rakesh (2d) · Kedar (1d) | 8.1d | |
| 7 — Monitoring & Feedback | Dashboards, retraining, review | Kedar (3d) · Rene (1.5d) · Rakesh (1d) · Naoki (0.5d) | 6d | Phase 7 complete (10%) |
| Total | ~51d |
Delivery Timeline (6-week plan)
| Week | Days | Phases Active | Who | Key Milestone |
|---|---|---|---|---|
| Week 1 | 1–5 | Phase 1 (EDA) · Phase 2 start Day 3 | Kedar · Rene · Rakesh (Day 1) | EDA complete; feature schema draft |
| Week 2 | 6–10 | Phase 2 (Features) · Phase 3 start Day 8 | Rene · Kedar · Naoki | Feature set locked; training dataset built |
| Week 3 | 11–15 | Phase 3 (Models) · Phase 4 (Outage) parallel | Rene (models) · Naoki (outage) | Models packaged; outage detection built |
| Week 4 | 16–20 | Phase 4 tail · Phase 5 (CIT/MIT) · Phase 6 prep | Naoki · Rene · Kedar · Rakesh | API contract with Deuna eng signed |
| Week 5 | 21–25 | Phase 6 (Integration) | Naoki · Rakesh | ⚠ Triton branch must be merged by Day 18 · Integration live in shadow mode |
| Week 6 | 26–30 | Phase 7 (Monitoring & Review) | Kedar · Rene · Rakesh | Dashboards live · retraining scheduled · post-launch report |
Critical path: Phases 1–2 sequential. Phases 3–5 can run in parallel. Phase 6 requires (a) models complete, (b) Triton branch merged, (c) 1-week Deuna engineering lead time for API contract. Phase 7 requires Phase 6 live.
Assumptions
- Snowflake access (
PAYMENT_ML) remains available read-only - Deuna eng available for API contract in Week 4 (Pablo / Israel)
- Triton branch merged to main by end of Week 3
- Staging environment available for Phase 6 integration tests
ATHIA_PREDICTIONS+ATHIA_FEEDBACKremain live throughout
Success Criteria
processor_selectorlive for ≥1 Volaris PSP- ≥1% absolute approval rate lift (A/B test at significance)
- ≥5% retry success rate improvement vs. baseline
- PSP failover within 1 routing cycle of threshold breach
- p95 latency <200ms end-to-end (model inference <50ms)
- 48h shadow run complete with documented comparison
✅ Production Status (2026-04-19): All Systems Live
2 ML models in production: LogReg (4 per-processor models) + DNN (multi-output neural net, 4 heads). Daily automated training from Snowflake via Lambda + EventBridge. GPU-accelerated (g6.2xlarge NVIDIA L4). 5-gate quality suite. Full CI/CD. ClearML experiment tracking. S3 versioned model storage with rollback.
Data Pipelines · Feature Service · Training Pipelines · Model Management · Eval Service · Evaluation Framework · Experiment System
V-D01: Service architecture · V-D02: API contracts · V-D03: TF ecosystem integration plan
65 tasks (was 54) · ~64d total (was ~49.5d) · Phase 0 adds ~14.5d · 3 engineers ~5–6 weeks
Service Architecture & Shell Setup
Design 7 service boundaries (Rakesh), define API contracts, scaffold all service shells using TensorFlow ecosystem: Data Pipelines (TFX), Feature Service (TF Transform), Training Pipelines (TFX Trainer), Model Management, Eval Service (TFMA), Evaluation Framework, Experiment System.
📐 V-D01–D03 (Rakesh design) + V-S01–S08 (team scaffolding)
Discovery & EDA
Understand Volaris transaction data — approval rates per PSP, retry patterns, routing rules, DYNAMIC_ROUTING_DETAIL JSON, sample size for A/B test.
Feature Engineering
Card BIN/brand, transaction context, user RFM, retry history, rolling processor health scores, Amex hard-rule bypass, training dataset build.
Model Development
Train processor_selector, retry_predictor, retry_sequence for Volaris 4 PSPs using tf.keras. DNN vs. wide-and-deep vs. TF Decision Forests comparison. TFMA per-slice evaluation.
🔧 TensorFlow ecosystem: tf.keras training via Training Pipeline service, TFMA evaluation via Eval Service, SavedModel export via Model Management.
Outage Detection
Rolling health score per PSP, failover to next-best Volaris processor, recovery detection via 1–2% sampling, alerts on state changes.
Message Manipulation
CIT/MIT audit for Volaris, approval delta by toggle × processor × card type, experiment design, new athena-platform endpoint, A/B test.
Platform Integration
Register models in athena-platform, create Volaris-scoped experiment, API contract with Deuna eng, shadow mode validation before live traffic.
✅ Triton branch: ExperimentService one-call API (V-39–41) + built-in shadow mode (V-46) reduce effort by ~1d.
Monitoring & Feedback Loop
Confirm ATHIA_PREDICTIONS + ATHIA_FEEDBACK are live in Deuna's Snowflake
Done ✓ — confirmed data live in Deuna's Snowflake (2026-02-24)
Deploy ATHIA_STAGE_OUTCOMES table
Done ✓ — deployed in feat/ATH-0000
Approval rate + model performance Grafana dashboards
Retraining trigger, scheduled pipeline, auto-winner, post-launch review
Ready ✓ — LLM orchestrator + training pipeline built
All 65 Tasks
| # | Task | Phase | Owner | Effort | Status |
|---|---|---|---|---|---|
| V-D01 | Design overall service architecture — 7 service boundaries, data flow, inter-service communication | 0 – Architecture | Rakesh | 2d | Design |
| V-D02 | Define API contracts for all 7 services — OpenAPI specs, error handling, versioning | 0 – Architecture | Rakesh | 1.5d | Design |
| V-D03 | Design TensorFlow ecosystem integration — map TFX components to services, TF Serving format, TFDV/TFMA | 0 – Architecture | Rakesh | 1d | Design |
| V-S01 | Scaffold Data Pipeline service — TFX ExampleGen + StatisticsGen, Snowflake adapter, TFDV schema | 0 – Shell | Kedar | 1.5d | |
| V-S02 | Scaffold Feature Service — TF Transform preprocessing_fn, feature store API, real-time endpoint | 0 – Shell | Rene | 1.5d | |
| V-S03 | Scaffold Training Pipeline service — TFX Trainer + tf.keras, Keras Tuner, training history | 0 – Shell | Rene | 1.5d | |
| V-S04 | Scaffold Model Management service — registry CRUD, SavedModel storage, lifecycle, version comparison | 0 – Shell | Naoki | 1d | |
| V-S05 | Scaffold Eval Service — TFMA integration, per-slice metrics, model blessing/rejection API | 0 – Shell | Rene | 1d | |
| V-S06 | Scaffold Evaluation Framework — A/B stat engine, winner detection, latency/revenue guardrails | 0 – Shell | Naoki | 1.5d | |
| V-S07 | Scaffold Experiment System — experiment CRUD, traffic splitting, variants, shadow mode orchestration | 0 – Shell | Naoki | 1.5d | |
| V-S08 | Set up shared TF dependencies — tensorflow, tfx, tf-transform, tfma, tfdv, keras-tuner + Docker base | 0 – Shell | Kedar | 0.5d | |
| V-01 | Filter Volaris transactions — date range, volume, monthly trend | 1 – EDA | Kedar | 0.5d | |
| V-02 | Per-processor approval rates (Worldpay, MIT, Elavon, Amex) by card type, currency, amount | 1 – EDA | Kedar | 1d | |
| V-03 | Retry pattern analysis — attempts per order, processor retry-to, 1st/2nd/3rd attempt success rates | 1 – EDA | Kedar | 1d | |
| V-04 | Explore DYNAMIC_ROUTING_DETAIL JSON — extract all keys and values | 1 – EDA | Kedar | 1d | |
| V-05 | Map Volaris routing rules from VW_ROUTING_MERCHANT_RULE* views | 1 – EDA | Kedar | 0.5d | |
| V-06 | Analyze smart routing log — algorithm types, skip rates, p95 latency baseline | 1 – EDA | Kedar | 0.5d | |
| V-07 | Hard vs. soft decline distribution by processor and error code | 1 – EDA | Rene | 1d | |
| V-08 | Profile airline-specific features — flight, passenger, booking window signal | 1 – EDA | Rene | 0.5d | |
| V-09 | A/B test sample size check — daily volume per processor ≥ 1000/variant in 7 days? | 1 – EDA | Rene | 0.5d | |
| V-10 | EDA summary report — approval rates, error taxonomy, processor share, correlations | 1 – EDA | Rene + Rakesh | 1d | |
| V-11 | Define Volaris feature schema — all features, types, sources, compute latency | 2 – Features | Rene + Naoki | 1d | |
| V-12 | Card-level features — BIN, brand, bank, type, country; historical approval rate per BIN × processor | 2 – Features | Rene | 1d | |
| V-13 | Transaction-level features — amount, currency, CIT/MIT, MCC, flight order type | 2 – Features | Rene | 1d | |
| V-14 | User-level features — RFM, fraud rate cohort, tenure, browsing signals | 2 – Features | Rene | 0.5d | |
| V-15 | Retry-context features — previous processor, error code, time since attempt, attempt number | 2 – Features | Kedar | 1d | |
| V-16 | Processor-state features — rolling approval/timeout/decline rate at 15-min, 1h, 24h windows | 2 – Features | Kedar | 1.5d | |
| V-17 | Amex hard-rule — always route Amex cards to Amex processor; bypass ML | 2 – Features | Naoki | 0.5d | |
| V-18 | Build training dataset — join features onto labeled outcomes; train/val/test split | 2 – Features | Kedar | 1d | Ready ✓ ATHIA_TRAINING_DATASET view + feature_extractor.py |
| V-19 | Feature quality validation — nulls, skew, leakage risk, outcome correlation | 2 – Features | Rene | 1d | Ready ✓ data_quality_validator.py (834 lines) |
| V-20 | Train processor_selector v1 — rank 4 PSPs by approval probability (tf.keras DNN) | 3 – Models | Rene | 2d | TF Training Pipeline service |
| V-21 | Evaluate processor_selector — AUC, lift vs. static rules, per-processor accuracy, latency | 3 – Models | Rene | 1d | Ready ✓ Metrics auto-calculated by pipeline |
| V-22 | Train retry_predictor v1 — predict retry approval probability | 3 – Models | Rene | 1.5d | Ready ✓ Training pipeline supports retry_predictor type |
| V-23 | Train retry_sequence v1 — optimal processor order for retry | 3 – Models | Rene | 1.5d | Ready ✓ Training pipeline supports retry_sequence type |
| V-24 | Evaluate retry models — success rate lift, processor fatigue patterns | 3 – Models | Rene | 1d | Ready ✓ Evaluation framework in pipeline |
| V-25 | Architecture comparison — DNN vs. wide-and-deep vs. TF Decision Forests; select champion | 3 – Models | Rene | 1d | TF Replaces XGBoost vs. LR comparison |
| V-26 | Inference latency test — all models under 50ms budget | 3 – Models | Naoki | 0.5d | |
| V-27 | Package models — serialize, write model card (schema, features, metrics) | 3 – Models | Kedar | 0.5d | Ready ✓ model_registry.py auto-creates tables + stores metadata |
| V-28 | Define outage signal — timeout/error code thresholds for PSP-down detection | 4 – P-01 | Rakesh + Naoki | 1d | |
| V-29 | Rolling processor health score — sliding 5–15 min window per PSP | 4 – P-01 | Naoki | 1.5d | |
| V-30 | Failover logic — skip degraded PSP, route to next-best Volaris processor | 4 – P-01 | Naoki | 1.5d | |
| V-31 | Recovery detection — 1–2% sampling of down PSP; auto-restore on consecutive wins | 4 – P-01 | Naoki | 1d | |
| V-32 | Outage simulation tests — inject failures per PSP; verify failover + recovery | 4 – P-01 | Naoki | 1d | |
| V-33 | Outage alerting — Slack/PagerDuty on PSP state changes | 4 – P-01 | Kedar | 0.5d | |
| V-34 | Audit CIT/MIT usage for Volaris — current distribution across PSPs | 5 – P-04 | Kedar | 0.5d | |
| V-35 | Approval delta by CIT vs MIT per processor — statistical test | 5 – P-04 | Rene | 1d | |
| V-36 | Design message manipulation experiment — CIT/MIT × processor × card type matrix | 5 – P-04 | Rene + Rakesh | 1d | |
| V-37 | Implement message recommendation API in athena-platform | 5 – P-04 | Naoki | 2d | |
| V-38 | Run A/B test — approval rate with vs. without message recommendations | 5 – P-04 | Kedar | 1d | |
| V-39 | Register processor_selector in MODEL_ARTIFACTS (version, Triton backend ref, feature schema) | 6 – Integration | Naoki | 0.3d | Ready ✓ POST /api/v1/ml/models (Triton branch ExperimentService) |
| V-40 | Register retry_predictor + retry_sequence in MODEL_ARTIFACTS | 6 – Integration | Naoki | 0.3d | Ready ✓ Same — ExperimentService handles all 3 model types |
| V-41 | Create Volaris-scoped experiment — merchant filter, 10% treatment split, shadow mode, guardrails | 6 – Integration | Naoki | 0.5d | Ready ✓ POST /api/v1/ml/experiments — variants + models in one call (Triton branch) |
| V-42 | Validate experiment assignment — SHA256 bucketing determinism for Volaris | 6 – Integration | Naoki | 0.5d | |
| V-43 | API contract with Deuna engineering — define POST /api/v1/ml/predict request/response for Volaris | 6 – Integration | Rakesh | 1d | |
| V-44 | Deuna payment service integration — Deuna calls athena-platform at routing decision point | 6 – Integration | Rakesh + Naoki | 2d | |
| V-45 | End-to-end integration test — full flow: Deuna → athena-platform → model → ranked PSPs | 6 – Integration | Naoki + Kedar | 1d | |
| V-46 | Shadow mode — 48h logging without acting; compare predicted vs. actual outcomes | 6 – Integration | Kedar | 0.5d | Ready ✓ is_shadow_mode=true built-in (Triton branch); set up + monitor only |
| V-47 | Confirm ATHIA_PREDICTIONS + ATHIA_FEEDBACK are live in Deuna's Snowflake | 7 – Monitoring | Kedar | 0.5d | Done ✓ Confirmed data live in Deuna's Snowflake (2026-02-24) |
| V-48 | Deploy ATHIA_STAGE_OUTCOMES table in Snowflake | 7 – Monitoring | Kedar | 0.5d | Done ✓ Deployed in feat/ATH-0000 SQL |
| V-49 | Volaris approval rate dashboard — daily/hourly per PSP vs. baseline | 7 – Monitoring | Kedar | 1d | |
| V-50 | Model performance dashboard — prediction confidence, rank accuracy, retry lift | 7 – Monitoring | Kedar | 1d | |
| V-51 | Define retraining trigger — approval rate drop or AUC drop thresholds | 7 – Monitoring | Rene | 0.5d | Ready ✓ llm_training_orchestrator.py makes RETRAIN_NOW / SCHEDULED / SKIP decisions |
| V-52 | Schedule weekly retraining — auto-register new version from latest ATHIA_TRAINING_DATASET | 7 – Monitoring | Rene | 1d | Ready ✓ training_pipeline.py + orchestrator built; configure for Volaris cadence |
| V-53 | Confirm auto-winner worker runs for Volaris experiment with correct guardrails | 7 – Monitoring | Naoki | 0.5d | |
| V-54 | Post-launch review — 2-week lift analysis: approval rate, outage response, retry success | 7 – Monitoring | Rakesh | 1d |
🔴 Critical — Do These First
| Action | Repo | Effort |
|---|---|---|
Build retry_optimization_requested stimulus — P-05 is entirely missing from LLM platform | DATA-Athena-Snowflake | 3d |
Complete Strategy Director — replace exit() placeholder & dummy ranker prompts | DATA-Athena-Snowflake | 2d |
| Add tests for all 18 V2 REST handlers — 0% coverage on new API version | athena-platform | 4d |
| Add tests for Bedrock client + Bedrock domain service — production-critical, currently excluded | athena-platform | 1.5d |
| Add route, service & client layer tests — all 14 routes, 13 services, 3 clients at 0% | DATA-Athena-Snowflake | 7d |
Remove internal/clients/ from coverage exclusions in CI | athena-platform | 0.5d |
- Unit test all route handlers with
FastAPI TestClient+ mocked services - Unit test all 13 services — mock Snowflake sessions and clients
- Unit test core multi-agent framework:
AgentWorkflow,AgentStrategy, node/edge composition - Add per-branch tests for all 11 stimulus branches (mock LLM responses with fixtures)
- Unify CI into a single
pytestrun — replace fragmented per-domain workflows - Enable
pytest-covwith 60% minimum threshold enforced in CI
- Circuit breaker in
AgentWorkflow— isolate node failures, prevent cascade - Enable OpenTelemetry tracing — already in codebase, just commented out
- Replace hardcoded thresholds (15% drop, 60–80 min windows) with configurable params
- Add LLM prompt injection guards — sanitize user inputs before system prompts
- Standardize tool definition — unify
@create_toolvs. manual; add versioning - Centralize config — replace scattered
load_dotenvwith Pydantic Settings schema
- Add tests for all 18 V2 handlers — entire new API version at 0%
- Test Bedrock client & service — excluded from coverage, production-critical
- Raise CI threshold 20% → 60%; remove
internal/clients/exclusion - Bootstrap integration test with
testcontainers-go— verify DI graph - Benchmark tests for
/ml/predict,/feedback, experiment assignment - Contract tests for Snowflake & Bedrock APIs — catch schema drift early
- Event-driven model registry cache invalidation — remove 24h stale assignment risk
- Experiment context middleware — auto-propagate session/experiment IDs per request
- Abstract
*gin.Contextfrom controllers — transport-agnostic, easier to test - Deploy
ATHIA_STAGE_OUTCOMES+ATHIA_SESSION_SUMMARYSnowflake tables - SageMaker model warm-up — cold starts can breach p95 < 200ms target
- Production Grafana dashboards + alerts — config exists locally, not deployed
Full Priority Order
| Priority | Action | Repo | Effort |
|---|---|---|---|
| Critical | Build retry optimization stimulus (P-05) | DATA-Athena-Snowflake | 3d |
| Critical | Complete Strategy Director matcher + ranker | DATA-Athena-Snowflake | 2d |
| Critical | Add tests for all 18 V2 REST handlers | athena-platform | 4d |
| Critical | Add tests for Bedrock client + service | athena-platform | 1.5d |
| Critical | Add route, service & client tests (all at 0%) | DATA-Athena-Snowflake | 7d |
| Critical | Remove internal/clients/ from coverage exclusions | athena-platform | 0.5d |
| High | Multi-agent framework + branch tests (11 branches) | DATA-Athena-Snowflake | 6d |
| High | Circuit breaker in AgentWorkflow | DATA-Athena-Snowflake | 2d |
| High | Enable OpenTelemetry tracing | DATA-Athena-Snowflake | 1.5d |
| High | Raise CI coverage threshold to 60% | athena-platform | 0.5d |
| High | Bootstrap integration test (testcontainers) | athena-platform | 1.5d |
| High | Event-driven model registry cache invalidation | athena-platform | 1.5d |
| High | Deploy ATHIA_STAGE_OUTCOMES + SESSION_SUMMARY tables | athena-platform | 1d |
| High | SageMaker model warm-up (latency target risk) | athena-platform | 1d |
| Medium | Adaptive thresholds (replace hardcoded values) | DATA-Athena-Snowflake | 2d |
| Medium | Experiment context middleware | athena-platform | 2d |
| Medium | Production Grafana dashboards + alert rules | athena-platform | 2d |
| Medium | Benchmark tests for hot endpoints | athena-platform | 1d |
| Medium | Unified CI test suite + coverage enforcement | DATA-Athena-Snowflake | 1.5d |
| Engineer | Repo | Commits |
|---|---|---|
| Rakesh | DATA-Athena-Snowflake | 0114867 Add BIN 546759 → MIT_BULK override; fix ECS poll; 36e2a2c Add refresh_bin_overrides.py; 4227ebb Add serving-time WORLDPAY overrides; 1ecc87c Add MIT_BULK neg-weight; e1fae15 Fix override detection; 965adc4 Revert Adyen processor merge (PR #1389) |
| Engineer | Repo | Commits |
|---|---|---|
| Rakesh | DATA-Athena-Snowflake | 8fd7755 fix: update Snowflake account to VLTAXPW-YN70854 |
| Engineer | Repo | Commits |
|---|---|---|
| Rakesh | DATA-Athena-Snowflake | 7c07eae feat(ATH-1272): BIN routing override config and automation script; 7433b34 fix: correct warehouse in routing scripts; 410e512 fix: PAYMENTS_ML warehouse in update_bin_routing_rules |
| Engineer | Repo | Commits |
|---|---|---|
| Rakesh | DATA-Athena-Snowflake | 96e572e feat(ATH-1375): add Adyen as 5th processor head in Volaris DNN; 7eb8b4e feat: persist BIN neg-weight config across retrains; aec0094 test(ATH-1352): neg-weight config tests; 4a6af72 fix(ATH-1277): align Snowflake warehouse defaults |
| Rakesh | athena-platform | c0cb96d feat(ATH-1375): add Adyen BIN/brand approval rate fields to serving layer; 583522a fix(ATH-1277): close DNN confidence gap — channel, timing, flight data |
| Engineer | Repo | Commits |
|---|---|---|
| Rakesh | DATA-Athena-Snowflake | c5b40d9 fix(volaris): correct bin_processor_rates formula |
| Rakesh | athena-platform | 8abd1b4 test(ATH-1352): expand BIN rate penalty coverage to 26 tests |
| Engineer | Repo | Commits |
|---|---|---|
| Rakesh | DATA-Athena-Snowflake | db2d8ea fix(ATH-1277): write version.json to latest/ prefix on every promotion; 42a3887 feat(ATH-1277): restore automatic latest/ writes |
| Engineer | Repo | Commits |
|---|---|---|
| Rakesh | DATA-Athena-Snowflake | dde18a4 feat(training): BIN neg-weight boosting enabled by default for 481516/416916 → WORLDPAY; e0dec88 feat(training): add BIN-level negative signal boosting for mis-routed processor pairs |
| Rakesh | athena-platform | 6691322 feat(ATH-1352): Thompson Sampling for Volaris smart router; a74d8a5 feat(ATH-1352): serving-layer BIN approval rate penalty |
| Engineer | Work Done |
|---|---|
| Rakesh | Deep analysis of Volaris DNN transactions. Identified that Adyen was added as a new processor in mid-May, which the model has no training history for — model needs retraining once 12 weeks of Adyen data is available. Extensive comparison of model versions v54 vs v56 — v54 is the stronger performer. Coordinated with Jose to revert production serving to v54 and disabled daily automated training to enable controlled manual training cadence going forward. |
| Engineer | Work Done |
|---|---|
| Rakesh | Audited all BIN-level penalty configurations and calibrated the model to correctly handle bad BINs (card BINs with historically low approval rates). This calibration work was the key driver behind v54 and its improved production performance. |
| Engineer | Work Done |
|---|---|
| Rakesh | Created a dedicated Transaction Explorer tool (transactions.html) — a live Snowflake-backed analyzer for inspecting individual transactions across the Deuna ML ecosystem. Displays all ML features fed to the DNN model, raw request/response payloads from the sidecar, payment attempt lifecycle (first try + retries), and feedback/outcome records. Supports sampling by processor, model version (DNN vs heuristic), and outcome correctness — especially useful for debugging Volaris DNN model performance discrepancies. |
| Engineer | Work Done |
|---|---|
| Rakesh | Fixed all ML training pipelines failing due to access issues post Snowflake migration to Ohio region (new instance: VLTAXPW-YN70854). Updated all scripts and config to point to new account/credentials. Verified pipeline connectivity and ran full metrics refresh — confirmed 884K attempts, 80.4% approval rate across all dashboards. |
| Engineer | Work Done |
|---|---|
| Rakesh | Analyzed Volaris DNN model versions v43–v48 to confirm iterative improvement across versions. Investigated bin-level penalty not applying in production — current hypothesis is that the bin-level penalty config has been loaded onto the sidecar rather than the serving path; debugging in progress. |
| Engineer | Work Done |
|---|---|
| Rakesh | Ran full Snowflake refresh across metrics and merchant dashboards (822K attempts, 80.8% approval rate). Added daily $ lift chart for Volaris DNN vs Heuristic comparison. Fixed model comparison chart rendering (Chart.js scale ID issue). Analyzed MCO model to confirm it is ready for regular (non-Volaris) merchant MCO rollout — confirmed bias removal logic is still functioning correctly. |
| Engineer | Work Done |
|---|---|
| Rakesh | Deep analysis of data and schema — identified Elavon and Worldpay transactions stuck in "processing" status, polluting training data with ambiguous outcomes. Pinging Deuna team to resolve. Added quality metric gates to model promotion pipeline — model will not be promoted if it does not outperform the currently serving model in offline evaluation. |
| Engineer | Work Done |
|---|---|
| Rakesh | Added serving change for calling MCO model for Volaris transactions using flight data. Verified MCO model shadow mode for Volaris transactions — fixing serving code. |
| Engineer | Work Done |
|---|---|
| Rakesh | Added detailed analysis for Volaris and MCO on metrics dashboard and scripts. Extensive analysis to identify and resolve sampling bias in DNN vs heuristic comparison — implemented stratified amount-bin matching. Audited MCO for Volaris and identified serving change needed to route Volaris transactions through MCO model using flight data. |
| Engineer | Work Done |
|---|---|
| Rakesh | Identified bug where "latest" was being logged as the model version instead of the actual version (e.g. v35) — causing confusion in dashboards. |
| Engineer | Work Done |
|---|---|
| Rakesh | Did extensive analysis to confirm that v34 is the best model for Volaris DNN. Explored offline simulation but determined it would be tricky due to lack of counterfactuals. |
| Engineer | Work Done |
|---|---|
| Rakesh | Analyzed and added multiple minor quality enhancements to Volaris DNN model — getting close to consistently beating control, even if by a small margin. |
| Engineer | Work Done |
|---|---|
| Rakesh | Added temperature-based calibration to Volaris DNN training model along with serving changes to use calibration. Analyzed biases in Volaris control and experiment groups. |
| Engineer | Work Done |
|---|---|
| Rakesh | Continued analyzing Volaris DNN model performance — got non-performing BINs across Visa and MC to go through adhoc override rules. Added detailed analysis to ticket ATH-1272. Next: tweaking model to use confidence score and temperature-based analysis — exploring per-processor confidence scaling with a threshold to fall back to heuristics when model is not confident. |
| Engineer | Work Done |
|---|---|
| Rakesh | Analyzed Volaris DNN model performance for specific BIN ranges where it is underperforming. Added adhoc rules to override model results in those failing BIN ranges. Submitted PR — scheduled to go out Monday. |
| Engineer | Work Done |
|---|---|
| Rakesh | Added capability to MCO models to handle airline tasks (Volaris-specific flows). Fixed MCO model training pipeline so that GPU doesn't go OOM — training now finishes and generates a model successfully. Deep analysis on Volaris routing model performance — found niche error where 3 specific MC BIN numbers were 100% failing; added BIN-related features to training pipeline and trained a new model. |
| Engineer | Work Done |
|---|---|
| Rakesh | Fixed ClearML integration issue. Deployed full training run end-to-end. Loaded model from S3 to EFS for serving. Added alerting check when daily pipeline run doesn't happen. Added scheduler to reload latest model version at 7am PST every day to ensure production always has the latest model. All previously discussed ideas completed — latest model trained and deployed in prod, no more known issues. Did all-nighter to fix everything that was broken for Volaris DNN — it was not getting trained daily with latest trends; fully resolved. Analyzed and fixed last mile of all breakages, verified everything working end-to-end. Fixed MCO training pipeline which was also failing. Confirmed Volaris routing is back to break even. |
| Engineer | Work Done |
|---|---|
| Rakesh | Debugged MCO model in shadow mode to ensure everything is hooked up correctly and working fine. Added the last remaining 2 heads of the MCO model — all heads are now being called in production now that everything is well connected. |
| Engineer | Work Done |
|---|---|
| Rakesh | DNN Volaris model is back up and running in production, handling live traffic with correct data as expected. Full analysis completed confirming everything is hooked up and connected end-to-end. Data distribution matches what the model was trained on — strong signal that the model should perform well for Volaris. Added serving change to handle other heads of MCO model so it can be used in payment services for whichever flow. |
| Engineer | Work Done |
|---|---|
| Rakesh | Added MCO dashboard for monitoring. Completed full analysis confirming all components are hooked up and connected end-to-end. |
| Engineer | Work Done |
|---|---|
| Rakesh | DNN model finally enabled in production — fixed many bugs so that production DNN functions correctly. Training with 1 year of data now succeeds. DNN serving live traffic alongside LogReg in A/B experiment. |
| Engineer | Work Done |
|---|---|
| Rakesh | Added model comparison graphs to metrics dashboard (DNN vs LogReg vs Heuristic — accuracy, approval rate, latency, volume). DNN model went to production yesterday. Monitoring production DNN, fixed one bug in pipeline. Retraining with 1 year of data — will load to production once training completes. |
| Engineer | Work Done |
|---|---|
| Rakesh | All productionization complete. 2 models (LogReg + DNN) for Volaris smart routing running in production. Continuous daily training from Snowflake data. GPU-accelerated (g6.2xlarge NVIDIA L4). Full CI/CD (GitHub Actions + CodePipeline + CodeDeploy). 5-gate quality suite. Deep ClearML integration. 13 of 14 gaps closed (~101d of ~104.5d saved). Only lineage tracking polish remaining (~1.5d). |
| Engineer | Work Done |
|---|---|
| Rakesh | Changed training pipelines to use GPU instances. Implemented deeper ClearML integration in training pipeline — extracting detailed metrics from each training run into ClearML for monitoring and debugging. |
| Engineer | Work Done |
|---|---|
| Rakesh | Spent ~30 hours debugging dev servers with ClearML integration — blocked by access issues. Worked with team to resolve access, dev pipeline now working correctly. Attempted DNN pipeline deployment — ran for 3 hours and failed. Long turnaround time makes iteration impractical without GPU. |
- GPU needed — training requires GPU instance; CPU-based runs too slow for practical iteration (3hr+ per attempt)
- AWS access for Rakesh — need direct access to AWS resources (console/CLI) to debug and iterate efficiently
| Engineer | Work Done |
|---|---|
| Rakesh | Implemented DNN pipeline Terraform and deploying in dev. Need to get PRs submitted and merged to main/qa so pipelines can run via CI/CD |
| Engineer | Work Done |
|---|---|
| Rakesh | Debugging why ClearML is not registering all metrics from each training run. Next: deploy DNN model and fix production integration to ensure every training run model reaches production automatically |
| Engineer | Work Done |
|---|---|
| Rakesh | Fixing Volaris training pipeline run in dev environment. Helping team with prod deployment |
| Engineer | Work Done |
|---|---|
| Rakesh | Deployed e2e SageMaker dolphin pipeline (PreprocessData ✅, TrainModel ✅, EvaluateModel fix pushed). Deployed Volaris smart router daily training infra. ClearML integration confirmed working. Implemented Volaris pipeline CLI with Snowflake connection. Added 13 tests for setup_pipeline. Fixed multiple SageMaker issues: pipeline name mismatch, model.save(), RegisterModel, FrameworkProcessor for eval. |
- PR #1132 (DATA-Athena-Snowflake) needs approval to merge to
qa— blocks CodePipeline deployment - CodeDeploy DeployEC2 stage failing — scripts need debugging after merge
- SSO lacks
sagemaker:CreatePipelinefor local runs
| Engineer | Work Done |
|---|---|
| Rakesh | Deployed e2e SageMaker training pipeline via Spacelift. Consolidated PRs #24+#25 → #26. Fixed Spacelift project_root, cleaned orphaned state. Enabled daily training for both pipelines. Set up ClearML creds in Secrets Manager + EC2. Created Spacelift stack for Volaris. Renamed model-artifacts → volaris-model-artifacts. |
| Engineer | Work Done |
|---|---|
| Rakesh | Fighting Terraform and Spacelift configuration issues. ClearML successfully deployed in dev environment |
| Engineer | Work Done |
|---|---|
| Rakesh | Deployed ClearML using Terraform and Spacelift in dev environment and handed over to team. Full data analysis of Snowflake data with Rene — generated list of suggestions for team, published at metrics dashboard |
| Rene | Full data analysis of Snowflake data with Rakesh — generated list of suggestions for team |
| Engineer | Work Done |
|---|---|
| Rakesh | Deploying entire training platform end to end. Learning Spacelift for infra deployment — finally got access. Deploying ClearML to monitor all training. Wrote analysis script to monitor model performance driving metrics dashboard directly from Snowflake — better for analysis and generating insights |
| Rene | Analyzing model performance with past data. Waiting for experiment to be enabled again — current data not significant enough |
| Engineer | Work Done |
|---|---|
| Rakesh | Analyzing model performance metrics, working on DNN model optimization for Volaris smart router |
| Engineer | Work Done |
|---|---|
| Rakesh | Training DNN model for Volaris smart router — also serves as example on how to use the TFX pipeline |
| Rene | Data analysis to identify patterns based on Volaris data |
| Engineer | Work Done |
|---|---|
| Rakesh | Verified everything working in production. Helping with questions about the model. Double-checked traffic ramp-up and analyzing how to do post-launch analysis |
| Engineer | Work Done |
|---|---|
| Rakesh | Integrated model in serving stack with sidecar approach, loading models from S3/EFS. Removed all parallel serving libraries no longer needed in training directory |
| Engineer | Work Done |
|---|---|
| Rene | Created and incorporated AMEX model, added to the serving mix |
| Rakesh | Built AMEX model with Rene. Working with Deuna team on deploying everything in production for Volaris |
| Engineer | Work Done |
|---|---|
| Rakesh | Analyzing code with Naoki to determine serving approach — sidecar vs deploying servomatic. Building production flow to save and load models from S3 & deploying servomatic binary |
| Naoki | Evaluating sidecar vs servomatic deployment for integrating TF model with the Go API |
| Rene | Continuing to iterate on the TF model and data analysis. Converting LR model to TF format for use in servomatic binary for online eval |
| Engineer | Work Done |
|---|---|
| Rene | First regression model trained on Volaris data — evaluating quality in offline mode, initial results look promising |
| Rakesh | Continuing to analyze approach to serve the model as servomatic with Naoki |
| Naoki | Analyzing serving architecture to connect trained model via servomatic in production |
| Engineer | Work Done |
|---|---|
| Rene | Iterating on data shape analysis and building first model version |
| Rakesh | Working with Rene on first model; analyzing serving code with Naoki to plan production integration |
| Naoki | Analyzing serving code with Rakesh to determine how to connect model in production |
| Engineer | Work Done |
|---|---|
| Rakesh | First analysis of the data with Rene — analyzing best approach to build processor selector model |
| Rene | Started on first model based on current understanding of data and features |
| Naoki | Working with Rene on integrating S3 file loading into TFX data loader for e2e training and eval |
| Engineer | Work Done |
|---|---|
| Rene | Looking at data shape for Volaris to train first processor selector model |
| Kedar | Working on data pipeline |
| Naoki | Continues setting up good practices (code quality, CI, testing patterns) |
| Rakesh | Writing smartrouter service |
| Engineer | Work Done |
|---|---|
| Rakesh | Iterated on experiment and metrics framework to make everything work locally and in tests |
| Naoki | Iterated on improving code and ramping up |
| Kedar | Looking at feature extraction from Snowflake |
| Rene | Working on simple first model |
| Engineer | Work Done |
|---|---|
| Rakesh | Added Evaluation Service (uses Model Service + Feature Service to evaluate TensorFlow models). Added e2e tests for all 3 services. Added experiment and metrics framework to track all training pipelines. Demo training pipeline working end to end. PR waiting for review |
| Engineer | Work Done |
|---|---|
| Rakesh | Built foundational services: Model Service and Feature Service with tests and scaffoldings to support TensorFlow trained models |
| Engineer | Work Done |
|---|---|
| Naoki | Fixed broken tests to get everything running locally. Looking at setting up automated deployment in dev environment for services |
| Kedar | Got repo and environment access figured out. Looking into Snowflake data schema |
| Rene | Got repo and environment access figured out. Looking at training pipeline code |
| Rakesh | Updated deuna.aidaptive.com with latest repo analysis and refreshed task list. Synced athena-platform (v0.15.5, Triton merged) |
Are ATHIA_PREDICTIONS / ATHIA_FEEDBACK tables populated in Deuna's Snowflake today? Confirmed ✓ 2026-02-24
Confirmed — data is live in Deuna's Snowflake (verified 2026-02-24).
Are SageMaker endpoints live for processor_selector / retry_predictor?
Or are they placeholders only? — Rakesh to confirm
Is there a live model in MODEL_ARTIFACTS that Deuna's payment service is calling today?
Rakesh to confirm
What is the current payment volume through the routing engine?
Minimum 1,000 transactions per variant needed for A/B test statistical validity — Ask Israel
Who owns the athena-platform Go repo deployments?
Aidaptive or Deuna infra? Affects Phase 1 deployment planning — Clarify with Pablo
When will feature/llm-driven-ml-training (Triton IS) merge to main? New
This PR closes G-06 and defines the production model serving backend (Triton vs. SageMaker). Its merge timeline directly sets the Phase 6 integration schedule — ask Pablo.
| Item | Owner | Status |
|---|---|---|
| Snowflake access — Rakesh | Israel (Deuna) | ✓ Done (2026-02-18) |
| Snowflake access — Naoki | Rakesh + Naoki | ✓ Done (2026-02-19) |
| Code / repo access — Rakesh | Pablo (Deuna) | ✓ Done (2026-02-19) |
| Claude / LLM access & budget | Pablo → Farhan | ✓ Done (2026-02-19) |
| Code / repo access — Naoki | TBD | ✓ Done |
| Deuna corp accounts — Rakesh & Naoki | TBD | Pending |
| Claude Code credits — Rakesh & Naoki | — | Not needed |
| Deploy ATHIA_STAGE_OUTCOMES + ATHIA_SESSION_SUMMARY in Snowflake | Rakesh | ✓ Done (feat/ATH-0000) |
| Build retry_optimization_requested workflow | Rakesh | Pending |
| Service | URL | Details |
|---|---|---|
| AWS Console (SSO) | deunaio.awsapps.com | Deuna AWS account access |
| Snowflake | vltaxpw-rmontes.snowflakecomputing.com | Account: VLTAXPW-RMONTES · DB: PAYMENT_ML · Warehouse: PAYMENT_ML · Read-only |
| Athia Experiments Dashboard | insights.deuna.com | Model performance data for processor selector experiments |
| ClearML (Prod) | athia-ml.deuna.io | ML experiment tracking & training monitoring — production |
| ClearML (Dev) | athia-ml.dev.deuna.io | ML experiment tracking & training monitoring — dev environment |
| Spacelift | duna-e-commmerce.app.spacelift.io | Infrastructure governance & Terraform deployment |
| Terraform Repo | github.com/DUNA-E-Commmerce/terraform-athia | All Athia infrastructure as code |
| Rule | Details |
|---|---|
| AWS Resource Tags | All AWS resources must include: CreatedBy=aidaptive, ServiceName=smartrouter, Environment=POC |
| Infrastructure as Code | All infrastructure via Terraform only — no manual AWS console resource creation |
| Date | Decision | Rationale | Made By |
|---|---|---|---|
| 2026-04-19 | All productionization complete — 2 models (LogReg + DNN) running daily in dev + prod | GPU-accelerated training, full CI/CD, 5-gate quality suite, ClearML tracking, S3 versioned storage. 13/14 gaps closed (~101d saved) | Rakesh |
| 2026-03-24 | Serving migrated from DATA-Athena-Snowflake to athia-model-server sidecar in athena-platform | Clean separation — training repo (Python) vs serving repo (Go + Python sidecar) | Rakesh |
| 2026-03-13 | Adopted TensorFlow ecosystem (TFX, TF Serving, TFDV, TFMA, TF Transform) for all ML work | Replaces Snowflake ML / XGBoost / scikit-learn. Unified training → validation → serving pipeline with production-grade tooling | Rakesh |
| 2026-03-13 | Added Phase 0 — 7 service shells + 3 design tasks (Rakesh) before Volaris feature work | Service architecture: Data Pipelines, Feature Service, Training Pipelines, Model Mgmt, Eval Service, Evaluation Framework, Experiment System | Rakesh |
| 2026-03-13 | Both repos switched to main branch — feat/ATH-0000 and Triton branches both merged | All ML training pipeline and Triton serving code now on main; no more feature branch tracking needed | Rakesh |
| 2026-03-13 | Triton branch merged to main — confirms deployment architecture | feature/llm-driven-ml-training merged; Triton IS, ExperimentService, shadow mode now in production codebase | Deuna Engineering |
| 2026-02-18 | Latency target updated: p95 <50ms → p95 <200ms | Revised from original SOW spec | Rakesh (w/ Pablo) |
| 2026-02-19 | Phase 1 target merchant set to Volaris (not Cinépolis) | Volaris has known PSPs (Worldpay ID:76, MIT ID:85, Elavon, Amex); Cinépolis only shows Cybersource gateway — processor unknown | Mark Walick |
| 2026-02-20 | Repo analysis scoped to branch feat/ATH-0000-athia-ml-llm-schema-discovery (not main) | This branch contains the active ML platform development; main does not reflect current capabilities | Pablo |
| 2026-02-26 | athena-platform feature/llm-driven-ml-training (Triton IS) identified as the production model serving path | Triton IS + shadow mode + ExperimentService provides complete training→serving pipeline; replaces manual SageMaker endpoint registration; closes G-06 | Rakesh |