Deuna × Aidaptive

⚡TL;DR Summary

Everything important in one place — read this first

What This Is

The planning phase is complete. We are now in active development — implementing the full Volaris task list to integrate Athia AI/ML into Deuna's payment routing service. A team of 3 engineers is focused on delivering the 54-task Volaris plan across 7 phases.

The Problem

Static routing rules → suboptimal acceptance rates
No intelligent PSP failover during outages
Retries not optimized by timing, route, or message
No feedback loop — outcomes don't improve future decisions

The Solution (5 Use Cases)

P-01 — PSP outage detection & failover
P-02 — Optimize existing static routing rules
P-03 — Per-transaction processor ranking
P-04 — Authorization message manipulation
P-05 — Retry optimization

Phase 1 Target — Volaris Merchant

Worldpay

ID: 76

MIT

ID: 85

Elavon

Cards

Amex

Amex cards

~15.5d

Remaining build effort

~89d saved across merged + feature branches

~2 wks

Full delivery

2 engineers · was 5–6 wks before savings

~1.5 wks

MVP (Stages 1–2)

Foundation + Automation only

✅ Already Built (Reduces Effort)

✓ ML serving API — processor_selector & retry_predictor endpoints live (v0.15.5)
✓ Model registry — artifact + experiment + variant tables + CRUD API
✓ A/B testing — auto-winner with statistical guardrails (p<0.05, 7 day min, 1000 samples)
✓ Snowflake feedback loop — PREDICTIONS + FEEDBACK + TRAINING_DATASET + STAGE_OUTCOMES tables active
✓ Full ML training pipeline — LR, RF, XGBoost via Snowflake ML (G-01, G-04, G-07, G-09, G-11, G-12)
✓ Triton IS + shadow mode + ExperimentService API — G-06 done
✓ OTEL metrics pipeline + Grafana CloudWatch dashboard — G-05 done (v0.15.1)
✓ Dynamic per-model encoders + A/B model version routing — G-13 done (v0.15.2+)
✓ Processor Selector v2 — 54-feature XGBoost encoder, fully tested
✓ ML platform services (PR pending) — Model Artifact, Feature, Evaluation, TFX, Experiment Tracking
✓ CI improvements — black, isort, uv migration, pytest fixtures (G-03 partial)
✓ Volaris training pipeline — data ingestion → preprocessing → trainer → evaluator → promoter, S3 artifact storage
✓ Volaris TF smart router sidecar — 3 TF SavedModels, 140-feature encoder, Go API integration with feature flag

🔨 Still Needs to Be Built (~15.5d)

✗ Feature store real-time serving layer — G-08 (~3.5d remaining)
✗ Orchestration DAG (Airflow/Prefect) — G-02 (~2d remaining)
✗ CI/CD deployment pipeline — G-03 (~3d remaining)
✗ Lineage tracking — G-10 (~5.5d)
✗ Rollback API — G-14 (~0.5d remaining)
✗ Retry optimization workflow (P-05) — stimulus still missing
✗ Strategy Director — matcher & ranker nodes are placeholder code

DATA-Athena-Snowflake Testing

~25% coverage · Maturity 3/5
Now training-only (serving removed via PR #1114). Volaris training pipeline complete. 30 tests. CI improved: black, isort enforcement, uv migration. Core multi-agent framework still largely untested.

athena-platform Testing

~35% coverage · Maturity 4/5
PR #215 — Volaris TF smart router sidecar. 3 TF SavedModels, 140-feature encoder, Go API integration with feature flag. 221 Python + 13 Go tests. V2 API (18 handlers) and Bedrock client still at 0%. CI threshold only 20%.

Current Blockers

Item	Owner	Status
Deuna corp accounts — Rakesh & Naoki	TBD	Not needed
Code / repo access — Naoki	TBD	Done
Are ATHIA_* tables live in Deuna's Snowflake?	Israel	Confirmed ✓
Are SageMaker endpoints live today?	Rakesh	Open question
Payment volume through routing engine?	Israel	Open question
GPU instance for training — CPU too slow (3hr+ per run)	Deuna	Blocking
AWS resource access for Rakesh (console/CLI)	Deuna	Blocking

✈️ See full Volaris delivery task list — 65 tasks across 8 phases (TensorFlow ecosystem) →

🏠Project Overview

What this project is and what success looks like

🎯

Purpose

Assess the effort required to integrate Athia AI/ML into Deuna's payment routing. Produce a clear work breakdown and estimate before any implementation begins.

✅

Phase 0 Deliverables

Full schema & data understanding
Effort estimate per workstream
Risks and open questions resolved
Recommended build order

🏆

Long-Term Success

Measurable approval lift
Stability during PSP outages
Latency: p95 < 200ms
Closed feedback/learning loop

✅ In Scope (Phase 0)

Understand Deuna's data, schema, routing rules
Assess Athia platform gaps vs. what's needed
Size effort for P-01 through P-05 use cases
Identify all dependencies, blockers, risks

🚫 Out of Scope (Phase 0)

Any implementation or code delivery
3DS optimization (Phase 2)
User-facing messaging (Phase 3)
Installment optimization

📅Timeline & Phases

Project phases from assessment to full delivery

🔍

Phase 0 — Assess Level of Effort Done ✓

2 days · $6K budget · Completed 2026-02-19
Nail down all the work required. Produce a detailed estimate with confidence before committing to delivery.

🚀

Phase 1 — Model in Production Pending

2 weeks · Core delivery
Model running in production for 2 processors with basic feature store. Target merchant: Volaris.

📊

Phase 2 — Monitoring + Experimentation Pending

Week 3 · Add monitoring and integrate with A/B experimentation infrastructure.

⚙️

Phase 3 — Drift Detection, CI/CD, Ramp-Up Pending

TBD · Drift detection, CI/CD pipeline, experiment ramp-up, additional model techniques.

Phase 1 Delivery Plan — Volaris Merchant

65 tasks · ~64 person-days · 5–6 weeks with 3 engineers · TensorFlow ecosystem

Sub-Phase	Focus	Tasks	Effort	Key Notes
0 — Service Architecture	Design + scaffold 7 TF service shells	11	14.5d	NEW Rakesh: architecture, API contracts, TF integration plan. Team: 7 service shells
1 — Discovery & EDA	Understand Volaris data	10	7.5d	Approval rates per PSP, retry patterns, routing rules, DYNAMIC_ROUTING_DETAIL JSON, A/B sample size
2 — Feature Engineering	Build ML features (Feature Service + TF Transform)	9	8.5d	Card BIN/brand, RFM, retry context, rolling health scores, Amex hard-rule bypass
3 — Model Development	Train models (tf.keras)	8	9d	TF DNN, wide-and-deep, TF Decision Forests via Training Pipeline + Eval Service
4 — Outage Detection	P-01: failover for 4 PSPs	6	6.5d	Rolling health score (5–15 min window), auto-failover, recovery detection (1–2% sampling), alerts
5 — Message Manipulation	P-04: CIT/MIT experiment	5	5.5d	CIT/MIT audit, approval delta by toggle × processor × card type, new athena-platform endpoint, A/B test
6 — Platform Integration	Register models, wire Deuna	8	6.5d	Triton ✓ ExperimentService one-call API + built-in shadow mode. Requires Deuna eng coordination.
7 — Monitoring & Feedback	Dashboards, retraining, review	8	6d	2 tasks already done: ATHIA tables confirmed live, ATHIA_STAGE_OUTCOMES deployed
Total		65	~64d	Phase 0 first (Rakesh design ∥ team scaffolding); Phases 1–2 sequential; 3–5 parallel after Phase 2

👥Team

People involved and their roles

Deuna (Client)

Name	Role
Reks	CEO & Co-Founder
Chema	Co-Founder
Pablo	CTO — Executive Sponsor
Israel	Data POC — Snowflake & Data Access
Farhan	Claude / LLM Access POC
Mark Walick	Product Management Lead

Aidaptive (Contractor)

Name	Role
Rakesh	CEO
Naoki	Solutions Architect
Rene	ML Engineer
Kedar	Backend / Data Engineer

🎯Use Cases (P-01 to P-05)

The five P0 use cases to be delivered

🏢

Phase 1 Target Merchant: Volaris Decided 2026-02-19

Volaris selected over Cinépolis. Known PSPs: Worldpay (ID: 76), MIT (ID: 85), Elavon (cards), Amex (Amex cards) — 4 processors total with routing policies per currency. Cinépolis deferred: only shows Cybersource (a gateway), actual processor unknown.

P-01

Outage Detection & Failover

Detect PSP failures via persistent timeout codes. Auto fail-over and fail-back using random sampling of downed PSP to detect recovery.

P-02

Routing Optimizer

Optimize Deuna's existing static routing rules based on historical outcomes. Build on existing rules engine rather than starting from scratch.

P-03

Per-Transaction Route Selection

Rank top 3 payment processors per transaction in real time based on prior outcomes, card signals, and merchant context.

P-04

Message Manipulation

Toggle CIT/MIT, AVS, MCC variables in authorization request messages. Provide top 3 configuration recommendations per transaction.

P-05

Retry Optimization

Optimize when, how, and where to retry declined transactions. MIT/subs focused. Enterprise darktime reduction. Delayed retry based on processor reputation.

🗄️Data & Schema

Snowflake database overview — extracted 2026-02-18

Connection

VLTAXPW-RMONTES

Database: PAYMENT_ML

Access: Read-only

ABTESTING Schema

Denormalized flat join of all views. Best starting point for EDA. No complex joins needed.

ALL_VIEWS_FLAT ALL_PAYMENT_EVENTS_FLAT

SOURCES Schema

15 clean views: orders, payments, attempts, events, user profiles, routing logs, merchant rules, airline data.

View	Why It Matters	Use Cases
`VW_ATHENA_PAYMENT_ATTEMPT`	Full retry chain per payment; processor, error codes, hard/soft decline, `DYNAMIC_ROUTING_DETAIL` JSON	P-03 P-05
`VW_SMART_ROUTING_ATTEMPTS`	Live routing engine log: algorithm type, latency, skip reasons — direct latency signal for p95 <200ms	P-01 P-02
`VW_ROUTING_MERCHANT_RULE`	Existing static rules engine — foundation for routing optimizer. SHADOW_MODE column suggests testing infrastructure exists.	P-02
`ABTESTING.ALL_VIEWS_FLAT`	Everything joined in one table — best for initial EDA	EDA

Feature Group	Key Columns	Use
Retry history	`NUM_ATTEMPTS_ORDER`, `PREVIOUS_ORDER_ERROR_CODE`, `AVG_SEC_BETWEEN_PAYMENT_ATTEMPS`	P-05
Error signals	`ERROR_CODE`, `ERROR_CATEGORY`, `HARD_SOFT`	P-03, P-05
Card signals	`CARD_BIN`, `CARD_BRAND`, `BANK`, `CARD_COUNTRY`	P-03
User behavior	`TARGET_USER_FRAUD_RATE_COHORT`, `TOTA_MINUTES_BROWSING`, RFM values	P-03
Message config	`MCI_MSI_TYPE`, `ORDER_MCI_MSI_TYPE`, `PAYMENT_ATTEMPT_METHOD_TYPE`	P-04
Geo & Device	`ORDER_COUNTRY_CODE`, `TARGET_USER_BROWSER`, `TARGET_USER_DEVICE`	P-03

🔍Repository Analysis

Findings from both Deuna GitHub repos — re-analyzed 2026-03-24 · Both repos tracked on main · DATA-Athena-Snowflake now training-only (PR #1114) · athena-platform PR #215 — Volaris TF smart router sidecar

DATA-Athena-Snowflake

github.com/DUNA-E-Commmerce/DATA-Athena-Snowflake · branch: main (feat/ATH-0000 merged — 182 files, +29K lines)

Python / LangGraph / Snowflake ML

✅ Key Finding (Updated 2026-03-24)

Now training-only — serving removed via PR #1114, migrated to athia-model-server sidecar in athena-platform. Volaris training pipeline complete: data ingestion → preprocessing → trainer → evaluator → promoter, with S3 artifact storage. 30 tests. Clean separation: training repo (Python) vs serving repo (Go + Python sidecar).

LLM Workflows (11 stimuli)

Workflow	Status
Acceptance rate analysis	Done (v0_1, v1_0)
Fraud card analysis	Done
Metrics anomaly detection	Done
Chatbot / data analyst	Done
Strategy generation director	Partial — Matcher has `exit()`
Cost optimization	Early stage
Retry optimization	Missing (P-05 gap)

ML Training Platform (now on main)

Service	Status
Training pipeline (Snowflake ML)	Done
LLM training orchestrator (GPT-4 + RAG)	Done
LLM experiment designer (GPT-4 + RAG)	Done
Data quality validator	Done
Model registry (auto table creation)	Done
Feature extractor	Done
Feedback collector (webhook/API/batch)	Done
Schema discovery (ChromaDB)	Done
Athia event ingestion	Done
Model deployer → athena-platform	Partial — EFS export done; API integration manual

Architecture — Multi-Agent Pattern

FastAPI + LangGraph · Stimulus-response orchestration · LLM backends: Claude (primary), GPT-4 (fallback)

Request → StimulusRegistry → OrchestratorWorkflow → Branch (DAG of Nodes)
        → AgentWorkflow (LangGraph StateGraph) → Response

11 stimuli: acceptance_rate_analysis · fraud_card_analysis · metrics_anomaly
            user_question · data_analyst · researcher_assistance · deep_exploration
            element_edition · knowledge_expert · strategy_generation · cost_optimization

End-to-End Training Flow

POST /training/run/processor_selector
  → Schema Discovery (ChromaDB index)
  → Data Quality Validation (temporal bias · class balance · outlier · concept drift)
  → LLM Training Orchestrator (GPT-4 + RAG → RETRAIN_NOW / SCHEDULED / SKIP)
  → LLM Experiment Designer (7 experiments, simple→complex)
  → Snowflake ML Training (LR, RF, XGBoost · 80/20 temporal split)
  → Best model selected by F1 score
  → Export to EFS + athena-platform payload prepared
  → Results stored in ML_TRAINING_RUNS

Training Pipeline Architecture

Pipeline flow: Training Decision → Data Prep → Feature Extraction → Validation → Experiment Design → Training → Model Selection → Deployment

Services

Service	Purpose
`TrainingPipeline`	Full training execution (plan/run/deploy)
`LLMTrainingOrchestrator`	LLM + RAG decision engine (RETRAIN_NOW / SCHEDULED / SKIP)
`LLMExperimentDesigner`	Designs 5-10 experiments using GPT-4/Claude with RAG
`ModelDeployer`	Exports to EFS, registers deployment, creates canary config
`TrainingPlanner`	Dry-run mode ("terraform plan" for ML)
`FeatureExtractor`	Auto-extracts features, creates training dataset views
`DataQualityValidator`	Schema, statistical, temporal bias, drift validation
`FeedbackCollector`	Webhook, API polling, batch feedback collection
`ModelRegistry`	Model CRUD, prediction/feedback schema management
`SchemaDiscovery`	Auto-discovers training tables via LLM
`LLMProvider`	Unified Claude/GPT-4 interface with auto-fallback

API Endpoints

Endpoint	Purpose
`POST /api/v1/training/plan/{model_type}`	Dry-run plan
`POST /api/v1/training/run/{model_type}`	Execute training
`POST /api/v1/training/decision/{model_type}`	LLM decision
`POST /api/v1/experiments/design/{model_type}`	Design experiments

Remaining Gaps

✗ No retry LLM workflow — retry_optimization_requested stimulus still missing (P-05)
✗ Strategy Director matcher/ranker still incomplete — exit() placeholder in matcher (P-02)
✗ Deployment integration incomplete — model_deployer.py exports to EFS but does NOT call athena-platform API (~1.5d left)
✗ No data lineage tracking — TrainingDatasetVersion not implemented (G-10)
✗ No rollback capability (G-14 partial — shadow mode + is_default merged to main; rollback API not built)
✗ SQL injection risks in schema_discovery.py, training_planner.py, athia_ingestion.py
✗ No circuit breakers — LangGraph node failures still cascade
✗ Fragile LLM JSON parsing — all services extract JSON by searching for braces (not robust)

🧪 Testing Coverage

~25%

Est. coverage

450+

Test functions

34

Test files

Layer	Coverage	Notes
Metrics layer	70–80%	Well tested — 5 files
Deployment / config	50–60%	Validation tests solid
Model registry (new)	~60%	Happy-path CRUD; no error cases
Schema discovery (new)	~45%	Discovery + semantic search; no edge cases
Experiment designer (new)	~40%	RAG + ChromaDB; LLM-dependent, not mocked
Training orchestrator (new)	~35%	Happy-path; external deps not mocked
Feedback collector (new)	~35%	Batch/polling skipped; no duplicates
Route handlers	~0%	Still largely untested
Multi-agent core + 11 branches	~5%	1 manual test — not in pytest
Lambda / AgentCore entrypoints	0%	4 entrypoints — no tests

Maturity: 3/5 — Moderate. ML training services are well-architected with growing test coverage after merge. Tests are still integration-only, depend on live Snowflake + OpenAI, and cover happy-path only. Multi-agent core still a black box. Not suitable for CI without mocking.

athena-platform

github.com/DUNA-E-Commmerce/athena-platform

Go / Gin

✅ Key Finding (Updated 2026-03-24)

PR #215 — Volaris TF smart router sidecar. 3 TF SavedModels (processor_selector, retry_predictor, retry_sequence), 140-feature encoder, Go API integration with feature flag. 221 Python + 13 Go tests. Now the single serving repo — training migrated to DATA-Athena-Snowflake. Still at v0.15.5 with Triton IS, shadow mode, ExperimentService, Grafana CloudWatch dashboard, and auto versioning.

✅ Triton Branch — Merged to Main

G-06 deployment gap nearly closed (~1.5d remaining). Provides partial G-14 rollback via shadow mode. Now in main: Triton IS sidecar, ModelConversionManager (sklearn→Triton), ServiceWithTriton (readiness checks), ExperimentService (one-call experiment creation), shadow mode seed experiments for all 3 model types, max_variants constraint. 32/32 new tests passing. Active branches in review: feat/new-encoder-v2, feature/experiment-api-cleanup.

ML Inference Types (already in registry)

Type	Maps To
`processor_selector`	P-03
`retry_predictor`	P-05
`retry_sequence`	P-05
`installment_optimizer`	Out of scope

Snowflake Tables

Table	Status
`ATHIA_PREDICTIONS`	Active
`ATHIA_FEEDBACK`	Active
`ATHIA_TRAINING_DATASET`	Active
`ATHIA_EXPERIMENT_LIFT`	Active
`ATHIA_STAGE_OUTCOMES`	Deployed (feat/ATH-0000)
`ATHIA_SESSION_SUMMARY`	Deployed (feat/ATH-0000)
`ATHIA_MULTI_STAGE_ANALYSIS`	New (feat/ATH-0000)
`ATHIA_MODEL_METRICS`	New (feat/ATH-0000)
`ML_MODEL_REGISTRY`	New (feat/ATH-0000)

Architecture — Clean Architecture (Go/Gin)

PostgreSQL (RDS Multi-AZ) + Redis (ElastiCache) + EFS · mTLS enforced on /api/v1/ml/predict/*

REST Handlers (V1 + V2, Gin)  ← mTLS on /ml/predict/*
        ↓
Controllers (~30 implementations)
        ↓
Domain Services (44 packages)  ← constructor injection throughout
        ↓
Repositories (43 GORM implementations)  ← in-memory SQLite for tests
        ↓
PostgreSQL (RDS Multi-AZ) + Redis (ElastiCache) + EFS (model storage)

A/B Experimentation — Auto-Winner Guardrails

Stats

p-value < 0.05
Min 1000 samples/variant
Min 7 days runtime

Lift

Min 1% absolute lift
Deterministic bucketing
SHA256(transaction_id)

Guardrails

≤10% latency regression
≥−5% revenue regression
Dry-run mode (safe default)

🧪 Testing Coverage

~25–30%

Est. coverage

777

Test functions

126

Test files

Layer	Coverage	Notes
Domain services (44)	44/44	All have test files
Repositories (43)	~43/43	In-memory SQLite isolation
V1 REST handlers (18)	15/18 (83%)	agent, workspaces, elements missing
Auth middleware	Tested	JWT + API key covered
V2 REST handlers (18)	0/18 (0%)	Entire new API version untested
Bedrock client	0%	Excluded from coverage config
auth, bedrock, element, workspace services	0%	4 domain services with no tests
Bootstrap / DI graph	Skipped	TODO: testcontainers

Maturity: ~4/5 — Strong. Domain and repository layers well covered. Triton merge adds 32 new tests. V2 API (18 handlers) and Bedrock ML inference path are still untested. CI threshold is only 20% and internal/clients/ is excluded from coverage entirely.

Testing Comparison — Both Repos

Metric	DATA-Athena-Snowflake	athena-platform
Test functions	450+ (+6 new integration files)	777
Test files	34 (28 + 6 new)	126
Est. coverage	~20%	~25–30%
Core product tested?	No — multi-agent core untested; new ML tests need live deps	Partial — V2 + Bedrock missing
CI enforced?	Partial (fragmented)	Yes — every PR
Coverage threshold	None enforced	20% (too low; target: 65%)
Maturity	2.5/5 — Low	~4/5 — Strong

⚡Training Platform Gaps

14 gaps — ~104.5d original estimate · ~89d saved across merged + feature branches · ~15.5d remaining

Timeline (2 engineers)

Full delivery: ~2 weeks (was 5–6 wks before savings)

MVP (Stages 1–2 only): ~1 week

1 engineer: ~3 weeks

Build Order

Stage 1 must complete before Stage 2. Stages 3–5 can overlap with late Stage 2.

Triton branch merged to main — deployment architecture confirmed.

Total Effort

~15.5d remaining

of ~104.5d original · ~89d saved

Progress Summary — 2026-03-24

9

Gaps fully done

G-01, G-04, G-05, G-06, G-07, G-09, G-11, G-12, G-13

3

Nearly done

G-02, G-08, G-14

2

Partial

G-03, G-10

0

Not started

—

Stage	Gap	Category	Priority	Original	Status
1 – Foundation	G-02 Orchestration	Infrastructure	High	8d	Nearly Done (~1d left) — ~88% complete. SageMaker + Spacelift deployed, daily schedules active
1 – Foundation	G-08 Feature Store	ML Infra	High	13.5d	Nearly Done (~3.5d left) — ~74% complete, 140-feature encoder in sidecar
1 – Foundation	G-04 Data Validation	Data Quality	High	7d	Done ✓
2 – Automation	G-03 CI/CD Pipeline	DevOps	High	9d	Partial (~2d left) — Quality gates done. Spacelift + GitHub Actions CI/CD. Integration tests partial. Staging/prod TODO
2 – Automation	G-06 Deployment Automation	Automation	High	7.5d	Done ✓
2 – Automation	G-07 Model Registration	Automation	Medium	5.5d	Done ✓
2 – Automation	G-01 Automated Retraining	Automation	High	10d	Done ✓
3 – Governance	G-13 Versioning Workflow	Governance	High	5d	Done ✓
3 – Governance	G-10 Lineage Tracking	Governance	Medium	6.5d	Partial (~4d left) — ~35% complete. ClearML online tracking at athia-ml.dev.deuna.io, pipeline→training task hierarchy
3 – Governance	G-14 Rollback Capability	Reliability	High	5d	Nearly Done (~0.5d left) — ~90% complete, rollback API pending
4 – Observability	G-05 Model Monitoring	Observability	High	8d	Done ✓
4 – Observability	G-09 Drift Detection	Observability	Medium	7d	Done ✓
5 – ML Quality	G-11 Hyperparameter Tuning	ML Quality	Medium	5.5d	Done ✓
5 – ML Quality	G-12 Algorithm Comparison	ML Quality	Medium	7d	Done ✓

Remaining Effort by Stage

Stage 1 – Foundation

~5.5d

Stage 2 – Automation

~3d

Stage 3 – Governance

~6d

Stage 4 – Observability

~1d

Stage 5 – ML Quality

0d ✓

📄Statement of Work — Delivery

Volaris Phase 1 · 65 tasks · ~64 person-days · 5–6 weeks (3 engineers) · TensorFlow ecosystem

Engagement Summary

Full implementation of Athia AI/ML smartrouting for Deuna — Volaris merchant. 7 delivery phases covering P-01 (outage failover), P-03 (per-transaction routing), P-04 (message manipulation), and P-05 (retry optimization). ~61.5 person-days of prior Deuna codebase work reduces original scope significantly.

54

Delivery tasks

~51d

Person-days

5–6 wks

Calendar time

Aidaptive Team — Roles & Responsibilities

Name	Role	Responsibilities	Days
Rakesh	Project Lead & Strategy	Client coordination (Pablo, Israel), architecture decisions, Phase 6 oversight, post-launch review	5.5d
Naoki	Solutions Architect	athena-platform Go dev, outage detection, message manipulation API, model serving (Triton), CI/CD, Phase 6 integration	14.6d
Rene	ML Engineer	Feature engineering, model training (processor_selector, retry_predictor, retry_sequence), data quality, drift detection, retraining pipeline	15d
Kedar	Data & Backend Engineer	Snowflake EDA, data pipelines, training datasets, feature feeds, Grafana dashboards, monitoring	16d
Total			~51.1d

Effort by Phase

Phase	Focus	Owners	Days	Milestone
1 — Discovery & EDA	Understand Volaris data	Kedar (4.5d) · Rene (2d) · Rakesh (0.5d) · Naoki (0.5d)	7.5d	Kick-off (20%)
2 — Feature Engineering	Build ML feature set	Rene (4d) · Kedar (3d) · Naoki (1d) · Rakesh (0.5d)	8.5d	Phase 2 complete (20%)
3 — Model Development	Train P-03 + P-05 models	Rene (6d) · Kedar (2d) · Naoki (1d)	9d	Phase 3 complete (20%)
4 — Outage Detection	P-01: failover for 4 PSPs	Naoki (4.5d) · Rakesh (1d) · Kedar (1d)	6.5d	Phase 6 complete (30%)
5 — Message Manipulation	P-04: CIT/MIT experiment	Naoki (2d) · Rene (1.5d) · Kedar (1.5d) · Rakesh (0.5d)	5.5d	Phase 6 complete (30%)
6 — Platform Integration	Register models, wire Deuna +G-06 close	Naoki (5.1d) · Rakesh (2d) · Kedar (1d)	8.1d
7 — Monitoring & Feedback	Dashboards, retraining, review	Kedar (3d) · Rene (1.5d) · Rakesh (1d) · Naoki (0.5d)	6d	Phase 7 complete (10%)
Total			~51d

Delivery Timeline (6-week plan)

Week	Days	Phases Active	Who	Key Milestone
Week 1	1–5	Phase 1 (EDA) · Phase 2 start Day 3	Kedar · Rene · Rakesh (Day 1)	EDA complete; feature schema draft
Week 2	6–10	Phase 2 (Features) · Phase 3 start Day 8	Rene · Kedar · Naoki	Feature set locked; training dataset built
Week 3	11–15	Phase 3 (Models) · Phase 4 (Outage) parallel	Rene (models) · Naoki (outage)	Models packaged; outage detection built
Week 4	16–20	Phase 4 tail · Phase 5 (CIT/MIT) · Phase 6 prep	Naoki · Rene · Kedar · Rakesh	API contract with Deuna eng signed
Week 5	21–25	Phase 6 (Integration)	Naoki · Rakesh	⚠ Triton branch must be merged by Day 18 · Integration live in shadow mode
Week 6	26–30	Phase 7 (Monitoring & Review)	Kedar · Rene · Rakesh	Dashboards live · retraining scheduled · post-launch report

Critical path: Phases 1–2 sequential. Phases 3–5 can run in parallel. Phase 6 requires (a) models complete, (b) Triton branch merged, (c) 1-week Deuna engineering lead time for API contract. Phase 7 requires Phase 6 live.

Assumptions

Snowflake access (PAYMENT_ML) remains available read-only
Deuna eng available for API contract in Week 4 (Pablo / Israel)
Triton branch merged to main by end of Week 3
Staging environment available for Phase 6 integration tests
ATHIA_PREDICTIONS + ATHIA_FEEDBACK remain live throughout

Success Criteria

processor_selector live for ≥1 Volaris PSP
≥1% absolute approval rate lift (A/B test at significance)
≥5% retry success rate improvement vs. baseline
PSP failover within 1 routing cycle of threshold breach
p95 latency <200ms end-to-end (model inference <50ms)
48h shadow run complete with documented comparison

✈️Volaris Smartrouting — Delivery Tasks

65 tasks across 8 phases to deliver AI-powered routing for Volaris — TensorFlow ecosystem, Phase 1 target client

🔧 Architecture Update (2026-03-13): TensorFlow Ecosystem + 7 Service Shells

Adopted TensorFlow ecosystem (tf.keras, TFX, TF Serving, TFDV, TFMA, TF Transform) for all ML work. Added Phase 0 with 11 new tasks: 3 design tasks (Rakesh) + 8 service shell scaffolds. Replaces Snowflake ML / XGBoost / scikit-learn.

7 Service Shells
Data Pipelines · Feature Service · Training Pipelines · Model Management · Eval Service · Evaluation Framework · Experiment System

Design Tasks (Rakesh)
V-D01: Service architecture · V-D02: API contracts · V-D03: TF ecosystem integration plan

Updated Totals
65 tasks (was 54) · ~64d total (was ~49.5d) · Phase 0 adds ~14.5d · 3 engineers ~5–6 weeks

4

PSPs (Worldpay · MIT · Elavon · Amex)

65

Total tasks +11 new

~64d

Total effort +14.5d (Phase 0)

5–6 wks

3 engineers parallel

Phase 0 — NEW

Service Architecture & Shell Setup

Design 7 service boundaries (Rakesh), define API contracts, scaffold all service shells using TensorFlow ecosystem: Data Pipelines (TFX), Feature Service (TF Transform), Training Pipelines (TFX Trainer), Model Management, Eval Service (TFMA), Evaluation Framework, Experiment System.

📐 V-D01–D03 (Rakesh design) + V-S01–S08 (team scaffolding)

11 tasks14.5dTensorFlow

Phase 1

Discovery & EDA

Understand Volaris transaction data — approval rates per PSP, retry patterns, routing rules, DYNAMIC_ROUTING_DETAIL JSON, sample size for A/B test.

10 tasks7.5d

Phase 2

Feature Engineering

Card BIN/brand, transaction context, user RFM, retry history, rolling processor health scores, Amex hard-rule bypass, training dataset build.

9 tasks8.5d

Phase 3

Model Development

Train processor_selector, retry_predictor, retry_sequence for Volaris 4 PSPs using tf.keras. DNN vs. wide-and-deep vs. TF Decision Forests comparison. TFMA per-slice evaluation.

🔧 TensorFlow ecosystem: tf.keras training via Training Pipeline service, TFMA evaluation via Eval Service, SavedModel export via Model Management.

8 tasks9dInfra ready

Phase 4 — P-01

Outage Detection

Rolling health score per PSP, failover to next-best Volaris processor, recovery detection via 1–2% sampling, alerts on state changes.

6 tasks6.5d

Phase 5 — P-04

Message Manipulation

CIT/MIT audit for Volaris, approval delta by toggle × processor × card type, experiment design, new athena-platform endpoint, A/B test.

5 tasks5.5d

Phase 6

Platform Integration

Register models in athena-platform, create Volaris-scoped experiment, API contract with Deuna eng, shadow mode validation before live traffic.

✅ Triton branch: ExperimentService one-call API (V-39–41) + built-in shadow mode (V-46) reduce effort by ~1d.

8 tasks6.5d ~~7.5d~~Triton ✓

Phase 7

Monitoring & Feedback Loop

V-47
Confirm ATHIA_PREDICTIONS + ATHIA_FEEDBACK are live in Deuna's Snowflake
Done ✓ — confirmed data live in Deuna's Snowflake (2026-02-24)

V-48
Deploy ATHIA_STAGE_OUTCOMES table
Done ✓ — deployed in feat/ATH-0000

V-49–50
Approval rate + model performance Grafana dashboards

V-51–54
Retraining trigger, scheduled pipeline, auto-winner, post-launch review
Ready ✓ — LLM orchestrator + training pipeline built

8 tasks6d2 tasks done/ready

All 65 Tasks

#	Task	Phase	Owner	Effort	Status
V-D01	Design overall service architecture — 7 service boundaries, data flow, inter-service communication	0 – Architecture	Rakesh	2d	Design
V-D02	Define API contracts for all 7 services — OpenAPI specs, error handling, versioning	0 – Architecture	Rakesh	1.5d	Design
V-D03	Design TensorFlow ecosystem integration — map TFX components to services, TF Serving format, TFDV/TFMA	0 – Architecture	Rakesh	1d	Design
V-S01	Scaffold Data Pipeline service — TFX ExampleGen + StatisticsGen, Snowflake adapter, TFDV schema	0 – Shell	Kedar	1.5d
V-S02	Scaffold Feature Service — TF Transform `preprocessing_fn`, feature store API, real-time endpoint	0 – Shell	Rene	1.5d
V-S03	Scaffold Training Pipeline service — TFX Trainer + tf.keras, Keras Tuner, training history	0 – Shell	Rene	1.5d
V-S04	Scaffold Model Management service — registry CRUD, SavedModel storage, lifecycle, version comparison	0 – Shell	Naoki	1d
V-S05	Scaffold Eval Service — TFMA integration, per-slice metrics, model blessing/rejection API	0 – Shell	Rene	1d
V-S06	Scaffold Evaluation Framework — A/B stat engine, winner detection, latency/revenue guardrails	0 – Shell	Naoki	1.5d
V-S07	Scaffold Experiment System — experiment CRUD, traffic splitting, variants, shadow mode orchestration	0 – Shell	Naoki	1.5d
V-S08	Set up shared TF dependencies — tensorflow, tfx, tf-transform, tfma, tfdv, keras-tuner + Docker base	0 – Shell	Kedar	0.5d
V-01	Filter Volaris transactions — date range, volume, monthly trend	1 – EDA	Kedar	0.5d
V-02	Per-processor approval rates (Worldpay, MIT, Elavon, Amex) by card type, currency, amount	1 – EDA	Kedar	1d
V-03	Retry pattern analysis — attempts per order, processor retry-to, 1st/2nd/3rd attempt success rates	1 – EDA	Kedar	1d
V-04	Explore `DYNAMIC_ROUTING_DETAIL` JSON — extract all keys and values	1 – EDA	Kedar	1d
V-05	Map Volaris routing rules from `VW_ROUTING_MERCHANT_RULE*` views	1 – EDA	Kedar	0.5d
V-06	Analyze smart routing log — algorithm types, skip rates, p95 latency baseline	1 – EDA	Kedar	0.5d
V-07	Hard vs. soft decline distribution by processor and error code	1 – EDA	Rene	1d
V-08	Profile airline-specific features — flight, passenger, booking window signal	1 – EDA	Rene	0.5d
V-09	A/B test sample size check — daily volume per processor ≥ 1000/variant in 7 days?	1 – EDA	Rene	0.5d
V-10	EDA summary report — approval rates, error taxonomy, processor share, correlations	1 – EDA	Rene + Rakesh	1d
V-11	Define Volaris feature schema — all features, types, sources, compute latency	2 – Features	Rene + Naoki	1d
V-12	Card-level features — BIN, brand, bank, type, country; historical approval rate per BIN × processor	2 – Features	Rene	1d
V-13	Transaction-level features — amount, currency, CIT/MIT, MCC, flight order type	2 – Features	Rene	1d
V-14	User-level features — RFM, fraud rate cohort, tenure, browsing signals	2 – Features	Rene	0.5d
V-15	Retry-context features — previous processor, error code, time since attempt, attempt number	2 – Features	Kedar	1d
V-16	Processor-state features — rolling approval/timeout/decline rate at 15-min, 1h, 24h windows	2 – Features	Kedar	1.5d
V-17	Amex hard-rule — always route Amex cards to Amex processor; bypass ML	2 – Features	Naoki	0.5d
V-18	Build training dataset — join features onto labeled outcomes; train/val/test split	2 – Features	Kedar	1d	Ready ✓ `ATHIA_TRAINING_DATASET` view + `feature_extractor.py`
V-19	Feature quality validation — nulls, skew, leakage risk, outcome correlation	2 – Features	Rene	1d	Ready ✓ `data_quality_validator.py` (834 lines)
V-20	Train `processor_selector` v1 — rank 4 PSPs by approval probability (tf.keras DNN)	3 – Models	Rene	2d	TF Training Pipeline service
V-21	Evaluate `processor_selector` — AUC, lift vs. static rules, per-processor accuracy, latency	3 – Models	Rene	1d	Ready ✓ Metrics auto-calculated by pipeline
V-22	Train `retry_predictor` v1 — predict retry approval probability	3 – Models	Rene	1.5d	Ready ✓ Training pipeline supports `retry_predictor` type
V-23	Train `retry_sequence` v1 — optimal processor order for retry	3 – Models	Rene	1.5d	Ready ✓ Training pipeline supports `retry_sequence` type
V-24	Evaluate retry models — success rate lift, processor fatigue patterns	3 – Models	Rene	1d	Ready ✓ Evaluation framework in pipeline
V-25	Architecture comparison — DNN vs. wide-and-deep vs. TF Decision Forests; select champion	3 – Models	Rene	1d	TF Replaces XGBoost vs. LR comparison
V-26	Inference latency test — all models under 50ms budget	3 – Models	Naoki	0.5d
V-27	Package models — serialize, write model card (schema, features, metrics)	3 – Models	Kedar	0.5d	Ready ✓ `model_registry.py` auto-creates tables + stores metadata
V-28	Define outage signal — timeout/error code thresholds for PSP-down detection	4 – P-01	Rakesh + Naoki	1d
V-29	Rolling processor health score — sliding 5–15 min window per PSP	4 – P-01	Naoki	1.5d
V-30	Failover logic — skip degraded PSP, route to next-best Volaris processor	4 – P-01	Naoki	1.5d
V-31	Recovery detection — 1–2% sampling of down PSP; auto-restore on consecutive wins	4 – P-01	Naoki	1d
V-32	Outage simulation tests — inject failures per PSP; verify failover + recovery	4 – P-01	Naoki	1d
V-33	Outage alerting — Slack/PagerDuty on PSP state changes	4 – P-01	Kedar	0.5d
V-34	Audit CIT/MIT usage for Volaris — current distribution across PSPs	5 – P-04	Kedar	0.5d
V-35	Approval delta by CIT vs MIT per processor — statistical test	5 – P-04	Rene	1d
V-36	Design message manipulation experiment — CIT/MIT × processor × card type matrix	5 – P-04	Rene + Rakesh	1d
V-37	Implement message recommendation API in athena-platform	5 – P-04	Naoki	2d
V-38	Run A/B test — approval rate with vs. without message recommendations	5 – P-04	Kedar	1d
V-39	Register `processor_selector` in MODEL_ARTIFACTS (version, Triton backend ref, feature schema)	6 – Integration	Naoki	0.3d	Ready ✓ `POST /api/v1/ml/models` (Triton branch ExperimentService)
V-40	Register `retry_predictor` + `retry_sequence` in MODEL_ARTIFACTS	6 – Integration	Naoki	0.3d	Ready ✓ Same — ExperimentService handles all 3 model types
V-41	Create Volaris-scoped experiment — merchant filter, 10% treatment split, shadow mode, guardrails	6 – Integration	Naoki	0.5d	Ready ✓ `POST /api/v1/ml/experiments` — variants + models in one call (Triton branch)
V-42	Validate experiment assignment — SHA256 bucketing determinism for Volaris	6 – Integration	Naoki	0.5d
V-43	API contract with Deuna engineering — define `POST /api/v1/ml/predict` request/response for Volaris	6 – Integration	Rakesh	1d
V-44	Deuna payment service integration — Deuna calls athena-platform at routing decision point	6 – Integration	Rakesh + Naoki	2d
V-45	End-to-end integration test — full flow: Deuna → athena-platform → model → ranked PSPs	6 – Integration	Naoki + Kedar	1d
V-46	Shadow mode — 48h logging without acting; compare predicted vs. actual outcomes	6 – Integration	Kedar	0.5d	Ready ✓ `is_shadow_mode=true` built-in (Triton branch); set up + monitor only
V-47	Confirm ATHIA_PREDICTIONS + ATHIA_FEEDBACK are live in Deuna's Snowflake	7 – Monitoring	Kedar	0.5d	Done ✓ Confirmed data live in Deuna's Snowflake (2026-02-24)
V-48	Deploy ATHIA_STAGE_OUTCOMES table in Snowflake	7 – Monitoring	Kedar	0.5d	Done ✓ Deployed in feat/ATH-0000 SQL
V-49	Volaris approval rate dashboard — daily/hourly per PSP vs. baseline	7 – Monitoring	Kedar	1d
V-50	Model performance dashboard — prediction confidence, rank accuracy, retry lift	7 – Monitoring	Kedar	1d
V-51	Define retraining trigger — approval rate drop or AUC drop thresholds	7 – Monitoring	Rene	0.5d	Ready ✓ `llm_training_orchestrator.py` makes RETRAIN_NOW / SCHEDULED / SKIP decisions
V-52	Schedule weekly retraining — auto-register new version from latest ATHIA_TRAINING_DATASET	7 – Monitoring	Rene	1d	Ready ✓ `training_pipeline.py` + orchestrator built; configure for Volaris cadence
V-53	Confirm auto-winner worker runs for Volaris experiment with correct guardrails	7 – Monitoring	Naoki	0.5d
V-54	Post-launch review — 2-week lift analysis: approval rate, outage response, retry success	7 – Monitoring	Rakesh	1d

💡Codebase Improvement Suggestions

What needs to change in both repos to reach production-grade quality

🔴 Critical — Do These First

Action	Repo	Effort
Build `retry_optimization_requested` stimulus — P-05 is entirely missing from LLM platform	DATA-Athena-Snowflake	3d
Complete Strategy Director — replace `exit()` placeholder & dummy ranker prompts	DATA-Athena-Snowflake	2d
Add tests for all 18 V2 REST handlers — 0% coverage on new API version	athena-platform	4d
Add tests for Bedrock client + Bedrock domain service — production-critical, currently excluded	athena-platform	1.5d
Add route, service & client layer tests — all 14 routes, 13 services, 3 clients at 0%	DATA-Athena-Snowflake	7d
Remove `internal/clients/` from coverage exclusions in CI	athena-platform	0.5d

Python / LangGraph DATA-Athena-Snowflake

Testing

✗ Unit test all route handlers with FastAPI TestClient + mocked services
✗ Unit test all 13 services — mock Snowflake sessions and clients
✗ Unit test core multi-agent framework: AgentWorkflow, AgentStrategy, node/edge composition
✗ Add per-branch tests for all 11 stimulus branches (mock LLM responses with fixtures)
✗ Unify CI into a single pytest run — replace fragmented per-domain workflows
✗ Enable pytest-cov with 60% minimum threshold enforced in CI

Architecture

✗ Circuit breaker in AgentWorkflow — isolate node failures, prevent cascade
✗ Enable OpenTelemetry tracing — already in codebase, just commented out
✗ Replace hardcoded thresholds (15% drop, 60–80 min windows) with configurable params
✗ Add LLM prompt injection guards — sanitize user inputs before system prompts
✗ Standardize tool definition — unify @create_tool vs. manual; add versioning
✗ Centralize config — replace scattered load_dotenv with Pydantic Settings schema

Go / Gin athena-platform

Testing

✗ Add tests for all 18 V2 handlers — entire new API version at 0%
✗ Test Bedrock client & service — excluded from coverage, production-critical
✗ Raise CI threshold 20% → 60%; remove internal/clients/ exclusion
✗ Bootstrap integration test with testcontainers-go — verify DI graph
✗ Benchmark tests for /ml/predict, /feedback, experiment assignment
✗ Contract tests for Snowflake & Bedrock APIs — catch schema drift early

Architecture

✗ Event-driven model registry cache invalidation — remove 24h stale assignment risk
✗ Experiment context middleware — auto-propagate session/experiment IDs per request
✗ Abstract *gin.Context from controllers — transport-agnostic, easier to test
✗ Deploy ATHIA_STAGE_OUTCOMES + ATHIA_SESSION_SUMMARY Snowflake tables
✗ SageMaker model warm-up — cold starts can breach p95 < 200ms target
✗ Production Grafana dashboards + alerts — config exists locally, not deployed

Full Priority Order

Priority	Action	Repo	Effort
Critical	Build retry optimization stimulus (P-05)	DATA-Athena-Snowflake	3d
Critical	Complete Strategy Director matcher + ranker	DATA-Athena-Snowflake	2d
Critical	Add tests for all 18 V2 REST handlers	athena-platform	4d
Critical	Add tests for Bedrock client + service	athena-platform	1.5d
Critical	Add route, service & client tests (all at 0%)	DATA-Athena-Snowflake	7d
Critical	Remove `internal/clients/` from coverage exclusions	athena-platform	0.5d
High	Multi-agent framework + branch tests (11 branches)	DATA-Athena-Snowflake	6d
High	Circuit breaker in AgentWorkflow	DATA-Athena-Snowflake	2d
High	Enable OpenTelemetry tracing	DATA-Athena-Snowflake	1.5d
High	Raise CI coverage threshold to 60%	athena-platform	0.5d
High	Bootstrap integration test (testcontainers)	athena-platform	1.5d
High	Event-driven model registry cache invalidation	athena-platform	1.5d
High	Deploy ATHIA_STAGE_OUTCOMES + SESSION_SUMMARY tables	athena-platform	1d
High	SageMaker model warm-up (latency target risk)	athena-platform	1d
Medium	Adaptive thresholds (replace hardcoded values)	DATA-Athena-Snowflake	2d
Medium	Experiment context middleware	athena-platform	2d
Medium	Production Grafana dashboards + alert rules	athena-platform	2d
Medium	Benchmark tests for hot endpoints	athena-platform	1d
Medium	Unified CI test suite + coverage enforcement	DATA-Athena-Snowflake	1.5d

📝Daily Updates

Aidaptive engineering activity across both codebases

2026-04-15

Goal: Have both Log Reg and DNN pipelines running regularly in dev and prod, pushing models to S3 buckets.

Engineer	Work Done
Rakesh	Changed training pipelines to use GPU instances. Implemented deeper ClearML integration in training pipeline — extracting detailed metrics from each training run into ClearML for monitoring and debugging.

2026-04-14

Goal: Get ClearML integration working in dev and deploy DNN pipeline end-to-end.

Engineer	Work Done
Rakesh	Spent ~30 hours debugging dev servers with ClearML integration — blocked by access issues. Worked with team to resolve access, dev pipeline now working correctly. Attempted DNN pipeline deployment — ran for 3 hours and failed. Long turnaround time makes iteration impractical without GPU.

Blockers:

GPU needed — training requires GPU instance; CPU-based runs too slow for practical iteration (3hr+ per attempt)
AWS access for Rakesh — need direct access to AWS resources (console/CLI) to debug and iterate efficiently

2026-04-11

Goal: Have both model techniques (DNN and Log Reg) running e2e daily, pushing models in dev and prod.

Engineer	Work Done
Rakesh	Implemented DNN pipeline Terraform and deploying in dev. Need to get PRs submitted and merged to main/qa so pipelines can run via CI/CD

2026-04-10

Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.

Engineer	Work Done
Rakesh	Debugging why ClearML is not registering all metrics from each training run. Next: deploy DNN model and fix production integration to ensure every training run model reaches production automatically

2026-04-07

Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.

Engineer	Work Done
Rakesh	Fixing Volaris training pipeline run in dev environment. Helping team with prod deployment

2026-04-05

Goal: Both training pipelines (dolphin SageMaker + Volaris smart router) running daily with ClearML tracking.

Engineer	Work Done
Rakesh	Deployed e2e SageMaker dolphin pipeline (PreprocessData ✅, TrainModel ✅, EvaluateModel fix pushed). Deployed Volaris smart router daily training infra. ClearML integration confirmed working. Implemented Volaris pipeline CLI with Snowflake connection. Added 13 tests for setup_pipeline. Fixed multiple SageMaker issues: pipeline name mismatch, model.save(), RegisterModel, FrameworkProcessor for eval.

Engineer

Work Done

Rakesh

Deployed e2e SageMaker dolphin pipeline (PreprocessData ✅, TrainModel ✅, EvaluateModel fix pushed). Deployed Volaris smart router daily training infra. ClearML integration confirmed working. Implemented Volaris pipeline CLI with Snowflake connection. Added 13 tests for setup_pipeline. Fixed multiple SageMaker issues: pipeline name mismatch, model.save(), RegisterModel, FrameworkProcessor for eval.

Blockers:

PR #1132 (DATA-Athena-Snowflake) needs approval to merge to qa — blocks CodePipeline deployment
CodeDeploy DeployEC2 stage failing — scripts need debugging after merge
SSO lacks sagemaker:CreatePipeline for local runs

Pending PRs:

#1132 DATA-Athena-Snowflake → qa (ClearML, Lambda handler, SageMaker fixes)
#26 terraform-athia → main (Dolphin SageMaker pipeline infra)
#27 terraform-athia → main (Volaris daily training infra)

2026-04-04

Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.

Engineer	Work Done
Rakesh	Deployed e2e SageMaker training pipeline via Spacelift. Consolidated PRs #24+#25 → #26. Fixed Spacelift project_root, cleaned orphaned state. Enabled daily training for both pipelines. Set up ClearML creds in Secrets Manager + EC2. Created Spacelift stack for Volaris. Renamed model-artifacts → volaris-model-artifacts.

2026-04-02

Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.

Engineer	Work Done
Rakesh	Fighting Terraform and Spacelift configuration issues. ClearML successfully deployed in dev environment

2026-04-01

Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.

Engineer	Work Done
Rakesh	Deployed ClearML using Terraform and Spacelift in dev environment and handed over to team. Full data analysis of Snowflake data with Rene — generated list of suggestions for team, published at metrics dashboard
Rene	Full data analysis of Snowflake data with Rakesh — generated list of suggestions for team

2026-03-31

Goal: Deploy entire training platform using TF and Spacelift.

Engineer	Work Done
Rakesh	Deploying entire training platform end to end. Learning Spacelift for infra deployment — finally got access. Deploying ClearML to monitor all training. Wrote analysis script to monitor model performance driving metrics dashboard directly from Snowflake — better for analysis and generating insights
Rene	Analyzing model performance with past data. Waiting for experiment to be enabled again — current data not significant enough

2026-03-30

Goal: Have clear idea of metrics for measuring model performance and deploy entire training platform.

Engineer	Work Done
Rakesh	Analyzing model performance metrics, working on DNN model optimization for Volaris smart router

2026-03-28

Goal: Train one more DNN model for Volaris smart router.

Engineer	Work Done
Rakesh	Training DNN model for Volaris smart router — also serves as example on how to use the TFX pipeline
Rene	Data analysis to identify patterns based on Volaris data

2026-03-27

Goal: Get TF LR model in production.

Engineer	Work Done
Rakesh	Verified everything working in production. Helping with questions about the model. Double-checked traffic ramp-up and analyzing how to do post-launch analysis

2026-03-26

Goal: Integrate new model in serving stack now that everything is working end to end.

Engineer	Work Done
Rakesh	Integrated model in serving stack with sidecar approach, loading models from S3/EFS. Removed all parallel serving libraries no longer needed in training directory

2026-03-25

Goal: Deploy everything in production for Volaris to get real data flowing.

Engineer	Work Done
Rene	Created and incorporated AMEX model, added to the serving mix
Rakesh	Built AMEX model with Rene. Working with Deuna team on deploying everything in production for Volaris

2026-03-24

Goal: Have TF model up and running in production integrated with the Go API.

Engineer	Work Done
Rakesh	Analyzing code with Naoki to determine serving approach — sidecar vs deploying servomatic. Building production flow to save and load models from S3 & deploying servomatic binary
Naoki	Evaluating sidecar vs servomatic deployment for integrating TF model with the Go API
Rene	Continuing to iterate on the TF model and data analysis. Converting LR model to TF format for use in servomatic binary for online eval

2026-03-23

Goal: Hook this model into production to have everything connected end to end.

Engineer	Work Done
Rene	First regression model trained on Volaris data — evaluating quality in offline mode, initial results look promising
Rakesh	Continuing to analyze approach to serve the model as servomatic with Naoki
Naoki	Analyzing serving architecture to connect trained model via servomatic in production

2026-03-20

Goal: Have one model trained on Volaris data.

Engineer	Work Done
Rene	Iterating on data shape analysis and building first model version
Rakesh	Working with Rene on first model; analyzing serving code with Naoki to plan production integration
Naoki	Analyzing serving code with Rakesh to determine how to connect model in production

2026-03-19

Goal: Have first model ready at least in offline mode in the coming days.

Engineer	Work Done
Rakesh	First analysis of the data with Rene — analyzing best approach to build processor selector model
Rene	Started on first model based on current understanding of data and features
Naoki	Working with Rene on integrating S3 file loading into TFX data loader for e2e training and eval

Blockers: None — have a few questions to confirm our understanding, will batch and ask together

2026-03-18

Goal: Have first ML model for selecting the right processor for every Volaris transaction.

Engineer	Work Done
Rene	Looking at data shape for Volaris to train first processor selector model
Kedar	Working on data pipeline
Naoki	Continues setting up good practices (code quality, CI, testing patterns)
Rakesh	Writing smartrouter service

Blockers cleared: AWS access granted · Code review done, all code merged to qa

2026-03-16

Goal: Submit everything and build data pipeline to extract Volaris data from Snowflake.

Engineer	Work Done
Rakesh	Iterated on experiment and metrics framework to make everything work locally and in tests
Naoki	Iterated on improving code and ramping up
Kedar	Looking at feature extraction from Snowflake
Rene	Working on simple first model

Blockers: PR review pending · AWS access for POC server (from 2026-03-13)

2026-03-15

Goal: Have training platform implemented in shape and be ready for feature engineering and training for Volaris model.

Engineer	Work Done
Rakesh	Added Evaluation Service (uses Model Service + Feature Service to evaluate TensorFlow models). Added e2e tests for all 3 services. Added experiment and metrics framework to track all training pipelines. Demo training pipeline working end to end. PR waiting for review

Blockers: PR review pending · AWS access for POC server (from 2026-03-13)

2026-03-14

Engineer	Work Done
Rakesh	Built foundational services: Model Service and Feature Service with tests and scaffoldings to support TensorFlow trained models

2026-03-13

Goal: Get everything running tests regularly and pushing to dev server automatically — getting comfortable with the current stack.

Engineer	Work Done
Naoki	Fixed broken tests to get everything running locally. Looking at setting up automated deployment in dev environment for services
Kedar	Got repo and environment access figured out. Looking into Snowflake data schema
Rene	Got repo and environment access figured out. Looking at training pipeline code
Rakesh	Updated deuna.aidaptive.com with latest repo analysis and refreshed task list. Synced athena-platform (v0.15.5, Triton merged)

Blocker: Waiting for AWS access to deploy on POC server

❓Open Questions

Items that need answers before effort estimates are finalized

1

Are ATHIA_PREDICTIONS / ATHIA_FEEDBACK tables populated in Deuna's Snowflake today? Confirmed ✓ 2026-02-24

Confirmed — data is live in Deuna's Snowflake (verified 2026-02-24).

2

Are SageMaker endpoints live for processor_selector / retry_predictor?

Or are they placeholders only? — Rakesh to confirm

3

Is there a live model in MODEL_ARTIFACTS that Deuna's payment service is calling today?

Rakesh to confirm

4

What is the current payment volume through the routing engine?

Minimum 1,000 transactions per variant needed for A/B test statistical validity — Ask Israel

5

Who owns the athena-platform Go repo deployments?

Aidaptive or Deuna infra? Affects Phase 1 deployment planning — Clarify with Pablo

6

When will `feature/llm-driven-ml-training` (Triton IS) merge to main? New

This PR closes G-06 and defines the production model serving backend (Triton vs. SageMaker). Its merge timeline directly sets the Phase 6 integration schedule — ask Pablo.

🔑Access & Blockers

Pending provisioning items

Item	Owner	Status
Snowflake access — Rakesh	Israel (Deuna)	✓ Done (2026-02-18)
Snowflake access — Naoki	Rakesh + Naoki	✓ Done (2026-02-19)
Code / repo access — Rakesh	Pablo (Deuna)	✓ Done (2026-02-19)
Claude / LLM access & budget	Pablo → Farhan	✓ Done (2026-02-19)
Code / repo access — Naoki	TBD	✓ Done
Deuna corp accounts — Rakesh & Naoki	TBD	Pending
Claude Code credits — Rakesh & Naoki	—	Not needed
Deploy ATHIA_STAGE_OUTCOMES + ATHIA_SESSION_SUMMARY in Snowflake	Rakesh	✓ Done (feat/ATH-0000)
Build retry_optimization_requested workflow	Rakesh	Pending

Quick Links

Service	URL	Details
AWS Console (SSO)	deunaio.awsapps.com	Deuna AWS account access
Snowflake	vltaxpw-rmontes.snowflakecomputing.com	Account: `VLTAXPW-RMONTES` · DB: `PAYMENT_ML` · Warehouse: `PAYMENT_ML` · Read-only
Athia Experiments Dashboard	insights.deuna.com	Model performance data for processor selector experiments
ClearML (Prod)	athia-ml.deuna.io	ML experiment tracking & training monitoring — production
ClearML (Dev)	athia-ml.dev.deuna.io	ML experiment tracking & training monitoring — dev environment
Spacelift	duna-e-commmerce.app.spacelift.io	Infrastructure governance & Terraform deployment
Terraform Repo	github.com/DUNA-E-Commmerce/terraform-athia	All Athia infrastructure as code

Development Rules

Rule	Details
AWS Resource Tags	All AWS resources must include: `CreatedBy=aidaptive`, `ServiceName=smartrouter`, `Environment=POC`
Infrastructure as Code	All infrastructure via Terraform only — no manual AWS console resource creation

Decisions Log

Date	Decision	Rationale	Made By
2026-03-24	Serving migrated from DATA-Athena-Snowflake to athia-model-server sidecar in athena-platform	Clean separation — training repo (Python) vs serving repo (Go + Python sidecar)	Rakesh
2026-03-13	Adopted TensorFlow ecosystem (TFX, TF Serving, TFDV, TFMA, TF Transform) for all ML work	Replaces Snowflake ML / XGBoost / scikit-learn. Unified training → validation → serving pipeline with production-grade tooling	Rakesh
2026-03-13	Added Phase 0 — 7 service shells + 3 design tasks (Rakesh) before Volaris feature work	Service architecture: Data Pipelines, Feature Service, Training Pipelines, Model Mgmt, Eval Service, Evaluation Framework, Experiment System	Rakesh
2026-03-13	Both repos switched to main branch — feat/ATH-0000 and Triton branches both merged	All ML training pipeline and Triton serving code now on main; no more feature branch tracking needed	Rakesh
2026-03-13	Triton branch merged to main — confirms deployment architecture	feature/llm-driven-ml-training merged; Triton IS, ExperimentService, shadow mode now in production codebase	Deuna Engineering
2026-02-18	Latency target updated: p95 <50ms → p95 <200ms	Revised from original SOW spec	Rakesh (w/ Pablo)
2026-02-19	Phase 1 target merchant set to Volaris (not Cinépolis)	Volaris has known PSPs (Worldpay ID:76, MIT ID:85, Elavon, Amex); Cinépolis only shows Cybersource gateway — processor unknown	Mark Walick
2026-02-20	Repo analysis scoped to branch `feat/ATH-0000-athia-ml-llm-schema-discovery` (not main)	This branch contains the active ML platform development; main does not reflect current capabilities	Pablo
2026-02-26	athena-platform `feature/llm-driven-ml-training` (Triton IS) identified as the production model serving path	Triton IS + shadow mode + ExperimentService provides complete training→serving pipeline; replaces manual SageMaker endpoint registration; closes G-06	Rakesh