D
×
A

Deuna × Aidaptive

Smartrouter Project — internal collaboration portal. Sign in with your work email to continue.

Access restricted to @aidaptive.com and @deuna.com accounts.

🚫

Access Denied

Your account () is not authorized to view this portal.

Access is restricted to @aidaptive.com and @deuna.com email addresses.

Aidaptive × Deuna

Smartrouter
Project

Phase 1 — Development
Overview
⚡TL;DR Summary 🏠Project Overview 📅Timeline & Phases 👥Team
Scope
🎯Use Cases 🗄️Data & Schema
Analysis
🔍Repo Analysis ⚡Training Platform Gaps
Delivery
📄Statement of Work ✈️Volaris Tasks 💡Improvements 📝Daily Updates 📊Work Summaries 📈Performance Metrics ❓Open Questions 📋Decisions Log 🔑Access & Blockers 📎References
Last updated 2026-04-15
Start date 2026-02-18
Aidaptive × Deuna › Smartrouter Project
Phase 1  ·  Development  ·  Volaris Delivery
Phase 1 — Development

Smartrouter AI/ML Integration

Planning phase complete. Now building the Athia AI/ML integration into Deuna's payment routing service — focused on the Volaris merchant task list with 3 engineers.

Phase
Development
Phase 1
Engineers
3
Parallel delivery
Latency Target
p95 <200ms
Revised from 50ms
Training Gaps
14 gaps
~104.5 person-days
⚡TL;DR Summary
Everything important in one place — read this first

What This Is

The planning phase is complete. We are now in active development — implementing the full Volaris task list to integrate Athia AI/ML into Deuna's payment routing service. A team of 3 engineers is focused on delivering the 54-task Volaris plan across 7 phases.

The Problem

  • Static routing rules → suboptimal acceptance rates
  • No intelligent PSP failover during outages
  • Retries not optimized by timing, route, or message
  • No feedback loop — outcomes don't improve future decisions

The Solution (5 Use Cases)

  • P-01 — PSP outage detection & failover
  • P-02 — Optimize existing static routing rules
  • P-03 — Per-transaction processor ranking
  • P-04 — Authorization message manipulation
  • P-05 — Retry optimization

Phase 1 Target — Volaris Merchant

Worldpay
ID: 76
MIT
ID: 85
Elavon
Cards
Amex
Amex cards
~15.5d
Remaining build effort
~89d saved across merged + feature branches
~2 wks
Full delivery
2 engineers · was 5–6 wks before savings
~1.5 wks
MVP (Stages 1–2)
Foundation + Automation only

✅ Already Built (Reduces Effort)

  • ✓ ML serving API — processor_selector & retry_predictor endpoints live (v0.15.5)
  • ✓ Model registry — artifact + experiment + variant tables + CRUD API
  • ✓ A/B testing — auto-winner with statistical guardrails (p<0.05, 7 day min, 1000 samples)
  • ✓ Snowflake feedback loop — PREDICTIONS + FEEDBACK + TRAINING_DATASET + STAGE_OUTCOMES tables active
  • ✓ Full ML training pipeline — LR, RF, XGBoost via Snowflake ML (G-01, G-04, G-07, G-09, G-11, G-12)
  • ✓ Triton IS + shadow mode + ExperimentService API — G-06 done
  • ✓ OTEL metrics pipeline + Grafana CloudWatch dashboard — G-05 done (v0.15.1)
  • ✓ Dynamic per-model encoders + A/B model version routing — G-13 done (v0.15.2+)
  • ✓ Processor Selector v2 — 54-feature XGBoost encoder, fully tested
  • ✓ ML platform services (PR pending) — Model Artifact, Feature, Evaluation, TFX, Experiment Tracking
  • ✓ CI improvements — black, isort, uv migration, pytest fixtures (G-03 partial)
  • ✓ Volaris training pipeline — data ingestion → preprocessing → trainer → evaluator → promoter, S3 artifact storage
  • ✓ Volaris TF smart router sidecar — 3 TF SavedModels, 140-feature encoder, Go API integration with feature flag

🔨 Still Needs to Be Built (~15.5d)

  • ✗ Feature store real-time serving layer — G-08 (~3.5d remaining)
  • ✗ Orchestration DAG (Airflow/Prefect) — G-02 (~2d remaining)
  • ✗ CI/CD deployment pipeline — G-03 (~3d remaining)
  • ✗ Lineage tracking — G-10 (~5.5d)
  • ✗ Rollback API — G-14 (~0.5d remaining)
  • ✗ Retry optimization workflow (P-05) — stimulus still missing
  • ✗ Strategy Director — matcher & ranker nodes are placeholder code

DATA-Athena-Snowflake Testing

~25% coverage · Maturity 3/5
Now training-only (serving removed via PR #1114). Volaris training pipeline complete. 30 tests. CI improved: black, isort enforcement, uv migration. Core multi-agent framework still largely untested.

athena-platform Testing

~35% coverage · Maturity 4/5
PR #215 — Volaris TF smart router sidecar. 3 TF SavedModels, 140-feature encoder, Go API integration with feature flag. 221 Python + 13 Go tests. V2 API (18 handlers) and Bedrock client still at 0%. CI threshold only 20%.

Current Blockers

ItemOwnerStatus
Deuna corp accounts — Rakesh & NaokiTBDNot needed
Code / repo access — NaokiTBDDone
Are ATHIA_* tables live in Deuna's Snowflake?IsraelConfirmed ✓
Are SageMaker endpoints live today?RakeshOpen question
Payment volume through routing engine?IsraelOpen question
GPU instance for training — CPU too slow (3hr+ per run)DeunaBlocking
AWS resource access for Rakesh (console/CLI)DeunaBlocking
✈️ See full Volaris delivery task list — 65 tasks across 8 phases (TensorFlow ecosystem) →
🏠Project Overview
What this project is and what success looks like
🎯

Purpose

Assess the effort required to integrate Athia AI/ML into Deuna's payment routing. Produce a clear work breakdown and estimate before any implementation begins.

✅

Phase 0 Deliverables

  • Full schema & data understanding
  • Effort estimate per workstream
  • Risks and open questions resolved
  • Recommended build order
🏆

Long-Term Success

  • Measurable approval lift
  • Stability during PSP outages
  • Latency: p95 < 200ms
  • Closed feedback/learning loop

✅ In Scope (Phase 0)

  • Understand Deuna's data, schema, routing rules
  • Assess Athia platform gaps vs. what's needed
  • Size effort for P-01 through P-05 use cases
  • Identify all dependencies, blockers, risks

🚫 Out of Scope (Phase 0)

  • Any implementation or code delivery
  • 3DS optimization (Phase 2)
  • User-facing messaging (Phase 3)
  • Installment optimization
📅Timeline & Phases
Project phases from assessment to full delivery
🔍

Phase 0 — Assess Level of Effort  Done ✓

2 days · $6K budget · Completed 2026-02-19
Nail down all the work required. Produce a detailed estimate with confidence before committing to delivery.

🚀

Phase 1 — Model in Production  Pending

2 weeks · Core delivery
Model running in production for 2 processors with basic feature store. Target merchant: Volaris.

📊

Phase 2 — Monitoring + Experimentation  Pending

Week 3 · Add monitoring and integrate with A/B experimentation infrastructure.

⚙️

Phase 3 — Drift Detection, CI/CD, Ramp-Up  Pending

TBD · Drift detection, CI/CD pipeline, experiment ramp-up, additional model techniques.

Phase 1 Delivery Plan — Volaris Merchant

65 tasks · ~64 person-days · 5–6 weeks with 3 engineers · TensorFlow ecosystem

Sub-PhaseFocusTasksEffortKey Notes
0 — Service ArchitectureDesign + scaffold 7 TF service shells1114.5dNEW Rakesh: architecture, API contracts, TF integration plan. Team: 7 service shells
1 — Discovery & EDAUnderstand Volaris data107.5dApproval rates per PSP, retry patterns, routing rules, DYNAMIC_ROUTING_DETAIL JSON, A/B sample size
2 — Feature EngineeringBuild ML features (Feature Service + TF Transform)98.5dCard BIN/brand, RFM, retry context, rolling health scores, Amex hard-rule bypass
3 — Model DevelopmentTrain models (tf.keras)89dTF DNN, wide-and-deep, TF Decision Forests via Training Pipeline + Eval Service
4 — Outage DetectionP-01: failover for 4 PSPs66.5dRolling health score (5–15 min window), auto-failover, recovery detection (1–2% sampling), alerts
5 — Message ManipulationP-04: CIT/MIT experiment55.5dCIT/MIT audit, approval delta by toggle × processor × card type, new athena-platform endpoint, A/B test
6 — Platform IntegrationRegister models, wire Deuna86.5dTriton ✓ ExperimentService one-call API + built-in shadow mode. Requires Deuna eng coordination.
7 — Monitoring & FeedbackDashboards, retraining, review86d2 tasks already done: ATHIA tables confirmed live, ATHIA_STAGE_OUTCOMES deployed
Total65~64dPhase 0 first (Rakesh design ∥ team scaffolding); Phases 1–2 sequential; 3–5 parallel after Phase 2
👥Team
People involved and their roles
Deuna (Client)
NameRole
ReksCEO & Co-Founder
ChemaCo-Founder
PabloCTO — Executive Sponsor
IsraelData POC — Snowflake & Data Access
FarhanClaude / LLM Access POC
Mark WalickProduct Management Lead
Aidaptive (Contractor)
NameRole
RakeshCEO
NaokiSolutions Architect
ReneML Engineer
KedarBackend / Data Engineer
🎯Use Cases (P-01 to P-05)
The five P0 use cases to be delivered
🏢

Phase 1 Target Merchant: Volaris  Decided 2026-02-19

Volaris selected over Cinépolis. Known PSPs: Worldpay (ID: 76), MIT (ID: 85), Elavon (cards), Amex (Amex cards) — 4 processors total with routing policies per currency. Cinépolis deferred: only shows Cybersource (a gateway), actual processor unknown.

P-01

Outage Detection & Failover

Detect PSP failures via persistent timeout codes. Auto fail-over and fail-back using random sampling of downed PSP to detect recovery.

P-02

Routing Optimizer

Optimize Deuna's existing static routing rules based on historical outcomes. Build on existing rules engine rather than starting from scratch.

P-03

Per-Transaction Route Selection

Rank top 3 payment processors per transaction in real time based on prior outcomes, card signals, and merchant context.

P-04

Message Manipulation

Toggle CIT/MIT, AVS, MCC variables in authorization request messages. Provide top 3 configuration recommendations per transaction.

P-05

Retry Optimization

Optimize when, how, and where to retry declined transactions. MIT/subs focused. Enterprise darktime reduction. Delayed retry based on processor reputation.

🗄️Data & Schema
Snowflake database overview — extracted 2026-02-18

Connection

VLTAXPW-RMONTES

Database: PAYMENT_ML

Access: Read-only

ABTESTING Schema

Denormalized flat join of all views. Best starting point for EDA. No complex joins needed.

ALL_VIEWS_FLAT ALL_PAYMENT_EVENTS_FLAT

SOURCES Schema

15 clean views: orders, payments, attempts, events, user profiles, routing logs, merchant rules, airline data.

ViewWhy It MattersUse Cases
VW_ATHENA_PAYMENT_ATTEMPTFull retry chain per payment; processor, error codes, hard/soft decline, DYNAMIC_ROUTING_DETAIL JSONP-03 P-05
VW_SMART_ROUTING_ATTEMPTSLive routing engine log: algorithm type, latency, skip reasons — direct latency signal for p95 <200msP-01 P-02
VW_ROUTING_MERCHANT_RULEExisting static rules engine — foundation for routing optimizer. SHADOW_MODE column suggests testing infrastructure exists.P-02
ABTESTING.ALL_VIEWS_FLATEverything joined in one table — best for initial EDAEDA
Feature GroupKey ColumnsUse
Retry historyNUM_ATTEMPTS_ORDER, PREVIOUS_ORDER_ERROR_CODE, AVG_SEC_BETWEEN_PAYMENT_ATTEMPSP-05
Error signalsERROR_CODE, ERROR_CATEGORY, HARD_SOFTP-03, P-05
Card signalsCARD_BIN, CARD_BRAND, BANK, CARD_COUNTRYP-03
User behaviorTARGET_USER_FRAUD_RATE_COHORT, TOTA_MINUTES_BROWSING, RFM valuesP-03
Message configMCI_MSI_TYPE, ORDER_MCI_MSI_TYPE, PAYMENT_ATTEMPT_METHOD_TYPEP-04
Geo & DeviceORDER_COUNTRY_CODE, TARGET_USER_BROWSER, TARGET_USER_DEVICEP-03
🔍Repository Analysis
Findings from both Deuna GitHub repos — re-analyzed 2026-03-24 · Both repos tracked on main · DATA-Athena-Snowflake now training-only (PR #1114) · athena-platform PR #215 — Volaris TF smart router sidecar

DATA-Athena-Snowflake

github.com/DUNA-E-Commmerce/DATA-Athena-Snowflake · branch: main (feat/ATH-0000 merged — 182 files, +29K lines)

Python / LangGraph / Snowflake ML

✅ Key Finding (Updated 2026-03-24)

Now training-only — serving removed via PR #1114, migrated to athia-model-server sidecar in athena-platform. Volaris training pipeline complete: data ingestion → preprocessing → trainer → evaluator → promoter, with S3 artifact storage. 30 tests. Clean separation: training repo (Python) vs serving repo (Go + Python sidecar).

LLM Workflows (11 stimuli)

WorkflowStatus
Acceptance rate analysisDone (v0_1, v1_0)
Fraud card analysisDone
Metrics anomaly detectionDone
Chatbot / data analystDone
Strategy generation directorPartial — Matcher has exit()
Cost optimizationEarly stage
Retry optimizationMissing (P-05 gap)

ML Training Platform (now on main)

ServiceStatus
Training pipeline (Snowflake ML)Done
LLM training orchestrator (GPT-4 + RAG)Done
LLM experiment designer (GPT-4 + RAG)Done
Data quality validatorDone
Model registry (auto table creation)Done
Feature extractorDone
Feedback collector (webhook/API/batch)Done
Schema discovery (ChromaDB)Done
Athia event ingestionDone
Model deployer → athena-platformPartial — EFS export done; API integration manual

Architecture — Multi-Agent Pattern

FastAPI + LangGraph · Stimulus-response orchestration · LLM backends: Claude (primary), GPT-4 (fallback)

Request → StimulusRegistry → OrchestratorWorkflow → Branch (DAG of Nodes)
        → AgentWorkflow (LangGraph StateGraph) → Response

11 stimuli: acceptance_rate_analysis · fraud_card_analysis · metrics_anomaly
            user_question · data_analyst · researcher_assistance · deep_exploration
            element_edition · knowledge_expert · strategy_generation · cost_optimization

End-to-End Training Flow

POST /training/run/processor_selector
  → Schema Discovery (ChromaDB index)
  → Data Quality Validation (temporal bias · class balance · outlier · concept drift)
  → LLM Training Orchestrator (GPT-4 + RAG → RETRAIN_NOW / SCHEDULED / SKIP)
  → LLM Experiment Designer (7 experiments, simple→complex)
  → Snowflake ML Training (LR, RF, XGBoost · 80/20 temporal split)
  → Best model selected by F1 score
  → Export to EFS + athena-platform payload prepared
  → Results stored in ML_TRAINING_RUNS

Training Pipeline Architecture

Pipeline flow: Training Decision → Data Prep → Feature Extraction → Validation → Experiment Design → Training → Model Selection → Deployment

Services

ServicePurpose
TrainingPipelineFull training execution (plan/run/deploy)
LLMTrainingOrchestratorLLM + RAG decision engine (RETRAIN_NOW / SCHEDULED / SKIP)
LLMExperimentDesignerDesigns 5-10 experiments using GPT-4/Claude with RAG
ModelDeployerExports to EFS, registers deployment, creates canary config
TrainingPlannerDry-run mode ("terraform plan" for ML)
FeatureExtractorAuto-extracts features, creates training dataset views
DataQualityValidatorSchema, statistical, temporal bias, drift validation
FeedbackCollectorWebhook, API polling, batch feedback collection
ModelRegistryModel CRUD, prediction/feedback schema management
SchemaDiscoveryAuto-discovers training tables via LLM
LLMProviderUnified Claude/GPT-4 interface with auto-fallback

API Endpoints

EndpointPurpose
POST /api/v1/training/plan/{model_type}Dry-run plan
POST /api/v1/training/run/{model_type}Execute training
POST /api/v1/training/decision/{model_type}LLM decision
POST /api/v1/experiments/design/{model_type}Design experiments

Remaining Gaps

  • ✗ No retry LLM workflow — retry_optimization_requested stimulus still missing (P-05)
  • ✗ Strategy Director matcher/ranker still incomplete — exit() placeholder in matcher (P-02)
  • ✗ Deployment integration incomplete — model_deployer.py exports to EFS but does NOT call athena-platform API (~1.5d left)
  • ✗ No data lineage tracking — TrainingDatasetVersion not implemented (G-10)
  • ✗ No rollback capability (G-14 partial — shadow mode + is_default merged to main; rollback API not built)
  • ✗ SQL injection risks in schema_discovery.py, training_planner.py, athia_ingestion.py
  • ✗ No circuit breakers — LangGraph node failures still cascade
  • ✗ Fragile LLM JSON parsing — all services extract JSON by searching for braces (not robust)

🧪 Testing Coverage

~25%
Est. coverage
450+
Test functions
34
Test files
LayerCoverageNotes
Metrics layer70–80%Well tested — 5 files
Deployment / config50–60%Validation tests solid
Model registry (new)~60%Happy-path CRUD; no error cases
Schema discovery (new)~45%Discovery + semantic search; no edge cases
Experiment designer (new)~40%RAG + ChromaDB; LLM-dependent, not mocked
Training orchestrator (new)~35%Happy-path; external deps not mocked
Feedback collector (new)~35%Batch/polling skipped; no duplicates
Route handlers~0%Still largely untested
Multi-agent core + 11 branches~5%1 manual test — not in pytest
Lambda / AgentCore entrypoints0%4 entrypoints — no tests

Maturity: 3/5 — Moderate. ML training services are well-architected with growing test coverage after merge. Tests are still integration-only, depend on live Snowflake + OpenAI, and cover happy-path only. Multi-agent core still a black box. Not suitable for CI without mocking.

athena-platform

github.com/DUNA-E-Commmerce/athena-platform

Go / Gin

✅ Key Finding (Updated 2026-03-24)

PR #215 — Volaris TF smart router sidecar. 3 TF SavedModels (processor_selector, retry_predictor, retry_sequence), 140-feature encoder, Go API integration with feature flag. 221 Python + 13 Go tests. Now the single serving repo — training migrated to DATA-Athena-Snowflake. Still at v0.15.5 with Triton IS, shadow mode, ExperimentService, Grafana CloudWatch dashboard, and auto versioning.

✅ Triton Branch — Merged to Main

G-06 deployment gap nearly closed (~1.5d remaining). Provides partial G-14 rollback via shadow mode. Now in main: Triton IS sidecar, ModelConversionManager (sklearn→Triton), ServiceWithTriton (readiness checks), ExperimentService (one-call experiment creation), shadow mode seed experiments for all 3 model types, max_variants constraint. 32/32 new tests passing. Active branches in review: feat/new-encoder-v2, feature/experiment-api-cleanup.

ML Inference Types (already in registry)

TypeMaps To
processor_selectorP-03
retry_predictorP-05
retry_sequenceP-05
installment_optimizerOut of scope

Snowflake Tables

TableStatus
ATHIA_PREDICTIONSActive
ATHIA_FEEDBACKActive
ATHIA_TRAINING_DATASETActive
ATHIA_EXPERIMENT_LIFTActive
ATHIA_STAGE_OUTCOMESDeployed (feat/ATH-0000)
ATHIA_SESSION_SUMMARYDeployed (feat/ATH-0000)
ATHIA_MULTI_STAGE_ANALYSISNew (feat/ATH-0000)
ATHIA_MODEL_METRICSNew (feat/ATH-0000)
ML_MODEL_REGISTRYNew (feat/ATH-0000)

Architecture — Clean Architecture (Go/Gin)

PostgreSQL (RDS Multi-AZ) + Redis (ElastiCache) + EFS · mTLS enforced on /api/v1/ml/predict/*

REST Handlers (V1 + V2, Gin)  ← mTLS on /ml/predict/*
        ↓
Controllers (~30 implementations)
        ↓
Domain Services (44 packages)  ← constructor injection throughout
        ↓
Repositories (43 GORM implementations)  ← in-memory SQLite for tests
        ↓
PostgreSQL (RDS Multi-AZ) + Redis (ElastiCache) + EFS (model storage)

A/B Experimentation — Auto-Winner Guardrails

Stats

p-value < 0.05
Min 1000 samples/variant
Min 7 days runtime

Lift

Min 1% absolute lift
Deterministic bucketing
SHA256(transaction_id)

Guardrails

≤10% latency regression
≥−5% revenue regression
Dry-run mode (safe default)

🧪 Testing Coverage

~25–30%
Est. coverage
777
Test functions
126
Test files
LayerCoverageNotes
Domain services (44)44/44All have test files
Repositories (43)~43/43In-memory SQLite isolation
V1 REST handlers (18)15/18 (83%)agent, workspaces, elements missing
Auth middlewareTestedJWT + API key covered
V2 REST handlers (18)0/18 (0%)Entire new API version untested
Bedrock client0%Excluded from coverage config
auth, bedrock, element, workspace services0%4 domain services with no tests
Bootstrap / DI graphSkippedTODO: testcontainers

Maturity: ~4/5 — Strong. Domain and repository layers well covered. Triton merge adds 32 new tests. V2 API (18 handlers) and Bedrock ML inference path are still untested. CI threshold is only 20% and internal/clients/ is excluded from coverage entirely.

Testing Comparison — Both Repos

MetricDATA-Athena-Snowflakeathena-platform
Test functions450+ (+6 new integration files)777
Test files34 (28 + 6 new)126
Est. coverage~20%~25–30%
Core product tested?No — multi-agent core untested; new ML tests need live depsPartial — V2 + Bedrock missing
CI enforced?Partial (fragmented)Yes — every PR
Coverage thresholdNone enforced20% (too low; target: 65%)
Maturity2.5/5 — Low~4/5 — Strong
⚡Training Platform Gaps
14 gaps — ~104.5d original estimate · ~89d saved across merged + feature branches · ~15.5d remaining

Timeline (2 engineers)

Full delivery: ~2 weeks (was 5–6 wks before savings)

MVP (Stages 1–2 only): ~1 week

1 engineer: ~3 weeks

Build Order

Stage 1 must complete before Stage 2. Stages 3–5 can overlap with late Stage 2.

Triton branch merged to main — deployment architecture confirmed.

Total Effort

~15.5d remaining

of ~104.5d original · ~89d saved

Progress Summary — 2026-03-24

9
Gaps fully done
G-01, G-04, G-05, G-06, G-07, G-09, G-11, G-12, G-13
3
Nearly done
G-02, G-08, G-14
2
Partial
G-03, G-10
0
Not started
—
StageGapCategoryPriorityOriginalStatus
1 – FoundationG-02 OrchestrationInfrastructureHigh8dNearly Done (~1d left) — ~88% complete. SageMaker + Spacelift deployed, daily schedules active
1 – FoundationG-08 Feature StoreML InfraHigh13.5dNearly Done (~3.5d left) — ~74% complete, 140-feature encoder in sidecar
1 – FoundationG-04 Data ValidationData QualityHigh7dDone ✓
2 – AutomationG-03 CI/CD PipelineDevOpsHigh9dPartial (~2d left) — Quality gates done. Spacelift + GitHub Actions CI/CD. Integration tests partial. Staging/prod TODO
2 – AutomationG-06 Deployment AutomationAutomationHigh7.5dDone ✓
2 – AutomationG-07 Model RegistrationAutomationMedium5.5dDone ✓
2 – AutomationG-01 Automated RetrainingAutomationHigh10dDone ✓
3 – GovernanceG-13 Versioning WorkflowGovernanceHigh5dDone ✓
3 – GovernanceG-10 Lineage TrackingGovernanceMedium6.5dPartial (~4d left) — ~35% complete. ClearML online tracking at athia-ml.dev.deuna.io, pipeline→training task hierarchy
3 – GovernanceG-14 Rollback CapabilityReliabilityHigh5dNearly Done (~0.5d left) — ~90% complete, rollback API pending
4 – ObservabilityG-05 Model MonitoringObservabilityHigh8dDone ✓
4 – ObservabilityG-09 Drift DetectionObservabilityMedium7dDone ✓
5 – ML QualityG-11 Hyperparameter TuningML QualityMedium5.5dDone ✓
5 – ML QualityG-12 Algorithm ComparisonML QualityMedium7dDone ✓
Remaining Effort by Stage
Stage 1 – Foundation
~5.5d
Stage 2 – Automation
~3d
Stage 3 – Governance
~6d
Stage 4 – Observability
~1d
Stage 5 – ML Quality
0d ✓
📄Statement of Work — Delivery
Volaris Phase 1 · 65 tasks · ~64 person-days · 5–6 weeks (3 engineers) · TensorFlow ecosystem

Engagement Summary

Full implementation of Athia AI/ML smartrouting for Deuna — Volaris merchant. 7 delivery phases covering P-01 (outage failover), P-03 (per-transaction routing), P-04 (message manipulation), and P-05 (retry optimization). ~61.5 person-days of prior Deuna codebase work reduces original scope significantly.

54
Delivery tasks
~51d
Person-days
5–6 wks
Calendar time

Aidaptive Team — Roles & Responsibilities

NameRoleResponsibilitiesDays
RakeshProject Lead & StrategyClient coordination (Pablo, Israel), architecture decisions, Phase 6 oversight, post-launch review5.5d
NaokiSolutions Architectathena-platform Go dev, outage detection, message manipulation API, model serving (Triton), CI/CD, Phase 6 integration14.6d
ReneML EngineerFeature engineering, model training (processor_selector, retry_predictor, retry_sequence), data quality, drift detection, retraining pipeline15d
KedarData & Backend EngineerSnowflake EDA, data pipelines, training datasets, feature feeds, Grafana dashboards, monitoring16d
Total~51.1d

Effort by Phase

PhaseFocusOwnersDaysMilestone
1 — Discovery & EDAUnderstand Volaris dataKedar (4.5d) · Rene (2d) · Rakesh (0.5d) · Naoki (0.5d)7.5dKick-off (20%)
2 — Feature EngineeringBuild ML feature setRene (4d) · Kedar (3d) · Naoki (1d) · Rakesh (0.5d)8.5dPhase 2 complete (20%)
3 — Model DevelopmentTrain P-03 + P-05 modelsRene (6d) · Kedar (2d) · Naoki (1d)9dPhase 3 complete (20%)
4 — Outage DetectionP-01: failover for 4 PSPsNaoki (4.5d) · Rakesh (1d) · Kedar (1d)6.5dPhase 6 complete (30%)
5 — Message ManipulationP-04: CIT/MIT experimentNaoki (2d) · Rene (1.5d) · Kedar (1.5d) · Rakesh (0.5d)5.5d
6 — Platform IntegrationRegister models, wire Deuna +G-06 closeNaoki (5.1d) · Rakesh (2d) · Kedar (1d)8.1d
7 — Monitoring & FeedbackDashboards, retraining, reviewKedar (3d) · Rene (1.5d) · Rakesh (1d) · Naoki (0.5d)6dPhase 7 complete (10%)
Total~51d

Delivery Timeline (6-week plan)

WeekDaysPhases ActiveWhoKey Milestone
Week 11–5Phase 1 (EDA) · Phase 2 start Day 3Kedar · Rene · Rakesh (Day 1)EDA complete; feature schema draft
Week 26–10Phase 2 (Features) · Phase 3 start Day 8Rene · Kedar · NaokiFeature set locked; training dataset built
Week 311–15Phase 3 (Models) · Phase 4 (Outage) parallelRene (models) · Naoki (outage)Models packaged; outage detection built
Week 416–20Phase 4 tail · Phase 5 (CIT/MIT) · Phase 6 prepNaoki · Rene · Kedar · RakeshAPI contract with Deuna eng signed
Week 521–25Phase 6 (Integration)Naoki · Rakesh⚠ Triton branch must be merged by Day 18 · Integration live in shadow mode
Week 626–30Phase 7 (Monitoring & Review)Kedar · Rene · RakeshDashboards live · retraining scheduled · post-launch report

Critical path: Phases 1–2 sequential. Phases 3–5 can run in parallel. Phase 6 requires (a) models complete, (b) Triton branch merged, (c) 1-week Deuna engineering lead time for API contract. Phase 7 requires Phase 6 live.

Assumptions

  • Snowflake access (PAYMENT_ML) remains available read-only
  • Deuna eng available for API contract in Week 4 (Pablo / Israel)
  • Triton branch merged to main by end of Week 3
  • Staging environment available for Phase 6 integration tests
  • ATHIA_PREDICTIONS + ATHIA_FEEDBACK remain live throughout

Success Criteria

  • processor_selector live for ≥1 Volaris PSP
  • ≥1% absolute approval rate lift (A/B test at significance)
  • ≥5% retry success rate improvement vs. baseline
  • PSP failover within 1 routing cycle of threshold breach
  • p95 latency <200ms end-to-end (model inference <50ms)
  • 48h shadow run complete with documented comparison
✈️Volaris Smartrouting — Delivery Tasks
65 tasks across 8 phases to deliver AI-powered routing for Volaris — TensorFlow ecosystem, Phase 1 target client

🔧 Architecture Update (2026-03-13): TensorFlow Ecosystem + 7 Service Shells

Adopted TensorFlow ecosystem (tf.keras, TFX, TF Serving, TFDV, TFMA, TF Transform) for all ML work. Added Phase 0 with 11 new tasks: 3 design tasks (Rakesh) + 8 service shell scaffolds. Replaces Snowflake ML / XGBoost / scikit-learn.

7 Service Shells
Data Pipelines · Feature Service · Training Pipelines · Model Management · Eval Service · Evaluation Framework · Experiment System
Design Tasks (Rakesh)
V-D01: Service architecture · V-D02: API contracts · V-D03: TF ecosystem integration plan
Updated Totals
65 tasks (was 54) · ~64d total (was ~49.5d) · Phase 0 adds ~14.5d · 3 engineers ~5–6 weeks
4
PSPs (Worldpay · MIT · Elavon · Amex)
65
Total tasks +11 new
~64d
Total effort +14.5d (Phase 0)
5–6 wks
3 engineers parallel
Phase 0 — NEW

Service Architecture & Shell Setup

Design 7 service boundaries (Rakesh), define API contracts, scaffold all service shells using TensorFlow ecosystem: Data Pipelines (TFX), Feature Service (TF Transform), Training Pipelines (TFX Trainer), Model Management, Eval Service (TFMA), Evaluation Framework, Experiment System.

📐 V-D01–D03 (Rakesh design) + V-S01–S08 (team scaffolding)

11 tasks14.5dTensorFlow
Phase 1

Discovery & EDA

Understand Volaris transaction data — approval rates per PSP, retry patterns, routing rules, DYNAMIC_ROUTING_DETAIL JSON, sample size for A/B test.

10 tasks7.5d
Phase 2

Feature Engineering

Card BIN/brand, transaction context, user RFM, retry history, rolling processor health scores, Amex hard-rule bypass, training dataset build.

9 tasks8.5d
Phase 3

Model Development

Train processor_selector, retry_predictor, retry_sequence for Volaris 4 PSPs using tf.keras. DNN vs. wide-and-deep vs. TF Decision Forests comparison. TFMA per-slice evaluation.

🔧 TensorFlow ecosystem: tf.keras training via Training Pipeline service, TFMA evaluation via Eval Service, SavedModel export via Model Management.

8 tasks9dInfra ready
Phase 4 — P-01

Outage Detection

Rolling health score per PSP, failover to next-best Volaris processor, recovery detection via 1–2% sampling, alerts on state changes.

6 tasks6.5d
Phase 5 — P-04

Message Manipulation

CIT/MIT audit for Volaris, approval delta by toggle × processor × card type, experiment design, new athena-platform endpoint, A/B test.

5 tasks5.5d
Phase 6

Platform Integration

Register models in athena-platform, create Volaris-scoped experiment, API contract with Deuna eng, shadow mode validation before live traffic.

✅ Triton branch: ExperimentService one-call API (V-39–41) + built-in shadow mode (V-46) reduce effort by ~1d.

8 tasks6.5d 7.5dTriton ✓
Phase 7

Monitoring & Feedback Loop

V-47
Confirm ATHIA_PREDICTIONS + ATHIA_FEEDBACK are live in Deuna's Snowflake
Done ✓ — confirmed data live in Deuna's Snowflake (2026-02-24)
V-48
Deploy ATHIA_STAGE_OUTCOMES table
Done ✓ — deployed in feat/ATH-0000
V-49–50
Approval rate + model performance Grafana dashboards
V-51–54
Retraining trigger, scheduled pipeline, auto-winner, post-launch review
Ready ✓ — LLM orchestrator + training pipeline built
8 tasks6d2 tasks done/ready

All 65 Tasks

#TaskPhaseOwnerEffortStatus
V-D01Design overall service architecture — 7 service boundaries, data flow, inter-service communication0 – ArchitectureRakesh2dDesign
V-D02Define API contracts for all 7 services — OpenAPI specs, error handling, versioning0 – ArchitectureRakesh1.5dDesign
V-D03Design TensorFlow ecosystem integration — map TFX components to services, TF Serving format, TFDV/TFMA0 – ArchitectureRakesh1dDesign
V-S01Scaffold Data Pipeline service — TFX ExampleGen + StatisticsGen, Snowflake adapter, TFDV schema0 – ShellKedar1.5d
V-S02Scaffold Feature Service — TF Transform preprocessing_fn, feature store API, real-time endpoint0 – ShellRene1.5d
V-S03Scaffold Training Pipeline service — TFX Trainer + tf.keras, Keras Tuner, training history0 – ShellRene1.5d
V-S04Scaffold Model Management service — registry CRUD, SavedModel storage, lifecycle, version comparison0 – ShellNaoki1d
V-S05Scaffold Eval Service — TFMA integration, per-slice metrics, model blessing/rejection API0 – ShellRene1d
V-S06Scaffold Evaluation Framework — A/B stat engine, winner detection, latency/revenue guardrails0 – ShellNaoki1.5d
V-S07Scaffold Experiment System — experiment CRUD, traffic splitting, variants, shadow mode orchestration0 – ShellNaoki1.5d
V-S08Set up shared TF dependencies — tensorflow, tfx, tf-transform, tfma, tfdv, keras-tuner + Docker base0 – ShellKedar0.5d
V-01Filter Volaris transactions — date range, volume, monthly trend1 – EDAKedar0.5d
V-02Per-processor approval rates (Worldpay, MIT, Elavon, Amex) by card type, currency, amount1 – EDAKedar1d
V-03Retry pattern analysis — attempts per order, processor retry-to, 1st/2nd/3rd attempt success rates1 – EDAKedar1d
V-04Explore DYNAMIC_ROUTING_DETAIL JSON — extract all keys and values1 – EDAKedar1d
V-05Map Volaris routing rules from VW_ROUTING_MERCHANT_RULE* views1 – EDAKedar0.5d
V-06Analyze smart routing log — algorithm types, skip rates, p95 latency baseline1 – EDAKedar0.5d
V-07Hard vs. soft decline distribution by processor and error code1 – EDARene1d
V-08Profile airline-specific features — flight, passenger, booking window signal1 – EDARene0.5d
V-09A/B test sample size check — daily volume per processor ≥ 1000/variant in 7 days?1 – EDARene0.5d
V-10EDA summary report — approval rates, error taxonomy, processor share, correlations1 – EDARene + Rakesh1d
V-11Define Volaris feature schema — all features, types, sources, compute latency2 – FeaturesRene + Naoki1d
V-12Card-level features — BIN, brand, bank, type, country; historical approval rate per BIN × processor2 – FeaturesRene1d
V-13Transaction-level features — amount, currency, CIT/MIT, MCC, flight order type2 – FeaturesRene1d
V-14User-level features — RFM, fraud rate cohort, tenure, browsing signals2 – FeaturesRene0.5d
V-15Retry-context features — previous processor, error code, time since attempt, attempt number2 – FeaturesKedar1d
V-16Processor-state features — rolling approval/timeout/decline rate at 15-min, 1h, 24h windows2 – FeaturesKedar1.5d
V-17Amex hard-rule — always route Amex cards to Amex processor; bypass ML2 – FeaturesNaoki0.5d
V-18Build training dataset — join features onto labeled outcomes; train/val/test split2 – FeaturesKedar1dReady ✓ ATHIA_TRAINING_DATASET view + feature_extractor.py
V-19Feature quality validation — nulls, skew, leakage risk, outcome correlation2 – FeaturesRene1dReady ✓ data_quality_validator.py (834 lines)
V-20Train processor_selector v1 — rank 4 PSPs by approval probability (tf.keras DNN)3 – ModelsRene2dTF Training Pipeline service
V-21Evaluate processor_selector — AUC, lift vs. static rules, per-processor accuracy, latency3 – ModelsRene1dReady ✓ Metrics auto-calculated by pipeline
V-22Train retry_predictor v1 — predict retry approval probability3 – ModelsRene1.5dReady ✓ Training pipeline supports retry_predictor type
V-23Train retry_sequence v1 — optimal processor order for retry3 – ModelsRene1.5dReady ✓ Training pipeline supports retry_sequence type
V-24Evaluate retry models — success rate lift, processor fatigue patterns3 – ModelsRene1dReady ✓ Evaluation framework in pipeline
V-25Architecture comparison — DNN vs. wide-and-deep vs. TF Decision Forests; select champion3 – ModelsRene1dTF Replaces XGBoost vs. LR comparison
V-26Inference latency test — all models under 50ms budget3 – ModelsNaoki0.5d
V-27Package models — serialize, write model card (schema, features, metrics)3 – ModelsKedar0.5dReady ✓ model_registry.py auto-creates tables + stores metadata
V-28Define outage signal — timeout/error code thresholds for PSP-down detection4 – P-01Rakesh + Naoki1d
V-29Rolling processor health score — sliding 5–15 min window per PSP4 – P-01Naoki1.5d
V-30Failover logic — skip degraded PSP, route to next-best Volaris processor4 – P-01Naoki1.5d
V-31Recovery detection — 1–2% sampling of down PSP; auto-restore on consecutive wins4 – P-01Naoki1d
V-32Outage simulation tests — inject failures per PSP; verify failover + recovery4 – P-01Naoki1d
V-33Outage alerting — Slack/PagerDuty on PSP state changes4 – P-01Kedar0.5d
V-34Audit CIT/MIT usage for Volaris — current distribution across PSPs5 – P-04Kedar0.5d
V-35Approval delta by CIT vs MIT per processor — statistical test5 – P-04Rene1d
V-36Design message manipulation experiment — CIT/MIT × processor × card type matrix5 – P-04Rene + Rakesh1d
V-37Implement message recommendation API in athena-platform5 – P-04Naoki2d
V-38Run A/B test — approval rate with vs. without message recommendations5 – P-04Kedar1d
V-39Register processor_selector in MODEL_ARTIFACTS (version, Triton backend ref, feature schema)6 – IntegrationNaoki0.3dReady ✓ POST /api/v1/ml/models (Triton branch ExperimentService)
V-40Register retry_predictor + retry_sequence in MODEL_ARTIFACTS6 – IntegrationNaoki0.3dReady ✓ Same — ExperimentService handles all 3 model types
V-41Create Volaris-scoped experiment — merchant filter, 10% treatment split, shadow mode, guardrails6 – IntegrationNaoki0.5dReady ✓ POST /api/v1/ml/experiments — variants + models in one call (Triton branch)
V-42Validate experiment assignment — SHA256 bucketing determinism for Volaris6 – IntegrationNaoki0.5d
V-43API contract with Deuna engineering — define POST /api/v1/ml/predict request/response for Volaris6 – IntegrationRakesh1d
V-44Deuna payment service integration — Deuna calls athena-platform at routing decision point6 – IntegrationRakesh + Naoki2d
V-45End-to-end integration test — full flow: Deuna → athena-platform → model → ranked PSPs6 – IntegrationNaoki + Kedar1d
V-46Shadow mode — 48h logging without acting; compare predicted vs. actual outcomes6 – IntegrationKedar0.5dReady ✓ is_shadow_mode=true built-in (Triton branch); set up + monitor only
V-47Confirm ATHIA_PREDICTIONS + ATHIA_FEEDBACK are live in Deuna's Snowflake7 – MonitoringKedar0.5dDone ✓ Confirmed data live in Deuna's Snowflake (2026-02-24)
V-48Deploy ATHIA_STAGE_OUTCOMES table in Snowflake7 – MonitoringKedar0.5dDone ✓ Deployed in feat/ATH-0000 SQL
V-49Volaris approval rate dashboard — daily/hourly per PSP vs. baseline7 – MonitoringKedar1d
V-50Model performance dashboard — prediction confidence, rank accuracy, retry lift7 – MonitoringKedar1d
V-51Define retraining trigger — approval rate drop or AUC drop thresholds7 – MonitoringRene0.5dReady ✓ llm_training_orchestrator.py makes RETRAIN_NOW / SCHEDULED / SKIP decisions
V-52Schedule weekly retraining — auto-register new version from latest ATHIA_TRAINING_DATASET7 – MonitoringRene1dReady ✓ training_pipeline.py + orchestrator built; configure for Volaris cadence
V-53Confirm auto-winner worker runs for Volaris experiment with correct guardrails7 – MonitoringNaoki0.5d
V-54Post-launch review — 2-week lift analysis: approval rate, outage response, retry success7 – MonitoringRakesh1d
💡Codebase Improvement Suggestions
What needs to change in both repos to reach production-grade quality

🔴 Critical — Do These First

ActionRepoEffort
Build retry_optimization_requested stimulus — P-05 is entirely missing from LLM platformDATA-Athena-Snowflake3d
Complete Strategy Director — replace exit() placeholder & dummy ranker promptsDATA-Athena-Snowflake2d
Add tests for all 18 V2 REST handlers — 0% coverage on new API versionathena-platform4d
Add tests for Bedrock client + Bedrock domain service — production-critical, currently excludedathena-platform1.5d
Add route, service & client layer tests — all 14 routes, 13 services, 3 clients at 0%DATA-Athena-Snowflake7d
Remove internal/clients/ from coverage exclusions in CIathena-platform0.5d
Python / LangGraph DATA-Athena-Snowflake
Testing
  • ✗ Unit test all route handlers with FastAPI TestClient + mocked services
  • ✗ Unit test all 13 services — mock Snowflake sessions and clients
  • ✗ Unit test core multi-agent framework: AgentWorkflow, AgentStrategy, node/edge composition
  • ✗ Add per-branch tests for all 11 stimulus branches (mock LLM responses with fixtures)
  • ✗ Unify CI into a single pytest run — replace fragmented per-domain workflows
  • ✗ Enable pytest-cov with 60% minimum threshold enforced in CI
Architecture
  • ✗ Circuit breaker in AgentWorkflow — isolate node failures, prevent cascade
  • ✗ Enable OpenTelemetry tracing — already in codebase, just commented out
  • ✗ Replace hardcoded thresholds (15% drop, 60–80 min windows) with configurable params
  • ✗ Add LLM prompt injection guards — sanitize user inputs before system prompts
  • ✗ Standardize tool definition — unify @create_tool vs. manual; add versioning
  • ✗ Centralize config — replace scattered load_dotenv with Pydantic Settings schema
Go / Gin athena-platform
Testing
  • ✗ Add tests for all 18 V2 handlers — entire new API version at 0%
  • ✗ Test Bedrock client & service — excluded from coverage, production-critical
  • ✗ Raise CI threshold 20% → 60%; remove internal/clients/ exclusion
  • ✗ Bootstrap integration test with testcontainers-go — verify DI graph
  • ✗ Benchmark tests for /ml/predict, /feedback, experiment assignment
  • ✗ Contract tests for Snowflake & Bedrock APIs — catch schema drift early
Architecture
  • ✗ Event-driven model registry cache invalidation — remove 24h stale assignment risk
  • ✗ Experiment context middleware — auto-propagate session/experiment IDs per request
  • ✗ Abstract *gin.Context from controllers — transport-agnostic, easier to test
  • ✗ Deploy ATHIA_STAGE_OUTCOMES + ATHIA_SESSION_SUMMARY Snowflake tables
  • ✗ SageMaker model warm-up — cold starts can breach p95 < 200ms target
  • ✗ Production Grafana dashboards + alerts — config exists locally, not deployed

Full Priority Order

PriorityActionRepoEffort
CriticalBuild retry optimization stimulus (P-05)DATA-Athena-Snowflake3d
CriticalComplete Strategy Director matcher + rankerDATA-Athena-Snowflake2d
CriticalAdd tests for all 18 V2 REST handlersathena-platform4d
CriticalAdd tests for Bedrock client + serviceathena-platform1.5d
CriticalAdd route, service & client tests (all at 0%)DATA-Athena-Snowflake7d
CriticalRemove internal/clients/ from coverage exclusionsathena-platform0.5d
HighMulti-agent framework + branch tests (11 branches)DATA-Athena-Snowflake6d
HighCircuit breaker in AgentWorkflowDATA-Athena-Snowflake2d
HighEnable OpenTelemetry tracingDATA-Athena-Snowflake1.5d
HighRaise CI coverage threshold to 60%athena-platform0.5d
HighBootstrap integration test (testcontainers)athena-platform1.5d
HighEvent-driven model registry cache invalidationathena-platform1.5d
HighDeploy ATHIA_STAGE_OUTCOMES + SESSION_SUMMARY tablesathena-platform1d
HighSageMaker model warm-up (latency target risk)athena-platform1d
MediumAdaptive thresholds (replace hardcoded values)DATA-Athena-Snowflake2d
MediumExperiment context middlewareathena-platform2d
MediumProduction Grafana dashboards + alert rulesathena-platform2d
MediumBenchmark tests for hot endpointsathena-platform1d
MediumUnified CI test suite + coverage enforcementDATA-Athena-Snowflake1.5d
📝Daily Updates
Aidaptive engineering activity across both codebases
2026-04-15
Goal: Have both Log Reg and DNN pipelines running regularly in dev and prod, pushing models to S3 buckets.
EngineerWork Done
RakeshChanged training pipelines to use GPU instances. Implemented deeper ClearML integration in training pipeline — extracting detailed metrics from each training run into ClearML for monitoring and debugging.
2026-04-14
Goal: Get ClearML integration working in dev and deploy DNN pipeline end-to-end.
EngineerWork Done
RakeshSpent ~30 hours debugging dev servers with ClearML integration — blocked by access issues. Worked with team to resolve access, dev pipeline now working correctly. Attempted DNN pipeline deployment — ran for 3 hours and failed. Long turnaround time makes iteration impractical without GPU.
Blockers:
  • GPU needed — training requires GPU instance; CPU-based runs too slow for practical iteration (3hr+ per attempt)
  • AWS access for Rakesh — need direct access to AWS resources (console/CLI) to debug and iterate efficiently
2026-04-11
Goal: Have both model techniques (DNN and Log Reg) running e2e daily, pushing models in dev and prod.
EngineerWork Done
RakeshImplemented DNN pipeline Terraform and deploying in dev. Need to get PRs submitted and merged to main/qa so pipelines can run via CI/CD
2026-04-10
Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.
EngineerWork Done
RakeshDebugging why ClearML is not registering all metrics from each training run. Next: deploy DNN model and fix production integration to ensure every training run model reaches production automatically
2026-04-07
Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.
EngineerWork Done
RakeshFixing Volaris training pipeline run in dev environment. Helping team with prod deployment
2026-04-05
Goal: Both training pipelines (dolphin SageMaker + Volaris smart router) running daily with ClearML tracking.
EngineerWork Done
RakeshDeployed e2e SageMaker dolphin pipeline (PreprocessData ✅, TrainModel ✅, EvaluateModel fix pushed). Deployed Volaris smart router daily training infra. ClearML integration confirmed working. Implemented Volaris pipeline CLI with Snowflake connection. Added 13 tests for setup_pipeline. Fixed multiple SageMaker issues: pipeline name mismatch, model.save(), RegisterModel, FrameworkProcessor for eval.
Blockers:
  • PR #1132 (DATA-Athena-Snowflake) needs approval to merge to qa — blocks CodePipeline deployment
  • CodeDeploy DeployEC2 stage failing — scripts need debugging after merge
  • SSO lacks sagemaker:CreatePipeline for local runs
Pending PRs:
  • #1132 DATA-Athena-Snowflake → qa (ClearML, Lambda handler, SageMaker fixes)
  • #26 terraform-athia → main (Dolphin SageMaker pipeline infra)
  • #27 terraform-athia → main (Volaris daily training infra)
2026-04-04
Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.
EngineerWork Done
RakeshDeployed e2e SageMaker training pipeline via Spacelift. Consolidated PRs #24+#25 → #26. Fixed Spacelift project_root, cleaned orphaned state. Enabled daily training for both pipelines. Set up ClearML creds in Secrets Manager + EC2. Created Spacelift stack for Volaris. Renamed model-artifacts → volaris-model-artifacts.
2026-04-02
Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.
EngineerWork Done
RakeshFighting Terraform and Spacelift configuration issues. ClearML successfully deployed in dev environment
2026-04-01
Goal: Have end to end training pipeline for TFX running on AWS and integrated into Deuna stack for daily run.
EngineerWork Done
RakeshDeployed ClearML using Terraform and Spacelift in dev environment and handed over to team. Full data analysis of Snowflake data with Rene — generated list of suggestions for team, published at metrics dashboard
ReneFull data analysis of Snowflake data with Rakesh — generated list of suggestions for team
2026-03-31
Goal: Deploy entire training platform using TF and Spacelift.
EngineerWork Done
RakeshDeploying entire training platform end to end. Learning Spacelift for infra deployment — finally got access. Deploying ClearML to monitor all training. Wrote analysis script to monitor model performance driving metrics dashboard directly from Snowflake — better for analysis and generating insights
ReneAnalyzing model performance with past data. Waiting for experiment to be enabled again — current data not significant enough
2026-03-30
Goal: Have clear idea of metrics for measuring model performance and deploy entire training platform.
EngineerWork Done
RakeshAnalyzing model performance metrics, working on DNN model optimization for Volaris smart router
2026-03-28
Goal: Train one more DNN model for Volaris smart router.
EngineerWork Done
RakeshTraining DNN model for Volaris smart router — also serves as example on how to use the TFX pipeline
ReneData analysis to identify patterns based on Volaris data
2026-03-27
Goal: Get TF LR model in production.
EngineerWork Done
RakeshVerified everything working in production. Helping with questions about the model. Double-checked traffic ramp-up and analyzing how to do post-launch analysis
2026-03-26
Goal: Integrate new model in serving stack now that everything is working end to end.
EngineerWork Done
RakeshIntegrated model in serving stack with sidecar approach, loading models from S3/EFS. Removed all parallel serving libraries no longer needed in training directory
2026-03-25
Goal: Deploy everything in production for Volaris to get real data flowing.
EngineerWork Done
ReneCreated and incorporated AMEX model, added to the serving mix
RakeshBuilt AMEX model with Rene. Working with Deuna team on deploying everything in production for Volaris
2026-03-24
Goal: Have TF model up and running in production integrated with the Go API.
EngineerWork Done
RakeshAnalyzing code with Naoki to determine serving approach — sidecar vs deploying servomatic. Building production flow to save and load models from S3 & deploying servomatic binary
NaokiEvaluating sidecar vs servomatic deployment for integrating TF model with the Go API
ReneContinuing to iterate on the TF model and data analysis. Converting LR model to TF format for use in servomatic binary for online eval
2026-03-23
Goal: Hook this model into production to have everything connected end to end.
EngineerWork Done
ReneFirst regression model trained on Volaris data — evaluating quality in offline mode, initial results look promising
RakeshContinuing to analyze approach to serve the model as servomatic with Naoki
NaokiAnalyzing serving architecture to connect trained model via servomatic in production
2026-03-20
Goal: Have one model trained on Volaris data.
EngineerWork Done
ReneIterating on data shape analysis and building first model version
RakeshWorking with Rene on first model; analyzing serving code with Naoki to plan production integration
NaokiAnalyzing serving code with Rakesh to determine how to connect model in production
2026-03-19
Goal: Have first model ready at least in offline mode in the coming days.
EngineerWork Done
RakeshFirst analysis of the data with Rene — analyzing best approach to build processor selector model
ReneStarted on first model based on current understanding of data and features
NaokiWorking with Rene on integrating S3 file loading into TFX data loader for e2e training and eval
Blockers: None — have a few questions to confirm our understanding, will batch and ask together
2026-03-18
Goal: Have first ML model for selecting the right processor for every Volaris transaction.
EngineerWork Done
ReneLooking at data shape for Volaris to train first processor selector model
KedarWorking on data pipeline
NaokiContinues setting up good practices (code quality, CI, testing patterns)
RakeshWriting smartrouter service
Blockers cleared: AWS access granted · Code review done, all code merged to qa
2026-03-16
Goal: Submit everything and build data pipeline to extract Volaris data from Snowflake.
EngineerWork Done
RakeshIterated on experiment and metrics framework to make everything work locally and in tests
NaokiIterated on improving code and ramping up
KedarLooking at feature extraction from Snowflake
ReneWorking on simple first model
Blockers: PR review pending · AWS access for POC server (from 2026-03-13)
2026-03-15
Goal: Have training platform implemented in shape and be ready for feature engineering and training for Volaris model.
EngineerWork Done
RakeshAdded Evaluation Service (uses Model Service + Feature Service to evaluate TensorFlow models). Added e2e tests for all 3 services. Added experiment and metrics framework to track all training pipelines. Demo training pipeline working end to end. PR waiting for review
Blockers: PR review pending · AWS access for POC server (from 2026-03-13)
2026-03-14
EngineerWork Done
RakeshBuilt foundational services: Model Service and Feature Service with tests and scaffoldings to support TensorFlow trained models
2026-03-13
Goal: Get everything running tests regularly and pushing to dev server automatically — getting comfortable with the current stack.
EngineerWork Done
NaokiFixed broken tests to get everything running locally. Looking at setting up automated deployment in dev environment for services
KedarGot repo and environment access figured out. Looking into Snowflake data schema
ReneGot repo and environment access figured out. Looking at training pipeline code
RakeshUpdated deuna.aidaptive.com with latest repo analysis and refreshed task list. Synced athena-platform (v0.15.5, Triton merged)
Blocker: Waiting for AWS access to deploy on POC server
❓Open Questions
Items that need answers before effort estimates are finalized
1

Are ATHIA_PREDICTIONS / ATHIA_FEEDBACK tables populated in Deuna's Snowflake today? Confirmed ✓ 2026-02-24

Confirmed — data is live in Deuna's Snowflake (verified 2026-02-24).

2

Are SageMaker endpoints live for processor_selector / retry_predictor?

Or are they placeholders only? — Rakesh to confirm

3

Is there a live model in MODEL_ARTIFACTS that Deuna's payment service is calling today?

Rakesh to confirm

4

What is the current payment volume through the routing engine?

Minimum 1,000 transactions per variant needed for A/B test statistical validity — Ask Israel

5

Who owns the athena-platform Go repo deployments?

Aidaptive or Deuna infra? Affects Phase 1 deployment planning — Clarify with Pablo

6

When will feature/llm-driven-ml-training (Triton IS) merge to main? New

This PR closes G-06 and defines the production model serving backend (Triton vs. SageMaker). Its merge timeline directly sets the Phase 6 integration schedule — ask Pablo.

🔑Access & Blockers
Pending provisioning items
ItemOwnerStatus
Snowflake access — RakeshIsrael (Deuna)✓ Done (2026-02-18)
Snowflake access — NaokiRakesh + Naoki✓ Done (2026-02-19)
Code / repo access — RakeshPablo (Deuna)✓ Done (2026-02-19)
Claude / LLM access & budgetPablo → Farhan✓ Done (2026-02-19)
Code / repo access — NaokiTBD✓ Done
Deuna corp accounts — Rakesh & NaokiTBDPending
Claude Code credits — Rakesh & Naoki—Not needed
Deploy ATHIA_STAGE_OUTCOMES + ATHIA_SESSION_SUMMARY in SnowflakeRakesh✓ Done (feat/ATH-0000)
Build retry_optimization_requested workflowRakeshPending
Quick Links
ServiceURLDetails
AWS Console (SSO)deunaio.awsapps.comDeuna AWS account access
Snowflakevltaxpw-rmontes.snowflakecomputing.comAccount: VLTAXPW-RMONTES · DB: PAYMENT_ML · Warehouse: PAYMENT_ML · Read-only
Athia Experiments Dashboardinsights.deuna.comModel performance data for processor selector experiments
ClearML (Prod)athia-ml.deuna.ioML experiment tracking & training monitoring — production
ClearML (Dev)athia-ml.dev.deuna.ioML experiment tracking & training monitoring — dev environment
Spaceliftduna-e-commmerce.app.spacelift.ioInfrastructure governance & Terraform deployment
Terraform Repogithub.com/DUNA-E-Commmerce/terraform-athiaAll Athia infrastructure as code
Development Rules
RuleDetails
AWS Resource TagsAll AWS resources must include: CreatedBy=aidaptive, ServiceName=smartrouter, Environment=POC
Infrastructure as CodeAll infrastructure via Terraform only — no manual AWS console resource creation
Decisions Log
DateDecisionRationaleMade By
2026-03-24Serving migrated from DATA-Athena-Snowflake to athia-model-server sidecar in athena-platformClean separation — training repo (Python) vs serving repo (Go + Python sidecar)Rakesh
2026-03-13Adopted TensorFlow ecosystem (TFX, TF Serving, TFDV, TFMA, TF Transform) for all ML workReplaces Snowflake ML / XGBoost / scikit-learn. Unified training → validation → serving pipeline with production-grade toolingRakesh
2026-03-13Added Phase 0 — 7 service shells + 3 design tasks (Rakesh) before Volaris feature workService architecture: Data Pipelines, Feature Service, Training Pipelines, Model Mgmt, Eval Service, Evaluation Framework, Experiment SystemRakesh
2026-03-13Both repos switched to main branch — feat/ATH-0000 and Triton branches both mergedAll ML training pipeline and Triton serving code now on main; no more feature branch tracking neededRakesh
2026-03-13Triton branch merged to main — confirms deployment architecturefeature/llm-driven-ml-training merged; Triton IS, ExperimentService, shadow mode now in production codebaseDeuna Engineering
2026-02-18Latency target updated: p95 <50ms → p95 <200msRevised from original SOW specRakesh (w/ Pablo)
2026-02-19Phase 1 target merchant set to Volaris (not Cinépolis)Volaris has known PSPs (Worldpay ID:76, MIT ID:85, Elavon, Amex); Cinépolis only shows Cybersource gateway — processor unknownMark Walick
2026-02-20Repo analysis scoped to branch feat/ATH-0000-athia-ml-llm-schema-discovery (not main)This branch contains the active ML platform development; main does not reflect current capabilitiesPablo
2026-02-26athena-platform feature/llm-driven-ml-training (Triton IS) identified as the production model serving pathTriton IS + shadow mode + ExperimentService provides complete training→serving pipeline; replaces manual SageMaker endpoint registration; closes G-06Rakesh
📎References & Documents
Key links and documents for this project
📄
Project Plan — v20 (Latest)
2026-03-16 · ML platform services + v0.15.5 sync, ~89d saved, ~15.5d remaining
📊
Data Dictionary
Google Sheets — Deuna data field definitions
🗺️
Athia Data Model
LucidChart — system architecture diagram
🐍
DATA-Athena-Snowflake
LLM analytics platform (Python / LangGraph)
🐹
athena-platform
ML serving + A/B testing platform (Go / Gin)
🗄️
Snowflake Schema Reference
2026-02-18 · Extracted from PAYMENT_ML database
Internal Only

Admin — Event Log

Time User Event Details
Loading events…