Smartrouter AI/ML Integration
Planning phase complete. Now building the Athia AI/ML integration into Deuna's payment routing service — focused on the Volaris merchant task list with 3 engineers.
What This Is
The planning phase is complete. We are now in active development — implementing the full Volaris task list to integrate Athia AI/ML into Deuna's payment routing service. A team of 3 engineers is focused on delivering the 54-task Volaris plan across 7 phases.
The Problem
- Static routing rules → suboptimal acceptance rates
- No intelligent PSP failover during outages
- Retries not optimized by timing, route, or message
- No feedback loop — outcomes don't improve future decisions
The Solution (5 Use Cases)
- P-01 — PSP outage detection & failover
- P-02 — Optimize existing static routing rules
- P-03 — Per-transaction processor ranking
- P-04 — Authorization message manipulation
- P-05 — Retry optimization
Phase 1 Target — Volaris Merchant
✅ Already Built (Reduces Effort)
- ML serving API —
processor_selector&retry_predictorendpoints live (v0.15.5) - Model registry — artifact + experiment + variant tables + CRUD API
- A/B testing — auto-winner with statistical guardrails (p<0.05, 7 day min, 1000 samples)
- Snowflake feedback loop — PREDICTIONS + FEEDBACK + TRAINING_DATASET + STAGE_OUTCOMES tables active
- Full ML training pipeline — LR, RF, XGBoost via Snowflake ML (G-01, G-04, G-07, G-09, G-11, G-12)
- Triton IS + shadow mode + ExperimentService API — G-06 done
- OTEL metrics pipeline + Grafana CloudWatch dashboard — G-05 done (v0.15.1)
- Dynamic per-model encoders + A/B model version routing — G-13 done (v0.15.2+)
- Processor Selector v2 — 54-feature XGBoost encoder, fully tested
- ML platform services (PR pending) — Model Artifact, Feature, Evaluation, TFX, Experiment Tracking
- CI improvements — black, isort, uv migration, pytest fixtures (G-03 partial)
- Volaris training pipeline — data ingestion → preprocessing → trainer → evaluator → promoter, S3 artifact storage
- Volaris TF smart router sidecar — 3 TF SavedModels, 140-feature encoder, Go API integration with feature flag
🔨 Still Needs to Be Built (~15.5d)
- Feature store real-time serving layer — G-08 (~3.5d remaining)
- Orchestration DAG (Airflow/Prefect) — G-02 (~2d remaining)
- CI/CD deployment pipeline — G-03 (~3d remaining)
- Lineage tracking — G-10 (~5.5d)
- Rollback API — G-14 (~0.5d remaining)
- Retry optimization workflow (P-05) — stimulus still missing
- Strategy Director — matcher & ranker nodes are placeholder code
DATA-Athena-Snowflake Testing
~25% coverage · Maturity 3/5
Now training-only (serving removed via PR #1114). Volaris training pipeline complete. 30 tests. CI improved: black, isort enforcement, uv migration. Core multi-agent framework still largely untested.
athena-platform Testing
~35% coverage · Maturity 4/5
PR #215 — Volaris TF smart router sidecar. 3 TF SavedModels, 140-feature encoder, Go API integration with feature flag. 221 Python + 13 Go tests. V2 API (18 handlers) and Bedrock client still at 0%. CI threshold only 20%.
Current Blockers
| Item | Owner | Status |
|---|---|---|
| Deuna corp accounts — Rakesh & Naoki | TBD | Not needed |
| Code / repo access — Naoki | TBD | Done |
| Are ATHIA_* tables live in Deuna's Snowflake? | Israel | Confirmed ✓ |
| Are SageMaker endpoints live today? | Rakesh | Open question |
| Payment volume through routing engine? | Israel | Open question |
| GPU instance for training — CPU too slow (3hr+ per run) | Deuna | Blocking |
| AWS resource access for Rakesh (console/CLI) | Deuna | Blocking |
Purpose
Assess the effort required to integrate Athia AI/ML into Deuna's payment routing. Produce a clear work breakdown and estimate before any implementation begins.
Phase 0 Deliverables
- Full schema & data understanding
- Effort estimate per workstream
- Risks and open questions resolved
- Recommended build order
Long-Term Success
- Measurable approval lift
- Stability during PSP outages
- Latency: p95 < 200ms
- Closed feedback/learning loop
✅ In Scope (Phase 0)
- Understand Deuna's data, schema, routing rules
- Assess Athia platform gaps vs. what's needed
- Size effort for P-01 through P-05 use cases
- Identify all dependencies, blockers, risks
🚫 Out of Scope (Phase 0)
- Any implementation or code delivery
- 3DS optimization (Phase 2)
- User-facing messaging (Phase 3)
- Installment optimization
Phase 0 — Assess Level of Effort Done ✓
2 days · $6K budget · Completed 2026-02-19
Nail down all the work required. Produce a detailed estimate with confidence before committing to delivery.
Phase 1 — Model in Production Pending
2 weeks · Core delivery
Model running in production for 2 processors with basic feature store. Target merchant: Volaris.
Phase 2 — Monitoring + Experimentation Pending
Week 3 · Add monitoring and integrate with A/B experimentation infrastructure.
Phase 3 — Drift Detection, CI/CD, Ramp-Up Pending
TBD · Drift detection, CI/CD pipeline, experiment ramp-up, additional model techniques.
Phase 1 Delivery Plan — Volaris Merchant
65 tasks · ~64 person-days · 5–6 weeks with 3 engineers · TensorFlow ecosystem
| Sub-Phase | Focus | Tasks | Effort | Key Notes |
|---|---|---|---|---|
| 0 — Service Architecture | Design + scaffold 7 TF service shells | 11 | 14.5d | NEW Rakesh: architecture, API contracts, TF integration plan. Team: 7 service shells |
| 1 — Discovery & EDA | Understand Volaris data | 10 | 7.5d | Approval rates per PSP, retry patterns, routing rules, DYNAMIC_ROUTING_DETAIL JSON, A/B sample size |
| 2 — Feature Engineering | Build ML features (Feature Service + TF Transform) | 9 | 8.5d | Card BIN/brand, RFM, retry context, rolling health scores, Amex hard-rule bypass |
| 3 — Model Development | Train models (tf.keras) | 8 | 9d | TF DNN, wide-and-deep, TF Decision Forests via Training Pipeline + Eval Service |
| 4 — Outage Detection | P-01: failover for 4 PSPs | 6 | 6.5d | Rolling health score (5–15 min window), auto-failover, recovery detection (1–2% sampling), alerts |
| 5 — Message Manipulation | P-04: CIT/MIT experiment | 5 | 5.5d | CIT/MIT audit, approval delta by toggle × processor × card type, new athena-platform endpoint, A/B test |
| 6 — Platform Integration | Register models, wire Deuna | 8 | 6.5d | Triton ✓ ExperimentService one-call API + built-in shadow mode. Requires Deuna eng coordination. |
| 7 — Monitoring & Feedback | Dashboards, retraining, review | 8 | 6d | 2 tasks already done: ATHIA tables confirmed live, ATHIA_STAGE_OUTCOMES deployed |
| Total | 65 | ~64d | Phase 0 first (Rakesh design ∥ team scaffolding); Phases 1–2 sequential; 3–5 parallel after Phase 2 |
| Name | Role |
|---|---|
| Reks | CEO & Co-Founder |
| Chema | Co-Founder |
| Pablo | CTO — Executive Sponsor |
| Israel | Data POC — Snowflake & Data Access |
| Farhan | Claude / LLM Access POC |
| Mark Walick | Product Management Lead |
| Name | Role |
|---|---|
| Rakesh | CEO |
| Naoki | Solutions Architect |
| Rene | ML Engineer |
| Kedar | Backend / Data Engineer |
Phase 1 Target Merchant: Volaris Decided 2026-02-19
Volaris selected over Cinépolis. Known PSPs: Worldpay (ID: 76), MIT (ID: 85), Elavon (cards), Amex (Amex cards) — 4 processors total with routing policies per currency. Cinépolis deferred: only shows Cybersource (a gateway), actual processor unknown.
Outage Detection & Failover
Detect PSP failures via persistent timeout codes. Auto fail-over and fail-back using random sampling of downed PSP to detect recovery.
Routing Optimizer
Optimize Deuna's existing static routing rules based on historical outcomes. Build on existing rules engine rather than starting from scratch.
Per-Transaction Route Selection
Rank top 3 payment processors per transaction in real time based on prior outcomes, card signals, and merchant context.
Message Manipulation
Toggle CIT/MIT, AVS, MCC variables in authorization request messages. Provide top 3 configuration recommendations per transaction.
Retry Optimization
Optimize when, how, and where to retry declined transactions. MIT/subs focused. Enterprise darktime reduction. Delayed retry based on processor reputation.
Connection
VLTAXPW-RMONTES
Database: PAYMENT_ML
Access: Read-only
ABTESTING Schema
Denormalized flat join of all views. Best starting point for EDA. No complex joins needed.
ALL_VIEWS_FLAT ALL_PAYMENT_EVENTS_FLAT
SOURCES Schema
15 clean views: orders, payments, attempts, events, user profiles, routing logs, merchant rules, airline data.
| View | Why It Matters | Use Cases |
|---|---|---|
VW_ATHENA_PAYMENT_ATTEMPT | Full retry chain per payment; processor, error codes, hard/soft decline, DYNAMIC_ROUTING_DETAIL JSON | P-03 P-05 |
VW_SMART_ROUTING_ATTEMPTS | Live routing engine log: algorithm type, latency, skip reasons — direct latency signal for p95 <200ms | P-01 P-02 |
VW_ROUTING_MERCHANT_RULE | Existing static rules engine — foundation for routing optimizer. SHADOW_MODE column suggests testing infrastructure exists. | P-02 |
ABTESTING.ALL_VIEWS_FLAT | Everything joined in one table — best for initial EDA | EDA |
| Feature Group | Key Columns | Use |
|---|---|---|
| Retry history | NUM_ATTEMPTS_ORDER, PREVIOUS_ORDER_ERROR_CODE, AVG_SEC_BETWEEN_PAYMENT_ATTEMPS | P-05 |
| Error signals | ERROR_CODE, ERROR_CATEGORY, HARD_SOFT | P-03, P-05 |
| Card signals | CARD_BIN, CARD_BRAND, BANK, CARD_COUNTRY | P-03 |
| User behavior | TARGET_USER_FRAUD_RATE_COHORT, TOTA_MINUTES_BROWSING, RFM values | P-03 |
| Message config | MCI_MSI_TYPE, ORDER_MCI_MSI_TYPE, PAYMENT_ATTEMPT_METHOD_TYPE | P-04 |
| Geo & Device | ORDER_COUNTRY_CODE, TARGET_USER_BROWSER, TARGET_USER_DEVICE | P-03 |
main · DATA-Athena-Snowflake now training-only (PR #1114) · athena-platform PR #215 — Volaris TF smart router sidecarDATA-Athena-Snowflake
github.com/DUNA-E-Commmerce/DATA-Athena-Snowflake · branch: main (feat/ATH-0000 merged — 182 files, +29K lines)
✅ Key Finding (Updated 2026-03-24)
Now training-only — serving removed via PR #1114, migrated to athia-model-server sidecar in athena-platform. Volaris training pipeline complete: data ingestion → preprocessing → trainer → evaluator → promoter, with S3 artifact storage. 30 tests. Clean separation: training repo (Python) vs serving repo (Go + Python sidecar).
LLM Workflows (11 stimuli)
| Workflow | Status |
|---|---|
| Acceptance rate analysis | Done (v0_1, v1_0) |
| Fraud card analysis | Done |
| Metrics anomaly detection | Done |
| Chatbot / data analyst | Done |
| Strategy generation director | Partial — Matcher has exit() |
| Cost optimization | Early stage |
| Retry optimization | Missing (P-05 gap) |
ML Training Platform (now on main)
| Service | Status |
|---|---|
| Training pipeline (Snowflake ML) | Done |
| LLM training orchestrator (GPT-4 + RAG) | Done |
| LLM experiment designer (GPT-4 + RAG) | Done |
| Data quality validator | Done |
| Model registry (auto table creation) | Done |
| Feature extractor | Done |
| Feedback collector (webhook/API/batch) | Done |
| Schema discovery (ChromaDB) | Done |
| Athia event ingestion | Done |
| Model deployer → athena-platform | Partial — EFS export done; API integration manual |
Architecture — Multi-Agent Pattern
FastAPI + LangGraph · Stimulus-response orchestration · LLM backends: Claude (primary), GPT-4 (fallback)
Request → StimulusRegistry → OrchestratorWorkflow → Branch (DAG of Nodes)
→ AgentWorkflow (LangGraph StateGraph) → Response
11 stimuli: acceptance_rate_analysis · fraud_card_analysis · metrics_anomaly
user_question · data_analyst · researcher_assistance · deep_exploration
element_edition · knowledge_expert · strategy_generation · cost_optimization
End-to-End Training Flow
POST /training/run/processor_selector → Schema Discovery (ChromaDB index) → Data Quality Validation (temporal bias · class balance · outlier · concept drift) → LLM Training Orchestrator (GPT-4 + RAG → RETRAIN_NOW / SCHEDULED / SKIP) → LLM Experiment Designer (7 experiments, simple→complex) → Snowflake ML Training (LR, RF, XGBoost · 80/20 temporal split) → Best model selected by F1 score → Export to EFS + athena-platform payload prepared → Results stored in ML_TRAINING_RUNS
Training Pipeline Architecture
Pipeline flow: Training Decision → Data Prep → Feature Extraction → Validation → Experiment Design → Training → Model Selection → Deployment
Services
| Service | Purpose |
|---|---|
TrainingPipeline | Full training execution (plan/run/deploy) |
LLMTrainingOrchestrator | LLM + RAG decision engine (RETRAIN_NOW / SCHEDULED / SKIP) |
LLMExperimentDesigner | Designs 5-10 experiments using GPT-4/Claude with RAG |
ModelDeployer | Exports to EFS, registers deployment, creates canary config |
TrainingPlanner | Dry-run mode ("terraform plan" for ML) |
FeatureExtractor | Auto-extracts features, creates training dataset views |
DataQualityValidator | Schema, statistical, temporal bias, drift validation |
FeedbackCollector | Webhook, API polling, batch feedback collection |
ModelRegistry | Model CRUD, prediction/feedback schema management |
SchemaDiscovery | Auto-discovers training tables via LLM |
LLMProvider | Unified Claude/GPT-4 interface with auto-fallback |
API Endpoints
| Endpoint | Purpose |
|---|---|
POST /api/v1/training/plan/{model_type} | Dry-run plan |
POST /api/v1/training/run/{model_type} | Execute training |
POST /api/v1/training/decision/{model_type} | LLM decision |
POST /api/v1/experiments/design/{model_type} | Design experiments |
Remaining Gaps
- No retry LLM workflow —
retry_optimization_requestedstimulus still missing (P-05) - Strategy Director matcher/ranker still incomplete —
exit()placeholder in matcher (P-02) - Deployment integration incomplete —
model_deployer.pyexports to EFS but does NOT call athena-platform API (~1.5d left) - No data lineage tracking —
TrainingDatasetVersionnot implemented (G-10) - No rollback capability (G-14 partial — shadow mode + is_default merged to main; rollback API not built)
- SQL injection risks in schema_discovery.py, training_planner.py, athia_ingestion.py
- No circuit breakers — LangGraph node failures still cascade
- Fragile LLM JSON parsing — all services extract JSON by searching for braces (not robust)
🧪 Testing Coverage
| Layer | Coverage | Notes |
|---|---|---|
| Metrics layer | 70–80% | Well tested — 5 files |
| Deployment / config | 50–60% | Validation tests solid |
| Model registry (new) | ~60% | Happy-path CRUD; no error cases |
| Schema discovery (new) | ~45% | Discovery + semantic search; no edge cases |
| Experiment designer (new) | ~40% | RAG + ChromaDB; LLM-dependent, not mocked |
| Training orchestrator (new) | ~35% | Happy-path; external deps not mocked |
| Feedback collector (new) | ~35% | Batch/polling skipped; no duplicates |
| Route handlers | ~0% | Still largely untested |
| Multi-agent core + 11 branches | ~5% | 1 manual test — not in pytest |
| Lambda / AgentCore entrypoints | 0% | 4 entrypoints — no tests |
Maturity: 3/5 — Moderate. ML training services are well-architected with growing test coverage after merge. Tests are still integration-only, depend on live Snowflake + OpenAI, and cover happy-path only. Multi-agent core still a black box. Not suitable for CI without mocking.
athena-platform
github.com/DUNA-E-Commmerce/athena-platform
✅ Key Finding (Updated 2026-03-24)
PR #215 — Volaris TF smart router sidecar. 3 TF SavedModels (processor_selector, retry_predictor, retry_sequence), 140-feature encoder, Go API integration with feature flag. 221 Python + 13 Go tests. Now the single serving repo — training migrated to DATA-Athena-Snowflake. Still at v0.15.5 with Triton IS, shadow mode, ExperimentService, Grafana CloudWatch dashboard, and auto versioning.
✅ Triton Branch — Merged to Main
G-06 deployment gap nearly closed (~1.5d remaining). Provides partial G-14 rollback via shadow mode. Now in main: Triton IS sidecar, ModelConversionManager (sklearn→Triton), ServiceWithTriton (readiness checks), ExperimentService (one-call experiment creation), shadow mode seed experiments for all 3 model types, max_variants constraint. 32/32 new tests passing. Active branches in review: feat/new-encoder-v2, feature/experiment-api-cleanup.
ML Inference Types (already in registry)
| Type | Maps To |
|---|---|
processor_selector | P-03 |
retry_predictor | P-05 |
retry_sequence | P-05 |
installment_optimizer | Out of scope |
Snowflake Tables
| Table | Status |
|---|---|
ATHIA_PREDICTIONS | Active |
ATHIA_FEEDBACK | Active |
ATHIA_TRAINING_DATASET | Active |
ATHIA_EXPERIMENT_LIFT | Active |
ATHIA_STAGE_OUTCOMES | Deployed (feat/ATH-0000) |
ATHIA_SESSION_SUMMARY | Deployed (feat/ATH-0000) |
ATHIA_MULTI_STAGE_ANALYSIS | New (feat/ATH-0000) |
ATHIA_MODEL_METRICS | New (feat/ATH-0000) |
ML_MODEL_REGISTRY | New (feat/ATH-0000) |
Architecture — Clean Architecture (Go/Gin)
PostgreSQL (RDS Multi-AZ) + Redis (ElastiCache) + EFS · mTLS enforced on /api/v1/ml/predict/*
REST Handlers (V1 + V2, Gin) ← mTLS on /ml/predict/*
↓
Controllers (~30 implementations)
↓
Domain Services (44 packages) ← constructor injection throughout
↓
Repositories (43 GORM implementations) ← in-memory SQLite for tests
↓
PostgreSQL (RDS Multi-AZ) + Redis (ElastiCache) + EFS (model storage)
A/B Experimentation — Auto-Winner Guardrails
Stats
p-value < 0.05
Min 1000 samples/variant
Min 7 days runtime
Lift
Min 1% absolute lift
Deterministic bucketing
SHA256(transaction_id)
Guardrails
≤10% latency regression
≥−5% revenue regression
Dry-run mode (safe default)
🧪 Testing Coverage
| Layer | Coverage | Notes |
|---|---|---|
| Domain services (44) | 44/44 | All have test files |
| Repositories (43) | ~43/43 | In-memory SQLite isolation |
| V1 REST handlers (18) | 15/18 (83%) | agent, workspaces, elements missing |
| Auth middleware | Tested | JWT + API key covered |
| V2 REST handlers (18) | 0/18 (0%) | Entire new API version untested |
| Bedrock client | 0% | Excluded from coverage config |
| auth, bedrock, element, workspace services | 0% | 4 domain services with no tests |
| Bootstrap / DI graph | Skipped | TODO: testcontainers |
Maturity: ~4/5 — Strong. Domain and repository layers well covered. Triton merge adds 32 new tests. V2 API (18 handlers) and Bedrock ML inference path are still untested. CI threshold is only 20% and internal/clients/ is excluded from coverage entirely.
Testing Comparison — Both Repos
| Metric | DATA-Athena-Snowflake | athena-platform |
|---|---|---|
| Test functions | 450+ (+6 new integration files) | 777 |
| Test files | 34 (28 + 6 new) | 126 |
| Est. coverage | ~20% | ~25–30% |
| Core product tested? | No — multi-agent core untested; new ML tests need live deps | Partial — V2 + Bedrock missing |
| CI enforced? | Partial (fragmented) | Yes — every PR |
| Coverage threshold | None enforced | 20% (too low; target: 65%) |
| Maturity | 2.5/5 — Low | ~4/5 — Strong |
Timeline (2 engineers)
Full delivery: ~2 weeks (was 5–6 wks before savings)
MVP (Stages 1–2 only): ~1 week
1 engineer: ~3 weeks
Build Order
Stage 1 must complete before Stage 2. Stages 3–5 can overlap with late Stage 2.
Triton branch merged to main — deployment architecture confirmed.
Total Effort
~15.5d remaining
of ~104.5d original · ~89d saved
Progress Summary — 2026-03-24
| Stage | Gap | Category | Priority | Original | Status |
|---|---|---|---|---|---|
| 1 – Foundation | G-02 Orchestration | Infrastructure | High | 8d | Nearly Done (~1d left) — ~88% complete. SageMaker + Spacelift deployed, daily schedules active |
| 1 – Foundation | G-08 Feature Store | ML Infra | High | 13.5d | Nearly Done (~3.5d left) — ~74% complete, 140-feature encoder in sidecar |
| 1 – Foundation | G-04 Data Validation | Data Quality | High | 7d | Done ✓ |
| 2 – Automation | G-03 CI/CD Pipeline | DevOps | High | 9d | Partial (~2d left) — Quality gates done. Spacelift + GitHub Actions CI/CD. Integration tests partial. Staging/prod TODO |
| 2 – Automation | G-06 Deployment Automation | Automation | High | 7.5d | Done ✓ |
| 2 – Automation | G-07 Model Registration | Automation | Medium | 5.5d | Done ✓ |
| 2 – Automation | G-01 Automated Retraining | Automation | High | 10d | Done ✓ |
| 3 – Governance | G-13 Versioning Workflow | Governance | High | 5d | Done ✓ |
| 3 – Governance | G-10 Lineage Tracking | Governance | Medium | 6.5d | Partial (~4d left) — ~35% complete. ClearML online tracking at athia-ml.dev.deuna.io, pipeline→training task hierarchy |
| 3 – Governance | G-14 Rollback Capability | Reliability | High | 5d | Nearly Done (~0.5d left) — ~90% complete, rollback API pending |
| 4 – Observability | G-05 Model Monitoring | Observability | High | 8d | Done ✓ |
| 4 – Observability | G-09 Drift Detection | Observability | Medium | 7d | Done ✓ |
| 5 – ML Quality | G-11 Hyperparameter Tuning | ML Quality | Medium | 5.5d | Done ✓ |
| 5 – ML Quality | G-12 Algorithm Comparison | ML Quality | Medium | 7d | Done ✓ |
Engagement Summary
Full implementation of Athia AI/ML smartrouting for Deuna — Volaris merchant. 7 delivery phases covering P-01 (outage failover), P-03 (per-transaction routing), P-04 (message manipulation), and P-05 (retry optimization). ~61.5 person-days of prior Deuna codebase work reduces original scope significantly.
Aidaptive Team — Roles & Responsibilities
| Name | Role | Responsibilities | Days |
|---|---|---|---|
| Rakesh | Project Lead & Strategy | Client coordination (Pablo, Israel), architecture decisions, Phase 6 oversight, post-launch review | 5.5d |
| Naoki | Solutions Architect | athena-platform Go dev, outage detection, message manipulation API, model serving (Triton), CI/CD, Phase 6 integration | 14.6d |
| Rene | ML Engineer | Feature engineering, model training (processor_selector, retry_predictor, retry_sequence), data quality, drift detection, retraining pipeline | 15d |
| Kedar | Data & Backend Engineer | Snowflake EDA, data pipelines, training datasets, feature feeds, Grafana dashboards, monitoring | 16d |
| Total | ~51.1d |
Effort by Phase
| Phase | Focus | Owners | Days | Milestone |
|---|---|---|---|---|
| 1 — Discovery & EDA | Understand Volaris data | Kedar (4.5d) · Rene (2d) · Rakesh (0.5d) · Naoki (0.5d) | 7.5d | Kick-off (20%) |
| 2 — Feature Engineering | Build ML feature set | Rene (4d) · Kedar (3d) · Naoki (1d) · Rakesh (0.5d) | 8.5d | Phase 2 complete (20%) |
| 3 — Model Development | Train P-03 + P-05 models | Rene (6d) · Kedar (2d) · Naoki (1d) | 9d | Phase 3 complete (20%) |
| 4 — Outage Detection | P-01: failover for 4 PSPs | Naoki (4.5d) · Rakesh (1d) · Kedar (1d) | 6.5d | Phase 6 complete (30%) |
| 5 — Message Manipulation | P-04: CIT/MIT experiment | Naoki (2d) · Rene (1.5d) · Kedar (1.5d) · Rakesh (0.5d) | 5.5d | |
| 6 — Platform Integration | Register models, wire Deuna +G-06 close | Naoki (5.1d) · Rakesh (2d) · Kedar (1d) | 8.1d | |
| 7 — Monitoring & Feedback | Dashboards, retraining, review | Kedar (3d) · Rene (1.5d) · Rakesh (1d) · Naoki (0.5d) | 6d | Phase 7 complete (10%) |
| Total | ~51d |
Delivery Timeline (6-week plan)
| Week | Days | Phases Active | Who | Key Milestone |
|---|---|---|---|---|
| Week 1 | 1–5 | Phase 1 (EDA) · Phase 2 start Day 3 | Kedar · Rene · Rakesh (Day 1) | EDA complete; feature schema draft |
| Week 2 | 6–10 | Phase 2 (Features) · Phase 3 start Day 8 | Rene · Kedar · Naoki | Feature set locked; training dataset built |
| Week 3 | 11–15 | Phase 3 (Models) · Phase 4 (Outage) parallel | Rene (models) · Naoki (outage) | Models packaged; outage detection built |
| Week 4 | 16–20 | Phase 4 tail · Phase 5 (CIT/MIT) · Phase 6 prep | Naoki · Rene · Kedar · Rakesh | API contract with Deuna eng signed |
| Week 5 | 21–25 | Phase 6 (Integration) | Naoki · Rakesh | ⚠ Triton branch must be merged by Day 18 · Integration live in shadow mode |
| Week 6 | 26–30 | Phase 7 (Monitoring & Review) | Kedar · Rene · Rakesh | Dashboards live · retraining scheduled · post-launch report |
Critical path: Phases 1–2 sequential. Phases 3–5 can run in parallel. Phase 6 requires (a) models complete, (b) Triton branch merged, (c) 1-week Deuna engineering lead time for API contract. Phase 7 requires Phase 6 live.
Assumptions
- Snowflake access (
PAYMENT_ML) remains available read-only - Deuna eng available for API contract in Week 4 (Pablo / Israel)
- Triton branch merged to main by end of Week 3
- Staging environment available for Phase 6 integration tests
ATHIA_PREDICTIONS+ATHIA_FEEDBACKremain live throughout
Success Criteria
processor_selectorlive for ≥1 Volaris PSP- ≥1% absolute approval rate lift (A/B test at significance)
- ≥5% retry success rate improvement vs. baseline
- PSP failover within 1 routing cycle of threshold breach
- p95 latency <200ms end-to-end (model inference <50ms)
- 48h shadow run complete with documented comparison
🔧 Architecture Update (2026-03-13): TensorFlow Ecosystem + 7 Service Shells
Adopted TensorFlow ecosystem (tf.keras, TFX, TF Serving, TFDV, TFMA, TF Transform) for all ML work. Added Phase 0 with 11 new tasks: 3 design tasks (Rakesh) + 8 service shell scaffolds. Replaces Snowflake ML / XGBoost / scikit-learn.
Data Pipelines · Feature Service · Training Pipelines · Model Management · Eval Service · Evaluation Framework · Experiment System
V-D01: Service architecture · V-D02: API contracts · V-D03: TF ecosystem integration plan
65 tasks (was 54) · ~64d total (was ~49.5d) · Phase 0 adds ~14.5d · 3 engineers ~5–6 weeks
Service Architecture & Shell Setup
Design 7 service boundaries (Rakesh), define API contracts, scaffold all service shells using TensorFlow ecosystem: Data Pipelines (TFX), Feature Service (TF Transform), Training Pipelines (TFX Trainer), Model Management, Eval Service (TFMA), Evaluation Framework, Experiment System.
📐 V-D01–D03 (Rakesh design) + V-S01–S08 (team scaffolding)
Discovery & EDA
Understand Volaris transaction data — approval rates per PSP, retry patterns, routing rules, DYNAMIC_ROUTING_DETAIL JSON, sample size for A/B test.
Feature Engineering
Card BIN/brand, transaction context, user RFM, retry history, rolling processor health scores, Amex hard-rule bypass, training dataset build.
Model Development
Train processor_selector, retry_predictor, retry_sequence for Volaris 4 PSPs using tf.keras. DNN vs. wide-and-deep vs. TF Decision Forests comparison. TFMA per-slice evaluation.
🔧 TensorFlow ecosystem: tf.keras training via Training Pipeline service, TFMA evaluation via Eval Service, SavedModel export via Model Management.
Outage Detection
Rolling health score per PSP, failover to next-best Volaris processor, recovery detection via 1–2% sampling, alerts on state changes.
Message Manipulation
CIT/MIT audit for Volaris, approval delta by toggle × processor × card type, experiment design, new athena-platform endpoint, A/B test.
Platform Integration
Register models in athena-platform, create Volaris-scoped experiment, API contract with Deuna eng, shadow mode validation before live traffic.
✅ Triton branch: ExperimentService one-call API (V-39–41) + built-in shadow mode (V-46) reduce effort by ~1d.
Monitoring & Feedback Loop
Confirm ATHIA_PREDICTIONS + ATHIA_FEEDBACK are live in Deuna's Snowflake
Done ✓ — confirmed data live in Deuna's Snowflake (2026-02-24)
Deploy ATHIA_STAGE_OUTCOMES table
Done ✓ — deployed in feat/ATH-0000
Approval rate + model performance Grafana dashboards
Retraining trigger, scheduled pipeline, auto-winner, post-launch review
Ready ✓ — LLM orchestrator + training pipeline built
All 65 Tasks
| # | Task | Phase | Owner | Effort | Status |
|---|---|---|---|---|---|
| V-D01 | Design overall service architecture — 7 service boundaries, data flow, inter-service communication | 0 – Architecture | Rakesh | 2d | Design |
| V-D02 | Define API contracts for all 7 services — OpenAPI specs, error handling, versioning | 0 – Architecture | Rakesh | 1.5d | Design |
| V-D03 | Design TensorFlow ecosystem integration — map TFX components to services, TF Serving format, TFDV/TFMA | 0 – Architecture | Rakesh | 1d | Design |
| V-S01 | Scaffold Data Pipeline service — TFX ExampleGen + StatisticsGen, Snowflake adapter, TFDV schema | 0 – Shell | Kedar | 1.5d | |
| V-S02 | Scaffold Feature Service — TF Transform preprocessing_fn, feature store API, real-time endpoint | 0 – Shell | Rene | 1.5d | |
| V-S03 | Scaffold Training Pipeline service — TFX Trainer + tf.keras, Keras Tuner, training history | 0 – Shell | Rene | 1.5d | |
| V-S04 | Scaffold Model Management service — registry CRUD, SavedModel storage, lifecycle, version comparison | 0 – Shell | Naoki | 1d | |
| V-S05 | Scaffold Eval Service — TFMA integration, per-slice metrics, model blessing/rejection API | 0 – Shell | Rene | 1d | |
| V-S06 | Scaffold Evaluation Framework — A/B stat engine, winner detection, latency/revenue guardrails | 0 – Shell | Naoki | 1.5d | |
| V-S07 | Scaffold Experiment System — experiment CRUD, traffic splitting, variants, shadow mode orchestration | 0 – Shell | Naoki | 1.5d | |
| V-S08 | Set up shared TF dependencies — tensorflow, tfx, tf-transform, tfma, tfdv, keras-tuner + Docker base | 0 – Shell | Kedar | 0.5d | |
| V-01 | Filter Volaris transactions — date range, volume, monthly trend | 1 – EDA | Kedar | 0.5d | |
| V-02 | Per-processor approval rates (Worldpay, MIT, Elavon, Amex) by card type, currency, amount | 1 – EDA | Kedar | 1d | |
| V-03 | Retry pattern analysis — attempts per order, processor retry-to, 1st/2nd/3rd attempt success rates | 1 – EDA | Kedar | 1d | |
| V-04 | Explore DYNAMIC_ROUTING_DETAIL JSON — extract all keys and values | 1 – EDA | Kedar | 1d | |
| V-05 | Map Volaris routing rules from VW_ROUTING_MERCHANT_RULE* views | 1 – EDA | Kedar | 0.5d | |
| V-06 | Analyze smart routing log — algorithm types, skip rates, p95 latency baseline | 1 – EDA | Kedar | 0.5d | |
| V-07 | Hard vs. soft decline distribution by processor and error code | 1 – EDA | Rene | 1d | |
| V-08 | Profile airline-specific features — flight, passenger, booking window signal | 1 – EDA | Rene | 0.5d | |
| V-09 | A/B test sample size check — daily volume per processor ≥ 1000/variant in 7 days? | 1 – EDA | Rene | 0.5d | |
| V-10 | EDA summary report — approval rates, error taxonomy, processor share, correlations | 1 – EDA | Rene + Rakesh | 1d | |
| V-11 | Define Volaris feature schema — all features, types, sources, compute latency | 2 – Features | Rene + Naoki | 1d | |
| V-12 | Card-level features — BIN, brand, bank, type, country; historical approval rate per BIN × processor | 2 – Features | Rene | 1d | |
| V-13 | Transaction-level features — amount, currency, CIT/MIT, MCC, flight order type | 2 – Features | Rene | 1d | |
| V-14 | User-level features — RFM, fraud rate cohort, tenure, browsing signals | 2 – Features | Rene | 0.5d | |
| V-15 | Retry-context features — previous processor, error code, time since attempt, attempt number | 2 – Features | Kedar | 1d | |
| V-16 | Processor-state features — rolling approval/timeout/decline rate at 15-min, 1h, 24h windows | 2 – Features | Kedar | 1.5d | |
| V-17 | Amex hard-rule — always route Amex cards to Amex processor; bypass ML | 2 – Features | Naoki | 0.5d | |
| V-18 | Build training dataset — join features onto labeled outcomes; train/val/test split | 2 – Features | Kedar | 1d | Ready ✓ ATHIA_TRAINING_DATASET view + feature_extractor.py |
| V-19 | Feature quality validation — nulls, skew, leakage risk, outcome correlation | 2 – Features | Rene | 1d | Ready ✓ data_quality_validator.py (834 lines) |
| V-20 | Train processor_selector v1 — rank 4 PSPs by approval probability (tf.keras DNN) | 3 – Models | Rene | 2d | TF Training Pipeline service |
| V-21 | Evaluate processor_selector — AUC, lift vs. static rules, per-processor accuracy, latency | 3 – Models | Rene | 1d | Ready ✓ Metrics auto-calculated by pipeline |
| V-22 | Train retry_predictor v1 — predict retry approval probability | 3 – Models | Rene | 1.5d | Ready ✓ Training pipeline supports retry_predictor type |
| V-23 | Train retry_sequence v1 — optimal processor order for retry | 3 – Models | Rene | 1.5d | Ready ✓ Training pipeline supports retry_sequence type |
| V-24 | Evaluate retry models — success rate lift, processor fatigue patterns | 3 – Models | Rene | 1d | Ready ✓ Evaluation framework in pipeline |
| V-25 | Architecture comparison — DNN vs. wide-and-deep vs. TF Decision Forests; select champion | 3 – Models | Rene | 1d | TF Replaces XGBoost vs. LR comparison |
| V-26 | Inference latency test — all models under 50ms budget | 3 – Models | Naoki | 0.5d | |
| V-27 | Package models — serialize, write model card (schema, features, metrics) | 3 – Models | Kedar | 0.5d | Ready ✓ model_registry.py auto-creates tables + stores metadata |
| V-28 | Define outage signal — timeout/error code thresholds for PSP-down detection | 4 – P-01 | Rakesh + Naoki | 1d | |
| V-29 | Rolling processor health score — sliding 5–15 min window per PSP | 4 – P-01 | Naoki | 1.5d | |
| V-30 | Failover logic — skip degraded PSP, route to next-best Volaris processor | 4 – P-01 | Naoki | 1.5d | |
| V-31 | Recovery detection — 1–2% sampling of down PSP; auto-restore on consecutive wins | 4 – P-01 | Naoki | 1d | |
| V-32 | Outage simulation tests — inject failures per PSP; verify failover + recovery | 4 – P-01 | Naoki | 1d | |
| V-33 | Outage alerting — Slack/PagerDuty on PSP state changes | 4 – P-01 | Kedar | 0.5d | |
| V-34 | Audit CIT/MIT usage for Volaris — current distribution across PSPs | 5 – P-04 | Kedar | 0.5d | |
| V-35 | Approval delta by CIT vs MIT per processor — statistical test | 5 – P-04 | Rene | 1d | |
| V-36 | Design message manipulation experiment — CIT/MIT × processor × card type matrix | 5 – P-04 | Rene + Rakesh | 1d | |
| V-37 | Implement message recommendation API in athena-platform | 5 – P-04 | Naoki | 2d | |
| V-38 | Run A/B test — approval rate with vs. without message recommendations | 5 – P-04 | Kedar | 1d | |
| V-39 | Register processor_selector in MODEL_ARTIFACTS (version, Triton backend ref, feature schema) | 6 – Integration | Naoki | 0.3d | Ready ✓ POST /api/v1/ml/models (Triton branch ExperimentService) |
| V-40 | Register retry_predictor + retry_sequence in MODEL_ARTIFACTS | 6 – Integration | Naoki | 0.3d | Ready ✓ Same — ExperimentService handles all 3 model types |
| V-41 | Create Volaris-scoped experiment — merchant filter, 10% treatment split, shadow mode, guardrails | 6 – Integration | Naoki | 0.5d | Ready ✓ POST /api/v1/ml/experiments — variants + models in one call (Triton branch) |
| V-42 | Validate experiment assignment — SHA256 bucketing determinism for Volaris | 6 – Integration | Naoki | 0.5d | |
| V-43 | API contract with Deuna engineering — define POST /api/v1/ml/predict request/response for Volaris | 6 – Integration | Rakesh | 1d | |
| V-44 | Deuna payment service integration — Deuna calls athena-platform at routing decision point | 6 – Integration | Rakesh + Naoki | 2d | |
| V-45 | End-to-end integration test — full flow: Deuna → athena-platform → model → ranked PSPs | 6 – Integration | Naoki + Kedar | 1d | |
| V-46 | Shadow mode — 48h logging without acting; compare predicted vs. actual outcomes | 6 – Integration | Kedar | 0.5d | Ready ✓ is_shadow_mode=true built-in (Triton branch); set up + monitor only |
| V-47 | Confirm ATHIA_PREDICTIONS + ATHIA_FEEDBACK are live in Deuna's Snowflake | 7 – Monitoring | Kedar | 0.5d | Done ✓ Confirmed data live in Deuna's Snowflake (2026-02-24) |
| V-48 | Deploy ATHIA_STAGE_OUTCOMES table in Snowflake | 7 – Monitoring | Kedar | 0.5d | Done ✓ Deployed in feat/ATH-0000 SQL |
| V-49 | Volaris approval rate dashboard — daily/hourly per PSP vs. baseline | 7 – Monitoring | Kedar | 1d | |
| V-50 | Model performance dashboard — prediction confidence, rank accuracy, retry lift | 7 – Monitoring | Kedar | 1d | |
| V-51 | Define retraining trigger — approval rate drop or AUC drop thresholds | 7 – Monitoring | Rene | 0.5d | Ready ✓ llm_training_orchestrator.py makes RETRAIN_NOW / SCHEDULED / SKIP decisions |
| V-52 | Schedule weekly retraining — auto-register new version from latest ATHIA_TRAINING_DATASET | 7 – Monitoring | Rene | 1d | Ready ✓ training_pipeline.py + orchestrator built; configure for Volaris cadence |
| V-53 | Confirm auto-winner worker runs for Volaris experiment with correct guardrails | 7 – Monitoring | Naoki | 0.5d | |
| V-54 | Post-launch review — 2-week lift analysis: approval rate, outage response, retry success | 7 – Monitoring | Rakesh | 1d |
🔴 Critical — Do These First
| Action | Repo | Effort |
|---|---|---|
Build retry_optimization_requested stimulus — P-05 is entirely missing from LLM platform | DATA-Athena-Snowflake | 3d |
Complete Strategy Director — replace exit() placeholder & dummy ranker prompts | DATA-Athena-Snowflake | 2d |
| Add tests for all 18 V2 REST handlers — 0% coverage on new API version | athena-platform | 4d |
| Add tests for Bedrock client + Bedrock domain service — production-critical, currently excluded | athena-platform | 1.5d |
| Add route, service & client layer tests — all 14 routes, 13 services, 3 clients at 0% | DATA-Athena-Snowflake | 7d |
Remove internal/clients/ from coverage exclusions in CI | athena-platform | 0.5d |
- Unit test all route handlers with
FastAPI TestClient+ mocked services - Unit test all 13 services — mock Snowflake sessions and clients
- Unit test core multi-agent framework:
AgentWorkflow,AgentStrategy, node/edge composition - Add per-branch tests for all 11 stimulus branches (mock LLM responses with fixtures)
- Unify CI into a single
pytestrun — replace fragmented per-domain workflows - Enable
pytest-covwith 60% minimum threshold enforced in CI
- Circuit breaker in
AgentWorkflow— isolate node failures, prevent cascade - Enable OpenTelemetry tracing — already in codebase, just commented out
- Replace hardcoded thresholds (15% drop, 60–80 min windows) with configurable params
- Add LLM prompt injection guards — sanitize user inputs before system prompts
- Standardize tool definition — unify
@create_toolvs. manual; add versioning - Centralize config — replace scattered
load_dotenvwith Pydantic Settings schema
- Add tests for all 18 V2 handlers — entire new API version at 0%
- Test Bedrock client & service — excluded from coverage, production-critical
- Raise CI threshold 20% → 60%; remove
internal/clients/exclusion - Bootstrap integration test with
testcontainers-go— verify DI graph - Benchmark tests for
/ml/predict,/feedback, experiment assignment - Contract tests for Snowflake & Bedrock APIs — catch schema drift early
- Event-driven model registry cache invalidation — remove 24h stale assignment risk
- Experiment context middleware — auto-propagate session/experiment IDs per request
- Abstract
*gin.Contextfrom controllers — transport-agnostic, easier to test - Deploy
ATHIA_STAGE_OUTCOMES+ATHIA_SESSION_SUMMARYSnowflake tables - SageMaker model warm-up — cold starts can breach p95 < 200ms target
- Production Grafana dashboards + alerts — config exists locally, not deployed
Full Priority Order
| Priority | Action | Repo | Effort |
|---|---|---|---|
| Critical | Build retry optimization stimulus (P-05) | DATA-Athena-Snowflake | 3d |
| Critical | Complete Strategy Director matcher + ranker | DATA-Athena-Snowflake | 2d |
| Critical | Add tests for all 18 V2 REST handlers | athena-platform | 4d |
| Critical | Add tests for Bedrock client + service | athena-platform | 1.5d |
| Critical | Add route, service & client tests (all at 0%) | DATA-Athena-Snowflake | 7d |
| Critical | Remove internal/clients/ from coverage exclusions | athena-platform | 0.5d |
| High | Multi-agent framework + branch tests (11 branches) | DATA-Athena-Snowflake | 6d |
| High | Circuit breaker in AgentWorkflow | DATA-Athena-Snowflake | 2d |
| High | Enable OpenTelemetry tracing | DATA-Athena-Snowflake | 1.5d |
| High | Raise CI coverage threshold to 60% | athena-platform | 0.5d |
| High | Bootstrap integration test (testcontainers) | athena-platform | 1.5d |
| High | Event-driven model registry cache invalidation | athena-platform | 1.5d |
| High | Deploy ATHIA_STAGE_OUTCOMES + SESSION_SUMMARY tables | athena-platform | 1d |
| High | SageMaker model warm-up (latency target risk) | athena-platform | 1d |
| Medium | Adaptive thresholds (replace hardcoded values) | DATA-Athena-Snowflake | 2d |
| Medium | Experiment context middleware | athena-platform | 2d |
| Medium | Production Grafana dashboards + alert rules | athena-platform | 2d |
| Medium | Benchmark tests for hot endpoints | athena-platform | 1d |
| Medium | Unified CI test suite + coverage enforcement | DATA-Athena-Snowflake | 1.5d |
| Engineer | Work Done |
|---|---|
| Rakesh | Changed training pipelines to use GPU instances. Implemented deeper ClearML integration in training pipeline — extracting detailed metrics from each training run into ClearML for monitoring and debugging. |
| Engineer | Work Done |
|---|---|
| Rakesh | Spent ~30 hours debugging dev servers with ClearML integration — blocked by access issues. Worked with team to resolve access, dev pipeline now working correctly. Attempted DNN pipeline deployment — ran for 3 hours and failed. Long turnaround time makes iteration impractical without GPU. |
- GPU needed — training requires GPU instance; CPU-based runs too slow for practical iteration (3hr+ per attempt)
- AWS access for Rakesh — need direct access to AWS resources (console/CLI) to debug and iterate efficiently
| Engineer | Work Done |
|---|---|
| Rakesh | Implemented DNN pipeline Terraform and deploying in dev. Need to get PRs submitted and merged to main/qa so pipelines can run via CI/CD |
| Engineer | Work Done |
|---|---|
| Rakesh | Debugging why ClearML is not registering all metrics from each training run. Next: deploy DNN model and fix production integration to ensure every training run model reaches production automatically |
| Engineer | Work Done |
|---|---|
| Rakesh | Fixing Volaris training pipeline run in dev environment. Helping team with prod deployment |
| Engineer | Work Done |
|---|---|
| Rakesh | Deployed e2e SageMaker dolphin pipeline (PreprocessData ✅, TrainModel ✅, EvaluateModel fix pushed). Deployed Volaris smart router daily training infra. ClearML integration confirmed working. Implemented Volaris pipeline CLI with Snowflake connection. Added 13 tests for setup_pipeline. Fixed multiple SageMaker issues: pipeline name mismatch, model.save(), RegisterModel, FrameworkProcessor for eval. |
- PR #1132 (DATA-Athena-Snowflake) needs approval to merge to
qa— blocks CodePipeline deployment - CodeDeploy DeployEC2 stage failing — scripts need debugging after merge
- SSO lacks
sagemaker:CreatePipelinefor local runs
| Engineer | Work Done |
|---|---|
| Rakesh | Deployed e2e SageMaker training pipeline via Spacelift. Consolidated PRs #24+#25 → #26. Fixed Spacelift project_root, cleaned orphaned state. Enabled daily training for both pipelines. Set up ClearML creds in Secrets Manager + EC2. Created Spacelift stack for Volaris. Renamed model-artifacts → volaris-model-artifacts. |
| Engineer | Work Done |
|---|---|
| Rakesh | Fighting Terraform and Spacelift configuration issues. ClearML successfully deployed in dev environment |
| Engineer | Work Done |
|---|---|
| Rakesh | Deployed ClearML using Terraform and Spacelift in dev environment and handed over to team. Full data analysis of Snowflake data with Rene — generated list of suggestions for team, published at metrics dashboard |
| Rene | Full data analysis of Snowflake data with Rakesh — generated list of suggestions for team |
| Engineer | Work Done |
|---|---|
| Rakesh | Deploying entire training platform end to end. Learning Spacelift for infra deployment — finally got access. Deploying ClearML to monitor all training. Wrote analysis script to monitor model performance driving metrics dashboard directly from Snowflake — better for analysis and generating insights |
| Rene | Analyzing model performance with past data. Waiting for experiment to be enabled again — current data not significant enough |
| Engineer | Work Done |
|---|---|
| Rakesh | Analyzing model performance metrics, working on DNN model optimization for Volaris smart router |
| Engineer | Work Done |
|---|---|
| Rakesh | Training DNN model for Volaris smart router — also serves as example on how to use the TFX pipeline |
| Rene | Data analysis to identify patterns based on Volaris data |
| Engineer | Work Done |
|---|---|
| Rakesh | Verified everything working in production. Helping with questions about the model. Double-checked traffic ramp-up and analyzing how to do post-launch analysis |
| Engineer | Work Done |
|---|---|
| Rakesh | Integrated model in serving stack with sidecar approach, loading models from S3/EFS. Removed all parallel serving libraries no longer needed in training directory |
| Engineer | Work Done |
|---|---|
| Rene | Created and incorporated AMEX model, added to the serving mix |
| Rakesh | Built AMEX model with Rene. Working with Deuna team on deploying everything in production for Volaris |
| Engineer | Work Done |
|---|---|
| Rakesh | Analyzing code with Naoki to determine serving approach — sidecar vs deploying servomatic. Building production flow to save and load models from S3 & deploying servomatic binary |
| Naoki | Evaluating sidecar vs servomatic deployment for integrating TF model with the Go API |
| Rene | Continuing to iterate on the TF model and data analysis. Converting LR model to TF format for use in servomatic binary for online eval |
| Engineer | Work Done |
|---|---|
| Rene | First regression model trained on Volaris data — evaluating quality in offline mode, initial results look promising |
| Rakesh | Continuing to analyze approach to serve the model as servomatic with Naoki |
| Naoki | Analyzing serving architecture to connect trained model via servomatic in production |
| Engineer | Work Done |
|---|---|
| Rene | Iterating on data shape analysis and building first model version |
| Rakesh | Working with Rene on first model; analyzing serving code with Naoki to plan production integration |
| Naoki | Analyzing serving code with Rakesh to determine how to connect model in production |
| Engineer | Work Done |
|---|---|
| Rakesh | First analysis of the data with Rene — analyzing best approach to build processor selector model |
| Rene | Started on first model based on current understanding of data and features |
| Naoki | Working with Rene on integrating S3 file loading into TFX data loader for e2e training and eval |
| Engineer | Work Done |
|---|---|
| Rene | Looking at data shape for Volaris to train first processor selector model |
| Kedar | Working on data pipeline |
| Naoki | Continues setting up good practices (code quality, CI, testing patterns) |
| Rakesh | Writing smartrouter service |
| Engineer | Work Done |
|---|---|
| Rakesh | Iterated on experiment and metrics framework to make everything work locally and in tests |
| Naoki | Iterated on improving code and ramping up |
| Kedar | Looking at feature extraction from Snowflake |
| Rene | Working on simple first model |
| Engineer | Work Done |
|---|---|
| Rakesh | Added Evaluation Service (uses Model Service + Feature Service to evaluate TensorFlow models). Added e2e tests for all 3 services. Added experiment and metrics framework to track all training pipelines. Demo training pipeline working end to end. PR waiting for review |
| Engineer | Work Done |
|---|---|
| Rakesh | Built foundational services: Model Service and Feature Service with tests and scaffoldings to support TensorFlow trained models |
| Engineer | Work Done |
|---|---|
| Naoki | Fixed broken tests to get everything running locally. Looking at setting up automated deployment in dev environment for services |
| Kedar | Got repo and environment access figured out. Looking into Snowflake data schema |
| Rene | Got repo and environment access figured out. Looking at training pipeline code |
| Rakesh | Updated deuna.aidaptive.com with latest repo analysis and refreshed task list. Synced athena-platform (v0.15.5, Triton merged) |
Are ATHIA_PREDICTIONS / ATHIA_FEEDBACK tables populated in Deuna's Snowflake today? Confirmed ✓ 2026-02-24
Confirmed — data is live in Deuna's Snowflake (verified 2026-02-24).
Are SageMaker endpoints live for processor_selector / retry_predictor?
Or are they placeholders only? — Rakesh to confirm
Is there a live model in MODEL_ARTIFACTS that Deuna's payment service is calling today?
Rakesh to confirm
What is the current payment volume through the routing engine?
Minimum 1,000 transactions per variant needed for A/B test statistical validity — Ask Israel
Who owns the athena-platform Go repo deployments?
Aidaptive or Deuna infra? Affects Phase 1 deployment planning — Clarify with Pablo
When will feature/llm-driven-ml-training (Triton IS) merge to main? New
This PR closes G-06 and defines the production model serving backend (Triton vs. SageMaker). Its merge timeline directly sets the Phase 6 integration schedule — ask Pablo.
| Item | Owner | Status |
|---|---|---|
| Snowflake access — Rakesh | Israel (Deuna) | ✓ Done (2026-02-18) |
| Snowflake access — Naoki | Rakesh + Naoki | ✓ Done (2026-02-19) |
| Code / repo access — Rakesh | Pablo (Deuna) | ✓ Done (2026-02-19) |
| Claude / LLM access & budget | Pablo → Farhan | ✓ Done (2026-02-19) |
| Code / repo access — Naoki | TBD | ✓ Done |
| Deuna corp accounts — Rakesh & Naoki | TBD | Pending |
| Claude Code credits — Rakesh & Naoki | — | Not needed |
| Deploy ATHIA_STAGE_OUTCOMES + ATHIA_SESSION_SUMMARY in Snowflake | Rakesh | ✓ Done (feat/ATH-0000) |
| Build retry_optimization_requested workflow | Rakesh | Pending |
| Service | URL | Details |
|---|---|---|
| AWS Console (SSO) | deunaio.awsapps.com | Deuna AWS account access |
| Snowflake | vltaxpw-rmontes.snowflakecomputing.com | Account: VLTAXPW-RMONTES · DB: PAYMENT_ML · Warehouse: PAYMENT_ML · Read-only |
| Athia Experiments Dashboard | insights.deuna.com | Model performance data for processor selector experiments |
| ClearML (Prod) | athia-ml.deuna.io | ML experiment tracking & training monitoring — production |
| ClearML (Dev) | athia-ml.dev.deuna.io | ML experiment tracking & training monitoring — dev environment |
| Spacelift | duna-e-commmerce.app.spacelift.io | Infrastructure governance & Terraform deployment |
| Terraform Repo | github.com/DUNA-E-Commmerce/terraform-athia | All Athia infrastructure as code |
| Rule | Details |
|---|---|
| AWS Resource Tags | All AWS resources must include: CreatedBy=aidaptive, ServiceName=smartrouter, Environment=POC |
| Infrastructure as Code | All infrastructure via Terraform only — no manual AWS console resource creation |
| Date | Decision | Rationale | Made By |
|---|---|---|---|
| 2026-03-24 | Serving migrated from DATA-Athena-Snowflake to athia-model-server sidecar in athena-platform | Clean separation — training repo (Python) vs serving repo (Go + Python sidecar) | Rakesh |
| 2026-03-13 | Adopted TensorFlow ecosystem (TFX, TF Serving, TFDV, TFMA, TF Transform) for all ML work | Replaces Snowflake ML / XGBoost / scikit-learn. Unified training → validation → serving pipeline with production-grade tooling | Rakesh |
| 2026-03-13 | Added Phase 0 — 7 service shells + 3 design tasks (Rakesh) before Volaris feature work | Service architecture: Data Pipelines, Feature Service, Training Pipelines, Model Mgmt, Eval Service, Evaluation Framework, Experiment System | Rakesh |
| 2026-03-13 | Both repos switched to main branch — feat/ATH-0000 and Triton branches both merged | All ML training pipeline and Triton serving code now on main; no more feature branch tracking needed | Rakesh |
| 2026-03-13 | Triton branch merged to main — confirms deployment architecture | feature/llm-driven-ml-training merged; Triton IS, ExperimentService, shadow mode now in production codebase | Deuna Engineering |
| 2026-02-18 | Latency target updated: p95 <50ms → p95 <200ms | Revised from original SOW spec | Rakesh (w/ Pablo) |
| 2026-02-19 | Phase 1 target merchant set to Volaris (not Cinépolis) | Volaris has known PSPs (Worldpay ID:76, MIT ID:85, Elavon, Amex); Cinépolis only shows Cybersource gateway — processor unknown | Mark Walick |
| 2026-02-20 | Repo analysis scoped to branch feat/ATH-0000-athia-ml-llm-schema-discovery (not main) | This branch contains the active ML platform development; main does not reflect current capabilities | Pablo |
| 2026-02-26 | athena-platform feature/llm-driven-ml-training (Triton IS) identified as the production model serving path | Triton IS + shadow mode + ExperimentService provides complete training→serving pipeline; replaces manual SageMaker endpoint registration; closes G-06 | Rakesh |