Deployment Guide¶
Infrastructure-as-Code deployment using Terraform and GitHub Actions.
Overview¶
Arbiter-Bot runs on AWS with a fully automated CI/CD pipeline:
| Component | Technology | Purpose |
|---|---|---|
| Compute | ECS Fargate | Serverless containers |
| Database | Aurora PostgreSQL | Persistent storage |
| Cache | ElastiCache Redis | Session and rate limit state |
| Networking | VPC + ALB | Isolation and load balancing |
| CI/CD | GitHub Actions | Automated deployment |
| IaC | Terraform | Infrastructure provisioning |
Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ AWS Region (us-east-1) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ VPC │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐│ │
│ │ │ Public │ │ Private │ │ Database ││ │
│ │ │ Subnets │ │ Subnets │ │ Subnets ││ │
│ │ │ │ │ │ │ ││ │
│ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────────────┐ ││ │
│ │ │ │ ALB │ │ │ │ ECS │ │ │ │ Aurora Postgres │ ││ │
│ │ │ │ │─┼──┼─│ Fargate │─┼──┼─│ │ ││ │
│ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────────────┘ ││ │
│ │ │ │ │ │ │ ││ │
│ │ │ │ │ │ │ ┌─────────────────┐ ││ │
│ │ │ │ │ │ │ │ ElastiCache │ ││ │
│ │ │ │ │ │ │ │ Redis │ ││ │
│ │ │ │ │ │ │ └─────────────────┘ ││ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘│ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Prerequisites¶
AWS Requirements¶
| Requirement | Details |
|---|---|
| AWS Account | With programmatic access |
| Region | us-east-1 (optimized for exchange latency) |
| IAM Role | OIDC provider for GitHub Actions |
| ECR Repository | Container image storage |
| Secrets Manager | Credential storage |
Local Tools¶
# Install required tools
brew install terraform awscli
# Verify versions
terraform --version # >= 1.0
aws --version # >= 2.0
GitHub Secrets¶
Configure these secrets in your GitHub repository:
| Secret | Description |
|---|---|
AWS_ROLE_ARN |
IAM role ARN for OIDC authentication |
POLY_PRIVATE_KEY |
Polymarket signing key |
POLY_API_KEY |
Polymarket API key |
POLY_API_SECRET |
Polymarket API secret |
POLY_API_PASSPHRASE |
Polymarket API passphrase |
KALSHI_KEY_ID |
Kalshi API key ID |
KALSHI_PRIVATE_KEY |
Kalshi RSA private key (PEM) |
Terraform Modules¶
The infrastructure is organized into reusable modules:
infrastructure/terraform/
├── main.tf # Root module orchestration
├── variables.tf # Input variables
├── outputs.tf # Output values
├── backend.tf # S3 state backend
└── modules/
├── vpc/ # Network infrastructure
├── secrets/ # Secrets Manager
├── rds/ # Aurora PostgreSQL
├── elasticache/ # Redis cluster
└── ecs/ # Fargate services
VPC Module¶
Creates isolated network infrastructure:
module "vpc" {
source = "./modules/vpc"
environment = var.environment
vpc_cidr = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
# Subnet configuration
public_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
private_subnets = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
database_subnets = ["10.0.21.0/24", "10.0.22.0/24", "10.0.23.0/24"]
}
Secrets Module¶
Manages exchange credentials securely:
module "secrets" {
source = "./modules/secrets"
environment = var.environment
secrets = {
polymarket = {
private_key = var.poly_private_key
api_key = var.poly_api_key
api_secret = var.poly_api_secret
passphrase = var.poly_api_passphrase
}
kalshi = {
key_id = var.kalshi_key_id
private_key = var.kalshi_private_key
}
}
}
RDS Module¶
Aurora PostgreSQL for persistent storage:
module "rds" {
source = "./modules/rds"
environment = var.environment
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.database_subnet_ids
security_group_id = module.vpc.database_security_group_id
# Instance configuration
instance_class = "db.r6g.large"
engine_version = "15.4"
allocated_storage = 100
# High availability
multi_az = var.environment == "prod"
backup_retention = 7
}
ElastiCache Module¶
Redis for session and rate limit state:
module "elasticache" {
source = "./modules/elasticache"
environment = var.environment
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.database_subnet_ids
security_group_id = module.vpc.cache_security_group_id
# Cluster configuration
node_type = "cache.r6g.large"
num_cache_nodes = var.environment == "prod" ? 3 : 1
engine_version = "7.0"
}
ECS Module¶
Fargate services for application workloads:
module "ecs" {
source = "./modules/ecs"
environment = var.environment
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
alb_security_group = module.vpc.alb_security_group_id
# Service images
trading_core_image = var.trading_core_image
telegram_bot_image = var.telegram_bot_image
web_api_image = var.web_api_image
# Resource allocation
services = {
trading_core = {
cpu = 4096 # 4 vCPU
memory = 8192 # 8 GB
count = var.environment == "prod" ? 2 : 1
}
telegram_bot = {
cpu = 512 # 0.5 vCPU
memory = 1024 # 1 GB
count = 1
}
web_api = {
cpu = 1024 # 1 vCPU
memory = 2048 # 2 GB
count = var.environment == "prod" ? 2 : 1
}
}
# Secrets references
secrets_arn = module.secrets.secrets_arn
# Database connection
database_url = module.rds.connection_string
redis_url = module.elasticache.connection_string
}
Service Configuration¶
Trading Core¶
The main arbitrage engine with high resource allocation:
| Setting | Value | Rationale |
|---|---|---|
| CPU | 4 vCPU | Low-latency computation |
| Memory | 8 GB | Order book caching |
| Replicas | 2 (prod) | High availability |
| Health Check | /health |
Liveness probe |
Telegram Bot¶
User interface service with minimal resources:
| Setting | Value | Rationale |
|---|---|---|
| CPU | 0.5 vCPU | I/O bound workload |
| Memory | 1 GB | Session state |
| Replicas | 1 | Stateless |
| Health Check | /health |
Liveness probe |
Web API¶
gRPC/REST API service:
| Setting | Value | Rationale |
|---|---|---|
| CPU | 1 vCPU | Request handling |
| Memory | 2 GB | Connection pooling |
| Replicas | 2 (prod) | Load balancing |
| Health Check | /health |
Liveness probe |
CI/CD Pipeline¶
GitHub Actions workflow with four stages:
┌──────────┐ ┌──────────┐ ┌──────────────┐ ┌─────────────────┐
│ Validate │───>│ Build │───>│ Deploy Infra │───>│ Deploy Services │
│ │ │ │ │ │ │ │
│ - Lint │ │ - Cargo │ │ - Terraform │ │ - ECS Update │
│ - Test │ │ - Docker │ │ Plan/Apply │ │ - Health Check │
│ - SAST │ │ - Push │ │ │ │ │
└──────────┘ └──────────┘ └──────────────┘ └─────────────────┘
Workflow Stages¶
1. Validate¶
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Lint
run: cargo fmt --check
- name: Clippy
run: cargo clippy -- -D warnings
- name: Test
run: cargo test --all-features
- name: Security Scan
run: cargo audit
2. Build¶
build:
needs: validate
runs-on: ubuntu-latest
steps:
- name: Build Release
run: cargo build --release
- name: Build Docker Image
run: docker build -t arbiter-engine .
- name: Push to ECR
run: |
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
docker tag arbiter-engine:latest $ECR_REGISTRY/arbiter-engine:${{ github.sha }}
docker push $ECR_REGISTRY/arbiter-engine:${{ github.sha }}
3. Deploy Infrastructure¶
deploy-infra:
needs: build
runs-on: ubuntu-latest
environment: ${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }}
steps:
- name: Terraform Init
run: terraform init
- name: Terraform Plan
run: terraform plan -out=tfplan
- name: Terraform Apply
run: terraform apply tfplan
4. Deploy Services¶
deploy-services:
needs: deploy-infra
runs-on: ubuntu-latest
steps:
- name: Update ECS Service
run: |
aws ecs update-service \
--cluster arbiter-${{ env.ENVIRONMENT }} \
--service trading-core \
--force-new-deployment
- name: Wait for Deployment
run: |
aws ecs wait services-stable \
--cluster arbiter-${{ env.ENVIRONMENT }} \
--services trading-core
Environment Promotion¶
| Branch | Environment | Approval |
|---|---|---|
feature/* |
- | - |
develop |
staging | Automatic |
main |
production | Manual (deploy-prod) |
Production deployments require the deploy-prod GitHub environment approval.
Manual Deployment¶
Initial Setup¶
# Configure AWS credentials
aws configure
# Initialize Terraform
cd infrastructure/terraform
terraform init
# Create workspace for environment
terraform workspace new staging
terraform workspace new prod
Deploy to Staging¶
terraform workspace select staging
terraform plan -var-file=environments/staging.tfvars -out=tfplan
terraform apply tfplan
Deploy to Production¶
terraform workspace select prod
# Plan with production variables
terraform plan -var-file=environments/prod.tfvars -out=tfplan
# Review plan carefully, then apply
terraform apply tfplan
Environment Variables¶
Create environment-specific variable files:
# environments/staging.tfvars
environment = "staging"
trading_core_image = "123456789.dkr.ecr.us-east-1.amazonaws.com/arbiter-engine:staging"
telegram_bot_image = "123456789.dkr.ecr.us-east-1.amazonaws.com/telegram-bot:staging"
web_api_image = "123456789.dkr.ecr.us-east-1.amazonaws.com/web-api:staging"
# Scaling
trading_core_count = 1
web_api_count = 1
# Database
rds_instance_class = "db.t3.medium"
rds_multi_az = false
# environments/prod.tfvars
environment = "prod"
trading_core_image = "123456789.dkr.ecr.us-east-1.amazonaws.com/arbiter-engine:v1.2.3"
telegram_bot_image = "123456789.dkr.ecr.us-east-1.amazonaws.com/telegram-bot:v1.2.3"
web_api_image = "123456789.dkr.ecr.us-east-1.amazonaws.com/web-api:v1.2.3"
# Scaling
trading_core_count = 2
web_api_count = 2
# Database
rds_instance_class = "db.r6g.large"
rds_multi_az = true
Rollback Procedures¶
Service Rollback¶
Roll back to a previous task definition:
# List recent task definitions
aws ecs list-task-definitions \
--family-prefix arbiter-trading-core \
--sort DESC \
--max-items 5
# Update service to previous version
aws ecs update-service \
--cluster arbiter-prod \
--service trading-core \
--task-definition arbiter-trading-core:42
# Wait for rollback
aws ecs wait services-stable \
--cluster arbiter-prod \
--services trading-core
Infrastructure Rollback¶
Terraform state allows reverting infrastructure changes:
# Show previous state versions
terraform state list
# Revert to previous state (use with caution)
terraform apply -target=module.ecs -var="trading_core_image=previous-image:tag"
Monitoring¶
CloudWatch Metrics¶
Key metrics to monitor:
| Metric | Threshold | Action |
|---|---|---|
| ECS CPUUtilization | > 80% | Scale out |
| ECS MemoryUtilization | > 85% | Scale out |
| RDS Connections | > 80% of max | Investigate |
| ALB TargetResponseTime | > 500ms | Investigate |
| ALB HTTPCode_5XX_Count | > 10/min | Alert |
Alarms¶
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "arbiter-${var.environment}-high-cpu"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
period = 60
statistic = "Average"
threshold = 80
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
ClusterName = aws_ecs_cluster.main.name
ServiceName = aws_ecs_service.trading_core.name
}
}
Log Groups¶
All services log to CloudWatch Logs:
| Log Group | Retention |
|---|---|
/ecs/arbiter-trading-core |
30 days |
/ecs/arbiter-telegram-bot |
14 days |
/ecs/arbiter-web-api |
14 days |
Security Considerations¶
Network Security¶
- All services run in private subnets
- Database subnets have no internet access
- ALB is the only public-facing component
- Security groups restrict traffic to required ports
Secrets Management¶
- All credentials stored in AWS Secrets Manager
- ECS tasks use IAM roles for Secrets Manager access
- Secrets rotated via Secrets Manager rotation
IAM Policies¶
Follow least-privilege principle:
data "aws_iam_policy_document" "ecs_task" {
statement {
effect = "Allow"
actions = [
"secretsmanager:GetSecretValue"
]
resources = [
module.secrets.secrets_arn
]
}
statement {
effect = "Allow"
actions = [
"logs:CreateLogStream",
"logs:PutLogEvents"
]
resources = [
"${aws_cloudwatch_log_group.main.arn}:*"
]
}
}
Troubleshooting¶
ECS Task Failures¶
# Check stopped task reason
aws ecs describe-tasks \
--cluster arbiter-prod \
--tasks arn:aws:ecs:us-east-1:123456789:task/arbiter-prod/abc123
# View container logs
aws logs get-log-events \
--log-group-name /ecs/arbiter-trading-core \
--log-stream-name ecs/trading-core/abc123
Database Connection Issues¶
# Test connectivity from bastion
psql -h aurora-endpoint.us-east-1.rds.amazonaws.com \
-U arbiter \
-d arbiter_prod
# Check security group rules
aws ec2 describe-security-groups \
--group-ids sg-12345678
Terraform State Issues¶
# Refresh state
terraform refresh
# Import existing resource
terraform import aws_ecs_cluster.main arbiter-prod
# Remove resource from state (doesn't delete)
terraform state rm aws_ecs_service.old_service
Related Documentation¶
- ADR-010: Deployment Architecture - Architecture decision
- Environment Variables - Configuration reference
- Security Guide - Security considerations
- Multi-Tenancy - Tenant isolation