Skip to content

Deployment Guide

Infrastructure-as-Code deployment using Terraform and GitHub Actions.

Overview

Arbiter-Bot runs on AWS with a fully automated CI/CD pipeline:

Component Technology Purpose
Compute ECS Fargate Serverless containers
Database Aurora PostgreSQL Persistent storage
Cache ElastiCache Redis Session and rate limit state
Networking VPC + ALB Isolation and load balancing
CI/CD GitHub Actions Automated deployment
IaC Terraform Infrastructure provisioning

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         AWS Region (us-east-1)                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                           VPC                              │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐│  │
│  │  │ Public      │  │ Private     │  │ Database            ││  │
│  │  │ Subnets     │  │ Subnets     │  │ Subnets             ││  │
│  │  │             │  │             │  │                     ││  │
│  │  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────────────┐ ││  │
│  │  │ │   ALB   │ │  │ │   ECS   │ │  │ │ Aurora Postgres │ ││  │
│  │  │ │         │─┼──┼─│ Fargate │─┼──┼─│                 │ ││  │
│  │  │ └─────────┘ │  │ └─────────┘ │  │ └─────────────────┘ ││  │
│  │  │             │  │             │  │                     ││  │
│  │  │             │  │             │  │ ┌─────────────────┐ ││  │
│  │  │             │  │             │  │ │ ElastiCache     │ ││  │
│  │  │             │  │             │  │ │ Redis           │ ││  │
│  │  │             │  │             │  │ └─────────────────┘ ││  │
│  │  └─────────────┘  └─────────────┘  └─────────────────────┘│  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Prerequisites

AWS Requirements

Requirement Details
AWS Account With programmatic access
Region us-east-1 (optimized for exchange latency)
IAM Role OIDC provider for GitHub Actions
ECR Repository Container image storage
Secrets Manager Credential storage

Local Tools

# Install required tools
brew install terraform awscli

# Verify versions
terraform --version  # >= 1.0
aws --version        # >= 2.0

GitHub Secrets

Configure these secrets in your GitHub repository:

Secret Description
AWS_ROLE_ARN IAM role ARN for OIDC authentication
POLY_PRIVATE_KEY Polymarket signing key
POLY_API_KEY Polymarket API key
POLY_API_SECRET Polymarket API secret
POLY_API_PASSPHRASE Polymarket API passphrase
KALSHI_KEY_ID Kalshi API key ID
KALSHI_PRIVATE_KEY Kalshi RSA private key (PEM)

Terraform Modules

The infrastructure is organized into reusable modules:

infrastructure/terraform/
├── main.tf              # Root module orchestration
├── variables.tf         # Input variables
├── outputs.tf           # Output values
├── backend.tf           # S3 state backend
└── modules/
    ├── vpc/             # Network infrastructure
    ├── secrets/         # Secrets Manager
    ├── rds/             # Aurora PostgreSQL
    ├── elasticache/     # Redis cluster
    └── ecs/             # Fargate services

VPC Module

Creates isolated network infrastructure:

module "vpc" {
  source = "./modules/vpc"

  environment         = var.environment
  vpc_cidr           = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]

  # Subnet configuration
  public_subnets   = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  private_subnets  = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
  database_subnets = ["10.0.21.0/24", "10.0.22.0/24", "10.0.23.0/24"]
}

Secrets Module

Manages exchange credentials securely:

module "secrets" {
  source = "./modules/secrets"

  environment = var.environment

  secrets = {
    polymarket = {
      private_key = var.poly_private_key
      api_key     = var.poly_api_key
      api_secret  = var.poly_api_secret
      passphrase  = var.poly_api_passphrase
    }
    kalshi = {
      key_id      = var.kalshi_key_id
      private_key = var.kalshi_private_key
    }
  }
}

RDS Module

Aurora PostgreSQL for persistent storage:

module "rds" {
  source = "./modules/rds"

  environment        = var.environment
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.database_subnet_ids
  security_group_id = module.vpc.database_security_group_id

  # Instance configuration
  instance_class     = "db.r6g.large"
  engine_version     = "15.4"
  allocated_storage  = 100

  # High availability
  multi_az           = var.environment == "prod"
  backup_retention   = 7
}

ElastiCache Module

Redis for session and rate limit state:

module "elasticache" {
  source = "./modules/elasticache"

  environment        = var.environment
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.database_subnet_ids
  security_group_id = module.vpc.cache_security_group_id

  # Cluster configuration
  node_type          = "cache.r6g.large"
  num_cache_nodes    = var.environment == "prod" ? 3 : 1
  engine_version     = "7.0"
}

ECS Module

Fargate services for application workloads:

module "ecs" {
  source = "./modules/ecs"

  environment        = var.environment
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.private_subnet_ids
  alb_security_group = module.vpc.alb_security_group_id

  # Service images
  trading_core_image  = var.trading_core_image
  telegram_bot_image  = var.telegram_bot_image
  web_api_image       = var.web_api_image

  # Resource allocation
  services = {
    trading_core = {
      cpu    = 4096   # 4 vCPU
      memory = 8192   # 8 GB
      count  = var.environment == "prod" ? 2 : 1
    }
    telegram_bot = {
      cpu    = 512    # 0.5 vCPU
      memory = 1024   # 1 GB
      count  = 1
    }
    web_api = {
      cpu    = 1024   # 1 vCPU
      memory = 2048   # 2 GB
      count  = var.environment == "prod" ? 2 : 1
    }
  }

  # Secrets references
  secrets_arn = module.secrets.secrets_arn

  # Database connection
  database_url = module.rds.connection_string
  redis_url    = module.elasticache.connection_string
}

Service Configuration

Trading Core

The main arbitrage engine with high resource allocation:

Setting Value Rationale
CPU 4 vCPU Low-latency computation
Memory 8 GB Order book caching
Replicas 2 (prod) High availability
Health Check /health Liveness probe

Telegram Bot

User interface service with minimal resources:

Setting Value Rationale
CPU 0.5 vCPU I/O bound workload
Memory 1 GB Session state
Replicas 1 Stateless
Health Check /health Liveness probe

Web API

gRPC/REST API service:

Setting Value Rationale
CPU 1 vCPU Request handling
Memory 2 GB Connection pooling
Replicas 2 (prod) Load balancing
Health Check /health Liveness probe

CI/CD Pipeline

GitHub Actions workflow with four stages:

┌──────────┐    ┌──────────┐    ┌──────────────┐    ┌─────────────────┐
│ Validate │───>│  Build   │───>│ Deploy Infra │───>│ Deploy Services │
│          │    │          │    │              │    │                 │
│ - Lint   │    │ - Cargo  │    │ - Terraform  │    │ - ECS Update    │
│ - Test   │    │ - Docker │    │   Plan/Apply │    │ - Health Check  │
│ - SAST   │    │ - Push   │    │              │    │                 │
└──────────┘    └──────────┘    └──────────────┘    └─────────────────┘

Workflow Stages

1. Validate

validate:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4

    - name: Lint
      run: cargo fmt --check

    - name: Clippy
      run: cargo clippy -- -D warnings

    - name: Test
      run: cargo test --all-features

    - name: Security Scan
      run: cargo audit

2. Build

build:
  needs: validate
  runs-on: ubuntu-latest
  steps:
    - name: Build Release
      run: cargo build --release

    - name: Build Docker Image
      run: docker build -t arbiter-engine .

    - name: Push to ECR
      run: |
        aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
        docker tag arbiter-engine:latest $ECR_REGISTRY/arbiter-engine:${{ github.sha }}
        docker push $ECR_REGISTRY/arbiter-engine:${{ github.sha }}

3. Deploy Infrastructure

deploy-infra:
  needs: build
  runs-on: ubuntu-latest
  environment: ${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }}
  steps:
    - name: Terraform Init
      run: terraform init

    - name: Terraform Plan
      run: terraform plan -out=tfplan

    - name: Terraform Apply
      run: terraform apply tfplan

4. Deploy Services

deploy-services:
  needs: deploy-infra
  runs-on: ubuntu-latest
  steps:
    - name: Update ECS Service
      run: |
        aws ecs update-service \
          --cluster arbiter-${{ env.ENVIRONMENT }} \
          --service trading-core \
          --force-new-deployment

    - name: Wait for Deployment
      run: |
        aws ecs wait services-stable \
          --cluster arbiter-${{ env.ENVIRONMENT }} \
          --services trading-core

Environment Promotion

Branch Environment Approval
feature/* - -
develop staging Automatic
main production Manual (deploy-prod)

Production deployments require the deploy-prod GitHub environment approval.

Manual Deployment

Initial Setup

# Configure AWS credentials
aws configure

# Initialize Terraform
cd infrastructure/terraform
terraform init

# Create workspace for environment
terraform workspace new staging
terraform workspace new prod

Deploy to Staging

terraform workspace select staging

terraform plan -var-file=environments/staging.tfvars -out=tfplan
terraform apply tfplan

Deploy to Production

terraform workspace select prod

# Plan with production variables
terraform plan -var-file=environments/prod.tfvars -out=tfplan

# Review plan carefully, then apply
terraform apply tfplan

Environment Variables

Create environment-specific variable files:

# environments/staging.tfvars
environment = "staging"

trading_core_image  = "123456789.dkr.ecr.us-east-1.amazonaws.com/arbiter-engine:staging"
telegram_bot_image  = "123456789.dkr.ecr.us-east-1.amazonaws.com/telegram-bot:staging"
web_api_image       = "123456789.dkr.ecr.us-east-1.amazonaws.com/web-api:staging"

# Scaling
trading_core_count = 1
web_api_count      = 1

# Database
rds_instance_class = "db.t3.medium"
rds_multi_az       = false
# environments/prod.tfvars
environment = "prod"

trading_core_image  = "123456789.dkr.ecr.us-east-1.amazonaws.com/arbiter-engine:v1.2.3"
telegram_bot_image  = "123456789.dkr.ecr.us-east-1.amazonaws.com/telegram-bot:v1.2.3"
web_api_image       = "123456789.dkr.ecr.us-east-1.amazonaws.com/web-api:v1.2.3"

# Scaling
trading_core_count = 2
web_api_count      = 2

# Database
rds_instance_class = "db.r6g.large"
rds_multi_az       = true

Rollback Procedures

Service Rollback

Roll back to a previous task definition:

# List recent task definitions
aws ecs list-task-definitions \
  --family-prefix arbiter-trading-core \
  --sort DESC \
  --max-items 5

# Update service to previous version
aws ecs update-service \
  --cluster arbiter-prod \
  --service trading-core \
  --task-definition arbiter-trading-core:42

# Wait for rollback
aws ecs wait services-stable \
  --cluster arbiter-prod \
  --services trading-core

Infrastructure Rollback

Terraform state allows reverting infrastructure changes:

# Show previous state versions
terraform state list

# Revert to previous state (use with caution)
terraform apply -target=module.ecs -var="trading_core_image=previous-image:tag"

Monitoring

CloudWatch Metrics

Key metrics to monitor:

Metric Threshold Action
ECS CPUUtilization > 80% Scale out
ECS MemoryUtilization > 85% Scale out
RDS Connections > 80% of max Investigate
ALB TargetResponseTime > 500ms Investigate
ALB HTTPCode_5XX_Count > 10/min Alert

Alarms

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "arbiter-${var.environment}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/ECS"
  period              = 60
  statistic           = "Average"
  threshold           = 80
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
    ServiceName = aws_ecs_service.trading_core.name
  }
}

Log Groups

All services log to CloudWatch Logs:

Log Group Retention
/ecs/arbiter-trading-core 30 days
/ecs/arbiter-telegram-bot 14 days
/ecs/arbiter-web-api 14 days

Security Considerations

Network Security

  • All services run in private subnets
  • Database subnets have no internet access
  • ALB is the only public-facing component
  • Security groups restrict traffic to required ports

Secrets Management

  • All credentials stored in AWS Secrets Manager
  • ECS tasks use IAM roles for Secrets Manager access
  • Secrets rotated via Secrets Manager rotation

IAM Policies

Follow least-privilege principle:

data "aws_iam_policy_document" "ecs_task" {
  statement {
    effect = "Allow"
    actions = [
      "secretsmanager:GetSecretValue"
    ]
    resources = [
      module.secrets.secrets_arn
    ]
  }

  statement {
    effect = "Allow"
    actions = [
      "logs:CreateLogStream",
      "logs:PutLogEvents"
    ]
    resources = [
      "${aws_cloudwatch_log_group.main.arn}:*"
    ]
  }
}

Troubleshooting

ECS Task Failures

# Check stopped task reason
aws ecs describe-tasks \
  --cluster arbiter-prod \
  --tasks arn:aws:ecs:us-east-1:123456789:task/arbiter-prod/abc123

# View container logs
aws logs get-log-events \
  --log-group-name /ecs/arbiter-trading-core \
  --log-stream-name ecs/trading-core/abc123

Database Connection Issues

# Test connectivity from bastion
psql -h aurora-endpoint.us-east-1.rds.amazonaws.com \
     -U arbiter \
     -d arbiter_prod

# Check security group rules
aws ec2 describe-security-groups \
  --group-ids sg-12345678

Terraform State Issues

# Refresh state
terraform refresh

# Import existing resource
terraform import aws_ecs_cluster.main arbiter-prod

# Remove resource from state (doesn't delete)
terraform state rm aws_ecs_service.old_service