Troubleshooting Guide¶
15 min
Common issues and solutions for AWS Marketplace deployments of Aletyx Decision Control.
CloudFormation Stack Issues¶
Stack Creation Failed¶
Check the Events Tab:
- Go to CloudFormation console
- Select your stack
- Click Events tab
- Look for resources with
CREATE_FAILEDstatus - Read the "Status reason" column for error details
Common CloudFormation Errors¶
"The subnet ID 'subnet-xxx' does not exist"¶
Cause: Selected subnet is in wrong region or doesn't exist Fix: Verify you selected a subnet in the same region as your stack
"Subnets specified should be in distinct availability zones"¶
Cause: Both subnets are in the same Availability Zone (Production only) Fix: Select Subnet2 from a different AZ than Subnet1
# Check subnet AZs
aws ec2 describe-subnets \
--subnet-ids subnet-111 subnet-222 \
--query 'Subnets[].[SubnetId,AvailabilityZone]' \
--output table
"Cannot find version 14.9 for postgres"¶
Cause: Invalid PostgreSQL version Fix: Template should use PostgreSQL 15.x (verify template is latest version)
"The volume 'vol-xxx' is not in the same availability zone"¶
Cause: EBS volume and instance in different AZs Fix: This is a template bug - update to latest template version
Application Not Accessible¶
Symptom: Cannot reach http://ec2-xxx.compute-1.amazonaws.com/
Diagnosis Steps¶
1. Verify instance is running:
aws ec2 describe-instances \
--instance-ids i-xxxxx \
--query 'Reservations[0].Instances[0].State.Name'
# Should return: "running"
2. Check security group allows your IP:
aws ec2 describe-security-groups --group-ids sg-xxxxx
# Verify inbound rules include your IP on port 80
3. Test port connectivity:
4. Check application logs (via SSH or SSM):
# Sandbox logs
sudo tail -f /var/log/aletyx-sandbox.log
# Production logs
sudo tail -f /var/log/aletyx-production.log
5. Check application is running:
# Check Java process
ps aux | grep java
# Check port 8080 listening
sudo netstat -tlnp | grep 8080
# Check Docker container
sudo docker ps
sudo docker logs decision-control --tail 100
Solution: Application Still Starting¶
Symptom: Connection refused immediately after stack creation Cause: Application takes 2-3 minutes to start Fix: Wait and retry
Solution: Security Group Blocking¶
Symptom: Connection times out Cause: Security group doesn't allow your IP Fix: Update security group or stack parameter
# Get your current IP
MY_IP=$(curl -s https://checkip.amazonaws.com)
# Update security group
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxx \
--protocol tcp \
--port 80 \
--cidr ${MY_IP}/32
HTTPS/SSL Issues¶
Symptom: HTTPS not working, certificate errors
DNS Not Resolving¶
Symptom: DNS problem: NXDOMAIN looking up A for my-app.example.com
Diagnosis:
Causes & Fixes:
-
Route 53 hosted zone doesn't exist:
-
DNS record not created:
-
DNS propagation delay (wait up to 5 minutes):
Let's Encrypt Validation Failed¶
Symptom: Failed to verify challenge or The client lacks sufficient authorization
Diagnosis:
Causes & Fixes:
-
Port 80 not accessible from 0.0.0.0/0:
-
nginx not configured properly:
-
Application blocking ACME challenge:
Certificate Expired¶
Symptom: SSL certificate problem: certificate has expired
Diagnosis:
echo | openssl s_client -servername my-app.example.com \
-connect my-app.example.com:443 2>/dev/null | \
openssl x509 -noout -dates
# Check if notAfter has passed
Causes & Fixes:
-
Auto-renewal failed:
-
Port 80 blocked during renewal:
-
Ensure security group allows HTTP from 0.0.0.0/0
-
Force manual renewal:
nginx Not Starting¶
Symptom: nginx: [emerg] cannot load certificate
Diagnosis:
# Test nginx configuration
sudo nginx -t
# Check certificate files exist
sudo ls -la /etc/letsencrypt/live/my-app.example.com/
Fixes:
# Restart nginx
sudo systemctl restart nginx
# If still failing, check logs
sudo journalctl -u nginx -n 100
Database Connection Issues (Production)¶
Symptom: Application cannot connect to RDS
RDS Instance Not Available¶
Diagnosis:
aws rds describe-db-instances \
--db-instance-identifier mydb \
--query 'DBInstances[0].DBInstanceStatus'
# Should return: "available"
If status is "creating": Wait 8-10 minutes for RDS to finish provisioning
Security Group Blocking EC2→RDS¶
Diagnosis:
# Get RDS security group
RDS_SG=$(aws rds describe-db-instances \
--db-instance-identifier mydb \
--query 'DBInstances[0].VpcSecurityGroups[0].VpcSecurityGroupId' \
--output text)
# Check inbound rules
aws ec2 describe-security-groups \
--group-ids $RDS_SG \
--query 'SecurityGroups[0].IpPermissions'
Expected: Should allow port 5432 from EC2 security group
Fix (if missing):
aws ec2 authorize-security-group-ingress \
--group-id $RDS_SG \
--protocol tcp \
--port 5432 \
--source-group sg-ec2-xxxxx
Test Database Connectivity¶
From EC2 instance:
# SSH to instance
aws ssm start-session --target i-xxxxx
# Test connection
psql -h mydb.c9akciq32.us-east-1.rds.amazonaws.com \
-U aletyxadmin \
-d decision_control \
-p 5432
# If connection fails, check:
# 1. Database endpoint is correct
# 2. Database is available
# 3. Security groups allow connection
Wrong Database Credentials¶
Diagnosis:
Fix: Update credentials to match RDS parameters
Performance Issues¶
High CPU Usage¶
Diagnosis:
# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxxxx \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
Solutions:
- Upgrade instance type:
- Sandbox: t3.medium → t3.large
-
Production: m5.xlarge → m5.2xlarge
-
Check for runaway processes:
-
Review application logs for errors:
High Database Connections (Production)¶
Diagnosis:
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=mydb \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Maximum
Solutions:
-
Increase max_connections:
-
Check for connection leaks:
-
Upgrade RDS instance class:
- db.t3.medium → db.m5.large
Slow Application Response¶
Diagnosis:
# Test response time
time curl http://ec2-xxx.compute-1.amazonaws.com/
# Check disk I/O
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name DiskReadBytes \
--dimensions Name=InstanceId,Value=i-xxxxx \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
Solutions:
- Check application logs for slow queries
- Increase instance size (more vCPU, RAM)
- Upgrade to c5 series (compute-optimized)
- Enable RDS read replicas (production high traffic)
Access and Permission Issues¶
SSM Session Manager Not Working¶
Symptom: TargetNotConnected error when starting session
Diagnosis:
# Check instance has SSM agent
aws ssm describe-instance-information \
--filters "Key=InstanceIds,Values=i-xxxxx"
Causes & Fixes:
- Instance doesn't have internet access:
- Needs NAT Gateway or IGW for SSM
-
Or use VPC endpoints for SSM
-
IAM role not attached:
-
SSM agent not running:
SSH Connection Refused¶
Symptom: Connection refused when trying to SSH
Diagnosis: 1. Check security group allows port 22 from your IP 2. Verify you have the correct private key 3. Check instance is running
Solutions:
# Test port 22 connectivity
nc -zv 54.123.45.67 22
# Fix key permissions
chmod 400 ~/.ssh/your-key.pem
# Use correct username
ssh -i ~/.ssh/your-key.pem ec2-user@54.123.45.67
Billing and Cost Issues¶
Unexpected Charges¶
Diagnosis:
# Check running instances
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].[InstanceId,InstanceType,LaunchTime]' \
--output table
Common Causes:
- Forgot to stop/delete instances
- EBS volumes not deleted with instances
- RDS backups accumulating
- Elastic IP charges (when not attached)
Solutions:
# Stop instance (Sandbox - stops compute charges)
aws ec2 stop-instances --instance-ids i-xxxxx
# Delete stack (removes everything)
aws cloudformation delete-stack --stack-name my-stack
# Check for orphaned volumes
aws ec2 describe-volumes \
--filters "Name=status,Values=available"
# Delete unused volumes
aws ec2 delete-volume --volume-id vol-xxxxx
Getting Help¶
Collect Diagnostic Information¶
Before contacting support, gather this information:
# 1. CloudFormation stack details
aws cloudformation describe-stacks \
--stack-name my-stack \
--region us-east-1 > stack-details.json
# 2. CloudFormation events
aws cloudformation describe-stack-events \
--stack-name my-stack \
--region us-east-1 > stack-events.json
# 3. Instance details
aws ec2 describe-instances \
--instance-ids i-xxxxx \
--region us-east-1 > instance-details.json
# 4. Application logs (last 100 lines)
ssh -i your-key.pem ec2-user@ec2-xxx.compute-1.amazonaws.com \
'sudo tail -100 /var/log/aletyx-*.log' > app-logs.txt
# 5. System logs
ssh -i your-key.pem ec2-user@ec2-xxx.compute-1.amazonaws.com \
'sudo journalctl -xe -n 100' > system-logs.txt
Support Channels¶
For AWS Marketplace Issues: - Email: aws-support@aletyx.com - Subject: Include "AWS Marketplace" and your stack name - Include: All diagnostic information above
For AWS Infrastructure Issues: - AWS Support Console: https://console.aws.amazon.com/support/ - Topic: EC2, RDS, CloudFormation, etc.
Documentation: - Aletyx Docs: https://docs.aletyx.ai - AWS Docs: https://docs.aws.amazon.com/
Next Steps¶
- Security Best Practices: Security Guide
- SSL/HTTPS Configuration: SSL/HTTPS Guide
- Sandbox Deployment: Sandbox Edition
- Production Deployment: Production Edition