Browse AWS Certification Guides

SOA-C03 Cheatsheet - CloudOps Signals, Runbooks, Reliability, Security & Networking

High-signal SOA-C03 reference: monitoring/logging/remediation patterns, reliability and DR decisions, CloudFormation/SSM automation, security/compliance operations, and network troubleshooting defaults.

Keep this page open while drilling. SOA-C03 rewards structured operations thinking: signal -> diagnosis -> low-risk remediation -> verification.


Quick facts (SOA-C03)

ItemValue
Questions65 total
Scoring50 scored + 15 unscored (unscored items are not identified)
Question typesMultiple choice, multiple response
Time130 minutes
Passing score720 (scaled 100-1000)
Cost150 USD
DomainsD1 22% - D2 22% - D3 22% - D4 16% - D5 18%

Fast strategy

  • Start from the constraint in the last sentence (availability, compliance, latency, cost, operational effort).
  • Prefer the smallest safe operational change that addresses root cause.
  • For noisy incidents, choose approaches that improve signal quality first (better alarms, filtering, dashboards).
  • For repeated incidents, prefer automation (EventBridge + Lambda/SSM runbooks).

Final 20-minute recall (exam day)

Cue -> best answer (pattern map)

If the question says…Usually best answer
Alarm fatigue / noisy incidentsComposite alarms + tuned thresholds + actionable routing
Repeatable remediation neededEventBridge -> SSM Automation/Lambda runbook
Patch governance at scaleSystems Manager Patch Manager
Configuration drift detectionAWS Config rules + automatic remediation
Need secure shell-less instance accessSystems Manager Session Manager
Stack update failedCloudFormation events + change sets + rollback analysis
Backup policy across accountsAWS Backup plans/policies
Access denial investigationIAM policy + resource policy + KMS key policy evaluation
Network reachability issueRoute table -> SG -> NACL -> endpoint/NAT path validation
Incident postmortem preventionRunbook updates + alarm improvements + automation

Must-memorize SOA defaults

TopicFast recall
Core observability stackCloudWatch metrics/logs, CloudTrail, X-Ray (where relevant)
RTO/RPORecovery time and data loss objectives drive backup/DR choice
Safe remediation orderDetect -> triage -> fix low blast radius -> verify -> automate
Operational preferenceManaged services + automation over manual repetitive operations

Last-minute traps

  • Acting on one symptom metric without correlation to logs/traces/deploy timeline.
  • Running high-risk remediation before confirming blast radius.
  • Treating backups as compliant without restore testing.
  • Alerting on every metric spike instead of SLO-aligned sustained conditions.

1) CloudOps incident loop

    flowchart LR
	  A[Detect signal] --> B[Triage severity]
	  B --> C[Identify probable root cause]
	  C --> D[Apply low-risk remediation]
	  D --> E[Validate recovery]
	  E --> F[Document + automate prevention]

Use this loop in scenario questions. Wrong answers often skip validation or choose high-blast-radius changes.


2) Monitoring and logging defaults (Domain 1)

Choose the right telemetry

NeedBest AWS signal
Resource/service health trendsCloudWatch metrics
Application/system event detailCloudWatch Logs
API-level audit trailCloudTrail
Network allow/deny and flow diagnosisVPC Flow Logs

Alarm design defaults

  • Use alarm thresholds tied to SLO/error budgets where possible.
  • Use composite alarms to reduce alert noise.
  • Route alerts to SNS or EventBridge for automation paths.
  • Include runbook links in alarm descriptions for faster response.

Common D1 pitfalls

  • Alarm on raw spikes without sustained evaluation windows.
  • No distinction between symptom metrics and cause metrics.
  • Missing CloudWatch agent config on EC2/ECS/EKS.
  • Automation triggers without guardrails/permissions checks.

3) Reliability and business continuity (Domain 2)

HA and scaling picks

RequirementTypical choice
Multi-instance failover + balancingELB + Auto Scaling
Regional DNS failover patternsRoute 53 health checks + routing policy
Managed DB high availabilityMulti-AZ for RDS/Aurora
Burst read/load reductionCloudFront or ElastiCache

Backup/restore language you must apply correctly

  • RPO: acceptable data loss window.
  • RTO: acceptable restoration time.

If question emphasizes strict RPO/RTO, prioritize restore method and backup frequency that explicitly satisfy those targets.

Reliability anti-patterns

  • Single-AZ for critical stateful production workloads.
  • Backups with no restore test evidence.
  • Scaling policies with no cooldown/health alignment.

4) Deployment, provisioning, and automation (Domain 3)

Core service map

NeedTypical AWS answer
Declarative infrastructureCloudFormation (or CDK)
Fleet ops and runbooksSystems Manager
Event-driven operational actionsEventBridge + Lambda/SSM
Multi-account/region deployment sharingStackSets / AWS RAM

CloudFormation troubleshooting checklist

  1. Validate IAM permissions for stack actions.
  2. Check resource dependency/order failures.
  3. Confirm subnet CIDR sizing and limits.
  4. Review event log for first failing resource (not only terminal error).

Automation rule of thumb

Automate repetitive, deterministic operations first: patching, restart/remediation runbooks, compliance drift checks, and standard incident responses.


5) Security and compliance operations (Domain 4)

High-yield controls

Control goalTypical services
Identity and least privilegeIAM, IAM Access Analyzer
AuditabilityCloudTrail, AWS Config
Secrets and key managementSecrets Manager, KMS
Findings aggregationSecurity Hub, GuardDuty, Inspector
Encryption in transitACM/TLS

Common exam patterns

  • Access denied: check identity policy, resource policy, and KMS key policy.
  • Compliance drift: Config rule failure -> remediation workflow.
  • Multi-account controls: Organizations/SCP boundaries and delegated operations.

6) Networking and content delivery troubleshooting (Domain 5)

VPC troubleshooting order

  1. Route tables
  2. Security groups (stateful)
  3. NACLs (stateless)
  4. Gateway/path (IGW, NAT, TGW, endpoints)
  5. DNS resolution (Route 53 / Resolver)

Network/data path service picks

NeedTypical AWS answer
Private access to AWS servicesVPC endpoints / PrivateLink
CDN and edge cachingCloudFront
Global traffic accelerationGlobal Accelerator
Hybrid/private connectivitySite-to-Site VPN / Transit Gateway

Frequent D5 anti-patterns

  • Allowing SG but blocking ephemeral return traffic via NACL.
  • Assuming NAT provides inbound access.
  • CloudFront cache issue treated as origin outage.

7) Troubleshooting playbooks you can reuse

5xx spike behind load balancer

  • Check target health first.
  • Correlate LB access logs + target logs + alarm timeline.
  • Validate autoscaling events and recent config/deploy changes.

Alarm noise flood

  • Replace independent symptom alarms with composite alarm logic.
  • Tune thresholds/evaluation periods from observed baseline.
  • Route only actionable alerts to incident channels.

Intermittent connectivity failure

  • Validate route and SG path both directions.
  • Inspect NACL rules for stateless return traffic blocks.
  • Use Reachability Analyzer and VPC Flow Logs for confirmation.

8) Cost-aware operations quick wins

  • Delete idle unattached EBS volumes and stale snapshots with retention policy.
  • Use lifecycle policies for S3/EFS where access patterns allow.
  • Reduce NAT egress by using VPC endpoints where applicable.
  • Right-size compute using utilization and recommendation signals.

Next steps

  • Use Resources to stay aligned to the official exam guide and AWS operations docs.
  • Use the FAQ when you need a quick reset on exam scope and candidate expectations.
  • Keep this page open while you drill remediation, backup, and network-troubleshooting scenarios.

Quiz

Loading quiz…