Determine High-Performing Data Ingestion and Transformation Solutions for SAA-C03

March 28, 2026 5 min read

Learn the Kinesis, Glue, Athena, EMR, DataSync, Lake Formation, and transfer-pattern choices AWS tests for SAA-C03 ingestion and transformation scenarios.

On this page

This newer SAA-C03 task group is about building data paths that scale cleanly from ingestion through transformation and analytics. The exam is not testing you as a dedicated data engineer. It is testing whether you can choose the right AWS-managed path for transfer, streaming, transformation, and analysis requirements.

What AWS is explicitly testing

The exam guide points to analytics and visualization services, ingestion patterns, transfer services such as DataSync and Storage Gateway, transformation services such as Glue, secure access to ingestion points, streaming services such as Kinesis, and format transformation choices.

The task behind the service list

This is really a pipeline-shape question:

how is the data arriving: batch, file-oriented, or streaming?
where is the data landing first: raw bucket, stream, or appliance-backed path?
what service is shaping it into an analytics-friendly form?
how are you controlling access to the ingestion point and the resulting lake?
what layer lets people query or visualize the result without overbuilding the platform?

Ingestion chooser

Requirement	Strongest first fit	Why
Real-time streaming ingestion	Kinesis	Purpose-built for streaming pipelines
Managed ETL and cataloging	Glue	Strong fit for transformation workflows and data catalog integration
Query data in place in S3	Athena	Fast analytical query pattern without managing clusters
Large-scale data processing cluster	EMR	Better fit for heavier distributed processing needs
Online or batch transfer into AWS storage	DataSync or Storage Gateway	Stronger than custom copy scripts for transfer patterns

Batch transfer, streaming, transformation, and visualization are different layers

Stage	Typical service fit	What the exam is really asking
Transfer into AWS	DataSync or Storage Gateway	How the data gets there reliably
Real-time streaming	Kinesis	How events flow continuously
Transformation and cataloging	Glue	How raw data becomes usable
Data lake governance	Lake Formation	How access is controlled and shared safely
Query and visualization	Athena and QuickSight	How people consume results without provisioning analytical clusters

If the problem is secure lake access, Athena alone is not the answer. If the problem is format conversion, DataSync alone is not the answer. SAA-C03 rewards the candidate who notices which stage is actually broken.

Secure access to ingestion points and data lakes

High-performing data pipelines are still security designs.

Requirement	Strongest first fit	Why
Private ingestion into S3 from VPC-based workloads	VPC endpoint plus bucket policy and least-privilege IAM	Reduces public exposure and tightens the path
Central data lake access control across accounts or teams	Lake Formation	Stronger governance answer than scattered bucket permissions alone
Encryption and controlled key usage for pipeline data	S3 encryption plus KMS key policy design	Keeps pipeline access and data protection aligned

End-to-end lake and analytics path

    flowchart LR
	  I["Transfer or streaming ingress"] --> R["Raw S3 landing zone"]
	  R --> G["Glue transform and catalog"]
	  G --> C["Curated S3 in Parquet or optimized layout"]
	  C --> L["Lake Formation governance"]
	  L --> A["Athena and QuickSight consumption"]

The exam often asks which stage is the real decision point. If the problem is transfer, Glue is usually not the answer. If the problem is transformation or query speed, the right answer is often a format-and-catalog decision rather than a bigger cluster.

Example: define a Kinesis ingestion path deliberately

1Resources:
2  AppEventsStream:
3    Type: AWS::Kinesis::Stream
4    Properties:
5      Name: app-events
6      ShardCount: 2
7      RetentionPeriodHours: 24

What to notice:

the stream exists to absorb and distribute event flow, not to replace downstream analytics services
shard count and retention both point to throughput and replay thinking
SAA-C03 expects you to separate ingestion capacity from transformation and query choices

Example: transforming CSV into a query-friendly format

This is the kind of format-conversion move AWS wants you to recognize when the data arrives in one shape but must be queried efficiently in another.

1raw = spark.read.option("header", "true").csv("s3://raw-orders-bucket/orders/")
2
3(raw
4  .repartition(8)
5  .write
6  .mode("overwrite")
7  .format("parquet")
8  .save("s3://curated-orders-bucket/orders/"))

What to notice:

the transformation is not only about cleaning data; it also changes the storage format to improve downstream analytics
repartitioning affects parallelism and file layout, which can matter for performance at scale
SAA-C03 may describe this indirectly as selecting the right configuration for ingestion or transforming data between formats such as CSV and Parquet

Visualization and consumption choices

Do not stop at ingestion. AWS explicitly includes analytics and visualization use cases here.

Requirement	Strongest first fit	Why
SQL-style analysis directly on data in S3	Athena	Query-in-place answer with low ops
Governed data-lake sharing and permissions	Lake Formation	Controls lake access patterns more cleanly
Business dashboards on top of analytical data	QuickSight	Visualization layer rather than pipeline layer
Heavy distributed processing with cluster control	EMR	Better when managed query-in-place is not enough

Failure patterns worth recognizing

Symptom	Strongest first check	Why
Data arrives slowly from on-premises systems	Transfer method and network path	This is usually a transfer problem before it is a Glue or Athena problem
Data is present in S3 but analysts cannot query it effectively	Catalog and format layer	Query services work best when the data is shaped and described correctly
The team is managing clusters for simple transformation work	EMR versus managed ETL fit	The exam often prefers managed transformation when cluster control is unnecessary
Streaming consumers fall behind	Stream throughput and consumer design	This is an ingestion-capacity and consumer-scaling question, not a pure storage question
Analysts can query the lake but permissions are messy across teams	Governance layer fit	Lake Formation or tighter lake-access design may be the real missing layer

Common traps

picking EMR when the requirement is mostly managed ETL, not cluster management
using Athena as if it were an ingestion service
ignoring secure access design for buckets, transfer targets, or streaming entry points
treating Lake Formation as if it were the transformation engine instead of the governance layer
forgetting that file format choices can be the real performance answer
solving a batch problem with streaming tools or a streaming problem with slow file-oriented assumptions

Quiz

Loading quiz…

Move next into 4. Cost-Optimized Architectures to study how the same storage, compute, database, and network choices change when cost becomes the deciding constraint.

3.4 Network Architectures