Conceptly
← All Concepts
🔗

AWS Glue

AnalyticsServerless Data Integration (ETL) Service

Glue is the data preparation layer that discovers source data, organizes schema, and transforms it for downstream analytics stores. It combines cataloging and ETL so a data lake remains queryable instead of degrading into raw files.

Architecture Diagram

📊 Data Flow

Dashed line animations indicate the flow direction of data or requests

Why do you need it?

An S3 bucket can be full of files, but if table schemas and partitions drift everywhere, query tools keep reading the same data differently. Without a preparation layer that discovers, normalizes, and transforms the raw data, the lake quickly degrades into a file dump.

Why did this approach emerge?

Early data pipelines had batch scripts and table definitions scattered across multiple locations, making maintenance difficult. This is why services like Glue, which unify schema catalogs and ETL, became important.

How does it work inside?

Glue discovers schemas with crawlers to build a Data Catalog, transforms data with ETL jobs, and delivers output to S3 or Redshift. It stores metadata in Glue Data Catalog and can run preparation work with Spark, PySpark, Python, or Ray engines, which makes file-based data much easier to query through Athena afterward.

What is it often confused with?

Glue and Athena are both close to the data lake, but their roles differ. Glue handles metadata cleanup and ETL preparation, while Athena is strong at querying the results with SQL. If the problem is cataloging schemas and transforming data, look at Glue; if the problem is querying data that already exists, look at Athena.

When should you use it?

Well-suited for data lake cleansing, schema discovery, batch transformations, and warehouse loading preparation. Not a good fit when the goal is simply exploring data with SQL without transformation.

Data lake constructionETL pipelinesData catalogSchema evolution