AWS Glue
Glue is the data preparation layer that discovers source data, organizes schema, and transforms it for downstream analytics stores. It combines cataloging and ETL so a data lake remains queryable instead of degrading into raw files.
▶Architecture Diagram
📊 Data FlowDashed line animations indicate the flow direction of data or requests
An S3 bucket can be full of files, but if table schemas and partitions drift everywhere, query tools keep reading the same data differently. Without a preparation layer that discovers, normalizes, and transforms the raw data, the lake quickly degrades into a file dump.
Early data pipelines had batch scripts and table definitions scattered across multiple locations, making maintenance difficult. This is why services like Glue, which unify schema catalogs and ETL, became important.
Glue discovers schemas with crawlers to build a Data Catalog, transforms data with ETL jobs, and delivers output to S3 or Redshift. It stores metadata in Glue Data Catalog and can run preparation work with Spark, PySpark, Python, or Ray engines, which makes file-based data much easier to query through Athena afterward.
Glue and Athena are both close to the data lake, but their roles differ. Glue handles metadata cleanup and ETL preparation, while Athena is strong at querying the results with SQL. If the problem is cataloging schemas and transforming data, look at Glue; if the problem is querying data that already exists, look at Athena.
Well-suited for data lake cleansing, schema discovery, batch transformations, and warehouse loading preparation. Not a good fit when the goal is simply exploring data with SQL without transformation.