Cannot Read from Athena Table: HIVE_CANNOT_OPEN_SPLIT: Not an Avro data file.

I'm trying to read avro files in Athena, the workflow is from DynamoDB to Kinesis then to Firehose, converting the format to AVRO in firehose using a lambda and configuring the compression within Firehose to gzip.

The files land successfully and are partitioned correctly. I've manually validated the schema by downloading the files which are gzipped.

The table is created via Athena using the below DDL : CREATE EXTERNAL TABLE mytablename ( ) PARTITIONED BY (year STRING, month STRING, day STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'=' {

} ') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 's3://XXXXXXXXXX/archive/';

But I'm still getting HIVE_CANNOT_OPEN_SPLIT: Not an Avro data file.

Kind Regards

Topics

Analytics Database Internet of Things (IoT)

Tags

Amazon Athena Amazon Data Firehose AWS Glue Amazon DynamoDB Amazon Kinesis

Language

English

Chris

asked 20 days ago258 views

2 Answers

Newest
Most votes
Most comments

Accepted Answer

Here is a working solution : https://docs.aws.amazon.com/athena/latest/ug/avro-serde.html the files are stored in LOCATION 's3://athena-examples-us-west-1/flight/avro/'; Sadly the files are not accessible in the S3 bucket so I'm unable clarify why these work and mine do not.

If you are getting this "HIVE_CANNOT_OPEN_SPLIT: Not an Avro data file" the files in your s3 bucket are not in the correct format, I'm using https://github.com/apache/avro Apache Avro in C# to generate my files and it seems there is something wrong with them. Even the most basic files are not recognized.

Chris

answered 19 days ago

Hi Chris,

Please go through the below steps and documentation links once i hope it will helps you to resolve your issue.

1. Ensure Correct Compression Configuration

Make sure the Firehose transformation Lambda is correctly compressing the files with gzip. Double-check the Lambda function to confirm it correctly handles compression after converting the data to Avro format.

2. Check S3 File Structure and Permissions

Verify the structure and permissions of the files in S3. Ensure the files are indeed in Avro format and properly gzipped. Sometimes, incorrect file permissions can cause issues with Athena.

3. Correct SerDe and Compression Settings in Athena DDL

Make sure the DDL statement in Athena correctly specifies the SerDe and takes into account the gzip compression. Here's an example with placeholders for your schema and correct S3 path:

CREATE EXTERNAL TABLE mytablename (
  -- Specify your columns here, e.g.,
  id STRING,
  name STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
  'avro.schema.literal' = '{
    "type": "record",
    "name": "YourRecordName",
    "fields": [
      {"name": "id", "type": "string"},
      {"name": "name", "type": "string"}
      // Add other fields as necessary
    ]
  }'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://XXXXXXXXXX/archive/'
TBLPROPERTIES ('compressionType'='gzip');

4. Validate the Avro Schema

Ensure the Avro schema provided in avro.schema.literal exactly matches the schema used when writing the Avro files in your Firehose Lambda transformation. Any mismatch can cause Athena to fail when trying to read the files.

5. Validate File Format

You mentioned manually validating the schema by downloading the files. Ensure that the files are correctly gzipped and in Avro format. You can use tools like avro-tools to inspect the schema of an Avro file:

avro-tools getschema yourfile.avro.gz

This command will output the schema, allowing you to confirm it matches the schema defined in your Athena table.

6. Recreate the Table

If the table was previously created and has been giving issues, consider dropping and recreating it to ensure there are no residual misconfigurations:

DROP TABLE IF EXISTS mytablename;

CREATE EXTERNAL TABLE mytablename (
  -- Specify your columns here
  id STRING,
  name STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
  'avro.schema.literal' = '{
    "type": "record",
    "name": "YourRecordName",
    "fields": [
      {"name": "id", "type": "string"},
      {"name": "name", "type": "string"}
      // Add other fields as necessary
    ]
  }'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://XXXXXXXXXX/archive/'
TBLPROPERTIES ('compressionType'='gzip');

https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html

https://docs.aws.amazon.com/athena/latest/ug/

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

https://docs.aws.amazon.com/lambda/latest/dg/troubleshooting-execution.html

EXPERT

Pandurangaswamy

answered 20 days ago

Chris
20 days ago
Thanks @Pandurangaswamy, I've added the gzip compression but with the same error, can you confirm if I need to add the "New line delimiter Not enabled" and set this to Enabled in order for each "line" to be a new object? AVRO as I understand it does not need a new line character for each object, but I'm not sure if hive looks for this when splitting the rows?

Relevant content

Unable to read Hive Acid tables in Athena using Athena Hive data connector
Accepted Answer
RamSet
asked 2 years ago
Redshift External Table - unable to query struct data types using Avro as a format
Accepted Answer
AWS-User-5186938
asked 2 years ago
SiteWise avro file schema
wz2b
asked 2 years ago
Glue job that converts protobuf into avro
erez
asked 6 months ago
How do I resolve "HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key" when reading files from AWS Config in Athena?
AWS OFFICIALUpdated a year ago
Why is Kinesis Data Firehose creating so many small files in S3?
AWS OFFICIALUpdated 2 years ago
Why do I get errors when I try to read JSON data in Amazon Athena?
AWS OFFICIALUpdated 4 months ago
How do I use an AWS Glue database table in a different account to convert record formats in a Kinesis Data Firehose delivery stream?
AWS OFFICIALUpdated 8 months ago
Migrating Glue Data Catalog tables to use Apache Iceberg open table format using Athena
EXPERT
Hamzah Chaudhry
published 3 months ago
Automated change data capture (CDC) data ingestion from DynamoDB to Redshift
EXPERT
Eesha Kumar
published 10 months ago