Cannot Read from Athena Table: HIVE_CANNOT_OPEN_SPLIT: Not an Avro data file.

0

Hi

I'm trying to read avro files in Athena, the workflow is from DynamoDB to Kinesis then to Firehose, converting the format to AVRO in firehose using a lambda and configuring the compression within Firehose to gzip.

The files land successfully and are partitioned correctly. I've manually validated the schema by downloading the files which are gzipped.

The table is created via Athena using the below DDL : CREATE EXTERNAL TABLE mytablename ( ) PARTITIONED BY (year STRING, month STRING, day STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'=' {

} ') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 's3://XXXXXXXXXX/archive/';

But I'm still getting HIVE_CANNOT_OPEN_SPLIT: Not an Avro data file.

Kind Regards

2 Answers
0
Accepted Answer

Here is a working solution : https://docs.aws.amazon.com/athena/latest/ug/avro-serde.html the files are stored in LOCATION 's3://athena-examples-us-west-1/flight/avro/'; Sadly the files are not accessible in the S3 bucket so I'm unable clarify why these work and mine do not.

If you are getting this "HIVE_CANNOT_OPEN_SPLIT: Not an Avro data file" the files in your s3 bucket are not in the correct format, I'm using https://github.com/apache/avro Apache Avro in C# to generate my files and it seems there is something wrong with them. Even the most basic files are not recognized.

Chris
answered 19 days ago
0

Hi Chris,

Please go through the below steps and documentation links once i hope it will helps you to resolve your issue.

1. Ensure Correct Compression Configuration

Make sure the Firehose transformation Lambda is correctly compressing the files with gzip. Double-check the Lambda function to confirm it correctly handles compression after converting the data to Avro format.

2. Check S3 File Structure and Permissions

Verify the structure and permissions of the files in S3. Ensure the files are indeed in Avro format and properly gzipped. Sometimes, incorrect file permissions can cause issues with Athena.

3. Correct SerDe and Compression Settings in Athena DDL

Make sure the DDL statement in Athena correctly specifies the SerDe and takes into account the gzip compression. Here's an example with placeholders for your schema and correct S3 path:

CREATE EXTERNAL TABLE mytablename (
  -- Specify your columns here, e.g.,
  id STRING,
  name STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
  'avro.schema.literal' = '{
    "type": "record",
    "name": "YourRecordName",
    "fields": [
      {"name": "id", "type": "string"},
      {"name": "name", "type": "string"}
      // Add other fields as necessary
    ]
  }'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://XXXXXXXXXX/archive/'
TBLPROPERTIES ('compressionType'='gzip');

4. Validate the Avro Schema

Ensure the Avro schema provided in avro.schema.literal exactly matches the schema used when writing the Avro files in your Firehose Lambda transformation. Any mismatch can cause Athena to fail when trying to read the files.

5. Validate File Format

You mentioned manually validating the schema by downloading the files. Ensure that the files are correctly gzipped and in Avro format. You can use tools like avro-tools to inspect the schema of an Avro file:

avro-tools getschema yourfile.avro.gz

This command will output the schema, allowing you to confirm it matches the schema defined in your Athena table.

6. Recreate the Table

If the table was previously created and has been giving issues, consider dropping and recreating it to ensure there are no residual misconfigurations:

DROP TABLE IF EXISTS mytablename;

CREATE EXTERNAL TABLE mytablename (
  -- Specify your columns here
  id STRING,
  name STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
  'avro.schema.literal' = '{
    "type": "record",
    "name": "YourRecordName",
    "fields": [
      {"name": "id", "type": "string"},
      {"name": "name", "type": "string"}
      // Add other fields as necessary
    ]
  }'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://XXXXXXXXXX/archive/'
TBLPROPERTIES ('compressionType'='gzip');

https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html

https://docs.aws.amazon.com/athena/latest/ug/

https://docs.aws.amazon.com/athena/latest/ug/

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

https://docs.aws.amazon.com/lambda/latest/dg/troubleshooting-execution.html

EXPERT
answered 20 days ago
  • Thanks @Pandurangaswamy, I've added the gzip compression but with the same error, can you confirm if I need to add the "New line delimiter Not enabled" and set this to Enabled in order for each "line" to be a new object? AVRO as I understand it does not need a new line character for each object, but I'm not sure if hive looks for this when splitting the rows?