Wednesday, September 3, 2014

Convert Avro file to Parquet format using java

Parquet is columnar data storage format , more on this on their github site.

Avro is binary compressed data with the schema to read the file.

In this blog we will see how we can convert existing avro files to parquet file using standalone java program.
 args[0] is input avro file args[1] is output parquet file.

    GenericDatumReader<Object> greader = new GenericDatumReader<Object>();
        FileReader<Object> fileReader = DataFileReader.openReader(new File(args[0]), greader);
        Schema avroSchema = fileReader.getSchema() ;
         
        // generate the corresponding Parquet schema
        MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);

        // create a WriteSupport object to serialize your Avro objects
        AvroWriteSupport writeSupport = new AvroWriteSupport(parquetSchema, avroSchema);

        // choose compression scheme
        CompressionCodecName compressionCodecName = CompressionCodecName.UNCOMPRESSED;

        // set Parquet file block size and page size values
        int blockSize = 256 * 1024 * 1024;
        int pageSize = 64 * 1024;

        String outputFilename=args[1];
        File f=new File(args[1]);
        if(f.exists()){
            f.delete();
        }
        Path outputPath = new Path(outputFilename);

        // the ParquetWriter object that will consume Avro GenericRecords
        AvroParquetWriter parquetWriter = new AvroParquetWriter(outputPath,
                avroSchema, compressionCodecName, blockSize, pageSize);

       
        DataFileReader<GenericRecord> reader = new DataFileReader<GenericRecord>(new File(args[0]), new GenericDatumReader<GenericRecord>());
        while (reader.hasNext()) {
            GenericRecord record = reader.next();
            parquetWriter.write(record);
        }
       
        parquetWriter.close();