Parquet is columnar data storage format , more on this on their github site.
Avro is binary compressed data with the schema to read the file.
In this blog we will see how we can convert existing avro files to parquet file using standalone java program.
args[0] is input avro file args[1] is output parquet file.
GenericDatumReader<Object> greader = new GenericDatumReader<Object>();
FileReader<Object> fileReader = DataFileReader.openReader(new File(args[0]), greader);
Schema avroSchema = fileReader.getSchema() ;
// generate the corresponding Parquet schema
MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);
// create a WriteSupport object to serialize your Avro objects
AvroWriteSupport writeSupport = new AvroWriteSupport(parquetSchema, avroSchema);
// choose compression scheme
CompressionCodecName compressionCodecName = CompressionCodecName.UNCOMPRESSED;
// set Parquet file block size and page size values
int blockSize = 256 * 1024 * 1024;
int pageSize = 64 * 1024;
String outputFilename=args[1];
File f=new File(args[1]);
if(f.exists()){
f.delete();
}
Path outputPath = new Path(outputFilename);
// the ParquetWriter object that will consume Avro GenericRecords
AvroParquetWriter parquetWriter = new AvroParquetWriter(outputPath,
avroSchema, compressionCodecName, blockSize, pageSize);
DataFileReader<GenericRecord> reader = new DataFileReader<GenericRecord>(new File(args[0]), new GenericDatumReader<GenericRecord>());
while (reader.hasNext()) {
GenericRecord record = reader.next();
parquetWriter.write(record);
}
parquetWriter.close();
Avro is binary compressed data with the schema to read the file.
In this blog we will see how we can convert existing avro files to parquet file using standalone java program.
args[0] is input avro file args[1] is output parquet file.
GenericDatumReader<Object> greader = new GenericDatumReader<Object>();
FileReader<Object> fileReader = DataFileReader.openReader(new File(args[0]), greader);
Schema avroSchema = fileReader.getSchema() ;
// generate the corresponding Parquet schema
MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);
// create a WriteSupport object to serialize your Avro objects
AvroWriteSupport writeSupport = new AvroWriteSupport(parquetSchema, avroSchema);
// choose compression scheme
CompressionCodecName compressionCodecName = CompressionCodecName.UNCOMPRESSED;
// set Parquet file block size and page size values
int blockSize = 256 * 1024 * 1024;
int pageSize = 64 * 1024;
String outputFilename=args[1];
File f=new File(args[1]);
if(f.exists()){
f.delete();
}
Path outputPath = new Path(outputFilename);
// the ParquetWriter object that will consume Avro GenericRecords
AvroParquetWriter parquetWriter = new AvroParquetWriter(outputPath,
avroSchema, compressionCodecName, blockSize, pageSize);
DataFileReader<GenericRecord> reader = new DataFileReader<GenericRecord>(new File(args[0]), new GenericDatumReader<GenericRecord>());
while (reader.hasNext()) {
GenericRecord record = reader.next();
parquetWriter.write(record);
}
parquetWriter.close();