Parquet is columnar data storage format , more on this on their github site.
Avro is binary compressed data with the schema to read the file.
In this blog we will see how we can convert existing avro files to parquet file using standalone java program.
args[0] is input avro file args[1] is output parquet file.
GenericDatumReader<Object> greader = new GenericDatumReader<Object>();
FileReader<Object> fileReader = DataFileReader.openReader(new File(args[0]), greader);
Schema avroSchema = fileReader.getSchema() ;
// generate the corresponding Parquet schema
MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);
// create a WriteSupport object to serialize your Avro objects
AvroWriteSupport writeSupport = new AvroWriteSupport(parquetSchema, avroSchema);
// choose compression scheme
CompressionCodecName compressionCodecName = CompressionCodecName.UNCOMPRESSED;
// set Parquet file block size and page size values
int blockSize = 256 * 1024 * 1024;
int pageSize = 64 * 1024;
String outputFilename=args[1];
File f=new File(args[1]);
if(f.exists()){
f.delete();
}
Path outputPath = new Path(outputFilename);
// the ParquetWriter object that will consume Avro GenericRecords
AvroParquetWriter parquetWriter = new AvroParquetWriter(outputPath,
avroSchema, compressionCodecName, blockSize, pageSize);
DataFileReader<GenericRecord> reader = new DataFileReader<GenericRecord>(new File(args[0]), new GenericDatumReader<GenericRecord>());
while (reader.hasNext()) {
GenericRecord record = reader.next();
parquetWriter.write(record);
}
parquetWriter.close();
Avro is binary compressed data with the schema to read the file.
In this blog we will see how we can convert existing avro files to parquet file using standalone java program.
args[0] is input avro file args[1] is output parquet file.
GenericDatumReader<Object> greader = new GenericDatumReader<Object>();
FileReader<Object> fileReader = DataFileReader.openReader(new File(args[0]), greader);
Schema avroSchema = fileReader.getSchema() ;
// generate the corresponding Parquet schema
MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);
// create a WriteSupport object to serialize your Avro objects
AvroWriteSupport writeSupport = new AvroWriteSupport(parquetSchema, avroSchema);
// choose compression scheme
CompressionCodecName compressionCodecName = CompressionCodecName.UNCOMPRESSED;
// set Parquet file block size and page size values
int blockSize = 256 * 1024 * 1024;
int pageSize = 64 * 1024;
String outputFilename=args[1];
File f=new File(args[1]);
if(f.exists()){
f.delete();
}
Path outputPath = new Path(outputFilename);
// the ParquetWriter object that will consume Avro GenericRecords
AvroParquetWriter parquetWriter = new AvroParquetWriter(outputPath,
avroSchema, compressionCodecName, blockSize, pageSize);
DataFileReader<GenericRecord> reader = new DataFileReader<GenericRecord>(new File(args[0]), new GenericDatumReader<GenericRecord>());
while (reader.hasNext()) {
GenericRecord record = reader.next();
parquetWriter.write(record);
}
parquetWriter.close();
great ! very useful to me ;o)
ReplyDeleteWhat is the algorithm for converting from Avro format to Parquet format? Thanks.
ReplyDeleteHow can we write in ByteArrayOutputStream instead of filePAth?
ReplyDeleteHi,
ReplyDeleteCan we append in existing file ?