Parquet Selectors Examples in Data Pipelines

the data pipeline functionality enables you to load data in the {{parquet}} data format use these examples to see how you can access the structure in different ways for an overview of loading parquet data, see docid ymealxr8i3ef2yhn9r8a work with parquet selectors the select statement for a parquet file uses the same syntax as the existing json selectors this selection applies to lists, structures, and maps the {{ocient}} system retrieves its list of requested columns from the select sql statement the columns correspond to the field names of the schema for the parquet files for the load the load retrieves required columns from the parquet source files and leaves the other columns alone this example snippet shows a list of \["id", "name", "array\[]", "a\[] project", "my map key name"] as the fields to select from the data the statement only retrieves the columns for \["id", "name", "array", "a", "my map"] in the row groups select $id as id, $name as name, $array\[] as array, $a\[] project as projection, $my map key name as key col you must select a leaf element, an array, or a tuple with your selector this is stricter than using json selectors, which can directly select array fields and json object fields example {"a" \[1,2,3], "b" {"c" 1}} you can extract with any of the selectors in json $a, $a\[], $b, $b c however, parquet only allows for the selectors $a\[], $b c this example assumes this schema // list\<string> (list non null, elements nullable) required group my list (list) { repeated group list { optional binary element (utf8); } } the selector must be $my list\[] , which includes the array syntax work with complex parquet selectors these examples show how to load a parquet list and convert it to an array and how to load a map and convert it to a tuple load list as an array example create the tips table to store the tip inquiries a city receives the columns are id — identifier as an integer categories — categories as a list of strings mayor — mayor as a string mentions — mentions as a list of strings service name — service name as a non nullable string tags — tags as a list of strings url — url as a string username — username as a string text — text description as a non nullable string venue name — venue name as a non nullable string create table tips( id int not null, categories varchar(255)\[], mayor varchar(255), mentions varchar(255)\[], service name varchar(255) not null, tags varchar(255)\[], url varchar(255), username varchar(255), text varchar(255) not null, venue name varchar(255) not null); create the tips pipeline data pipeline for loading the tip data specify the {{aws}} s3 source with bucket ocient examples , object key metabase samples/parquet/tips parquet , and endpoint https //s3 us east 1 amazonaws com the source format is parquet create pipeline tips pipeline source s3 bucket 'ocient examples' object key 'metabase samples/parquet/tips parquet' endpoint 'https //s3 us east 1 amazonaws com' extract format parquet into tips select $id as id, $source categories\[] as categories, $source mayor as mayor, $source mentions\[] as mentions, $source "service" as service name, $source tags\[] as tags, $source url as url, $source username as username, $text as text, $venue name as venue name; start the data pipeline the data pipeline loads the categories , mentions , and tags columns as arrays start pipeline tips pipeline; load map as a tuple example this example uses the file /tmp/to load/locations parquet that contains a map\<string, double> field named coordinates each value of the map field contains x and y entries for the coordinates create the locations table to store the map values the cartesian coords column is a tuple with two elements for the x and y coordinates create table locations( cartesian coords tuple<\<double, double>> not null); create the locations pipeline data pipeline for loading the maps data specify the filter for the parquet file /tmp/to load/locations parquet the source format is parquet select the x and y coordinates from the map create pipeline locations pipeline source filesystem filter '/tmp/to load/locations parquet' extract format parquet into locations select tuple($coordinates x, $coordinates y) as cartesian coords; start the data pipeline the data pipeline loads the cartesian coords column as a tuple start pipeline locations pipeline; related links docid\ jfqu osagg5enkvmeesnl docid\ y731i6lhout2b 7mc1jr8 docid\ jloppc7jopkao4enkj6 d docid\ hlhxjwwhmbdtuouffmzkl docid ymealxr8i3ef2yhn9r8a docid\ wvvm 7a 1h4g4zpdnhiem