Load Metadata and File-Based Partitioned Data in Data Pipelines

The data pipeline loading infrastructure enables the loading of certain metadata values of a source record into the System. Metadata represents information about a record that is not represented in the record data itself, e.g., the filename of the source record or the timestamp when a record was created in . You can also load data in partitioned files using naming standards with a filter set in the file path. Syntaxes The METADATA function can load a variety of metadata along with each record of the data pipeline, or it can use key-value pairs to return the value in the filename metadata. Load Metadata into a Column To load a metadata value into a column, use the METADATA function in the SELECT SQL statement and replace key with the key you want to load. You can use transformation functions to achieve the final data type or allow the pipeline to apply automatic casting.

SQL

METADATA(key)

Argument	Data Type	Description
`key`	string	The name of the specified metadata key. Example: `'topic'`

Retrieve a Value from Filename Metadata To retrieve a value from a key-value pair in the filename metadata, specify the first argument as 'hive_partition' and provide a search string for the key. In this case, the filename must follow Hive naming standards. The standard embeds field names and key-value pairs in path segments, such as /year=2019/month=2/data.parquet. If there are duplicate key-value pairs in the filename, the function uses the last pair. The function returns a string with the value of the associated key, and returns NULL if the file does not contain the search string or the filename does not follow Hive naming standards.

SQL

METADATA('hive_partition',search_string)

Argument	Data Type	Description
`search_string`	string	The search string to use as a key for searching the filename metadata to return the value from the key-value pair as a string.

Loads a variety of header metadata along with each record of the data pipeline.

SQL

METADATA('header', header_row_index, header_col_index)

Argument	Data Type	Description
`header_row_index`	integer	The index, which starts at 1, of the header row to load.
`header_col_index`	integer	The index, which starts at 1, of the header column to load.

Metadata fields cannot be combined in a transformation function with other source data.

Supported Values for the Metadata Key The metadata values you can load differ based on the source type. This table lists the available values for the metadata keys specified using the key argument by source type and their returned data type. Metadata key values are not case-sensitive.

Metadata Key	Source Type	Returned Value Data Type	Description
`row_id`	File Load and Kafka Load	UUID	A unique row identifier that is generated by the Ocient System during loading. The Ocient System guarantees this identifier is unique for each loaded record because the generation is based on internal mechanisms that enforce record deduplication.
`filename`	File Load	VARCHAR	The filename and full file path indicate the source of the record.
`file_modified_timestamp`	File Load	TIMESTAMP	The timestamp that indicates when the file was last modified on the data source.
`header`	Delimited File Load	VARCHAR	Returns the string representation of the header field for the specified header row and column. Returns NULL if any of these conditions exist: The specified field is not present. The specified field is empty, and the `EMPTY_FIELD_AS_NULL` delimited extract option is set to `true`. The specified field matches one of the strings in the `NULL_STRINGS` delimited extract option.
`key`	Kafka Load	VARBINARY	The bytes of the Kafka key for the record.
`line_number`	File Load	BIGINT	The line number in the file for the source of the record.
`offset`	Kafka Load	BIGINT	The Kafka offset of the record.
`partition`	Kafka Load	BIGINT	The name of the Kafka partition that indicates the source of the record.
`record`	File and Kafka Load	VARCHAR	The string representation of the full record. For delimited, JSON, and binary formats, the system represents the record in the same way as the source. For XML, ASN.1, and formats, the system represents the record as a JSON object.
`record_timestamp`	Kafka Load	TIMESTAMP	The timestamp that indicates when the record was created in the Kafka Broker.
`source_record_id`	File and Kafka Load	UUID	A unique source row identifier that is generated by the Ocient System during loading. The Ocient System guarantees this unique identifier for each source record with a specific SOURCE and EXTRACT configuration because the generation is based on source names, source record numbers, and source record offsets.
`topic`	Kafka Load	VARCHAR	The name of the Kafka topic that indicates the source of the record.

Examples Load Filename Metadata This example snippet uses the CREATE PIPELINE SQL statement to load data and the source filename with each record in a pipeline. For details about this pipeline definition and setup, see Data Pipeline Load of CSV Data from S3.

SQL

CREATE PIPELINE orders_pipeline
SOURCE
    S3
        ENDPOINT 'https://s3.us-east-1.amazonaws.com'
        BUCKET 'ocient-docs'
        FILTER 'metabase_samples/csv/orders.csv'
EXTRACT
    FORMAT csv
    NUM_HEADER_LINES 1
INTO public.orders
SELECT
    $1 as id,
    $2 as user_id,
    $3 as product_id,
    $4 as subtotal,
    $5 as tax,
    $6 as total,
    $7 as discount,
    $8 as created_at,
    $9 as quantity,
    METADATA('filename') AS filename_column;

Load Hive-Style Partitioned Data This example uses the CREATE PIPELINE SQL statement to load data and the value of the search string 'year' of the filename with each record in a pipeline. The S3 FILTER option contains the filter for the year in a file path. In turn, the METADATA function uses the filter 'year' value to load all records from files with the year 2019 in the specified path. For details about this pipeline definition and setup, see Data Pipeline Load of CSV Data from S3. The example assumes the filename follows Hive-style naming standards.

SQL

CREATE PIPELINE orders_pipeline
SOURCE
    S3
        ENDPOINT 'https://s3.us-east-1.amazonaws.com'
        BUCKET 'ocient-docs'
        FILTER '/year=2019/**'
EXTRACT
    FORMAT csv
    NUM_HEADER_LINES 1
INTO public.orders
SELECT
    $1 as id,
    $2 as user_id,
    $3 as product_id,
    $4 as subtotal,
    $5 as tax,
    $6 as total,
    $7 as discount,
    $8 as created_at,
    $9 as quantity,
    METADATA('hive_partition','year') AS hive;

Load Data CREATE PIPELINE Transform Data in Data Pipelines

​Related Links

Related Links