Extract Configuration
An extract configuration object. This object defines the way that data is extracted from the source records. The extract definition is applied to a specific topic or file group as defined in the source. Keys are file group names or topic names and values are Extract Settings.
Required keys:
- None - Extract defaults to JSON for all topics or file_groups unless provided
In the following example, two file groups are defined. The file groups defined in the source section are referenced in the extract section to define how data is extracted from the file group. For Kafka based loads, the topic names are used in the extract section to define extraction if required.
Similar to Source and Sink config, record_type instructs the Pipeline to instantiate a specific extraction implementation for a given type of source record such as Delimited Records (e.g., CSV, TSV), JSON Records, or Fixed Width Binary Records.
Type: | string |
Required: | No |
Default: | json |
Allowed values:
extract.<file_group_name>.null_strings
null_strings instructs the extraction process to convert a string literal such as NA to a null value in the transformation layer. This allows different null string values to be provided. null_strings must be an array of strings. If trim_whitespace is set to true and a string value is trimmed to become a null_string, that value converts to null.
Type: | string[] |
Required: | No |
Default: | |
The empty_as_null setting specifies whether empty fields should be treated as null. An empty field is defined as a string field containing no characters (its length is 0). When set to false, empty fields will be treated literally as empty strings. When set to true, empty fields will be treated as null in the transformation and loading process. If trim_whitespace is set to true and empty_as_null is set to true, a field that consists of only whitespace will also be converted to null.
Type: | boolean |
Required: | No |
Default: | True for the Delimited and FWB extract types, False for a JSON extract type |
The trim_whitespace setting specifies whether whitespace should be trimmed from the beginning and end of each string field as it is extracted.
Type: | boolean |
Required: | No |
Default: | False |
The encoding setting specifies the file encoding to use. Encodings should use standard encodings defined in the Java Internationalization Guide.
Type: | string |
Required: | No |
Default: | UTF8 |
replace_invalid_characters defines whether invalid characters based on the encoding type should be replaced with a replacement character (i.e. U+FFFD). If this setting is false and invalid characters are encountered, the record will not be loaded and a record extraction error will be logged.
Type: | boolean |
Required: | No |
Default: | False |
The JSON Extract type allows LAT to extract JSON data from the source records. There are no additional configuration settings available for JSON beyond the common Extract settings.
In this example JSON extract configuration, the replace_invalid_characters setting is used to convert any invalid UTF-8 characters with the UTF-8 replacement character. The file group name assigned to this file group was defined as "json_user_data" in the source section of the pipelines. When transforming this data, you can reference columns by the JSON object keys.
The Delimited Extract type allows the LAT to extract delimited data from the source records. The delimiter character is configurable to support CSV, TSV, or other delimited formats like pipe and semicolon. Fixed width files are a separate extraction type.
- Delimited files can have additional comment rows that are prefixed by a # character. These lines are not processed and will be skipped during loading.
- A maximum number of 1024 columns can be loaded in a single delimited file.
- You can enclose fields with an optional enclosure character. In these cases, the enclosure character is not retained in the loaded data.
- You can skip a configurable number of header lines during loading to prevent column headers from appearing as loaded data.
The system parses headers from the beginning of each file if you configure num_header_lines and do not configure headers.
The system automatically names columns if you do not configure num_header_lines and headers. The columns are named "$0", "$1", and so on consecutively instead.
- You can define multiple file groups. This allows different delimited extraction settings on each file group.
- The default record delimiter is a \n newline character. If source data differs, this should be overridden.
Records with improperly escaped "quote characters" or malformed "quoting" of column data will not parse. See the num_header_lines setting for more details.
- Control characters such as the NULL byte 0x0 are automatically processed as whitespace when parsing delimited records.
The field_delimiter specifies the string that should function as a column delimiter. The field_delimiter character applies to any unquoted and unescaped text when extracting records from the source file. This can be a single character or a multi-character string delimiter. Only one field_delimiter string is allowed per extract.
Type: | string |
Required: | No |
Default: | , |
The record_delimiter specifies the string that indicates the end of a record. The record_delimiter character applies to any unquoted and unescaped text when extracting records from the source file. This can be a single character or a two-character string delimiter. Only one record_delimiter string is allowed per extract.
The default is a standard newline character \n. Depending on the file, typical settings for this delimiter are a newline \n, a carriage return \r or a carriage return and newline \r\n.
Type: | string; maximum 2 characters |
Required: | No |
Default: | \n |
Also referred to as the "quote character," the field_optionally_enclosed_by settings specifies the character that is optionally used to enclose a field or column. The data between enclosure characters can contain delimiter characters. The enclosure character itself must be escaped inside of an enclosed field. Common values are " to enclose fields by double quotes. Because this setting defines an optional enclosure, not all fields in the record need to be enclosed. The escape character used to escape the enclosure character is separately configured and defaults to " per the RFC-4180 standard.
Type: | string; maximum 1 character |
Required: | No |
Default: | " |
Example:
In the following pipe delimited record, the double quote character can be specified so that the final column will be this field has a | in it when loaded into the database. If no enclosure character is specified, this record would yield four columns instead of three.
The escape settings specifies the character that is used as an escape occurrences of the enclosing character used to value so that the following character is escaped. This is typically used in enclosed columns to escape the enclosure character defined as field_optionally_enclosed_by.
Type: | string; maximum 1 character |
Required: | No |
Default: | " |
The num_header_lines setting defines the number of header lines in the source. The system reads these header lines but does not load them as data.
If you do not specify headers, the system uses the final line within this section to name the columns in each file. Then, the system uses these names in the Transform section to refer to column values. The system names columns for each file, so the names available in the Transform section might change if the file header changes.
This setting only applies to file-based source types.
Type: | int |
Required: | No |
Default: | 0 |
The headers setting defines the header labels associated with each column in a delimited file. This array of values is associated with the columns in order from left to right. The strings assigned to the column names are used in the Transform section to refer to column values. Any explicitly supplied headers that are an empty string "" or a null literal will be skipped during extraction. This can have performance benefits if all columns are not needed.
Type: | string[] |
Required: | No |
Default: | null |
The include_columns setting restricts extraction to the columns with the specified names. If you do not need all columns during transformation, you might improve performance when you use the include_columns setting.
If there are no headers and you specify this setting, the system names columns based on their original positions before other columns are filtered out. For example, if there are four columns and $0 and $3 are included, the system names these columns $0 and $3 during transformation, not $0 and $1.
If you specify the exclude_columns setting, you cannot specify this setting.
Type: | string[] |
Required: | No |
Default: | null |
The exclude_columns setting causes the LAT to not extract any of the specified columns. If you do not need all columns during transformation, you might improve performance when you use the exclude_columns setting.
If there are no headers and you specify this setting, the system names columns based on their original positions before other columns are filtered out. For example, if there are four columns and $1 and $2 are excluded, the system names these columns $0 and $3 during transformation, not $0 and $1.
If you specify the include_columns setting, you cannot specify this setting.
Type: | string[] |
Required: | No |
Default: | null |
The skip_blank_lines setting specifies whether lines with only whitespace should be skipped or processed.
Type: | boolean |
Required: | No |
Default: | False |
In this example extract configuration, a pipe delimited file type with five column names is provided. The file group name assigned to this file group was defined as "delimited_user_data" in the source section of the pipelines. Whitespace will be trimmed from each field and blank lines will be skipped instead of reporting errors. Fields with special characters can be escaped by ". Any literal values of NULL or N/A are converted to null. When transforming this data, columns can be referred to by id, username, last_login_at, first_name, and last_name. The fourth column is listed as null in the header, so it will be skipped during extraction. Because num_header_lines is not specified, the files are assumed to have no headers.
The Fixed Width Binary (FWB) Extract type allows the LAT to extract data from binary source records that have a fixed number of bytes per column. FWB source records typically come from sources like mainframe systems like the series that use Cobol Copybooks to produce character, integer, floating point, and decimal data in various byte formats. The FWB Extract type requires users to specify the byte offset, width in bytes, and the type of extraction to use on the indicated bytes for each column. The FWB type also allows users to specify a record length in bytes rather than a character record delimiter. FWB files are typically extracted using encodings such as cp500 and cp037.
The following data types are supported when extracting Fixed Width Binary data. All data types require specification of offset and width to locate the bytes to extract in each record.
Data Type | Description | Supported Settings |
string | Represents bytes of character data following the encoding defined in the extract settings. Bytes are extracted into a string for transformation and loading. | encoding - inherited |
integer | Represents a signed integer. Uses the endianness setting to determine how bytes should be interpreted. The integer type can be used for 1, 2, 4, or 8 bytes of data. These correspond to mainframe concepts of a Byte, Half-word, Full-word, and Double-word. | endianness - inherited |
unsigned_integer | endianness - inherited | |
float |
| |
packed_decimal | A Packed Binary Coded Decimal format for decimal precision that packs two integral values in each byte and encodes the sign in the final byte. The number of decimal points represented by the data is supplied with a separate extraction setting. The sign is encoded in the 4 least significant bits of the final byte. A value of 0xD or 0xB indicates a negative sign. Any other value including the common values of 0xF, 0xE, 0xC or 0xA indicating positive or unsigned. | endianness - inherited decimals - number of digits to place the right of the decimal point sign_format - defaults to 'trailing' byte format indicating that the sign is encoded in the most significant four bits of the final byte; provided for future extensibility. |
zoned_decimal | A Zoned Binary Coded Decimal format for decimal precision that packs two integral values in each byte and encodes the sign in the final byte. The number of decimal points represented by the data is supplied with a separate extraction setting. The sign is encoded in the 4 most significant bits of the final byte. A value of 0xD, 0xB, 0x7 indicates a negative sign. Values of 0xF, 0xE, 0xC or 0xA indicate positive or unsigned. | endianness - inherited decimals - number of digits to place the right of the decimal point sign_format - defaults to 'trailing' byte format indicating that the sign is encoded in the most significant four bits of the final byte; provided for future extensibility. |
The record_width setting specifies the total number of bytes in each record. Files are read serially, and the next record is assumed to start on the following byte. The record_width can be greater than the number of bytes used in extracting the individual columns.
Type: | integer |
Required: | Yes |
Default: | |
The padding_character setting specifies a character in a fixed width file type that should be removed from extracted character data. After extracting a character field, any leading or trailing occurrences of this character will be stripped from the ends of the character data. Typical examples are the space character or an underscore.
Type: | string |
Required: | No |
Default: | " " (in the specified encoding) |
The endianness setting specifies whether the extraction of numeric fields should use Big Endian or Little Endian byte ordering to process extracted bytes. endianness impacts integer, packed_decimal, and zoned_decimal column types.
Type: | string |
Required: | No |
Default: | big |
Allowed values:
- big: Big Endian
- little: Little Endian
extract.<file_group_name>.columns
A extract columns configuration object. A collection of columns and their associated extraction configuration. Keys are column names and values are Column Extracts configuration.
Required keys:
Column extract configurations for Fixed Width Binary records.
A column extract configuration object keyed by a column name. A column’s value is defined as an extract definition. The extract definition will indicate the offset and width to locate the bytes to extract, the data type to use in the extraction, and any additional parameters about the bytes to extract.
Required keys:
- offset
- width
- data_type
Optional keys:
- decimals
- sign_format
Number of bytes to skip from the beginning of the record before beginning to extract bytes for this column. Note that offset is relative to the beginning of the record, so arbitrary offsets can be used including overlapping sets of bytes with other columns.
Type: | integer |
Required: | Yes |
Default: | |
Number of bytes to extract from the offset location.
Type: | integer |
Required: | Yes |
Default: | |
Data type to extract from the bytes that are specified by offset and width. See Fixed Width Binary Data Types for reference on available data types.
Type: | string |
Required: | Yes |
Default: | string |
Allowed values:
- string
- integer
- unsigned_integer
- float
- packed_decimal
- zoned_decimal
Note that each data_type has specific settings that are allowed to control the extraction to that data_type. Some of these are inherited from the overall extract configuration, while others are defined on each column.
The number of numeric values that are placed to the right of the decimal point in a zoned_decimal or packed_decimal data type. For example, if decimals is set to 2` and the data in a zoned_decimal represents the numerals 12345, this would create the decimal 123.45` in the extracted value.
decimals configures the scale of the extracted decimal and the total number of digits present in the source data determines the precision. If insufficient scale or precision exists in the target Ocient column, data will fail to load.
Type: | integer |
Required: | No |
Default: | 0 |
Applies to:
- packed_decimal
- zoned_decimal
In decimal types, the sign for the number can be encoded in different ways. Currently, only the trailing format is supported which requires the sign to be encoded in 4 bits of the last byte in the column data. This setting is provided for future extensibility. The values that indicate a positive or negative sign are listed in Fixed Width Binary Data Types.
Type: | string |
Required: | No |
Default: | trailing |
Allowed values:
- trailing - the sign is encoded in the most significant 4 bits of the last byte in the binary coded decimal.
Applies to:
- packed_decimal
- zoned_decimal
The following example illustrates extracting fixed width binary data from a file group that was given the name "order_files" in the source section of the pipeline. The file has 3100 bytes per record and is only extracting a few columns.
Load Data