LAT Extract Configuration

data pipelines are now the preferred method for loading data into the ocient system for details, see docid\ zncvnrhsf6fg1yvqk6mxt an extract configuration object this object defines the way that data is extracted from the source records the extract definition is applied to a specific topic or file group as defined in the source keys are file group names or topic names and values are docid\ p makey5cmtzhn iw2zon for file based loads, this should match a file group name defined in the source file groups section for {{kafka}} based loads this should match a topic defined in kafka required keys none extract defaults to json for all topics or file groups unless provided example extract in the following example, two file groups are defined the file groups defined in the source section are referenced in the extract section to define how data is extracted from the file group for kafka based loads, the topic names are used in the extract section to define extraction if required { , "source" { "type" "s3", "endpoint" "http //some endpoint/", "bucket" "my bucket", "file groups" { "delimited user data" { "prefix" "crm/users/", "file matcher syntax" "glob", "file matcher pattern" " csv", "sort type" "lexicographic" }, "delimited order data" { "prefix" "crm/orders/", "file matcher syntax" "glob", "file matcher pattern" " json", "sort type" "lexicographic" } } }, "extract" { "delimited user data" { "record type" "delimited", "headers" \["id", "username", "last login at", null, "first name", "last name"], "field delimiter" "|", "trim whitespace" true, "skip blank lines" true, "field optionally enclosed by" "\\"", "null strings" \["null", "n/a"] }, "delimited order data" { "record type" "delimited", "num header lines" 1, "exclude columns" \["user id", "order total"], "field delimiter" "," } }, "transform" { } } extract settings extract \<file group name> record type similar to source and sink config, record type instructs the pipeline to instantiate a specific extraction implementation for a given type of source record such as delimited records (e g , csv, tsv), json records, or fixed width binary records 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type allowed values json see docid\ p makey5cmtzhn iw2zon for additional configuration delimited see docid\ p makey5cmtzhn iw2zon for additional configuration fixed width binary see docid\ p makey5cmtzhn iw2zon for additional configuration shared extract configuration extract \<file group name> null strings null strings instructs the extraction process to convert a string literal such as na to a null value in the transformation layer this allows different null string values to be provided null strings must be an array of strings if trim whitespace is set to true and a string value is trimmed to become a null string , that value converts to null left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> empty as null the empty as null setting specifies whether empty fields should be treated as null an empty field is defined as a string field containing no characters (its length is 0) when set to false, empty fields will be treated literally as empty strings when set to true, empty fields will be treated as null in the transformation and loading process if trim whitespace is set to true and empty as null is set to true, a field that consists of only whitespace will also be converted to null left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> trim whitespace the trim whitespace setting specifies whether whitespace should be trimmed from the beginning and end of each string field as it is extracted 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> encoding the encoding setting specifies the file encoding to use encodings should use standard {{java}} encodings defined in the https //docs oracle com/en/java/javase/17/intl/supported encodings html#guid 187ba718 195f 4c39 b0d5 f3fdf02c7205 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> replace invalid characters replace invalid characters defines whether invalid characters based on the encoding type should be replaced with a replacement character (i e u+fffd ) if this setting is false and invalid characters are encountered, the record will not be loaded and a record extraction error will be logged left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type json extract the json extract type allows lat to extract json data from the source records there are no additional configuration settings available for json beyond the common extract settings example json extract configuration in this example json extract configuration, the replace invalid characters setting is used to convert any invalid utf 8 characters with the utf 8 replacement character the file group name assigned to this file group was defined as "json user data" in the source section of the pipelines when transforming this data, you can reference columns by the json object keys { "extract" { "json user data" { "record type" "json", "replace invalid characters" true } }, } delimited extract the delimited extract type allows the lat to extract delimited data from the source records the delimiter character is configurable to support csv, tsv, or other delimited formats like pipe and semicolon fixed width files are a separate extraction type notes on delimited files delimited files can have additional comment rows that are prefixed by a # character these lines are not processed and will be skipped during loading a maximum number of 1024 columns can be loaded in a single delimited file you can enclose fields with an optional enclosure character in these cases, the enclosure character is not retained in the loaded data you can skip a configurable number of header lines during loading to prevent column headers from appearing as loaded data the system parses headers from the beginning of each file if you configure docid\ p makey5cmtzhn iw2zon and do not configure docid\ p makey5cmtzhn iw2zon the system automatically names columns if you do not configure docid\ p makey5cmtzhn iw2zon and docid\ p makey5cmtzhn iw2zon the columns are named "$0", "$1", and so on consecutively instead you can define multiple file groups this allows different delimited extraction settings on each file group the default record delimiter is a \n newline character if source data differs, this should be overridden records with improperly escaped "quote characters" or malformed "quoting" of column data will not parse see the docid\ p makey5cmtzhn iw2zon setting for more details control characters such as the null byte 0x0 are automatically processed as whitespace when parsing delimited records delimited configuration extract \<file group name> field delimiter the field delimiter specifies the string that should function as a column delimiter the field delimiter character applies to any unquoted and unescaped text when extracting records from the source file this can be a single character or a multi character string delimiter only one field delimiter string is allowed per extract 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> record delimiter the record delimiter specifies the string that indicates the end of a record the record delimiter character applies to any unquoted and unescaped text when extracting records from the source file this can be a single character or a two character string delimiter only one record delimiter string is allowed per extract the default is a standard newline character \n depending on the file, typical settings for this delimiter are a newline \n , a carriage return \r or a carriage return and newline \r\n 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> field optionally enclosed by also referred to as the "quote character," the field optionally enclosed by settings specifies the character that is optionally used to enclose a field or column the data between enclosure characters can contain delimiter characters the enclosure character itself must be escaped inside of an enclosed field common values are " to enclose fields by double quotes because this setting defines an optional enclosure, not all fields in the record need to be enclosed the escape character used to escape the enclosure character is separately configured and defaults to " per the rfc 4180 standard 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type example in the following pipe delimited record, the double quote character can be specified so that the final column will be this field has a | in it when loaded into the database if no enclosure character is specified, this record would yield four columns instead of three a|b|"this field has a | in it" extract \<file group name> escape the escape settings specifies the character that is used as an escape occurrences of the enclosing character used to value so that the following character is escaped this is typically used in enclosed columns to escape the enclosure character defined as field optionally enclosed by 198,430 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> num header lines the num header lines setting defines the number of header lines in the source the system reads these header lines but does not load them as data if you do not specify docid\ p makey5cmtzhn iw2zon , the system uses the final line within this section to name the columns in each file then, the system uses these names in the transform section to refer to column values the system names columns for each file, so the names available in the transform section might change if the file header changes this setting only applies to file based source types 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> headers the headers setting defines the header labels associated with each column in a delimited file this array of values is associated with the columns in order from left to right the strings assigned to the column names are used in the transform section to refer to column values any explicitly supplied headers that are an empty string "" or a null literal will be skipped during extraction this can have performance benefits if all columns are not needed 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> include columns the include columns setting restricts extraction to the columns with the specified names if you do not need all columns during transformation, you might improve performance when you use the include columns setting if there are no headers and you specify this setting, the system names columns based on their original positions before other columns are filtered out for example, if there are four columns and $0 and $3 are included, the system names these columns $0 and $3 during transformation, not $0 and $1 if you specify the exclude columns setting, you cannot specify this setting 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> exclude columns the exclude columns setting causes the lat to not extract any of the specified columns if you do not need all columns during transformation, you might improve performance when you use the exclude columns setting if there are no headers and you specify this setting, the system names columns based on their original positions before other columns are filtered out for example, if there are four columns and $1 and $2 are excluded, the system names these columns $0 and $3 during transformation, not $0 and $1 if you specify the include columns setting, you cannot specify this setting 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> skip blank lines the skip blank lines setting specifies whether lines with only whitespace should be skipped or processed left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> skip comment lines the skip comment lines setting specifies whether lines with a leading comment character should be skipped or processed type boolean required no default true extract \<file group name> comment character the comment character setting specifies which character represents the start of a comment line type string; maximum 1 character required no default '#' example delimited extract configuration in this example extract configuration, a pipe delimited file type with five column names is provided the file group name assigned to this file group was defined as "delimited user data" in the source section of the pipelines whitespace will be trimmed from each field and blank lines will be skipped instead of reporting errors fields with special characters can be escaped by " any literal values of null or n/a are converted to null when transforming this data, columns can be referred to by id , username , last login at , first name , and last name the fourth column is listed as null in the header, so it will be skipped during extraction because num header lines is not specified, the files are assumed to have no headers { "extract" { "delimited user data" { "record type" "delimited", "headers" \["id", "username", "last login at", null, "first name", "last name"], "field delimiter" "|", "trim whitespace" true, "skip blank lines" true, "include columns" \["id", "username", "last name"], "field optionally enclosed by" "\\"", "null strings" \["null", "n/a"] }, }, } fixed width binary extract the fixed width binary (fwb) extract type allows the lat to extract data from binary source records that have a fixed number of bytes per column fwb source records typically come from sources like mainframe systems like the {{ibmz}} series that use cobol copybooks to produce character, integer, floating point, and decimal data in various byte formats the fwb extract type requires users to specify the byte offset, width in bytes, and the type of extraction to use on the indicated bytes for each column the fwb type also allows users to specify a record length in bytes rather than a character record delimiter fwb files are typically extracted using encodings such as cp500 and cp037 fixed width binary data types the following data types are supported when extracting fixed width binary data all data types require specification of offset and width to locate the bytes to extract in each record 152,257,115 true unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type fixed width binary configuration extract \<file group name> record width the record width setting specifies the total number of bytes in each record files are read serially, and the next record is assumed to start on the following byte the record width can be greater than the number of bytes used in extracting the individual columns 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> padding character the padding character setting specifies a character in a fixed width file type that should be removed from extracted character data after extracting a character field, any leading or trailing occurrences of this character will be stripped from the ends of the character data typical examples are the space character or an underscore 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> endianness the endianness setting specifies whether the extraction of numeric fields should use big endian or little endian byte ordering to process extracted bytes endianness impacts integer, packed decimal, and zoned decimal column types 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type allowed values big big endian little little endian extract \<file group name> columns a extract columns configuration object a collection of columns and their associated extraction configuration keys are column names and values are docid\ p makey5cmtzhn iw2zon required keys docid\ p makey5cmtzhn iw2zon column extracts column extract configurations for fixed width binary records extract \<file group name> columns \<column> a column extract configuration object keyed by a column name a column’s value is defined as an extract definition the extract definition will indicate the offset and width to locate the bytes to extract, the data type to use in the extraction, and any additional parameters about the bytes to extract required keys offset width data type optional keys decimals sign format extract \<file group name> columns \<column> offset number of bytes to skip from the beginning of the record before beginning to extract bytes for this column note that offset is relative to the beginning of the record, so arbitrary offsets can be used including overlapping sets of bytes with other columns 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> columns \<column> width number of bytes to extract from the offset location 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type extract \<file group name> columns \<column> data type data type to extract from the bytes that are specified by offset and width see docid\ p makey5cmtzhn iw2zon for reference on available data types 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type allowed values string integer unsigned integer float packed decimal zoned decimal note that each data type has specific settings that are allowed to control the extraction to that data type some of these are inherited from the overall extract configuration, while others are defined on each column extract \<file group name> columns \<column> decimals the number of numeric values that are placed to the right of the decimal point in a zoned decimal or packed decimal data type for example, if decimals is set to 2` and the data in a zoned decimal represents the numerals 12345 , this would create the decimal 123 45` in the extracted value decimals configures the scale of the extracted decimal and the total number of digits present in the source data determines the precision if insufficient scale or precision exists in the target ocient column, data will fail to load 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type applies to packed decimal zoned decimal extract \<file group name> columns \<column> sign format in decimal types, the sign for the number can be encoded in different ways currently, only the trailing format is supported which requires the sign to be encoded in 4 bits of the last byte in the column data this setting is provided for future extensibility the values that indicate a positive or negative sign are listed in docid\ p makey5cmtzhn iw2zon 0 left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type allowed values trailing the sign is encoded in the most significant 4 bits of the last byte in the binary coded decimal applies to packed decimal zoned decimal fixed width binary extract example the following example illustrates extracting fixed width binary data from a file group that was given the name "order files" in the source section of the pipeline the file has 3100 bytes per record and is only extracting a few columns { "extract" { "order files" { "record type" "fixed width binary", "record width" 3100, "endianness" "big", "columns" { "order id" { "offset" 0, "width" 4, "data type" "integer" }, "username" { "offset" 4, "width" 20, "data type" "string" }, "order total" { "offset" 24, "width" 10, "data type" "packed decimal", "decimals" 2 }, "sales tax total" { "offset" 34, "width" 8, "data type" "zoned decimal", "decimals" 3 } } } } } related links docid\ zjlpeecj74 lp0ciflqh0