The process of breaking content up into translatable strings is known as string parsing. Whenever Smartling captures content from your file, app, or website, the system parses or breaks it down into small, discrete entities. This makes it easier to re-order a translation, whether for formatting or linguistic reasons. There are two levels of parsing: strings and segments.
Strings
A string is a unit of content. Smartling extracts strings from the captured content based on the paragraph markers found in your content. Those markers vary depending on the content or file type (see below table).
Segments
A large string, containing many words, may be parsed into segments, defined by end of sentence punctuation markers, i.e. periods, semicolons, interrogation points. Segments are only visible in the CAT Tool.
Please note:
If a section of text comprised of multiple sentences is surrounded by a single set of tags, it cannot be split into multiple segments.
For example:
In order for the string to be parsed into segments, the opening and closing tags must both be contained within the same segment:
Parsing Per File Format
Parsing rules differ depending on the file type being processed. With an Excel file, for example, the content of each cell becomes a separate string in Smartling and each string gets a different variant. The table below shows the basic string segmentation approach for various file types, as well as how the variant attribute is set.
Content format |
What the parser uses to define the boundaries of the string |
Notes |
Excel |
Cell |
In the same column is recommended. |
Word |
Paragraph or line break |
Not a new line or sentence ending in a full stop/period. |
Powerpoint |
Paragraph/text box |
The strings are arranged in order of when each text box was created, not where it is placed on the slide. |
inDesign |
Paragraph within a text frame |
|
CSV |
Cell |
Optionally segment HTML strings into additional strings based on block-level tags. |
Text file |
Newline |
|
HTML |
Block-level tag |
Examples; DIV or LI, but not SPAN or A. See Capturing Content for a full list of block-level tags. |
Key-based file formats, such as Java Properties, Xliff, iOS, Android |
Key |
Optionally split HTML strings into additional strings based on block-level tags. |
JSON |
Element or object |
Optionally segment HTML strings into additional strings based on block-level tags. |
XML |
Element or object |
Optionally segment HTML strings into additional strings based on block-level tags. |
GDN HTML |
Block-level tag |
|
Strings API |
Defined explicitly in API call |
|
Markdown |
Markdown codes corresponding to block-level HTML tags |
|
Directives
Parsing behavior can be modified through the use of directives and rules. For example, specifying which columns of a CSV file to extract as source text for translation and which to use as keys/variants is done through the use of file-parsing directives, eg;
# smartling.source_key_paths=1
# smartling.paths=2
Another important file-parsing directive which can modify the format of strings is:
# smartling.string_format_paths= [file format type, e.g.: HTML]
This directive can be used to specify a single path for a format, or a comma-separated array of paths. You might use the latter if you have a resource file containing disparate kinds of strings that are used in different contexts. For example, a JSON file containing application UI strings, with standard string formatting and placeholders, together with one or few keys that are large HTML documents (example; TOS or Privacy Policy).
The advantages of specifying the string path format as HTML in a non-HTML file is that large strings are broken down in smaller strings. This will benefit both your translation resources and your translation memory leverage. It is worth noting that in doing so, you lose any "keys", however, you still have "variants". In most cases, such as translation Jobs, this is acceptable, however, this is not acceptable if you are importing translations from a file, as keys are critical for alignment.
Furthermore, if you specify multiple string formats, and choose ICU and HTML, ICU formatting rules takes precedence over HTML.
Strings and Segments in the CAT Tool
Larger strings may be further divided into segments, only visible in the CAT tool. A segment is usually a sentence, with a sentence-ending punctuation mark such as a period (.), exclamation point (!) or question mark (?) creating a new segment.
The following example shows an entire string (denoted with a green vertical bar) that has been parsed or broken down into two segments. A Translator, Editor, or Reviewer will then be able to translate or edit each of the corresponding segments.
Merge Segments
If you're translating a string with multiple segments, you have the option to merge segments. Mouse over the Merge segment into next icon. Alternatively, you can use the shortcut that you've set in your keyboard settings.