Writing a custom lexer
Contents
Writing a custom lexer#
Many programs already come with their input formats, switching to a different format requires establishing some way to get backward compatibility for older inputs. When transitioning to TOML Fortran reading of the input will use the provided TOML data structures. If the previous input format is sufficiently compatible, it can be parsed into a matching TOML data structure and allow to seamlessly use of the TOML format going forward while still providing compatibility for previously written inputs.
This tutorial is meant to teach how lexing in TOML Fortran works and enable the reader to implement their custom lexer for their custom format. There is no guarantee that a custom input format can be ported by creating a custom lexer, since the format needs to fulfill some basic requirements, like providing typed values. For this tutorial, we will choose JSON as our input format and walk through all the steps to create a new lexer from scratch.
Note
The choice of JSON for this tutorial is not a coincidence. TOML Fortran does implement this lexer to parse JSON files into TOML data structures to support the encoding tests in the validation suite of BurntSushi/toml-test.
Important
This tutorial makes partial use of the internal API of TOML Fortran.
Identifying limitation#
Before we start to implement our custom lexer, we need to identify any limitations of the TOML data structures to represent our custom format.
TOML documents always have a table at the document root, there is no way to represent a JSON array or single value in TOML.
Furthermore, JSON supports the value type null
, which is not representable in TOML.
We have two choices here, either we can flag null
values as an invalid token or we can replace them in the lexer with something else like an empty table.
Finally, there are other details we have to take into account, like how JSON is handling duplicate keys, for most of the implementation-dependent cases we will follow the rules TOML provides.
This tutorial by no means aims for offering a fully compliant parser as we already fail for top-level arrays or null
type values.
For a custom format, this might be even more challenging, especially if the format is defined by only a single implementation.
Note
Writing a compliant JSON parser can quickly become quite challenging (see Parsing JSON is a Minefield).
But format limitations can go both ways, there are of course also features in TOML we cannot express in JSON. However, since we want to map JSON to TOML and not the other way round we do not have to worry about limitations present in JSON. Every feature available in TOML representable in the previous input format will be an incentive to switch to the new format.
Note
For the actual application of the JSON parser in the validation suite, this problem is solved by not using only strings to represent values and adding type annotations. In TOML Fortran these annotations are mapped back by pruning the read JSON data structure. The pruning is done via a visitor which is accepted after the data structure has been completely parsed.
Creating the lexer#
First, we start by creating a new subclass of the abstract base class (ABC) imported from the tomlf_de_abc module.
use tomlf_constants, only : tfc, tfi, tfr, toml_escape
use tomlf_datetime, only : toml_datetime
use tomlf_de_abc, only : abstract_lexer
use tomlf_de_token, only : toml_token, token_kind
use tomlf_error, only : toml_error, make_error
use tomlf_utils, only : read_whole_file, read_whole_line
implicit none
private
public :: json_lexer
public :: new_lexer_from_file, new_lexer_from_unit, new_lexer_from_string
!> Tokenizer for JSON documents
type, extends(abstract_lexer) :: json_lexer
!> Name of the source file, used for error reporting
character(len=:), allocatable :: filename
!> Current internal position in the source chunk
integer :: pos = 0
!> Current source chunk
character(:, tfc), allocatable :: chunk
!> Additional tokens to insert before the actual token stream
integer :: prelude = 2
contains
!> Obtain the next token
procedure :: next
!> Extract a string from a token
procedure :: extract_string
!> Extract an integer from a token
procedure :: extract_integer
!> Extract a float from a token
procedure :: extract_float
!> Extract a boolean from a token
procedure :: extract_bool
!> Extract a timestamp from a token
procedure :: extract_datetime
!> Get information about source
procedure :: get_info
end type json_lexer
We start by creating a constructor to consume an input file and turn it into a string to advance through.
!> Create a new instance of a lexer by reading from a file
subroutine new_lexer_from_file(lexer, filename, error)
!> Instance of the lexer
type(json_lexer), intent(out) :: lexer
!> Name of the file to read from
character(len=*), intent(in) :: filename
!> Error code
type(toml_error), allocatable, intent(out) :: error
integer :: stat
lexer%filename = filename
call read_whole_file(filename, lexer%chunk, stat)
if (stat /= 0) call make_error(error, "Could not open file '"//filename//"'")
end subroutine new_lexer_from_file
Using a formatted unit is more inefficient compared to reading the whole file with direct access, but needed in case we are dealing with the standard input. We make sure to error out if we get direct access or stream access units since we cannot reliably read those.
!> Create a new instance of a lexer by reading from a unit.
!>
!> Currently, only sequential access units can be processed by this constructor.
subroutine new_lexer_from_unit(lexer, io, error)
!> Instance of the lexer
type(json_lexer), intent(out) :: lexer
!> Unit to read from
integer, intent(in) :: io
!> Error code
type(toml_error), allocatable, intent(out) :: error
character(:, tfc), allocatable :: source, line
integer, parameter :: bufsize = 512
character(bufsize, tfc) :: filename, mode
integer :: stat
inquire(unit=io, access=mode, name=filename)
select case(trim(mode))
case default
stat = 1
case("sequential", "SEQUENTIAL")
allocate(character(0) :: source)
do
call read_whole_line(io, line, stat)
if (stat > 0) exit
source = source // line // toml_escape%newline
if (stat < 0) then
if (is_iostat_end(stat)) stat = 0
exit
end if
end do
call new_lexer_from_string(lexer, source)
end select
if (len_trim(filename) > 0) lexer%filename = trim(filename)
if (stat /= 0) call make_error(error, "Failed to read from unit")
end subroutine new_lexer_from_unit
Finally, we sometimes also need to read from a string, there we add a constructor which can create a lexer for a string input.
!> Create a new instance of a lexer by reading from a string.
subroutine new_lexer_from_string(lexer, string)
!> Instance of the lexer
type(json_lexer), intent(out) :: lexer
!> String to read from
character(len=*), intent(in) :: string
lexer%chunk = string
end subroutine new_lexer_from_string
The parser might need access to some of the internal data of the lexer, which is done via the get_info procedure.
!> Extract information about the source
subroutine get_info(lexer, meta, output)
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Query about the source
character(*, tfc), intent(in) :: meta
!> Metadata about the source
character(:, tfc), allocatable, intent(out) :: output
select case(meta)
case("source")
output = lexer%chunk // toml_escape%newline
case("filename")
if (allocated(lexer%filename)) output = lexer%filename
end select
end subroutine get_info
Identifying tokens#
Now that we can instantiate the lexer we need to implement the possibility to recognize tokens, this is done with the next method. We start with creating the actual tokenization step in the next_token procedure, which we will call in the next method.
!> Actually generate the next token, unbuffered version
subroutine next_token(lexer, token)
!> Instance of the lexer
class(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
As a first action, we will advance the internal state of the lexer by consuming the last token. For convenience, we save the position in the source string in the pos and prev variables.
integer :: prev, pos
! Consume current token
lexer%pos = lexer%pos + token%last - token%first + 1
prev = lexer%pos
pos = lexer%pos
The next thing we check is if we have exhausted the input string and if so we return the end of file token.
! If lexer is exhausted, return EOF as early as possible
if (pos > len(lexer%chunk)) then
token = toml_token(token_kind%eof, prev, pos)
return
end if
Now we can inspect the current character from the source string and decide which token it should be labeled. The character set is quite simple, we have to consider opening and closing brackets and braces, for arrays and tables, respectively, commas, colons, strings, and whitespace. We will be explicitly producing whitespace tokens here rather than skipping it since the parser can gracefully handle whitespace. However, we have to consider that newlines have semantical meaning in TOML while they are only considered whitespace in JSON and therefore we will only produce whitespace tokens.
We use a select case statement to decide which token to produce.
select case(lexer%chunk(pos:pos))
case(" ", toml_escape%tabulator, toml_escape%newline, toml_escape%carriage_return)
do while(any(lexer%chunk(pos+1:pos+1) == [" ", toml_escape%tabulator, &
& toml_escape%newline, toml_escape%carriage_return]) .and. pos < len(lexer%chunk))
pos = pos + 1
end do
token = toml_token(token_kind%whitespace, prev, pos)
return
case(":")
token = toml_token(token_kind%equal, prev, pos)
return
case("{")
token = toml_token(token_kind%lbrace, prev, pos)
return
case("}")
token = toml_token(token_kind%rbrace, prev, pos)
return
case("[")
token = toml_token(token_kind%lbracket, prev, pos)
return
case("]")
token = toml_token(token_kind%rbracket, prev, pos)
return
case('"')
call next_string(lexer, token)
return
case("-", "0":"9")
call next_number(lexer, token)
if (token%kind /= token_kind%invalid) return
case("t", "f")
call next_boolean(lexer, token)
return
case("n")
call next_null(lexer, token)
return
case(",")
token = toml_token(token_kind%comma, prev, pos)
return
end select
! If the current token is invalid, advance to the next terminator
do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
pos = pos + 1
end do
token = toml_token(token_kind%invalid, prev, pos)
end subroutine next_token
To wrap up the lexing we will try to identify unknown tokens as well as possible trying to advance to the next terminating character. For the terminating characters, we choose whitespace as well as control characters and place those in the module scope.
character(*, tfc), parameter :: terminated = " {}[],:"//&
& toml_escape%tabulator//toml_escape%newline//toml_escape%carriage_return
Note
We are cheating a bit here since we declare the colon as an equal token. This way we can use the same lexer for both JSON and TOML and still have the same parsing rules.
One special case to consider is literals, like strings numbers or booleans. To not clutter the logic here we create separate routines for parsing the respective literal values. For obtaining string values we will implement this as next_string. Here we cannot simply advance to the next quote character, since we need to handle escape characters gracefully. While doing so we can also ensure that the escape sequences found are valid and not malformed.
!> Process next string token
subroutine next_string(lexer, token)
!> Instance of the lexer
type(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
character(1, tfc) :: ch
character(*, tfc), parameter :: valid_escape = 'btnfr\"'
integer :: prev, pos
logical :: escape, valid
prev = lexer%pos
pos = lexer%pos
valid = .true.
escape = .false.
do while(pos < len(lexer%chunk))
pos = pos + 1
ch = lexer%chunk(pos:pos)
valid = valid .and. valid_string(ch)
if (escape) then
escape = .false.
valid = valid .and. verify(ch, valid_escape) == 0
cycle
end if
escape = ch == toml_escape%backslash
if (ch == '"') exit
if (ch == toml_escape%newline) then
pos = pos - 1
valid = .false.
exit
end if
end do
valid = valid .and. lexer%chunk(pos:pos) == '"' .and. pos /= prev
token = toml_token(merge(token_kind%string, token_kind%invalid, valid), prev, pos)
end subroutine next_string
Strings can only contain printable characters, therefore we check for valid string characters using a small valid_string function for each character.
!> Validate characters in string, non-printable characters are invalid in this context
pure function valid_string(ch) result(valid)
character(1, tfc), intent(in) :: ch
logical :: valid
character(1, tfc), parameter :: x00 = achar(int(z"00")), x08 = achar(int(z"08")), &
& x0b = achar(int(z"0b")), x1f = achar(int(z"1f")), x7f = achar(int(z"7f"))
valid = &
& .not.(x00 <= ch .and. ch <= x08) .and. &
& .not.(x0b <= ch .and. ch <= x1f) .and. &
& ch /= x7f
end function valid_string
We also need to identify numbers, mapping to either integers or floats in TOML, which is done via next_number.
!> Process next number token, can produce either integer or floats
subroutine next_number(lexer, token)
!> Instance of the lexer
type(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
integer :: prev, pos, point, expo
logical :: minus, zero, first
character(1, tfc) :: ch
integer, parameter :: offset(*) = [0, 1, 2]
prev = lexer%pos
pos = lexer%pos
token = toml_token(token_kind%invalid, prev, pos)
point = 0
expo = 0
zero = .false.
first = .true.
minus = lexer%chunk(pos:pos) == "-"
if (minus) pos = pos + 1
do while(pos <= len(lexer%chunk))
ch = lexer%chunk(pos:pos)
if (ch == ".") then
if (point > 0 .or. expo > 0) return
zero = .false.
point = pos
pos = pos + 1
cycle
end if
if (ch == "e" .or. ch == "E") then
if (expo > 0) return
zero = .false.
expo = pos
pos = pos + 1
cycle
end if
if (ch == "+" .or. ch == "-") then
if (.not.any(lexer%chunk(pos-1:pos-1) == ["e", "E"])) return
pos = pos + 1
cycle
end if
if (verify(ch, "0123456789") == 0) then
if (zero) return
zero = first .and. ch == "0"
first = .false.
pos = pos + 1
cycle
end if
exit
end do
if (any([expo, point] == pos-1)) return
token = toml_token(merge(token_kind%float, token_kind%int, any([expo, point] > 0)), &
& prev, pos-1)
end subroutine next_number
To support boolean values we implement a next_boolean procedure.
!> Process next boolean token
subroutine next_boolean(lexer, token)
!> Instance of the lexer
type(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
integer :: pos, prev
prev = lexer%pos
pos = lexer%pos
do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
pos = pos + 1
end do
select case(lexer%chunk(prev:pos))
case default
token = toml_token(token_kind%invalid, prev, pos)
case("true", "false")
token = toml_token(token_kind%bool, prev, pos)
end select
end subroutine next_boolean
Finally, we also want to parse null values using the next_null procedure.
!> Process next null token
subroutine next_null(lexer, token)
!> Instance of the lexer
type(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
integer :: pos, prev
prev = lexer%pos
pos = lexer%pos
do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
pos = pos + 1
end do
token = toml_token( &
& merge(token_kind%nil, token_kind%invalid, lexer%chunk(prev:pos) == "null"), &
& prev, pos)
end subroutine next_null
With this logic available we can now generate all required tokens for parsing JSON.
Tip
Moving most of the validation logic in the tokenization simplifies the actual extraction of the value as we have to deal with fewer edge cases.
Now we can wrap up the next procedure, instead of directly returning the token we will make some adjustments to the token stream here. In general, this is the right place to buffer tokens, perform overflow checks, or detect unclosed groups, we will only use it to insert two additional tokens to inject a top-level key.
!> Advance to the next token in the lexer
subroutine next(lexer, token)
!> Instance of the lexer
class(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
type(toml_token), parameter :: prelude(2) = &
[toml_token(token_kind%equal, 0, 0), toml_token(token_kind%keypath, 1, 0)]
if (lexer%prelude > 0) then
token = prelude(lexer%prelude)
lexer%prelude = lexer%prelude - 1
return
end if
call next_token(lexer, token)
end subroutine next
This will direct the parser to leave the root document where newlines are semantically relevant since we cannot produce such newline tokens in our JSON lexer.
Exercise
The nil token will make the parser skip the respective value.
If we want to support null values, how would we have to modify our lexer to produce for example an empty table {}
instead, i.e. a lbrace and rbrace token?
Extracting values#
Before we can connect our lexer to the existing TOML parser we have to implement the extraction of the values. The parser itself will use the extract member functions to obtain values for the respective tokens and never directly access the character stream.
To extract the string value we implement the extract_string procedure. We will also use the extract_string routine to catch the keypath token we inserted in the token stream and return the wanted dummy value.
!> Extract string value of token
subroutine extract_string(lexer, token, string)
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Token to extract string value from
type(toml_token), intent(in) :: token
!> String value of token
character(len=:), allocatable, intent(out) :: string
integer :: it, length
logical :: escape
character(1, tfc) :: ch
length = token%last - token%first + 1
select case(token%kind)
case(token_kind%keypath) ! dummy token inserted by lexer prelude
string = "_"
case(token_kind%string)
string = ""
escape = .false.
do it = token%first + 1, token%last - 1
ch = lexer%chunk(it:it)
if (escape) then
escape = .false.
select case(ch)
case(toml_escape%dquote, toml_escape%backslash); string = string // ch
case("b"); string = string // toml_escape%bspace
case("t"); string = string // toml_escape%tabulator
case("n"); string = string // toml_escape%newline
case("r"); string = string // toml_escape%carriage_return
case("f"); string = string // toml_escape%formfeed
end select
cycle
end if
escape = ch == toml_escape%backslash
if (.not.escape) string = string // ch
end do
end select
end subroutine extract_string
Similarly, we implement the extract_integer, instead of using an internal read, we implement the reading ourselves.
!> Extract integer value of token
subroutine extract_integer(lexer, token, val)
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Token to extract integer value from
type(toml_token), intent(in) :: token
!> Integer value of token
integer(tfi), intent(out) :: val
integer :: first, it, tmp
character(1, tfc) :: ch
character(*, tfc), parameter :: num = "0123456789"
if (token%kind /= token_kind%int) return
val = 0
first = token%first
if (lexer%chunk(first:first) == "-") first = first + 1
if (lexer%chunk(first:first) == "0") return
do it = first, token%last
ch = lexer%chunk(it:it)
tmp = scan(num, ch) - 1
if (tmp < 0) cycle
val = val * 10 - tmp
end do
if (lexer%chunk(token%first:token%first) /= "-") val = -val
end subroutine extract_integer
For floating point numbers implemented in extract_float we will just use an internal read.
!> Extract floating point value of token
subroutine extract_float(lexer, token, val)
use, intrinsic :: ieee_arithmetic, only : ieee_value, &
& ieee_positive_inf, ieee_negative_inf, ieee_quiet_nan
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Token to extract floating point value from
type(toml_token), intent(in) :: token
!> Floating point value of token
real(tfr), intent(out) :: val
integer :: stat
if (token%kind /= token_kind%float) return
read(lexer%chunk(token%first:token%last), *, iostat=stat) val
end subroutine extract_float
The last token we can produce and extract from our lexer is are boolean values, which we implement in extract_boolean.
!> Extract boolean value of token
subroutine extract_bool(lexer, token, val)
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Token to extract boolean value from
type(toml_token), intent(in) :: token
!> Boolean value of token
logical, intent(out) :: val
if (token%kind /= token_kind%bool) return
val = lexer%chunk(token%first:token%last) == "true"
end subroutine extract_bool
We create a mocked routine for extract_datetime since we cannot produce this token in JSON.
!> Extract datetime value of token
subroutine extract_datetime(lexer, token, val)
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Token to extract datetime value from
type(toml_token), intent(in) :: token
!> Datetime value of token
type(toml_datetime), intent(out) :: val
associate(lexer => lexer, token => token) ! ignore unused dummy arguments
end associate
end subroutine extract_datetime
This provides our lexer with full functionality regarding the extraction of values needed for parsing and creating data structures.
full source
For completeness here is again the full source of our lexer implementation.
module tjson_lexer
use tomlf_constants, only : tfc, tfi, tfr, toml_escape
use tomlf_datetime, only : toml_datetime
use tomlf_de_abc, only : abstract_lexer
use tomlf_de_token, only : toml_token, token_kind
use tomlf_error, only : toml_error, make_error
use tomlf_utils, only : read_whole_file, read_whole_line
implicit none
private
public :: json_lexer
public :: new_lexer_from_file, new_lexer_from_unit, new_lexer_from_string
public :: toml_token, token_kind
!> Tokenizer for JSON documents
type, extends(abstract_lexer) :: json_lexer
!> Name of the source file, used for error reporting
character(len=:), allocatable :: filename
!> Current internal position in the source chunk
integer :: pos = 0
!> Current source chunk
character(:, tfc), allocatable :: chunk
!> Additional tokens to insert before the actual token stream
integer :: prelude = 2
contains
!> Obtain the next token
procedure :: next
!> Extract a string from a token
procedure :: extract_string
!> Extract an integer from a token
procedure :: extract_integer
!> Extract a float from a token
procedure :: extract_float
!> Extract a boolean from a token
procedure :: extract_bool
!> Extract a timestamp from a token
procedure :: extract_datetime
!> Get information about source
procedure :: get_info
end type json_lexer
character(*, tfc), parameter :: terminated = " {}[],:"//&
& toml_escape%tabulator//toml_escape%newline//toml_escape%carriage_return
contains
!> Create a new instance of a lexer by reading from a file
subroutine new_lexer_from_file(lexer, filename, error)
!> Instance of the lexer
type(json_lexer), intent(out) :: lexer
!> Name of the file to read from
character(len=*), intent(in) :: filename
!> Error code
type(toml_error), allocatable, intent(out) :: error
integer :: stat
lexer%filename = filename
call read_whole_file(filename, lexer%chunk, stat)
if (stat /= 0) call make_error(error, "Could not open file '"//filename//"'")
end subroutine new_lexer_from_file
!> Create a new instance of a lexer by reading from a unit.
!>
!> Currently, only sequential access units can be processed by this constructor.
subroutine new_lexer_from_unit(lexer, io, error)
!> Instance of the lexer
type(json_lexer), intent(out) :: lexer
!> Unit to read from
integer, intent(in) :: io
!> Error code
type(toml_error), allocatable, intent(out) :: error
character(:, tfc), allocatable :: source, line
integer, parameter :: bufsize = 512
character(bufsize, tfc) :: filename, mode
integer :: stat
inquire(unit=io, access=mode, name=filename)
select case(trim(mode))
case default
stat = 1
case("sequential", "SEQUENTIAL")
allocate(character(0) :: source)
do
call read_whole_line(io, line, stat)
if (stat > 0) exit
source = source // line // toml_escape%newline
if (stat < 0) then
if (is_iostat_end(stat)) stat = 0
exit
end if
end do
call new_lexer_from_string(lexer, source)
end select
if (len_trim(filename) > 0) lexer%filename = trim(filename)
if (stat /= 0) call make_error(error, "Failed to read from unit")
end subroutine new_lexer_from_unit
!> Create a new instance of a lexer by reading from a string.
subroutine new_lexer_from_string(lexer, string)
!> Instance of the lexer
type(json_lexer), intent(out) :: lexer
!> String to read from
character(len=*), intent(in) :: string
lexer%chunk = string
end subroutine new_lexer_from_string
!> Extract information about the source
subroutine get_info(lexer, meta, output)
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Query about the source
character(*, tfc), intent(in) :: meta
!> Metadata about the source
character(:, tfc), allocatable, intent(out) :: output
select case(meta)
case("source")
output = lexer%chunk // toml_escape%newline
case("filename")
if (allocated(lexer%filename)) output = lexer%filename
end select
end subroutine get_info
!> Advance to the next token in the lexer
subroutine next(lexer, token)
!> Instance of the lexer
class(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
type(toml_token), parameter :: prelude(2) = &
[toml_token(token_kind%equal, 0, 0), toml_token(token_kind%keypath, 1, 0)]
if (lexer%prelude > 0) then
token = prelude(lexer%prelude)
lexer%prelude = lexer%prelude - 1
return
end if
call next_token(lexer, token)
end subroutine next
!> Actually generate the next token, unbuffered version
subroutine next_token(lexer, token)
!> Instance of the lexer
class(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
integer :: prev, pos
! Consume current token
lexer%pos = lexer%pos + token%last - token%first + 1
prev = lexer%pos
pos = lexer%pos
! If lexer is exhausted, return EOF as early as possible
if (pos > len(lexer%chunk)) then
token = toml_token(token_kind%eof, prev, pos)
return
end if
select case(lexer%chunk(pos:pos))
case(" ", toml_escape%tabulator, toml_escape%newline, toml_escape%carriage_return)
do while(any(lexer%chunk(pos+1:pos+1) == [" ", toml_escape%tabulator, &
& toml_escape%newline, toml_escape%carriage_return]) .and. pos < len(lexer%chunk))
pos = pos + 1
end do
token = toml_token(token_kind%whitespace, prev, pos)
return
case(":")
token = toml_token(token_kind%equal, prev, pos)
return
case("{")
token = toml_token(token_kind%lbrace, prev, pos)
return
case("}")
token = toml_token(token_kind%rbrace, prev, pos)
return
case("[")
token = toml_token(token_kind%lbracket, prev, pos)
return
case("]")
token = toml_token(token_kind%rbracket, prev, pos)
return
case('"')
call next_string(lexer, token)
return
case("-", "0":"9")
call next_number(lexer, token)
if (token%kind /= token_kind%invalid) return
case("t", "f")
call next_boolean(lexer, token)
return
case("n")
call next_null(lexer, token)
return
case(",")
token = toml_token(token_kind%comma, prev, pos)
return
end select
! If the current token is invalid, advance to the next terminator
do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
pos = pos + 1
end do
token = toml_token(token_kind%invalid, prev, pos)
end subroutine next_token
!> Process next string token
subroutine next_string(lexer, token)
!> Instance of the lexer
type(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
character(1, tfc) :: ch
character(*, tfc), parameter :: valid_escape = 'btnfr\"'
integer :: prev, pos
logical :: escape, valid
prev = lexer%pos
pos = lexer%pos
valid = .true.
escape = .false.
do while(pos < len(lexer%chunk))
pos = pos + 1
ch = lexer%chunk(pos:pos)
valid = valid .and. valid_string(ch)
if (escape) then
escape = .false.
valid = valid .and. verify(ch, valid_escape) == 0
cycle
end if
escape = ch == toml_escape%backslash
if (ch == '"') exit
if (ch == toml_escape%newline) then
pos = pos - 1
valid = .false.
exit
end if
end do
valid = valid .and. lexer%chunk(pos:pos) == '"' .and. pos /= prev
token = toml_token(merge(token_kind%string, token_kind%invalid, valid), prev, pos)
end subroutine next_string
!> Process next number token, can produce either integer or floats
subroutine next_number(lexer, token)
!> Instance of the lexer
type(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
integer :: prev, pos, point, expo
logical :: minus, zero, first
character(1, tfc) :: ch
integer, parameter :: offset(*) = [0, 1, 2]
prev = lexer%pos
pos = lexer%pos
token = toml_token(token_kind%invalid, prev, pos)
point = 0
expo = 0
zero = .false.
first = .true.
minus = lexer%chunk(pos:pos) == "-"
if (minus) pos = pos + 1
do while(pos <= len(lexer%chunk))
ch = lexer%chunk(pos:pos)
if (ch == ".") then
if (point > 0 .or. expo > 0) return
zero = .false.
point = pos
pos = pos + 1
cycle
end if
if (ch == "e" .or. ch == "E") then
if (expo > 0) return
zero = .false.
expo = pos
pos = pos + 1
cycle
end if
if (ch == "+" .or. ch == "-") then
if (.not.any(lexer%chunk(pos-1:pos-1) == ["e", "E"])) return
pos = pos + 1
cycle
end if
if (verify(ch, "0123456789") == 0) then
if (zero) return
zero = first .and. ch == "0"
first = .false.
pos = pos + 1
cycle
end if
exit
end do
if (any([expo, point] == pos-1)) return
token = toml_token(merge(token_kind%float, token_kind%int, any([expo, point] > 0)), &
& prev, pos-1)
end subroutine next_number
!> Process next boolean token
subroutine next_boolean(lexer, token)
!> Instance of the lexer
type(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
integer :: pos, prev
prev = lexer%pos
pos = lexer%pos
do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
pos = pos + 1
end do
select case(lexer%chunk(prev:pos))
case default
token = toml_token(token_kind%invalid, prev, pos)
case("true", "false")
token = toml_token(token_kind%bool, prev, pos)
end select
end subroutine next_boolean
!> Process next null token
subroutine next_null(lexer, token)
!> Instance of the lexer
type(json_lexer), intent(inout) :: lexer
!> Current token
type(toml_token), intent(inout) :: token
integer :: pos, prev
prev = lexer%pos
pos = lexer%pos
do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
pos = pos + 1
end do
token = toml_token( &
& merge(token_kind%nil, token_kind%invalid, lexer%chunk(prev:pos) == "null"), &
& prev, pos)
end subroutine next_null
!> Validate characters in string, non-printable characters are invalid in this context
pure function valid_string(ch) result(valid)
character(1, tfc), intent(in) :: ch
logical :: valid
character(1, tfc), parameter :: x00 = achar(int(z"00")), x08 = achar(int(z"08")), &
& x0b = achar(int(z"0b")), x1f = achar(int(z"1f")), x7f = achar(int(z"7f"))
valid = &
& .not.(x00 <= ch .and. ch <= x08) .and. &
& .not.(x0b <= ch .and. ch <= x1f) .and. &
& ch /= x7f
end function valid_string
!> Extract string value of token
subroutine extract_string(lexer, token, string)
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Token to extract string value from
type(toml_token), intent(in) :: token
!> String value of token
character(len=:), allocatable, intent(out) :: string
integer :: it, length
logical :: escape
character(1, tfc) :: ch
length = token%last - token%first + 1
select case(token%kind)
case(token_kind%keypath) ! dummy token inserted by lexer prelude
string = "_"
case(token_kind%string)
string = ""
escape = .false.
do it = token%first + 1, token%last - 1
ch = lexer%chunk(it:it)
if (escape) then
escape = .false.
select case(ch)
case(toml_escape%dquote, toml_escape%backslash); string = string // ch
case("b"); string = string // toml_escape%bspace
case("t"); string = string // toml_escape%tabulator
case("n"); string = string // toml_escape%newline
case("r"); string = string // toml_escape%carriage_return
case("f"); string = string // toml_escape%formfeed
end select
cycle
end if
escape = ch == toml_escape%backslash
if (.not.escape) string = string // ch
end do
end select
end subroutine extract_string
!> Extract integer value of token
subroutine extract_integer(lexer, token, val)
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Token to extract integer value from
type(toml_token), intent(in) :: token
!> Integer value of token
integer(tfi), intent(out) :: val
integer :: first, it, tmp
character(1, tfc) :: ch
character(*, tfc), parameter :: num = "0123456789"
if (token%kind /= token_kind%int) return
val = 0
first = token%first
if (lexer%chunk(first:first) == "-") first = first + 1
if (lexer%chunk(first:first) == "0") return
do it = first, token%last
ch = lexer%chunk(it:it)
tmp = scan(num, ch) - 1
if (tmp < 0) cycle
val = val * 10 - tmp
end do
if (lexer%chunk(token%first:token%first) /= "-") val = -val
end subroutine extract_integer
!> Extract floating point value of token
subroutine extract_float(lexer, token, val)
use, intrinsic :: ieee_arithmetic, only : ieee_value, &
& ieee_positive_inf, ieee_negative_inf, ieee_quiet_nan
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Token to extract floating point value from
type(toml_token), intent(in) :: token
!> Floating point value of token
real(tfr), intent(out) :: val
integer :: stat
if (token%kind /= token_kind%float) return
read(lexer%chunk(token%first:token%last), *, iostat=stat) val
end subroutine extract_float
!> Extract boolean value of token
subroutine extract_bool(lexer, token, val)
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Token to extract boolean value from
type(toml_token), intent(in) :: token
!> Boolean value of token
logical, intent(out) :: val
if (token%kind /= token_kind%bool) return
val = lexer%chunk(token%first:token%last) == "true"
end subroutine extract_bool
!> Extract datetime value of token
subroutine extract_datetime(lexer, token, val)
!> Instance of the lexer
class(json_lexer), intent(in) :: lexer
!> Token to extract datetime value from
type(toml_token), intent(in) :: token
!> Datetime value of token
type(toml_datetime), intent(out) :: val
associate(lexer => lexer, token => token) ! ignore unused dummy arguments
end associate
end subroutine extract_datetime
end module tjson_lexer
Verifying the lexer#
We could start right into connecting our lexer with the parser, but we have not yet verified that the tokenization and value extraction work as expected. For this purpose, we will create some unit tests using the test-drive framework.
As the entry point for our tester, we will use the standard wrapper for launching test suites.
tester program
Taken from the test-drive README
!> Wrapper for the testsuites
program tester
use, intrinsic :: iso_fortran_env, only : error_unit
use testdrive, only : run_testsuite, new_testsuite, testsuite_type, &
& select_suite, run_selected, get_argument
use test_lexer, only : collect_lexer
implicit none
integer :: stat, is
character(len=:), allocatable :: suite_name, test_name
type(testsuite_type), allocatable :: testsuites(:)
character(len=*), parameter :: fmt = '("#", *(1x, a))'
stat = 0
testsuites = [ &
& new_testsuite("lexer", collect_lexer) &
& ]
call get_argument(1, suite_name)
call get_argument(2, test_name)
if (allocated(suite_name)) then
is = select_suite(testsuites, suite_name)
if (is > 0 .and. is <= size(testsuites)) then
if (allocated(test_name)) then
write(error_unit, fmt) "Suite:", testsuites(is)%name
call run_selected(testsuites(is)%collect, test_name, error_unit, stat)
if (stat < 0) then
error stop 1
end if
else
write(error_unit, fmt) "Testing:", testsuites(is)%name
call run_testsuite(testsuites(is)%collect, error_unit, stat)
end if
else
write(error_unit, fmt) "Available testsuites"
do is = 1, size(testsuites)
write(error_unit, fmt) "-", testsuites(is)%name
end do
error stop 1
end if
else
do is = 1, size(testsuites)
write(error_unit, fmt) "Testing:", testsuites(is)%name
call run_testsuite(testsuites(is)%collect, error_unit, stat)
end do
end if
if (stat > 0) then
write(error_unit, '(i0, 1x, a)') stat, "test(s) failed!"
error stop 1
end if
end program tester
Our actual test suite for the lexer will be based on a routine called check_token, which creates a new lexer from a string and retrieves all tokens while comparing them with a reference token stream. We then can implement our checks by providing a string and a list of tokens to see whether our lexer can identify the expected tokens correctly. For visualization, we use the tomlf_diagnostic module to label the tokens in the actual source string.
module test_lexer
use testdrive
use tjson_lexer
use tomlf_constants, only : tfi, tfr, nl => TOML_NEWLINE
implicit none
public :: collect_lexer
contains
!> Collect all exported unit tests
subroutine collect_lexer(testsuite)
!> Collection of tests
type(unittest_type), allocatable, intent(out) :: testsuite(:)
testsuite = [ &
& new_unittest("bool-true", bool_true), &
& new_unittest("bool-falsey", bool_falsey), &
& new_unittest("empty", empty), &
& new_unittest("float-point", float_point), &
& new_unittest("float-double-point", float_double_point), &
& new_unittest("integer-limits", integer_limits), &
& new_unittest("string", string), &
& new_unittest("whitespace-mixed", whitespace_mixed)]
end subroutine collect_lexer
subroutine empty(error)
!> Error handling
type(error_type), allocatable, intent(out) :: error
call check_token(error, "", &
& [token_kind%eof])
end subroutine empty
subroutine bool_falsey(error)
!> Error handling
type(error_type), allocatable, intent(out) :: error
call check_token(error, "falsey", &
& [token_kind%invalid, token_kind%eof])
end subroutine bool_falsey
subroutine bool_true(error)
!> Error handling
type(error_type), allocatable, intent(out) :: error
call check_token(error, "true", &
& [token_kind%bool, token_kind%eof])
end subroutine bool_true
subroutine float_point(error)
!> Error handling
type(error_type), allocatable, intent(out) :: error
call check_token(error, "3.14,+3.14,-3.14,0.123", &
& [token_kind%float, token_kind%comma, token_kind%invalid, token_kind%comma, &
& token_kind%float, token_kind%comma, token_kind%float, token_kind%eof])
end subroutine float_point
subroutine float_double_point(error)
!> Error handling
type(error_type), allocatable, intent(out) :: error
call check_token(error, "0..1,0.1.2", &
& [token_kind%invalid, token_kind%comma, token_kind%invalid, token_kind%eof])
end subroutine float_double_point
subroutine integer_limits(error)
!> Error handling
type(error_type), allocatable, intent(out) :: error
call check_token(error, "9223372036854775807,-9223372036854775808", &
& [token_kind%int, token_kind%comma, token_kind%int, token_kind%eof])
end subroutine integer_limits
subroutine string(error)
!> Error handling
type(error_type), allocatable, intent(out) :: error
call check_token(error, """something"",""anything""", &
& [token_kind%string, token_kind%comma, token_kind%string, token_kind%eof])
end subroutine string
subroutine whitespace_mixed(error)
!> Error handling
type(error_type), allocatable, intent(out) :: error
call check_token(error, achar(9)//" "//achar(9), &
& [token_kind%whitespace, token_kind%eof])
end subroutine whitespace_mixed
subroutine check_token(error, string, expected)
use tomlf_diagnostic, only : render, toml_label, toml_level
use tomlf_de_token, only : stringify
use tomlf_terminal, only : toml_terminal
!> Error handling
type(error_type), allocatable, intent(out) :: error
!> String to be parsed
character(len=*), intent(in) :: string
!> Expected token kind
integer, intent(in) :: expected(:)
integer :: it
logical :: okay
character(len=:), allocatable :: msg
type(json_lexer) :: lexer
type(toml_token) :: token
type(toml_label), allocatable :: label(:)
call new_lexer_from_string(lexer, string)
lexer%prelude = 0
allocate(label(0))
do it = 1, size(expected)
call lexer%next(token)
okay = token%kind == expected(it)
label = [label, toml_label(merge(toml_level%info, toml_level%error, okay), &
& token%first, token%last, stringify(token), .not.okay)]
msg = render(string//nl, [label(size(label))], toml_terminal(.true.))
call check(error, token%kind, expected(it), &
& "Expected '"//stringify(toml_token(expected(it)))// &
& "' but got '"//stringify(token)//"'"//nl//msg)
if (allocated(error)) exit
end do
if (.not.allocated(error)) then
msg = render(string//nl, label, toml_terminal(.true.))
print '(a)', msg
end if
end subroutine check_token
end module test_lexer
These are only a couple of tests, we have much more cases to consider for a robust lexer.
Exercise
Write at least ten more tests for edge cases in the lexer. Make sure to include invalid cases and ensure that even invalid tokens are generated correctly.
Connecting to the parser#
Now that we have verified the tokenization process in our lexer we can connect our custom lexer to the default TOML parser.
For this purpose we define convenience interfaces called json_load / json_loads similar to the available toml_load / toml_loads interfaces. Other than the TOML-related load interfaces, we will also use them to implement necessary post-processing steps for the data structure.
module tjson_parser
use tomlf_constants, only : tfc, tfi, tfr, toml_type
use tomlf_datetime, only : toml_datetime
use tomlf_de_context, only : toml_context
use tjson_lexer, only : json_lexer, new_lexer_from_string, new_lexer_from_unit, &
& new_lexer_from_file
use tomlf_de_parser, only : parse, toml_parser_config
use tomlf_diagnostic, only : toml_level
use tomlf_build, only : get_value
use tomlf_error, only : toml_error
use tomlf_type, only : toml_table, toml_value, cast_to_table, &
& toml_visitor, toml_array, toml_keyval, toml_key, len
implicit none
private
public :: json_load, json_loads
public :: toml_context, toml_parser_config, toml_level
!> Load a TOML data structure from the provided source
interface json_load
module procedure :: json_load_file
module procedure :: json_load_unit
end interface json_load
!> Load a TOML data structure from a string
interface json_loads
module procedure :: json_load_string
end interface json_loads
The json_load interface is implemented by json_load_file and json_load_unit. The former is a wrapper that is using the new_lexer_from_file constructor.
!> Load TOML data structure from file
subroutine json_load_file(object, filename, config, context, error)
!> Instance of the TOML data structure, not allocated in case of error
class(toml_value), allocatable, intent(out) :: object
!> Name of the file to load
character(*, tfc), intent(in) :: filename
!> Configuration for the parser
type(toml_parser_config), intent(in), optional :: config
!> Context tracking the origin of the data structure to allow rich reports
type(toml_context), intent(out), optional :: context
!> Error handling, provides detailed diagnostic in case of error
type(toml_error), allocatable, intent(out), optional :: error
type(json_lexer) :: lexer
type(toml_error), allocatable :: error_
type(toml_table), allocatable :: table
call new_lexer_from_file(lexer, filename, error_)
if (.not.allocated(error_)) then
call parse(lexer, table, config, context, error)
if (allocated(table)) call prune(object, table)
else
if (present(error)) call move_alloc(error_, error)
end if
end subroutine json_load_file
The latter wraps the new_lexer_from_unit constructor.
!> Load TOML data structure from unit
subroutine json_load_unit(object, io, config, context, error)
!> Instance of the TOML data structure, not allocated in case of error
class(toml_value), allocatable, intent(out) :: object
!> Unit to read from
integer, intent(in) :: io
!> Configuration for the parser
type(toml_parser_config), intent(in), optional :: config
!> Context tracking the origin of the data structure to allow rich reports
type(toml_context), intent(out), optional :: context
!> Error handling, provides detailed diagnostic in case of error
type(toml_error), allocatable, intent(out), optional :: error
type(json_lexer) :: lexer
type(toml_error), allocatable :: error_
type(toml_table), allocatable :: table
call new_lexer_from_unit(lexer, io, error_)
if (.not.allocated(error_)) then
call parse(lexer, table, config, context, error)
if (allocated(table)) call prune(object, table)
else
if (present(error)) call move_alloc(error_, error)
end if
end subroutine json_load_unit
Finally, we also provide json_loads by implementing json_load_string using our new_lexer_from_string constructor.
!> Load TOML data structure from string
subroutine json_load_string(object, string, config, context, error)
!> Instance of the TOML data structure, not allocated in case of error
class(toml_value), allocatable, intent(out) :: object
!> String containing TOML document
character(*, tfc), intent(in) :: string
!> Configuration for the parser
type(toml_parser_config), intent(in), optional :: config
!> Context tracking the origin of the data structure to allow rich reports
type(toml_context), intent(out), optional :: context
!> Error handling, provides detailed diagnostic in case of error
type(toml_error), allocatable, intent(out), optional :: error
type(json_lexer) :: lexer
type(toml_table), allocatable :: table
call new_lexer_from_string(lexer, string)
call parse(lexer, table, config, context, error)
if (allocated(table)) call prune(object, table)
end subroutine json_load_string
These wrappers so far are very straightforward, first setting up a lexer instance and invoking the parse procedure which will construct the actual parser instance and process the token stream. After a successful run, the table instance will be allocated, for the post-processing, we invoke the prune routine.
subroutine prune(object, table)
!> Instance of the TOML data structure, not allocated in case of error
class(toml_value), allocatable, intent(inout) :: object
!> Instance of the TOML data structure, not allocated in case of error
type(toml_table), allocatable, intent(inout) :: table
call table%pop("_", object)
end subroutine prune
Where we effectively remove the first child from the root table and return is a polymorphic toml_value. This has the advantage that we can support arrays and values at the root level with our JSON loader.
Tip
The user can dispatch the value using a select type construct or by creating a view using the cast_to_table / cast_to_array / cast_to_keyval functions.
full source
For completeness here is again the full source of our parser implementation.
Note that this implementation also contains an implementation of a toml_visitor to prune type annotations used in the validation test suite to represent TOML values.
module tjson_parser
use tomlf_constants, only : tfc, tfi, tfr, toml_type
use tomlf_datetime, only : toml_datetime
use tomlf_de_context, only : toml_context
use tjson_lexer, only : json_lexer, new_lexer_from_string, new_lexer_from_unit, &
& new_lexer_from_file
use tomlf_de_parser, only : parse, toml_parser_config
use tomlf_diagnostic, only : toml_level
use tomlf_build, only : get_value
use tomlf_error, only : toml_error
use tomlf_type, only : toml_table, toml_value, cast_to_table, &
& toml_visitor, toml_array, toml_keyval, toml_key, len
implicit none
private
public :: json_load, json_loads
public :: toml_context, toml_parser_config, toml_level
!> Load a TOML data structure from the provided source
interface json_load
module procedure :: json_load_file
module procedure :: json_load_unit
end interface json_load
!> Load a TOML data structure from a string
interface json_loads
module procedure :: json_load_string
end interface json_loads
!> Implement pruning of annotated values as visitor
type, extends(toml_visitor) :: json_prune
contains
!> Traverse the AST and prune all annotated values
procedure :: visit
end type json_prune
contains
!> Load TOML data structure from file
subroutine json_load_file(object, filename, config, context, error)
!> Instance of the TOML data structure, not allocated in case of error
class(toml_value), allocatable, intent(out) :: object
!> Name of the file to load
character(*, tfc), intent(in) :: filename
!> Configuration for the parser
type(toml_parser_config), intent(in), optional :: config
!> Context tracking the origin of the data structure to allow rich reports
type(toml_context), intent(out), optional :: context
!> Error handling, provides detailed diagnostic in case of error
type(toml_error), allocatable, intent(out), optional :: error
type(json_lexer) :: lexer
type(toml_error), allocatable :: error_
type(toml_table), allocatable :: table
call new_lexer_from_file(lexer, filename, error_)
if (.not.allocated(error_)) then
call parse(lexer, table, config, context, error)
if (allocated(table)) call prune(object, table)
else
if (present(error)) call move_alloc(error_, error)
end if
end subroutine json_load_file
!> Load TOML data structure from unit
subroutine json_load_unit(object, io, config, context, error)
!> Instance of the TOML data structure, not allocated in case of error
class(toml_value), allocatable, intent(out) :: object
!> Unit to read from
integer, intent(in) :: io
!> Configuration for the parser
type(toml_parser_config), intent(in), optional :: config
!> Context tracking the origin of the data structure to allow rich reports
type(toml_context), intent(out), optional :: context
!> Error handling, provides detailed diagnostic in case of error
type(toml_error), allocatable, intent(out), optional :: error
type(json_lexer) :: lexer
type(toml_error), allocatable :: error_
type(toml_table), allocatable :: table
call new_lexer_from_unit(lexer, io, error_)
if (.not.allocated(error_)) then
call parse(lexer, table, config, context, error)
if (allocated(table)) call prune(object, table)
else
if (present(error)) call move_alloc(error_, error)
end if
end subroutine json_load_unit
!> Load TOML data structure from string
subroutine json_load_string(object, string, config, context, error)
!> Instance of the TOML data structure, not allocated in case of error
class(toml_value), allocatable, intent(out) :: object
!> String containing TOML document
character(*, tfc), intent(in) :: string
!> Configuration for the parser
type(toml_parser_config), intent(in), optional :: config
!> Context tracking the origin of the data structure to allow rich reports
type(toml_context), intent(out), optional :: context
!> Error handling, provides detailed diagnostic in case of error
type(toml_error), allocatable, intent(out), optional :: error
type(json_lexer) :: lexer
type(toml_table), allocatable :: table
call new_lexer_from_string(lexer, string)
call parse(lexer, table, config, context, error)
if (allocated(table)) call prune(object, table)
end subroutine json_load_string
!> Prune the artificial root table inserted by the lexer
subroutine prune(object, table)
!> Instance of the TOML data structure, not allocated in case of error
class(toml_value), allocatable, intent(inout) :: object
!> Instance of the TOML data structure, not allocated in case of error
type(toml_table), allocatable, intent(inout) :: table
type(json_prune) :: pruner
call table%pop("_", object)
if (allocated(object)) call object%accept(pruner)
end subroutine prune
!> Visit a TOML value
subroutine visit(self, val)
!> Instance of the JSON pruner
class(json_prune), intent(inout) :: self
!> TOML value to visit
class(toml_value), intent(inout) :: val
select type(val)
class is(toml_array)
call visit_array(self, val)
class is(toml_table)
call visit_table(self, val)
end select
end subroutine visit
!> Visit a TOML array
subroutine visit_array(visitor, array)
!> Instance of the JSON pruner
class(json_prune), intent(inout) :: visitor
!> TOML value to visit
type(toml_array), intent(inout) :: array
class(toml_value), allocatable :: val, tmp
character(kind=tfc, len=:), allocatable :: str
type(toml_key), allocatable :: vt(:)
integer :: i, n, stat
n = len(array)
do i = 1, n
call array%shift(val)
select type(val)
class default
call val%accept(visitor)
class is(toml_table)
call val%get_keys(vt)
if (val%has_key("type") .and. val%has_key("value") .and. size(vt)==2) then
call get_value(val, "type", str)
call prune_value(tmp, val, str)
call val%destroy
call tmp%accept(visitor)
call array%push_back(tmp, stat)
cycle
else
call val%accept(visitor)
end if
end select
call array%push_back(val, stat)
end do
end subroutine visit_array
!> Visit a TOML table
subroutine visit_table(visitor, table)
!> Instance of the JSON pruner
class(json_prune), intent(inout) :: visitor
!> TOML table to visit
type(toml_table), intent(inout) :: table
class(toml_value), pointer :: ptr
class(toml_value), allocatable :: val
character(kind=tfc, len=:), allocatable :: str
type(toml_key), allocatable :: list(:), vt(:)
integer :: i, n, stat
call table%get_keys(list)
n = size(list, 1)
do i = 1, n
call table%get(list(i)%key, ptr)
select type(ptr)
class default
call ptr%accept(visitor)
class is(toml_table)
call ptr%get_keys(vt)
if (ptr%has_key("type") .and. ptr%has_key("value") .and. size(vt)==2) then
call get_value(ptr, "type", str)
call prune_value(val, ptr, str)
call val%accept(visitor)
call table%delete(list(i)%key)
call table%push_back(val, stat)
else
call ptr%accept(visitor)
end if
end select
end do
end subroutine visit_table
subroutine prune_value(val, table, str)
!> Actual TOML value
class(toml_value), allocatable, intent(out) :: val
!> TOML table to prune
type(toml_table), intent(inout) :: table
!> Value kind
character(kind=tfc, len=*), intent(in) :: str
class(toml_value), pointer :: ptr
character(:, tfc), pointer :: sval
integer :: stat
type(toml_datetime) :: dval
integer(tfi) :: ival
real(tfr) :: fval
call table%get("value", ptr)
allocate(val, source=ptr)
if (allocated(table%key)) then
val%key = table%key
else
deallocate(val%key)
end if
select type(val)
class is(toml_keyval)
call val%get(sval)
select case(str)
case("date", "time", "datetime", "date-local", "time-local", "datetime-local")
dval = toml_datetime(sval)
call val%set(dval)
case("bool")
call val%set(sval == "true")
case("integer")
read(sval, *, iostat=stat) ival
if (stat == 0) then
call val%set(ival)
end if
case("float")
read(sval, *, iostat=stat) fval
if (stat == 0) then
call val%set(fval)
end if
end select
end select
end subroutine prune_value
end module tjson_parser
Summary#
Now we have a working lexer that can tokenize JSON documents into TOML parsable tokens. The lexer implemented in TOML Fortran works on a similar construction, with the difference that the TOML grammar is much more complicated to parse than JSON.
Important
In this tutorial, you have learned about the tokenization process used in TOML Fortran. You can now
implement a custom lexer based on the TOML tokens
verify your lexer against an expected token stream
adjust the token stream to direct the parsing process
add a post-processing step to prune the resulting data structure