Writing a custom lexer#

Many programs already come with their input formats, switching to a different format requires establishing some way to get backward compatibility for older inputs. When transitioning to TOML Fortran reading of the input will use the provided TOML data structures. If the previous input format is sufficiently compatible, it can be parsed into a matching TOML data structure and allow to seamlessly use of the TOML format going forward while still providing compatibility for previously written inputs.

This tutorial is meant to teach how lexing in TOML Fortran works and enable the reader to implement their custom lexer for their custom format. There is no guarantee that a custom input format can be ported by creating a custom lexer, since the format needs to fulfill some basic requirements, like providing typed values. For this tutorial, we will choose JSON as our input format and walk through all the steps to create a new lexer from scratch.

Note

The choice of JSON for this tutorial is not a coincidence. TOML Fortran does implement this lexer to parse JSON files into TOML data structures to support the encoding tests in the validation suite of BurntSushi/toml-test.

Important

This tutorial makes partial use of the internal API of TOML Fortran.

Identifying limitation#

Before we start to implement our custom lexer, we need to identify any limitations of the TOML data structures to represent our custom format. TOML documents always have a table at the document root, there is no way to represent a JSON array or single value in TOML. Furthermore, JSON supports the value type null, which is not representable in TOML. We have two choices here, either we can flag null values as an invalid token or we can replace them in the lexer with something else like an empty table. Finally, there are other details we have to take into account, like how JSON is handling duplicate keys, for most of the implementation-dependent cases we will follow the rules TOML provides.

This tutorial by no means aims for offering a fully compliant parser as we already fail for top-level arrays or null type values. For a custom format, this might be even more challenging, especially if the format is defined by only a single implementation.

Note

Writing a compliant JSON parser can quickly become quite challenging (see Parsing JSON is a Minefield).

But format limitations can go both ways, there are of course also features in TOML we cannot express in JSON. However, since we want to map JSON to TOML and not the other way round we do not have to worry about limitations present in JSON. Every feature available in TOML representable in the previous input format will be an incentive to switch to the new format.

Note

For the actual application of the JSON parser in the validation suite, this problem is solved by not using only strings to represent values and adding type annotations. In TOML Fortran these annotations are mapped back by pruning the read JSON data structure. The pruning is done via a visitor which is accepted after the data structure has been completely parsed.

Creating the lexer#

First, we start by creating a new subclass of the abstract base class (ABC) imported from the tomlf_de_abc module.

src/json_lexer.f90 (json_lexer)#

   use tomlf_constants, only : tfc, tfi, tfr, toml_escape
   use tomlf_datetime, only : toml_datetime
   use tomlf_de_abc, only : abstract_lexer
   use tomlf_de_token, only : toml_token, token_kind
   use tomlf_error, only : toml_error, make_error
   use tomlf_utils, only : read_whole_file, read_whole_line
   implicit none
   private

   public :: json_lexer
   public :: new_lexer_from_file, new_lexer_from_unit, new_lexer_from_string

   !> Tokenizer for JSON documents
   type, extends(abstract_lexer) :: json_lexer
      !> Name of the source file, used for error reporting
      character(len=:), allocatable :: filename
      !> Current internal position in the source chunk
      integer :: pos = 0
      !> Current source chunk
      character(:, tfc), allocatable :: chunk
      !> Additional tokens to insert before the actual token stream
      integer :: prelude = 2
   contains
      !> Obtain the next token
      procedure :: next
      !> Extract a string from a token
      procedure :: extract_string
      !> Extract an integer from a token
      procedure :: extract_integer
      !> Extract a float from a token
      procedure :: extract_float
      !> Extract a boolean from a token
      procedure :: extract_bool
      !> Extract a timestamp from a token
      procedure :: extract_datetime
      !> Get information about source
      procedure :: get_info
   end type json_lexer

We start by creating a constructor to consume an input file and turn it into a string to advance through.

src/json_lexer.f90 (new_lexer_from_file)#

!> Create a new instance of a lexer by reading from a file
subroutine new_lexer_from_file(lexer, filename, error)
   !> Instance of the lexer
   type(json_lexer), intent(out) :: lexer
   !> Name of the file to read from
   character(len=*), intent(in) :: filename
   !> Error code
   type(toml_error), allocatable, intent(out) :: error

   integer :: stat

   lexer%filename = filename
   call read_whole_file(filename, lexer%chunk, stat)

   if (stat /= 0) call make_error(error, "Could not open file '"//filename//"'")
end subroutine new_lexer_from_file

Using a formatted unit is more inefficient compared to reading the whole file with direct access, but needed in case we are dealing with the standard input. We make sure to error out if we get direct access or stream access units since we cannot reliably read those.

src/json_lexer.f90 (new_lexer_from_unit)#

!> Create a new instance of a lexer by reading from a unit.
!>
!> Currently, only sequential access units can be processed by this constructor.
subroutine new_lexer_from_unit(lexer, io, error)
   !> Instance of the lexer
   type(json_lexer), intent(out) :: lexer
   !> Unit to read from
   integer, intent(in) :: io
   !> Error code
   type(toml_error), allocatable, intent(out) :: error

   character(:, tfc), allocatable :: source, line
   integer, parameter :: bufsize = 512
   character(bufsize, tfc) :: filename, mode
   integer :: stat

   inquire(unit=io, access=mode, name=filename)
   select case(trim(mode))
   case default
      stat = 1

   case("sequential", "SEQUENTIAL")
      allocate(character(0) :: source)
      do 
         call read_whole_line(io, line, stat)
         if (stat > 0) exit
         source = source // line // toml_escape%newline
         if (stat < 0) then
            if (is_iostat_end(stat)) stat = 0
            exit
         end if
      end do
      call new_lexer_from_string(lexer, source)
   end select
   if (len_trim(filename) > 0) lexer%filename = trim(filename)

   if (stat /= 0) call make_error(error, "Failed to read from unit")
end subroutine new_lexer_from_unit

Finally, we sometimes also need to read from a string, there we add a constructor which can create a lexer for a string input.

src/json_lexer.f90 (new_lexer_from_string)#

!> Create a new instance of a lexer by reading from a string.
subroutine new_lexer_from_string(lexer, string)
   !> Instance of the lexer
   type(json_lexer), intent(out) :: lexer
   !> String to read from
   character(len=*), intent(in) :: string

   lexer%chunk = string
end subroutine new_lexer_from_string

The parser might need access to some of the internal data of the lexer, which is done via the get_info procedure.

src/json_lexer.f90 (get_info)#

!> Extract information about the source
subroutine get_info(lexer, meta, output)
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Query about the source
   character(*, tfc), intent(in) :: meta
   !> Metadata about the source
   character(:, tfc), allocatable, intent(out) :: output

   select case(meta)
   case("source")
      output = lexer%chunk // toml_escape%newline
   case("filename")
      if (allocated(lexer%filename)) output = lexer%filename
   end select
end subroutine get_info

Identifying tokens#

Now that we can instantiate the lexer we need to implement the possibility to recognize tokens, this is done with the next method. We start with creating the actual tokenization step in the next_token procedure, which we will call in the next method.

src/json_lexer.f90 (next_token)#

!> Actually generate the next token, unbuffered version
subroutine next_token(lexer, token)
   !> Instance of the lexer
   class(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

As a first action, we will advance the internal state of the lexer by consuming the last token. For convenience, we save the position in the source string in the pos and prev variables.

src/json_lexer.f90 (next_token, continued)#

   integer :: prev, pos

   ! Consume current token
   lexer%pos = lexer%pos + token%last - token%first + 1
   prev = lexer%pos
   pos = lexer%pos

The next thing we check is if we have exhausted the input string and if so we return the end of file token.

src/json_lexer.f90 (next_token, continued)#

   ! If lexer is exhausted, return EOF as early as possible
   if (pos > len(lexer%chunk)) then
      token = toml_token(token_kind%eof, prev, pos)
      return
   end if

Now we can inspect the current character from the source string and decide which token it should be labeled. The character set is quite simple, we have to consider opening and closing brackets and braces, for arrays and tables, respectively, commas, colons, strings, and whitespace. We will be explicitly producing whitespace tokens here rather than skipping it since the parser can gracefully handle whitespace. However, we have to consider that newlines have semantical meaning in TOML while they are only considered whitespace in JSON and therefore we will only produce whitespace tokens.

We use a select case statement to decide which token to produce.

src/json_lexer.f90 (next_token, continued)#

   select case(lexer%chunk(pos:pos))
   case(" ", toml_escape%tabulator, toml_escape%newline, toml_escape%carriage_return)
      do while(any(lexer%chunk(pos+1:pos+1) == [" ", toml_escape%tabulator, &
            & toml_escape%newline, toml_escape%carriage_return]) .and. pos < len(lexer%chunk))
         pos = pos + 1
      end do
      token = toml_token(token_kind%whitespace, prev, pos)
      return
   case(":")
      token = toml_token(token_kind%equal, prev, pos)
      return
   case("{")
      token = toml_token(token_kind%lbrace, prev, pos)
      return
   case("}")
      token = toml_token(token_kind%rbrace, prev, pos)
      return
   case("[")
      token = toml_token(token_kind%lbracket, prev, pos)
      return
   case("]")
      token = toml_token(token_kind%rbracket, prev, pos)
      return
   case('"')
      call next_string(lexer, token)
      return
   case("-", "0":"9")
      call next_number(lexer, token)
      if (token%kind /= token_kind%invalid) return
   case("t", "f")
      call next_boolean(lexer, token)
      return
   case("n")
      call next_null(lexer, token)
      return
   case(",")
      token = toml_token(token_kind%comma, prev, pos)
      return
   end select

   ! If the current token is invalid, advance to the next terminator
   do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
      pos = pos + 1
   end do
   token = toml_token(token_kind%invalid, prev, pos)
end subroutine next_token

To wrap up the lexing we will try to identify unknown tokens as well as possible trying to advance to the next terminating character. For the terminating characters, we choose whitespace as well as control characters and place those in the module scope.

src/json_lexer.f90 (terminated)#

   character(*, tfc), parameter :: terminated = " {}[],:"//&
      & toml_escape%tabulator//toml_escape%newline//toml_escape%carriage_return

Note

We are cheating a bit here since we declare the colon as an equal token. This way we can use the same lexer for both JSON and TOML and still have the same parsing rules.

One special case to consider is literals, like strings numbers or booleans. To not clutter the logic here we create separate routines for parsing the respective literal values. For obtaining string values we will implement this as next_string. Here we cannot simply advance to the next quote character, since we need to handle escape characters gracefully. While doing so we can also ensure that the escape sequences found are valid and not malformed.

src/json_lexer.f90 (next_string)#

!> Process next string token
subroutine next_string(lexer, token)
   !> Instance of the lexer
   type(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

   character(1, tfc) :: ch
   character(*, tfc), parameter :: valid_escape = 'btnfr\"'
   integer :: prev, pos
   logical :: escape, valid

   prev = lexer%pos
   pos = lexer%pos

   valid = .true.
   escape = .false.

   do while(pos < len(lexer%chunk))
      pos = pos + 1
      ch = lexer%chunk(pos:pos)
      valid = valid .and. valid_string(ch)
      if (escape) then
         escape = .false.
         valid = valid .and. verify(ch, valid_escape) == 0
         cycle
      end if
      escape = ch == toml_escape%backslash
      if (ch == '"') exit
      if (ch == toml_escape%newline) then
         pos = pos - 1
         valid = .false.
         exit
      end if
   end do

   valid = valid .and. lexer%chunk(pos:pos) == '"' .and. pos /= prev
   token = toml_token(merge(token_kind%string, token_kind%invalid, valid), prev, pos)
end subroutine next_string

Strings can only contain printable characters, therefore we check for valid string characters using a small valid_string function for each character.

src/json_lexer.f90 (valid_string)#

!> Validate characters in string, non-printable characters are invalid in this context
pure function valid_string(ch) result(valid)
   character(1, tfc), intent(in) :: ch
   logical :: valid

   character(1, tfc), parameter :: x00 = achar(int(z"00")), x08 = achar(int(z"08")), &
      & x0b = achar(int(z"0b")), x1f = achar(int(z"1f")), x7f = achar(int(z"7f"))

   valid = &
      & .not.(x00 <= ch .and. ch <= x08) .and. &
      & .not.(x0b <= ch .and. ch <= x1f) .and. &
      & ch /= x7f
end function valid_string

We also need to identify numbers, mapping to either integers or floats in TOML, which is done via next_number.

src/json_lexer.f90 (next_number)#

!> Process next number token, can produce either integer or floats
subroutine next_number(lexer, token)
   !> Instance of the lexer
   type(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

   integer :: prev, pos, point, expo
   logical :: minus, zero, first
   character(1, tfc) :: ch
   integer, parameter :: offset(*) = [0, 1, 2]

   prev = lexer%pos
   pos = lexer%pos
   token = toml_token(token_kind%invalid, prev, pos)

   point = 0
   expo = 0
   zero = .false.
   first = .true.
   minus = lexer%chunk(pos:pos) == "-"
   if (minus) pos = pos + 1

   do while(pos <= len(lexer%chunk))
      ch = lexer%chunk(pos:pos)
      if (ch == ".") then
         if (point > 0 .or. expo > 0) return
         zero = .false.
         point = pos
         pos = pos + 1
         cycle
      end if

      if (ch == "e" .or. ch == "E") then
         if (expo > 0) return
         zero = .false.
         expo = pos
         pos = pos + 1
         cycle
      end if

      if (ch == "+" .or. ch == "-") then
         if (.not.any(lexer%chunk(pos-1:pos-1) == ["e", "E"])) return
         pos = pos + 1
         cycle
      end if

      if (verify(ch, "0123456789") == 0) then
         if (zero) return
         zero = first .and. ch == "0"
         first = .false.
         pos = pos + 1
         cycle
      end if

      exit
   end do

   if (any([expo, point] == pos-1)) return
   token = toml_token(merge(token_kind%float, token_kind%int, any([expo, point] > 0)), &
      & prev, pos-1)
end subroutine next_number

To support boolean values we implement a next_boolean procedure.

src/json_lexer.f90 (next_boolean)#

!> Process next boolean token
subroutine next_boolean(lexer, token)
   !> Instance of the lexer
   type(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

   integer :: pos, prev

   prev = lexer%pos
   pos = lexer%pos

   do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
      pos = pos + 1
   end do

   select case(lexer%chunk(prev:pos))
   case default
      token = toml_token(token_kind%invalid, prev, pos)
   case("true", "false")
      token = toml_token(token_kind%bool, prev, pos)
   end select
end subroutine next_boolean

Finally, we also want to parse null values using the next_null procedure.

src/json_lexer.f90 (next_null)#

!> Process next null token
subroutine next_null(lexer, token)
   !> Instance of the lexer
   type(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

   integer :: pos, prev

   prev = lexer%pos
   pos = lexer%pos

   do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
      pos = pos + 1
   end do

   token = toml_token( &
      & merge(token_kind%nil, token_kind%invalid, lexer%chunk(prev:pos) == "null"), &
      & prev, pos)
end subroutine next_null

With this logic available we can now generate all required tokens for parsing JSON.

Tip

Moving most of the validation logic in the tokenization simplifies the actual extraction of the value as we have to deal with fewer edge cases.

Now we can wrap up the next procedure, instead of directly returning the token we will make some adjustments to the token stream here. In general, this is the right place to buffer tokens, perform overflow checks, or detect unclosed groups, we will only use it to insert two additional tokens to inject a top-level key.

src/json_lexer.f90 (next)#

!> Advance to the next token in the lexer
subroutine next(lexer, token)
   !> Instance of the lexer
   class(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

   type(toml_token), parameter :: prelude(2) = &
      [toml_token(token_kind%equal, 0, 0), toml_token(token_kind%keypath, 1, 0)]

   if (lexer%prelude > 0) then
      token = prelude(lexer%prelude)
      lexer%prelude = lexer%prelude - 1
      return
   end if

   call next_token(lexer, token)
end subroutine next

This will direct the parser to leave the root document where newlines are semantically relevant since we cannot produce such newline tokens in our JSON lexer.

Exercise

The nil token will make the parser skip the respective value. If we want to support null values, how would we have to modify our lexer to produce for example an empty table {} instead, i.e. a lbrace and rbrace token?

Extracting values#

Before we can connect our lexer to the existing TOML parser we have to implement the extraction of the values. The parser itself will use the extract member functions to obtain values for the respective tokens and never directly access the character stream.

To extract the string value we implement the extract_string procedure. We will also use the extract_string routine to catch the keypath token we inserted in the token stream and return the wanted dummy value.

src/json_lexer.f90 (extract_string)#

!> Extract string value of token
subroutine extract_string(lexer, token, string)
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Token to extract string value from
   type(toml_token), intent(in) :: token
   !> String value of token
   character(len=:), allocatable, intent(out) :: string

   integer :: it, length
   logical :: escape
   character(1, tfc) :: ch

   length = token%last - token%first + 1

   select case(token%kind)
   case(token_kind%keypath)  ! dummy token inserted by lexer prelude
      string = "_"
   case(token_kind%string)
      string = ""
      escape = .false.
      do it = token%first + 1, token%last - 1
         ch = lexer%chunk(it:it)
         if (escape) then
            escape = .false.
            select case(ch)
            case(toml_escape%dquote, toml_escape%backslash); string = string // ch
            case("b"); string = string // toml_escape%bspace
            case("t"); string = string // toml_escape%tabulator
            case("n"); string = string // toml_escape%newline
            case("r"); string = string // toml_escape%carriage_return
            case("f"); string = string // toml_escape%formfeed
            end select
            cycle
         end if
         escape = ch == toml_escape%backslash
         if (.not.escape) string = string // ch
      end do
   end select
end subroutine extract_string

Similarly, we implement the extract_integer, instead of using an internal read, we implement the reading ourselves.

src/json_lexer.f90 (extract_integer)#

!> Extract integer value of token
subroutine extract_integer(lexer, token, val)
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Token to extract integer value from
   type(toml_token), intent(in) :: token
   !> Integer value of token
   integer(tfi), intent(out) :: val

   integer :: first, it, tmp
   character(1, tfc) :: ch
   character(*, tfc), parameter :: num = "0123456789"

   if (token%kind /= token_kind%int) return

   val = 0
   first = token%first

   if (lexer%chunk(first:first) == "-") first = first + 1
   if (lexer%chunk(first:first) == "0") return

   do it = first, token%last
      ch = lexer%chunk(it:it)
      tmp = scan(num, ch) - 1
      if (tmp < 0) cycle
      val = val * 10 - tmp
   end do

   if (lexer%chunk(token%first:token%first) /= "-") val = -val
end subroutine extract_integer

For floating point numbers implemented in extract_float we will just use an internal read.

src/json_lexer.f90 (extract_float)#

!> Extract floating point value of token
subroutine extract_float(lexer, token, val)
   use, intrinsic :: ieee_arithmetic, only : ieee_value, &
      & ieee_positive_inf, ieee_negative_inf, ieee_quiet_nan
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Token to extract floating point value from
   type(toml_token), intent(in) :: token
   !> Floating point value of token
   real(tfr), intent(out) :: val

   integer :: stat

   if (token%kind /= token_kind%float) return

   read(lexer%chunk(token%first:token%last), *, iostat=stat) val
end subroutine extract_float

The last token we can produce and extract from our lexer is are boolean values, which we implement in extract_boolean.

src/json_lexer.f90 (extract_boolean)#

!> Extract boolean value of token
subroutine extract_bool(lexer, token, val)
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Token to extract boolean value from
   type(toml_token), intent(in) :: token
   !> Boolean value of token
   logical, intent(out) :: val

   if (token%kind /= token_kind%bool) return

   val = lexer%chunk(token%first:token%last) == "true"
end subroutine extract_bool

We create a mocked routine for extract_datetime since we cannot produce this token in JSON.

src/json_lexer.f90 (extract_datetime)#

!> Extract datetime value of token
subroutine extract_datetime(lexer, token, val)
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Token to extract datetime value from
   type(toml_token), intent(in) :: token
   !> Datetime value of token
   type(toml_datetime), intent(out) :: val

   associate(lexer => lexer, token => token)  ! ignore unused dummy arguments
   end associate
end subroutine extract_datetime

This provides our lexer with full functionality regarding the extraction of values needed for parsing and creating data structures.

full source

For completeness here is again the full source of our lexer implementation.

src/json_lexer.f90#

module tjson_lexer
   use tomlf_constants, only : tfc, tfi, tfr, toml_escape
   use tomlf_datetime, only : toml_datetime
   use tomlf_de_abc, only : abstract_lexer
   use tomlf_de_token, only : toml_token, token_kind
   use tomlf_error, only : toml_error, make_error
   use tomlf_utils, only : read_whole_file, read_whole_line
   implicit none
   private

   public :: json_lexer
   public :: new_lexer_from_file, new_lexer_from_unit, new_lexer_from_string
   public :: toml_token, token_kind

   !> Tokenizer for JSON documents
   type, extends(abstract_lexer) :: json_lexer
      !> Name of the source file, used for error reporting
      character(len=:), allocatable :: filename
      !> Current internal position in the source chunk
      integer :: pos = 0
      !> Current source chunk
      character(:, tfc), allocatable :: chunk
      !> Additional tokens to insert before the actual token stream
      integer :: prelude = 2
   contains
      !> Obtain the next token
      procedure :: next
      !> Extract a string from a token
      procedure :: extract_string
      !> Extract an integer from a token
      procedure :: extract_integer
      !> Extract a float from a token
      procedure :: extract_float
      !> Extract a boolean from a token
      procedure :: extract_bool
      !> Extract a timestamp from a token
      procedure :: extract_datetime
      !> Get information about source
      procedure :: get_info
   end type json_lexer

   character(*, tfc), parameter :: terminated = " {}[],:"//&
      & toml_escape%tabulator//toml_escape%newline//toml_escape%carriage_return

contains

!> Create a new instance of a lexer by reading from a file
subroutine new_lexer_from_file(lexer, filename, error)
   !> Instance of the lexer
   type(json_lexer), intent(out) :: lexer
   !> Name of the file to read from
   character(len=*), intent(in) :: filename
   !> Error code
   type(toml_error), allocatable, intent(out) :: error

   integer :: stat

   lexer%filename = filename
   call read_whole_file(filename, lexer%chunk, stat)

   if (stat /= 0) call make_error(error, "Could not open file '"//filename//"'")
end subroutine new_lexer_from_file

!> Create a new instance of a lexer by reading from a unit.
!>
!> Currently, only sequential access units can be processed by this constructor.
subroutine new_lexer_from_unit(lexer, io, error)
   !> Instance of the lexer
   type(json_lexer), intent(out) :: lexer
   !> Unit to read from
   integer, intent(in) :: io
   !> Error code
   type(toml_error), allocatable, intent(out) :: error

   character(:, tfc), allocatable :: source, line
   integer, parameter :: bufsize = 512
   character(bufsize, tfc) :: filename, mode
   integer :: stat

   inquire(unit=io, access=mode, name=filename)
   select case(trim(mode))
   case default
      stat = 1

   case("sequential", "SEQUENTIAL")
      allocate(character(0) :: source)
      do 
         call read_whole_line(io, line, stat)
         if (stat > 0) exit
         source = source // line // toml_escape%newline
         if (stat < 0) then
            if (is_iostat_end(stat)) stat = 0
            exit
         end if
      end do
      call new_lexer_from_string(lexer, source)
   end select
   if (len_trim(filename) > 0) lexer%filename = trim(filename)

   if (stat /= 0) call make_error(error, "Failed to read from unit")
end subroutine new_lexer_from_unit

!> Create a new instance of a lexer by reading from a string.
subroutine new_lexer_from_string(lexer, string)
   !> Instance of the lexer
   type(json_lexer), intent(out) :: lexer
   !> String to read from
   character(len=*), intent(in) :: string

   lexer%chunk = string
end subroutine new_lexer_from_string

!> Extract information about the source
subroutine get_info(lexer, meta, output)
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Query about the source
   character(*, tfc), intent(in) :: meta
   !> Metadata about the source
   character(:, tfc), allocatable, intent(out) :: output

   select case(meta)
   case("source")
      output = lexer%chunk // toml_escape%newline
   case("filename")
      if (allocated(lexer%filename)) output = lexer%filename
   end select
end subroutine get_info

!> Advance to the next token in the lexer
subroutine next(lexer, token)
   !> Instance of the lexer
   class(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

   type(toml_token), parameter :: prelude(2) = &
      [toml_token(token_kind%equal, 0, 0), toml_token(token_kind%keypath, 1, 0)]

   if (lexer%prelude > 0) then
      token = prelude(lexer%prelude)
      lexer%prelude = lexer%prelude - 1
      return
   end if

   call next_token(lexer, token)
end subroutine next

!> Actually generate the next token, unbuffered version
subroutine next_token(lexer, token)
   !> Instance of the lexer
   class(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

   integer :: prev, pos

   ! Consume current token
   lexer%pos = lexer%pos + token%last - token%first + 1
   prev = lexer%pos
   pos = lexer%pos

   ! If lexer is exhausted, return EOF as early as possible
   if (pos > len(lexer%chunk)) then
      token = toml_token(token_kind%eof, prev, pos)
      return
   end if

   select case(lexer%chunk(pos:pos))
   case(" ", toml_escape%tabulator, toml_escape%newline, toml_escape%carriage_return)
      do while(any(lexer%chunk(pos+1:pos+1) == [" ", toml_escape%tabulator, &
            & toml_escape%newline, toml_escape%carriage_return]) .and. pos < len(lexer%chunk))
         pos = pos + 1
      end do
      token = toml_token(token_kind%whitespace, prev, pos)
      return
   case(":")
      token = toml_token(token_kind%equal, prev, pos)
      return
   case("{")
      token = toml_token(token_kind%lbrace, prev, pos)
      return
   case("}")
      token = toml_token(token_kind%rbrace, prev, pos)
      return
   case("[")
      token = toml_token(token_kind%lbracket, prev, pos)
      return
   case("]")
      token = toml_token(token_kind%rbracket, prev, pos)
      return
   case('"')
      call next_string(lexer, token)
      return
   case("-", "0":"9")
      call next_number(lexer, token)
      if (token%kind /= token_kind%invalid) return
   case("t", "f")
      call next_boolean(lexer, token)
      return
   case("n")
      call next_null(lexer, token)
      return
   case(",")
      token = toml_token(token_kind%comma, prev, pos)
      return
   end select

   ! If the current token is invalid, advance to the next terminator
   do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
      pos = pos + 1
   end do
   token = toml_token(token_kind%invalid, prev, pos)
end subroutine next_token

!> Process next string token
subroutine next_string(lexer, token)
   !> Instance of the lexer
   type(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

   character(1, tfc) :: ch
   character(*, tfc), parameter :: valid_escape = 'btnfr\"'
   integer :: prev, pos
   logical :: escape, valid

   prev = lexer%pos
   pos = lexer%pos

   valid = .true.
   escape = .false.

   do while(pos < len(lexer%chunk))
      pos = pos + 1
      ch = lexer%chunk(pos:pos)
      valid = valid .and. valid_string(ch)
      if (escape) then
         escape = .false.
         valid = valid .and. verify(ch, valid_escape) == 0
         cycle
      end if
      escape = ch == toml_escape%backslash
      if (ch == '"') exit
      if (ch == toml_escape%newline) then
         pos = pos - 1
         valid = .false.
         exit
      end if
   end do

   valid = valid .and. lexer%chunk(pos:pos) == '"' .and. pos /= prev
   token = toml_token(merge(token_kind%string, token_kind%invalid, valid), prev, pos)
end subroutine next_string

!> Process next number token, can produce either integer or floats
subroutine next_number(lexer, token)
   !> Instance of the lexer
   type(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

   integer :: prev, pos, point, expo
   logical :: minus, zero, first
   character(1, tfc) :: ch
   integer, parameter :: offset(*) = [0, 1, 2]

   prev = lexer%pos
   pos = lexer%pos
   token = toml_token(token_kind%invalid, prev, pos)

   point = 0
   expo = 0
   zero = .false.
   first = .true.
   minus = lexer%chunk(pos:pos) == "-"
   if (minus) pos = pos + 1

   do while(pos <= len(lexer%chunk))
      ch = lexer%chunk(pos:pos)
      if (ch == ".") then
         if (point > 0 .or. expo > 0) return
         zero = .false.
         point = pos
         pos = pos + 1
         cycle
      end if

      if (ch == "e" .or. ch == "E") then
         if (expo > 0) return
         zero = .false.
         expo = pos
         pos = pos + 1
         cycle
      end if

      if (ch == "+" .or. ch == "-") then
         if (.not.any(lexer%chunk(pos-1:pos-1) == ["e", "E"])) return
         pos = pos + 1
         cycle
      end if

      if (verify(ch, "0123456789") == 0) then
         if (zero) return
         zero = first .and. ch == "0"
         first = .false.
         pos = pos + 1
         cycle
      end if

      exit
   end do

   if (any([expo, point] == pos-1)) return
   token = toml_token(merge(token_kind%float, token_kind%int, any([expo, point] > 0)), &
      & prev, pos-1)
end subroutine next_number

!> Process next boolean token
subroutine next_boolean(lexer, token)
   !> Instance of the lexer
   type(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

   integer :: pos, prev

   prev = lexer%pos
   pos = lexer%pos

   do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
      pos = pos + 1
   end do

   select case(lexer%chunk(prev:pos))
   case default
      token = toml_token(token_kind%invalid, prev, pos)
   case("true", "false")
      token = toml_token(token_kind%bool, prev, pos)
   end select
end subroutine next_boolean

!> Process next null token
subroutine next_null(lexer, token)
   !> Instance of the lexer
   type(json_lexer), intent(inout) :: lexer
   !> Current token
   type(toml_token), intent(inout) :: token

   integer :: pos, prev

   prev = lexer%pos
   pos = lexer%pos

   do while(verify(lexer%chunk(pos+1:pos+1), terminated) > 0 .and. pos < len(lexer%chunk))
      pos = pos + 1
   end do

   token = toml_token( &
      & merge(token_kind%nil, token_kind%invalid, lexer%chunk(prev:pos) == "null"), &
      & prev, pos)
end subroutine next_null

!> Validate characters in string, non-printable characters are invalid in this context
pure function valid_string(ch) result(valid)
   character(1, tfc), intent(in) :: ch
   logical :: valid

   character(1, tfc), parameter :: x00 = achar(int(z"00")), x08 = achar(int(z"08")), &
      & x0b = achar(int(z"0b")), x1f = achar(int(z"1f")), x7f = achar(int(z"7f"))

   valid = &
      & .not.(x00 <= ch .and. ch <= x08) .and. &
      & .not.(x0b <= ch .and. ch <= x1f) .and. &
      & ch /= x7f
end function valid_string

!> Extract string value of token
subroutine extract_string(lexer, token, string)
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Token to extract string value from
   type(toml_token), intent(in) :: token
   !> String value of token
   character(len=:), allocatable, intent(out) :: string

   integer :: it, length
   logical :: escape
   character(1, tfc) :: ch

   length = token%last - token%first + 1

   select case(token%kind)
   case(token_kind%keypath)  ! dummy token inserted by lexer prelude
      string = "_"
   case(token_kind%string)
      string = ""
      escape = .false.
      do it = token%first + 1, token%last - 1
         ch = lexer%chunk(it:it)
         if (escape) then
            escape = .false.
            select case(ch)
            case(toml_escape%dquote, toml_escape%backslash); string = string // ch
            case("b"); string = string // toml_escape%bspace
            case("t"); string = string // toml_escape%tabulator
            case("n"); string = string // toml_escape%newline
            case("r"); string = string // toml_escape%carriage_return
            case("f"); string = string // toml_escape%formfeed
            end select
            cycle
         end if
         escape = ch == toml_escape%backslash
         if (.not.escape) string = string // ch
      end do
   end select
end subroutine extract_string

!> Extract integer value of token
subroutine extract_integer(lexer, token, val)
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Token to extract integer value from
   type(toml_token), intent(in) :: token
   !> Integer value of token
   integer(tfi), intent(out) :: val

   integer :: first, it, tmp
   character(1, tfc) :: ch
   character(*, tfc), parameter :: num = "0123456789"

   if (token%kind /= token_kind%int) return

   val = 0
   first = token%first

   if (lexer%chunk(first:first) == "-") first = first + 1
   if (lexer%chunk(first:first) == "0") return

   do it = first, token%last
      ch = lexer%chunk(it:it)
      tmp = scan(num, ch) - 1
      if (tmp < 0) cycle
      val = val * 10 - tmp
   end do

   if (lexer%chunk(token%first:token%first) /= "-") val = -val
end subroutine extract_integer

!> Extract floating point value of token
subroutine extract_float(lexer, token, val)
   use, intrinsic :: ieee_arithmetic, only : ieee_value, &
      & ieee_positive_inf, ieee_negative_inf, ieee_quiet_nan
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Token to extract floating point value from
   type(toml_token), intent(in) :: token
   !> Floating point value of token
   real(tfr), intent(out) :: val

   integer :: stat

   if (token%kind /= token_kind%float) return

   read(lexer%chunk(token%first:token%last), *, iostat=stat) val
end subroutine extract_float

!> Extract boolean value of token
subroutine extract_bool(lexer, token, val)
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Token to extract boolean value from
   type(toml_token), intent(in) :: token
   !> Boolean value of token
   logical, intent(out) :: val

   if (token%kind /= token_kind%bool) return

   val = lexer%chunk(token%first:token%last) == "true"
end subroutine extract_bool

!> Extract datetime value of token
subroutine extract_datetime(lexer, token, val)
   !> Instance of the lexer
   class(json_lexer), intent(in) :: lexer
   !> Token to extract datetime value from
   type(toml_token), intent(in) :: token
   !> Datetime value of token
   type(toml_datetime), intent(out) :: val

   associate(lexer => lexer, token => token)  ! ignore unused dummy arguments
   end associate
end subroutine extract_datetime

end module tjson_lexer

Verifying the lexer#

We could start right into connecting our lexer with the parser, but we have not yet verified that the tokenization and value extraction work as expected. For this purpose, we will create some unit tests using the test-drive framework.

As the entry point for our tester, we will use the standard wrapper for launching test suites.

Our actual test suite for the lexer will be based on a routine called check_token, which creates a new lexer from a string and retrieves all tokens while comparing them with a reference token stream. We then can implement our checks by providing a string and a list of tokens to see whether our lexer can identify the expected tokens correctly. For visualization, we use the tomlf_diagnostic module to label the tokens in the actual source string.

test/test_lexer.f90#

module test_lexer
   use testdrive
   use tjson_lexer
   use tomlf_constants, only : tfi, tfr, nl => TOML_NEWLINE
   implicit none

   public :: collect_lexer

contains

!> Collect all exported unit tests
subroutine collect_lexer(testsuite)
   !> Collection of tests
   type(unittest_type), allocatable, intent(out) :: testsuite(:)

   testsuite = [ &
      & new_unittest("bool-true", bool_true), &
      & new_unittest("bool-falsey", bool_falsey), &
      & new_unittest("empty", empty), &
      & new_unittest("float-point", float_point), &
      & new_unittest("float-double-point", float_double_point), &
      & new_unittest("integer-limits", integer_limits), &
      & new_unittest("string", string), &
      & new_unittest("whitespace-mixed", whitespace_mixed)]

end subroutine collect_lexer

subroutine empty(error)
   !> Error handling
   type(error_type), allocatable, intent(out) :: error

   call check_token(error, "", &
      & [token_kind%eof])
end subroutine empty

subroutine bool_falsey(error)
   !> Error handling
   type(error_type), allocatable, intent(out) :: error

   call check_token(error, "falsey", &
      & [token_kind%invalid, token_kind%eof])
end subroutine bool_falsey

subroutine bool_true(error)
   !> Error handling
   type(error_type), allocatable, intent(out) :: error

   call check_token(error, "true", &
      & [token_kind%bool, token_kind%eof])
end subroutine bool_true

subroutine float_point(error)
   !> Error handling
   type(error_type), allocatable, intent(out) :: error

   call check_token(error, "3.14,+3.14,-3.14,0.123", &
      & [token_kind%float, token_kind%comma, token_kind%invalid, token_kind%comma, &
      &  token_kind%float, token_kind%comma, token_kind%float, token_kind%eof])
end subroutine float_point

subroutine float_double_point(error)
   !> Error handling
   type(error_type), allocatable, intent(out) :: error

   call check_token(error, "0..1,0.1.2", &
      & [token_kind%invalid, token_kind%comma, token_kind%invalid, token_kind%eof])
end subroutine float_double_point

subroutine integer_limits(error)
   !> Error handling
   type(error_type), allocatable, intent(out) :: error

   call check_token(error, "9223372036854775807,-9223372036854775808", &
      & [token_kind%int, token_kind%comma, token_kind%int, token_kind%eof])
end subroutine integer_limits

subroutine string(error)
   !> Error handling
   type(error_type), allocatable, intent(out) :: error

   call check_token(error, """something"",""anything""", &
      & [token_kind%string, token_kind%comma, token_kind%string, token_kind%eof])
end subroutine string

subroutine whitespace_mixed(error)
   !> Error handling
   type(error_type), allocatable, intent(out) :: error

   call check_token(error, achar(9)//" "//achar(9), &
      & [token_kind%whitespace, token_kind%eof])
end subroutine whitespace_mixed

subroutine check_token(error, string, expected)
   use tomlf_diagnostic, only : render, toml_label, toml_level
   use tomlf_de_token, only : stringify
   use tomlf_terminal, only : toml_terminal
   !> Error handling
   type(error_type), allocatable, intent(out) :: error
   !> String to be parsed
   character(len=*), intent(in) :: string
   !> Expected token kind
   integer, intent(in) :: expected(:)

   integer :: it
   logical :: okay
   character(len=:), allocatable :: msg
   type(json_lexer) :: lexer
   type(toml_token) :: token
   type(toml_label), allocatable :: label(:)

   call new_lexer_from_string(lexer, string)
   lexer%prelude = 0

   allocate(label(0))
   do it = 1, size(expected)
      call lexer%next(token)
      okay = token%kind == expected(it)
      label = [label, toml_label(merge(toml_level%info, toml_level%error, okay), &
         &     token%first, token%last, stringify(token), .not.okay)]
      msg = render(string//nl, [label(size(label))], toml_terminal(.true.))
      call check(error, token%kind, expected(it), &
         & "Expected '"//stringify(toml_token(expected(it)))// &
         & "' but got '"//stringify(token)//"'"//nl//msg)
      if (allocated(error)) exit
   end do
   if (.not.allocated(error)) then
      msg = render(string//nl, label, toml_terminal(.true.))
      print '(a)', msg
   end if
end subroutine check_token

end module test_lexer

These are only a couple of tests, we have much more cases to consider for a robust lexer.

Exercise

Write at least ten more tests for edge cases in the lexer. Make sure to include invalid cases and ensure that even invalid tokens are generated correctly.

Connecting to the parser#

Now that we have verified the tokenization process in our lexer we can connect our custom lexer to the default TOML parser.

For this purpose we define convenience interfaces called json_load / json_loads similar to the available toml_load / toml_loads interfaces. Other than the TOML-related load interfaces, we will also use them to implement necessary post-processing steps for the data structure.

src/json_parser.f90#

module tjson_parser
   use tomlf_constants, only : tfc, tfi, tfr, toml_type
   use tomlf_datetime, only : toml_datetime
   use tomlf_de_context, only : toml_context
   use tjson_lexer, only : json_lexer, new_lexer_from_string, new_lexer_from_unit, &
      & new_lexer_from_file
   use tomlf_de_parser, only : parse, toml_parser_config
   use tomlf_diagnostic, only : toml_level
   use tomlf_build, only : get_value
   use tomlf_error, only : toml_error
   use tomlf_type, only : toml_table, toml_value, cast_to_table, &
      & toml_visitor, toml_array, toml_keyval, toml_key, len
   implicit none
   private

   public :: json_load, json_loads
   public :: toml_context, toml_parser_config, toml_level


   !> Load a TOML data structure from the provided source
   interface json_load
      module procedure :: json_load_file
      module procedure :: json_load_unit
   end interface json_load

   !> Load a TOML data structure from a string
   interface json_loads
      module procedure :: json_load_string
   end interface json_loads

The json_load interface is implemented by json_load_file and json_load_unit. The former is a wrapper that is using the new_lexer_from_file constructor.

src/json_parser.f90 (json_load_file)#

!> Load TOML data structure from file
subroutine json_load_file(object, filename, config, context, error)
   !> Instance of the TOML data structure, not allocated in case of error
   class(toml_value), allocatable, intent(out) :: object
   !> Name of the file to load
   character(*, tfc), intent(in) :: filename
   !> Configuration for the parser
   type(toml_parser_config), intent(in), optional :: config
   !> Context tracking the origin of the data structure to allow rich reports
   type(toml_context), intent(out), optional :: context
   !> Error handling, provides detailed diagnostic in case of error
   type(toml_error), allocatable, intent(out), optional :: error

   type(json_lexer) :: lexer
   type(toml_error), allocatable :: error_
   type(toml_table), allocatable :: table

   call new_lexer_from_file(lexer, filename, error_)
   if (.not.allocated(error_)) then
      call parse(lexer, table, config, context, error)
      if (allocated(table)) call prune(object, table)
   else
      if (present(error)) call move_alloc(error_, error)
   end if
end subroutine json_load_file

The latter wraps the new_lexer_from_unit constructor.

src/json_parser.f90 (json_load_unit)#

!> Load TOML data structure from unit
subroutine json_load_unit(object, io, config, context, error)
   !> Instance of the TOML data structure, not allocated in case of error
   class(toml_value), allocatable, intent(out) :: object
   !> Unit to read from
   integer, intent(in) :: io
   !> Configuration for the parser
   type(toml_parser_config), intent(in), optional :: config
   !> Context tracking the origin of the data structure to allow rich reports
   type(toml_context), intent(out), optional :: context
   !> Error handling, provides detailed diagnostic in case of error
   type(toml_error), allocatable, intent(out), optional :: error

   type(json_lexer) :: lexer
   type(toml_error), allocatable :: error_
   type(toml_table), allocatable :: table

   call new_lexer_from_unit(lexer, io, error_)
   if (.not.allocated(error_)) then
      call parse(lexer, table, config, context, error)
      if (allocated(table)) call prune(object, table)
   else
      if (present(error)) call move_alloc(error_, error)
   end if
end subroutine json_load_unit

Finally, we also provide json_loads by implementing json_load_string using our new_lexer_from_string constructor.

src/json_parser.f90 (json_load_unit)#

!> Load TOML data structure from string
subroutine json_load_string(object, string, config, context, error)
   !> Instance of the TOML data structure, not allocated in case of error
   class(toml_value), allocatable, intent(out) :: object
   !> String containing TOML document
   character(*, tfc), intent(in) :: string
   !> Configuration for the parser
   type(toml_parser_config), intent(in), optional :: config
   !> Context tracking the origin of the data structure to allow rich reports
   type(toml_context), intent(out), optional :: context
   !> Error handling, provides detailed diagnostic in case of error
   type(toml_error), allocatable, intent(out), optional :: error

   type(json_lexer) :: lexer
   type(toml_table), allocatable :: table

   call new_lexer_from_string(lexer, string)
   call parse(lexer, table, config, context, error)
   if (allocated(table)) call prune(object, table)
end subroutine json_load_string

These wrappers so far are very straightforward, first setting up a lexer instance and invoking the parse procedure which will construct the actual parser instance and process the token stream. After a successful run, the table instance will be allocated, for the post-processing, we invoke the prune routine.

src/json_parser.f90 (prune)#

subroutine prune(object, table)
   !> Instance of the TOML data structure, not allocated in case of error
   class(toml_value), allocatable, intent(inout) :: object
   !> Instance of the TOML data structure, not allocated in case of error
   type(toml_table), allocatable, intent(inout) :: table

   call table%pop("_", object)
end subroutine prune

Where we effectively remove the first child from the root table and return is a polymorphic toml_value. This has the advantage that we can support arrays and values at the root level with our JSON loader.

Tip

The user can dispatch the value using a select type construct or by creating a view using the cast_to_table / cast_to_array / cast_to_keyval functions.

full source

For completeness here is again the full source of our parser implementation.

Note that this implementation also contains an implementation of a toml_visitor to prune type annotations used in the validation test suite to represent TOML values.

src/json_parser.f90#

module tjson_parser
   use tomlf_constants, only : tfc, tfi, tfr, toml_type
   use tomlf_datetime, only : toml_datetime
   use tomlf_de_context, only : toml_context
   use tjson_lexer, only : json_lexer, new_lexer_from_string, new_lexer_from_unit, &
      & new_lexer_from_file
   use tomlf_de_parser, only : parse, toml_parser_config
   use tomlf_diagnostic, only : toml_level
   use tomlf_build, only : get_value
   use tomlf_error, only : toml_error
   use tomlf_type, only : toml_table, toml_value, cast_to_table, &
      & toml_visitor, toml_array, toml_keyval, toml_key, len
   implicit none
   private

   public :: json_load, json_loads
   public :: toml_context, toml_parser_config, toml_level


   !> Load a TOML data structure from the provided source
   interface json_load
      module procedure :: json_load_file
      module procedure :: json_load_unit
   end interface json_load

   !> Load a TOML data structure from a string
   interface json_loads
      module procedure :: json_load_string
   end interface json_loads

   !> Implement pruning of annotated values as visitor
   type, extends(toml_visitor) :: json_prune
   contains
      !> Traverse the AST and prune all annotated values
      procedure :: visit
   end type json_prune

contains

!> Load TOML data structure from file
subroutine json_load_file(object, filename, config, context, error)
   !> Instance of the TOML data structure, not allocated in case of error
   class(toml_value), allocatable, intent(out) :: object
   !> Name of the file to load
   character(*, tfc), intent(in) :: filename
   !> Configuration for the parser
   type(toml_parser_config), intent(in), optional :: config
   !> Context tracking the origin of the data structure to allow rich reports
   type(toml_context), intent(out), optional :: context
   !> Error handling, provides detailed diagnostic in case of error
   type(toml_error), allocatable, intent(out), optional :: error

   type(json_lexer) :: lexer
   type(toml_error), allocatable :: error_
   type(toml_table), allocatable :: table

   call new_lexer_from_file(lexer, filename, error_)
   if (.not.allocated(error_)) then
      call parse(lexer, table, config, context, error)
      if (allocated(table)) call prune(object, table)
   else
      if (present(error)) call move_alloc(error_, error)
   end if
end subroutine json_load_file

!> Load TOML data structure from unit
subroutine json_load_unit(object, io, config, context, error)
   !> Instance of the TOML data structure, not allocated in case of error
   class(toml_value), allocatable, intent(out) :: object
   !> Unit to read from
   integer, intent(in) :: io
   !> Configuration for the parser
   type(toml_parser_config), intent(in), optional :: config
   !> Context tracking the origin of the data structure to allow rich reports
   type(toml_context), intent(out), optional :: context
   !> Error handling, provides detailed diagnostic in case of error
   type(toml_error), allocatable, intent(out), optional :: error

   type(json_lexer) :: lexer
   type(toml_error), allocatable :: error_
   type(toml_table), allocatable :: table

   call new_lexer_from_unit(lexer, io, error_)
   if (.not.allocated(error_)) then
      call parse(lexer, table, config, context, error)
      if (allocated(table)) call prune(object, table)
   else
      if (present(error)) call move_alloc(error_, error)
   end if
end subroutine json_load_unit

!> Load TOML data structure from string
subroutine json_load_string(object, string, config, context, error)
   !> Instance of the TOML data structure, not allocated in case of error
   class(toml_value), allocatable, intent(out) :: object
   !> String containing TOML document
   character(*, tfc), intent(in) :: string
   !> Configuration for the parser
   type(toml_parser_config), intent(in), optional :: config
   !> Context tracking the origin of the data structure to allow rich reports
   type(toml_context), intent(out), optional :: context
   !> Error handling, provides detailed diagnostic in case of error
   type(toml_error), allocatable, intent(out), optional :: error

   type(json_lexer) :: lexer
   type(toml_table), allocatable :: table

   call new_lexer_from_string(lexer, string)
   call parse(lexer, table, config, context, error)
   if (allocated(table)) call prune(object, table)
end subroutine json_load_string

!> Prune the artificial root table inserted by the lexer
subroutine prune(object, table)
   !> Instance of the TOML data structure, not allocated in case of error
   class(toml_value), allocatable, intent(inout) :: object
   !> Instance of the TOML data structure, not allocated in case of error
   type(toml_table), allocatable, intent(inout) :: table

   type(json_prune) :: pruner

   call table%pop("_", object)

   if (allocated(object)) call object%accept(pruner)
end subroutine prune

!> Visit a TOML value
subroutine visit(self, val)
   !> Instance of the JSON pruner
   class(json_prune), intent(inout) :: self
   !> TOML value to visit
   class(toml_value), intent(inout) :: val

   select type(val)
   class is(toml_array)
      call visit_array(self, val)
   class is(toml_table)
      call visit_table(self, val)
   end select
end subroutine visit

!> Visit a TOML array
subroutine visit_array(visitor, array)
   !> Instance of the JSON pruner
   class(json_prune), intent(inout) :: visitor
   !> TOML value to visit
   type(toml_array), intent(inout) :: array

   class(toml_value), allocatable :: val, tmp
   character(kind=tfc, len=:), allocatable :: str
   type(toml_key), allocatable :: vt(:)
   integer :: i, n, stat

   n = len(array)
   do i = 1, n
      call array%shift(val)
      select type(val)
      class default
         call val%accept(visitor)
      class is(toml_table)
         call val%get_keys(vt)
         if (val%has_key("type") .and. val%has_key("value") .and. size(vt)==2) then
            call get_value(val, "type", str)
            call prune_value(tmp, val, str)
            call val%destroy
            call tmp%accept(visitor)
            call array%push_back(tmp, stat)
            cycle
         else
            call val%accept(visitor)
         end if
      end select
      call array%push_back(val, stat)
   end do
end subroutine visit_array

!> Visit a TOML table
subroutine visit_table(visitor, table)
   !> Instance of the JSON pruner
   class(json_prune), intent(inout) :: visitor
   !> TOML table to visit
   type(toml_table), intent(inout) :: table

   class(toml_value), pointer :: ptr
   class(toml_value), allocatable :: val
   character(kind=tfc, len=:), allocatable :: str
   type(toml_key), allocatable :: list(:), vt(:)
   integer :: i, n, stat

   call table%get_keys(list)
   n = size(list, 1)

   do i = 1, n
      call table%get(list(i)%key, ptr)
      select type(ptr)
      class default
         call ptr%accept(visitor)
      class is(toml_table)
         call ptr%get_keys(vt)
         if (ptr%has_key("type") .and. ptr%has_key("value") .and. size(vt)==2) then
            call get_value(ptr, "type", str)
            call prune_value(val, ptr, str)
            call val%accept(visitor)
            call table%delete(list(i)%key)
            call table%push_back(val, stat)
         else
            call ptr%accept(visitor)
         end if
      end select
   end do
end subroutine visit_table

subroutine prune_value(val, table, str)
   !> Actual TOML value
   class(toml_value), allocatable, intent(out) :: val
   !> TOML table to prune
   type(toml_table), intent(inout) :: table
   !> Value kind
   character(kind=tfc, len=*), intent(in) :: str

   class(toml_value), pointer :: ptr
   character(:, tfc), pointer :: sval
   integer :: stat
   type(toml_datetime) :: dval
   integer(tfi) :: ival
   real(tfr) :: fval

   call table%get("value", ptr)
   allocate(val, source=ptr)
   if (allocated(table%key)) then
      val%key = table%key
   else
      deallocate(val%key)
   end if

   select type(val)
   class is(toml_keyval)
      call val%get(sval)
      select case(str)
      case("date", "time", "datetime", "date-local", "time-local", "datetime-local")
         dval = toml_datetime(sval)
         call val%set(dval)
      case("bool")
         call val%set(sval == "true")
      case("integer")
         read(sval, *, iostat=stat) ival
         if (stat == 0) then
            call val%set(ival)
         end if
      case("float")
         read(sval, *, iostat=stat) fval
         if (stat == 0) then
            call val%set(fval)
         end if
      end select
   end select
end subroutine prune_value

end module tjson_parser

Summary#

Now we have a working lexer that can tokenize JSON documents into TOML parsable tokens. The lexer implemented in TOML Fortran works on a similar construction, with the difference that the TOML grammar is much more complicated to parse than JSON.

Important

In this tutorial, you have learned about the tokenization process used in TOML Fortran. You can now

implement a custom lexer based on the TOML tokens
verify your lexer against an expected token stream
adjust the token stream to direct the parsing process
add a post-processing step to prune the resulting data structure

TOML Fortran

Writing a custom lexer

Contents

Writing a custom lexer#

Identifying limitation#

Creating the lexer#

Identifying tokens#

Extracting values#

Verifying the lexer#

Connecting to the parser#

Summary#