performance – Simple parser using flex and c++


This is an alternative parser based on the specifications from this question. Briefly stated, the input file is a text file which has at least 33 fields separated by semicolons.

If the fourth field begins with either a T or an E, the line is valid and a subset of it written to the output file. Specifically, fields as numbered from $0$, should be output in this order: $ {0, 2, 3, 4, 5, 6, 10, 9, 11, 7, 32}$, each separated by a comma. All other fields are discarded.

%{
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <algorithm>
#include <experimental/iterator>
#include <iterator>
#undef YY_DECL
#define YY_DECL int FileLexer::yylex()


class FileLexer : public yyFlexLexer {
public:
    FileLexer(std::istream& in, std::ostream& out) :
        yyFlexLexer{in, out},
        out{out}
    {}
    using FlexLexer::yylex;
    /// the yylex function is automatically created by Flex.
    virtual int yylex();
private:
    /// pointer to the current value
    std::vector<std::string> vec;
    std::ostream& out;
    unsigned fieldcount{0};
    bool valid{true};
};


%}
%option warn nodefault batch noyywrap c++
%option yyclass="FileLexer"

FIELD  (^;n)*
DELIM   ; 
%%

{DELIM} { }
n      {
            if (valid && fieldcount >= 33) {
                std::copy(vec.begin(), vec.end(), std::experimental::make_ostream_joiner(out, ","));
                out << 'n';
            }
            vec.clear();
            fieldcount = 0;
            valid = true;
            return 1;
        }
{FIELD} {
            if (valid) {
                switch (fieldcount++) {
                    case 0:
                    case 1:
                    case 4:
                    case 5:
                    case 6:
                    case 7:
                    case 9:
                    case 32:
                        vec.push_back(yytext);
                        break;
                    case 3:
                        if (yytext(0) == 'E' || yytext(0) == 'T') {
                            vec.push_back(yytext);
                            valid = true;
                        } else {
                            valid = false;
                        }
                        break;
                    case 10:
                        {
                            auto n{vec.size()};
                            vec.push_back(yytext);
                            std::iter_swap(vec.begin()+n, vec.begin()+n-2);
                        }
                        break;
                    case 11:
                        {
                            auto n{vec.size()};
                            vec.push_back(yytext);
                            std::iter_swap(vec.begin()+n, vec.begin()+n-1);
                        }
                        break;

                }
            }
        }
%%
int main(int argc, char *argv()) {
    if (argc >= 3) {
        std::ifstream in{argv(1)};
        std::ofstream out{argv(2)};
        FileLexer lexer{in, out};
        while (lexer.yylex() != 0)
        {}
    }
}

Compile with:

flex -o parsefile.cpp lexer.l 
g++ -O2 -std=gnu++17 parsefile.cpp -o parsefile

This works but is slow (2.165 s) on my machine, with the same million-line input file as mentioned in my answer to the other question.

I tried it a few different ways but was unable to get a version that was faster than the PHP code in the other question. The switch statement logic is arguably a bit overly clever and stores only the needed fields in the desired order, but the speed was about the same as the straightforward implementation.

If it matters, I’m using gcc version 10.1 and flex 2.6.4 on a 64-bit Linux machine.