In C++ what is the fastest way to read all the words from a collection of large text files?

Question

Welcome To Ask or Share your Answers For Others

In C++ what is the fastest way to read all the words from a collection of large text files?

asked Jan 27, 2021 in Technique[技术] by 深蓝 (71.8m points)

In C++ what is the fastest way to read all the words from a collection of large text files?

I have a collection of large text files and I want my code to read them word by word and store the word in a string variable. I wrote a simple code as a prototype that reads words from a collection of files and prints them (real code won't have a print command) but my actual program will contain around 20 files and hundreds of thousand of words. It seems like the performance of this code won't be great. Is there anything I can do that will improve the speed?

#include <iostream>
#include <fstream>
using namespace std;
int main() {
    string s;
    string array[3]={"text1.txt","text2.txt","text3.txt"};
    for(int i=0;i<3;i++){
        ifstream fileread;
        fileread.open(array[i]);
        while(fileread>>s){
            cout<<s<<endl;
        }

    }
    return 0;
}

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-01-27T04:34:48+0000

You are already putting the words in a string variable and the thing that's most likely slowing it down is printing the actual words to std::cout.

If you have an extremely fast disk (like a ramdisk) you could in theory speed things up by reading from multiple files at once.

I've put together an idea (that requires C++17 or later) how that could be done that also measures the time it takes and displays some statistics when the reading is done. I've commented in the code to explain what it does. To compare with the speed when reading the files in serial, just remove the std::execution::par argument to std::for_each.

#include <algorithm>
#include <chrono>
#include <cstddef>
#include <execution>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <string>
#include <string_view>
#include <unordered_map>
#include <vector>

int cppmain(const std::string_view program, std::vector<std::string> files) {
    if(files.empty()) {
        std::cerr << "USAGE: " << program << " <files...>
";
        return 1;
    }

    // measure how long time it takes to read all the files
    auto start = std::chrono::steady_clock::now();

    // a map of filenames mapping to a vector of words in each file
    std::unordered_map<std::string, std::vector<std::string>> file_words_map;

    // create the filename Keys in the map - this is needed to not insert keys
    // in the map object itself from multiple threads simultaneously.
    for(auto& file : files) file_words_map[file];

    // make sure the files vector only contains unique filenames
    if(files.size() != file_words_map.size()) {
        // we had duplicate filenames - remove them
        files.resize(file_words_map.size());
        std::transform(file_words_map.begin(), file_words_map.end(),
                       files.begin(),
                       [](auto& file_words) { return file_words.first; });
    }

    // read from multiple files simultaneously
    std::for_each(std::execution::par, files.begin(), files.end(),
                  [&file_words_map](auto& file) {                      
                      std::ifstream is(file); // open a file for reading
                      if(is) {                // and check that it opened ok
                          // get a reference to this file's vector in the map
                          auto& words = file_words_map[file];

                          // copy all words from the file into the words vector
                          std::copy(std::istream_iterator<std::string>(is),
                                    std::istream_iterator<std::string>{},
                                    std::back_inserter(words));
                      }
                  });

    // all files read, calculate how long time it took
    auto elapsed_time = std::chrono::steady_clock::now() - start;

    // calculate how many words we read in total
    auto total =
        std::accumulate(file_words_map.begin(), file_words_map.end(), 0ULL,
                        [](const auto val, const auto& file_words) {
                            return val + file_words.second.size();
                        });

    // statistics
    constexpr std::size_t max_words_to_display = 5;

    for(const auto& [file, words] : file_words_map) {
        std::cout << std::setw(50) << std::left << file << ' ' << std::setw(10)
                  << std::right << words.size() << "
 ";

        std::copy_n(words.begin(), std::min(max_words_to_display, words.size()),
                    std::ostream_iterator<std::string>(std::cout, ", "));

        if(words.size() > max_words_to_display) std::cout << "...";
        std::cout << '
';
    }

    std::cout << "Read " << total << " words from " << file_words_map.size()
              << " files in "
              << std::chrono::duration_cast<std::chrono::milliseconds>(elapsed_time)
                                                                      .count()
              << " ms.
";

    return 0;
}

int main(int argc, char* argv[]) {
    return cppmain(argv[0], {argv + 1, argv + argc});
}

Example output:

...
handler.cpp                                               155
 #include, <condition_variable>, #include, <functional>, #include, ...
pybind.cpp                                                162
 #include, <iostream>, #include, <pybind11.h>, #include, ...
readip.cpp                                                296
 #include, <netinet/ip.h>, #include, <fstream>, #include, ...
realdist.cpp                                              299
 #include, <cmath>, #include, <iostream>, #include, ...
strerr.cpp                                                125
 #include, <iostream>, using, namespace, std;, ...
rec.cpp                                                    91
 #include, <filesystem>, #include, <iostream>, #include, ...
Read 239715 words from 1621 files in 11 ms.

Note: You could actually do the for_each loop over the file_words_map directly instead of using the files vector and then lookup the words vector in the map like the above code does it - which would be a lot cleaner and wouldn't require that duplicates are removed from the vector - but for me, this approach is 4-5 times faster and since it's speed that's important, I used this. You can experiment with that yourself to see the result.

If you're using g++ or clang++, compile with:

-O3 -std=c++17 -ltbb -pthread

In MSVC you don't need to specify the libraries.

Categories

In C++ what is the fastest way to read all the words from a collection of large text files?

In C++ what is the fastest way to read all the words from a collection of large text files?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags