You are already putting the words in a string variable and the thing that's most likely slowing it down is printing the actual words to std::cout
.
If you have an extremely fast disk (like a ramdisk) you could in theory speed things up by reading from multiple files at once.
I've put together an idea (that requires C++17 or later) how that could be done that also measures the time it takes and displays some statistics when the reading is done. I've commented in the code to explain what it does. To compare with the speed when reading the files in serial, just remove the std::execution::par
argument to std::for_each
.
#include <algorithm>
#include <chrono>
#include <cstddef>
#include <execution>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <string>
#include <string_view>
#include <unordered_map>
#include <vector>
int cppmain(const std::string_view program, std::vector<std::string> files) {
if(files.empty()) {
std::cerr << "USAGE: " << program << " <files...>
";
return 1;
}
// measure how long time it takes to read all the files
auto start = std::chrono::steady_clock::now();
// a map of filenames mapping to a vector of words in each file
std::unordered_map<std::string, std::vector<std::string>> file_words_map;
// create the filename Keys in the map - this is needed to not insert keys
// in the map object itself from multiple threads simultaneously.
for(auto& file : files) file_words_map[file];
// make sure the files vector only contains unique filenames
if(files.size() != file_words_map.size()) {
// we had duplicate filenames - remove them
files.resize(file_words_map.size());
std::transform(file_words_map.begin(), file_words_map.end(),
files.begin(),
[](auto& file_words) { return file_words.first; });
}
// read from multiple files simultaneously
std::for_each(std::execution::par, files.begin(), files.end(),
[&file_words_map](auto& file) {
std::ifstream is(file); // open a file for reading
if(is) { // and check that it opened ok
// get a reference to this file's vector in the map
auto& words = file_words_map[file];
// copy all words from the file into the words vector
std::copy(std::istream_iterator<std::string>(is),
std::istream_iterator<std::string>{},
std::back_inserter(words));
}
});
// all files read, calculate how long time it took
auto elapsed_time = std::chrono::steady_clock::now() - start;
// calculate how many words we read in total
auto total =
std::accumulate(file_words_map.begin(), file_words_map.end(), 0ULL,
[](const auto val, const auto& file_words) {
return val + file_words.second.size();
});
// statistics
constexpr std::size_t max_words_to_display = 5;
for(const auto& [file, words] : file_words_map) {
std::cout << std::setw(50) << std::left << file << ' ' << std::setw(10)
<< std::right << words.size() << "
";
std::copy_n(words.begin(), std::min(max_words_to_display, words.size()),
std::ostream_iterator<std::string>(std::cout, ", "));
if(words.size() > max_words_to_display) std::cout << "...";
std::cout << '
';
}
std::cout << "Read " << total << " words from " << file_words_map.size()
<< " files in "
<< std::chrono::duration_cast<std::chrono::milliseconds>(elapsed_time)
.count()
<< " ms.
";
return 0;
}
int main(int argc, char* argv[]) {
return cppmain(argv[0], {argv + 1, argv + argc});
}
Example output:
...
handler.cpp 155
#include, <condition_variable>, #include, <functional>, #include, ...
pybind.cpp 162
#include, <iostream>, #include, <pybind11.h>, #include, ...
readip.cpp 296
#include, <netinet/ip.h>, #include, <fstream>, #include, ...
realdist.cpp 299
#include, <cmath>, #include, <iostream>, #include, ...
strerr.cpp 125
#include, <iostream>, using, namespace, std;, ...
rec.cpp 91
#include, <filesystem>, #include, <iostream>, #include, ...
Read 239715 words from 1621 files in 11 ms.
Note: You could actually do the for_each
loop over the file_words_map
directly instead of using the files
vector and then lookup the words
vector in the map like the above code does it - which would be a lot cleaner and wouldn't require that duplicates are removed from the vector - but for me, this approach is 4-5 times faster and since it's speed that's important, I used this. You can experiment with that yourself to see the result.
If you're using g++
or clang++
, compile with:
-O3 -std=c++17 -ltbb -pthread
In MSVC you don't need to specify the libraries.