Problem
I'm trying to scan a drive directory (recursively walk all the paths) and write all the paths to a file (as it's finding them) using fs.createWriteStream in order to keep the memory usage low, but it doesn't work, the memory usage reaches 2GB during the scan.
Expected
I was expecting fs.createWriteStream to automatically handle memory/disk usage at all times, keeping memory usage at a minimum with back-pressure.
Code
const fs = require('fs')
const walkdir = require('walkdir')
let dir = 'C:/'
let options = {
"max_depth": 0,
"track_inodes": true,
"return_object": false,
"no_return": true,
}
const wstream = fs.createWriteStream("C:/Users/USERNAME/Desktop/paths.txt")
let walker = walkdir(dir, options)
walker.on('path', (path) => {
wstream.write(path + '\n')
})
walker.on('end', (path) => {
wstream.end()
})
Is it because I'm not using .pipe()? I tried creating a new Stream.Readable({read{}}) and then inside the .on('path' emitter pushing paths into it with readable.push(path) but that didn't really work.
UPDATE:
Method 2:
I tried the proposed in the answers drain method but it doesn't help much, it does reduce memory usage to 500mb (which is still too much for a stream) but it slows down the code significantly (from seconds to minutes)
Method 3:
I also tried using readdirp, it uses even less memory (~400mb) and is faster but I don't know how to pause it and use the drain method there to reduce the memory usage further:
const readdirp = require('readdirp')
let dir = 'C:/'
const wstream = fs.createWriteStream("C:/Users/USERNAME/Desktop/paths.txt")
readdirp(dir, {alwaysStat: false, type: 'files_directories'})
.on('data', (entry) => {
wstream.write(`${entry.fullPath}\n`)
})
Method 4:
I also tried doing this operation with a custom recursive walker, and even though it uses only 30mb of memory, which is what I wanted, but it is like 10 times slower than the readdirp method and it is synchronous which is undesirable:
const fs = require('fs')
const path = require('path')
let dir = 'C:/'
function customRecursiveWalker(dir) {
fs.readdirSync(dir).forEach(file => {
let fullPath = path.join(dir, file)
// Folders
if (fs.lstatSync(fullPath).isDirectory()) {
fs.appendFileSync("C:/Users/USERNAME/Desktop/paths.txt", `${fullPath}\n`)
customRecursiveWalker(fullPath)
}
// Files
else {
fs.appendFileSync("C:/Users/USERNAME/Desktop/paths.txt", `${fullPath}\n`)
}
})
}
customRecursiveWalker(dir)
Preliminary observation: you've attempted to get the results you want using multiple approaches. One complication when comparing the approaches you used is that they do not all do the same work. If you run tests on file tree that contains only regular files, that tree does not contain mount points, you can probably compare the approaches fairly, but when you start adding mount points, symbolic links, etc, you may get different memory and time statistics merely due to the fact that one approach excludes files that another approach includes.
I've initially attempted a solution using readdirp, but unfortunately, but that library appears buggy to me. Running it on my system here, I got inconsistent results. One run would output 10Mb of data, another run with the same input parameters would output 22Mb, then I'd get another number, etc. I looked at the code and found that it does not respect the return value of push:
_push(entry) {
if (this.readable) {
this.push(entry);
}
}
As per the documentation the push method may return a false value, in which case the Readable stream should stop producing data and wait until _read is called again. readdirp entirely ignores that part of the specification. It is crucial to pay attention to the return value of push to get proper handling of back-pressure. There are also other things that seemed questionable in that code.
So I abandoned that and worked on a proof of concept showing how it could be done. The crucial parts are:
When the push method returns false it is imperative to stop adding data to the stream. Instead, we record where we were, and stop.
We start again only when _read is called.
If you uncomment the console.log statements that print START and STOP. You'll see them printed out in succession on the console. We start, produce data until Node tells us to stop, and then we stop, until Node tells us to start again, and so on.
const stream = require("stream");
const fs = require("fs");
const { readdir, lstat } = fs.promises;
const path = require("path");
class Walk extends stream.Readable {
constructor(root, maxDepth = Infinity) {
super();
this._maxDepth = maxDepth;
// These fields allow us to remember where we were when we have to pause our
// work.
// The path of the directory to process when we resume processing, and the
// depth of this directory.
this._curdir = [root, 1];
// The directories still to process.
this._dirs = [this._curdir];
// The list of files to process when we resume processing.
this._files = [];
// The location in `this._files` were to continue processing when we resume.
this._ix = 0;
// A flag recording whether or not the fetching of files is currently going
// on.
this._started = false;
}
async _fetch() {
// Recall where we were by loading the state in local variables.
let files = this._files;
let dirs = this._dirs;
let [dir, depth] = this._curdir;
let ix = this._ix;
while (true) {
// If we've gone past the end of the files we were processing, then
// just forget about them. This simplifies the code that follows a bit.
if (ix >= files.length) {
ix = 0;
files = [];
}
// Read directories until we have files to process.
while (!files.length) {
// We've read everything, end the stream.
if (dirs.length === 0) {
// This is how the stream API requires us to indicate the stream has
// ended.
this.push(null);
// We're no longer running.
this._started = false;
return;
}
// Here, we get the next directory to process and get the list of
// files in it.
[dir, depth] = dirs.pop();
try {
files = await readdir(dir, { withFileTypes: true });
}
catch (ex) {
// This is a proof-of-concept. In a real application, you should
// determine what exceptions you want to ignore (e.g. EPERM).
}
}
// Process each file.
for (; ix < files.length; ++ix) {
const dirent = files[ix];
// Don't include in the results those files that are not directories,
// files or symbolic links.
if (!(dirent.isFile() || dirent.isDirectory() || dirent.isSymbolicLink())) {
continue;
}
const fullPath = path.join(dir, dirent.name);
if (dirent.isDirectory() & depth < this._maxDepth) {
// Keep track that we need to walk this directory.
dirs.push([fullPath, depth + 1]);
}
// Finally, we can put the data into the stream!
if (!this.push(`${fullPath}\n`)) {
// If the push returned false, we have to stop pushing results to the
// stream until _read is called again, so we have to stop.
// Uncomment this if you want to see when the stream stops.
// console.log("STOP");
// Record where we were in our processing.
this._files = files;
// The element at ix *has* been processed, so ix + 1.
this._ix = ix + 1;
this._curdir = [dir, depth];
// We're stopping, so indicate that!
this._started = false;
return;
}
}
}
}
async _read() {
// Do not start the process that puts data on the stream over and over
// again.
if (this._started) {
return;
}
this._started = true; // Yep, we've started.
// Uncomment this if you want to see when the stream starts.
// console.log("START");
await this._fetch();
}
}
// Change the paths to something that makes sense for you.
stream.pipeline(new Walk("/home/", 5),
fs.createWriteStream("/tmp/paths3.txt"),
(err) => console.log("ended with", err));
When I run the first attempt you made with walkdir here, I get the following statistics:
Elapsed time (wall clock): 59 sec
Maximum resident set size: 2.90 GB
When I use the code I've shown above:
Elapsed time (wall clock): 35 sec
Maximum resident set size: 0.1 GB
The file tree I use for the tests produces a file listing of 792 MB
You could exploit the returned value from WritableStream.write(): it essentially states if you should continue to read or not. a WritableStream has an internal property that stores the threshold after which the buffer should be processed by the OS. The drain event will be emitted when the buffer has been flushed, i.e. you can call safely call WritableStream.write() without risking to excessively fill the buffer (which means the RAM). Luckily for you, walkdir let you control the process: you can emit pause(pause the walk. no more events will be emitted until resume) and resume(resume the walk) event from the walkdir object, pausing and resuming the writing process on you stream accordingly. Try with this:
let is_emitter_paused = false;
wstream.on('drain', (evt) => {
if (is_emitter_paused) {
walkdir.resume();
}
});
walkdir.on('path', function(path, stat) {
is_emitter_paused = !wstream.write(path + '\n');
if (is_emitter_paused) {
walkdir.pause();
}
});
Here's an implementation inspired by #Louis's answer. I think it's a bit easier to follow and in my minimal testing it performs about the same.
const fs = require('fs');
const path = require('path');
const stream = require('stream');
class Walker extends stream.Readable {
constructor(root = process.cwd(), maxDepth = Infinity) {
super();
// Dirs to process
this._dirs = [{ path: root, depth: 0 }];
// Max traversal depth
this._maxDepth = maxDepth;
// Files to flush
this._files = [];
}
_drain() {
while (this._files.length > 0) {
const file = this._files.pop();
if (file.isFile() || file.isDirectory() || file.isSymbolicLink()) {
const filePath = path.join(this._dir.path, file.name);
if (file.isDirectory() && this._maxDepth > this._dir.depth) {
// Add directory to be walked at a later time
this._dirs.push({ path: filePath, depth: this._dir.depth + 1 });
}
if (!this.push(`${filePath}\n`)) {
// Hault walking
return false;
}
}
}
if (this._dirs.length === 0) {
// Walking complete
this.push(null);
return false;
}
// Continue walking
return true;
}
async _step() {
try {
this._dir = this._dirs.pop();
this._files = await fs.promises.readdir(this._dir.path, { withFileTypes: true });
} catch (e) {
this.emit('error', e); // Uh oh...
}
}
async _walk() {
this.walking = true;
while (this._drain()) {
await this._step();
}
this.walking = false;
}
_read() {
if (!this.walking) {
this._walk();
}
}
}
stream.pipeline(new Walker('some/dir/path', 5),
fs.createWriteStream('output.txt'),
(err) => console.log('ended with', err));
Related
Following is the code to create a 2d matrix in javascript:
function Create2DArray(rows) {
var arr = [];
for (var i=0;i<rows;i++) {
arr[i] = [];
}
return arr;
}
now I have a couple of 2d matrices inside an array:
const matrices = []
for(let i=1; i<10000; i++){
matrices.push(new Create2DArray(i*100))
}
// I'm just mocking it here. In reality we have data available in matrix form.
I want to do operations on each matrix like this:
for(let i=0; i<matrices.length; i++){
...domeAnythingWithEachMatrix()
}
& since it will be a computationally expensive process, I would like to do it via a web worker so that the main thread is not blocked.
I'm using paralleljs for this purpose since it will provide nice api for multithreading. (Or should I use the native Webworker? Please suggest.)
update() {
for(let i=0; i<matrices.length; i++){
var p = new Parallel(matrices[i]);
p.spawn(function (matrix) {
return doanythingOnMatrix(matrix)
// can be anything like transpose, scaling, translate etc...
}).then(function (matrix) {
return back so that I can use those values to update the DOM or directly update the DOM here.
// suggest a best way so that I can prevent crashes and improve performance.
});
}
requestAnimationFrame(update)
}
So my question is what is the best way of doing this?
Is it ok to use a new Webworker or Parallel instance inside a for loop?
Would it cause memory issues?
Or is it ok to create a global instance of Parallel or Webworker and use it for manipulating each matrix?
Or suggest a better approach.
I'm using Parallel.js for as alternative for Webworker
Is it ok to use parallel.js for multithreading? (Or do I need to use the native Webworker?)
In reality, the matrices would contain position data & this data is processed by the Webworker or parallel.js instance behind the scenes and returns the processed result back to the main app, which is then used to draw items / update canvas
UPDATE NOTE
Actually, this is an animation. So it will have to be updated for each matrix during each tick.
Currently, I'm creating a new Instance of parallel inside the for loop. I fear that this would be a non conventional approach. Or it would cause memory leaks. I need the best way of doing this. Please suggest.
UPDATE
This is my example:
Following our discussion in the comments, here is an attempt at using chunks. The data is processed by groups of 10 (a chunk), so that you can receive their results regularly, and we only start the animation after receiving 200 of them (buffer) to get a head start (think of it like a video stream). But these values may need to be adjusted depending on how long each matrix takes to process.
That being said, you added details afterwards about the lag you get. I'm not sure if this will solve it, or if the problem lays in your canvas update function. That's just a path to explore:
/*
* A helper function to process data in chunks
*/
async function processInChunks({ items, processingFunc, chunkSize, bufferSize, onData, onComplete }) {
const results = [];
// For each group of {chunkSize} items
for (let i = 0; i < items.length; i += chunkSize) {
// Process this group in parallel
const p = new Parallel( items.slice(i, i + chunkSize) );
// p.map is no a real Promise, so we create one
// to be able to await it
const chunkResults = await new Promise(resolve => {
return p.map(processingFunc).then(resolve);
});
// Add to the results
results.push(...chunkResults);
// Pass the results to a callback if we're above the {bufferSize}
if (i >= bufferSize && typeof onData === 'function') {
// Flush the results
onData(results.splice(0, results.length));
}
}
// In case there was less data than the wanted {bufferSize},
// pass the results anyway
if (results.length) {
onData(results.splice(0, results.length));
}
if (typeof onComplete === 'function') {
onComplete();
}
}
/*
* Usage
*/
// For the demo, a fake matrix Array
const matrices = new Array(3000).fill(null).map((_, i) => i + 1);
const results = [];
let animationRunning = false;
// For the demo, a function which takes time to complete
function doAnythingWithMatrix(matrix) {
const start = new Date().getTime();
while (new Date().getTime() - start < 30) { /* sleep */ }
return matrix;
}
processInChunks({
items: matrices,
processingFunc: doAnythingWithMatrix,
chunkSize: 10, // Receive results after each group of 10
bufferSize: 200, // But wait for at least 200 before starting to receive them
onData: (chunkResults) => {
results.push(...chunkResults);
if (!animationRunning) { runAnimation(); }
},
onComplete: () => {
console.log('All the matrices were processed');
}
});
function runAnimation() {
animationRunning = results.length > 0;
if (animationRunning) {
updateCanvas(results.shift());
requestAnimationFrame(runAnimation);
}
}
function updateCanvas(currentMatrixResult) {
// Just for the demo, we're not really using a canvas
canvas.innerHTML = `Frame ${currentMatrixResult} out of ${matrices.length}`;
info.innerHTML = results.length;
}
<script src="https://unpkg.com/paralleljs#1.0/lib/parallel.js"></script>
<h1 id="canvas">Buffering...</h1>
<h3>(we've got a headstart of <span id="info">0</span> matrix results)</h3>
I am trying to create 150 million lines of data and write the data into a csv file so that I can insert the data into different databases with little modification.
I am using a few functions to generate seemingly random data and pushing the data into the writable stream.
The code that I have right now is unsuccessful at handling memory issue.
After a few hours of research, I am starting to think that I should not be pushing each data at the end of the for loop because it seems that the pipe method simply cannot handle garbage collection this way.
Also, I found a few StackOverFlow answers and NodeJS docs that recommend against using push at all.
However, I am very new to NodeJS and I feel like I am blocked and do not know how to proceed from here.
If someone can provide me any guidance on how to proceed and give me an example, I would really appreciate it.
Below is a part of my code to give you a better understanding of what I am trying to achieve.
P.S. -
I have found a way to write successfully handle memory issue without using pipe method at all --I used the drain event-- but I had to start from scratch and now I am curious to know if there is a simple way to handle this memory issue without completely changing this bit of code.
Also, I have been trying to avoid using any library because I feel like there should be a relatively easy tweak to make this work without using a library but please tell me if I am wrong. Thank you in advance.
// This is my target number of data
const targetDataNum = 150000000;
// Create readable stream
const readableStream = new Stream.Readable({
read() {}
});
// Create writable stream
const writableStream = fs.createWriteStream('./database/RDBMS/test.csv');
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Then, push a number of data to the readable stream (150M in this case)
for (var i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
const data = `${id},${body},${date},${dp}\n`;
readableStream.push(data, 'utf8');
};
// Pipe readable stream to writeable stream
readableStream.pipe(writableStream);
// End the stream
readableStream.push(null);
Since you're new to streams, maybe start with an easier abstraction: generators. Generators generate data only when it is consumed (just like Streams should), but they don't have buffering and complicated constructors and methods.
This is just your for loop, moved into a generator function:
function * generateData(targetDataNum) {
for (var i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
yield `${id},${body},${date},${dp}\n`;
}
}
In Node 12, you can create a Readable stream directly from any iterable, including generators and async generators:
const stream = Readable.from(generateData(), {encoding: 'utf8'})
stream.pipe(writableStream)
i suggest to try a solution like the following:
const { Readable } = require('readable-stream');
class CustomReadable extends Readable {
constructor(max, options = {}) {
super(options);
this.targetDataNum = max;
this.i = 1;
}
_read(size) {
if (i <= this.targetDataNum) {
// your code to build the csv content
this.push(data, 'utf8');
return;
}
this.push(null);
}
}
const rs = new CustomReadable(150000000);
rs.pipe(ws);
Just complete it with your portion of code to fill the csv and create the writable stream.
With this solution you leave calling the rs.push method to the internal _read stream method invoked until this.push(null) is not called. Probably before you were filling the internal stream buffer too fast calling push manually in a loop getting the out memory error.
Try pipeing to the WritableStream before you start pumping data into the ReadableStream and yield before you write the next chunk.
...
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Pipe readable stream to writeable stream
readableStream.pipe(writableStream);
// Then, push a number of data to the readable stream (150M in this case)
for (var i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
const data = `${id},${body},${date},${dp}\n`;
readableStream.push(data, 'utf8');
// somehow YIELD for the STREAM to drain out.
};
...
The entire Stream implementation of Node.js relies on the fact that the wire is slow and that the CPU can actually have a downtime before the next chunk of data comes in from the stream source or till the next chunk of data has been written to the stream destination.
In the current implementation, since the for-loop has booked up the CPU, there is no downtime for the actual pipeing of the data to the writestream. You will be able to catch this if you watch cat test.csv which will not change while the loop is running.
As (I am sure) you know, pipe helps in guaranteeing that the data you are working with is buffered in memory only in chunks and not as a whole. But that guarantee only holds true if the CPU gets enough downtime to actually drain the data.
Having said all that, I wrapped your entire code into an async IIFE and ran it with an await for a setTimeout which ensures that I yield for the stream to drain the data.
let fs = require('fs');
let Stream = require('stream');
(async function () {
// This is my target number of data
const targetDataNum = 150000000;
// Create readable stream
const readableStream = new Stream.Readable({
read() { }
});
// Create writable stream
const writableStream = fs.createWriteStream('./test.csv');
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Pipe readable stream to writeable stream
readableStream.pipe(writableStream);
// Then, push a number of data to the readable stream (150M in this case)
for (var i = 1; i <= targetDataNum; i += 1) {
console.log(`Pushing ${i}`);
const id = i;
const body = `body${i}`;
const date = `date${i}`;
const dp = `dp${i}`;
const data = `${id},${body},${date},${dp}\n`;
readableStream.push(data, 'utf8');
await new Promise(resolve => setImmediate(resolve));
};
// End the stream
readableStream.push(null);
})();
This is what top looks like pretty much the whole time I am running this.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15213 binaek ** ** ****** ***** ***** * ***.* 0.5 *:**.** node
Notice the %MEM which stays more-or-less static.
You were running out of memory because you were pre-generating all the data in memory before you wrote any of it to disk. Instead, you need a strategy to write is as you generate so you don't have to hold large amounts of data in memory.
It does not seem like you need .pipe() here because you control the generation of the data (it's not coming from some random readStream).
So, you can just generate the data and immediately write it and handle the drain event when needed. Here's a runnable example (this creates a very large file):
const {once} = require('events');
const fs = require('fs');
// This is my target number of data
const targetDataNum = 150000000;
async function run() {
// Create writable stream
const writableStream = fs.createWriteStream('./test.csv');
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Then, push a number of data to the readable stream (150M in this case)
for (let i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
const data = `${id},${body},${date},${dp}\n`;
const canWriteMore = writableStream.write(data);
if (!canWriteMore) {
// wait for stream to be ready for more writing
await once(writableStream, "drain");
}
}
writableStream.end();
}
run().then(() => {
console.log(done);
}).catch(err => {
console.log("got rejection: ", err);
});
// placeholders for the functions that were being used
function randomDate(low, high) {
let rand = randomNumber(low.getTime(), high.getTime());
return new Date(rand);
}
function randomNumber(low, high) {
return Math.floor(Math.random() * (high - low)) + low;
}
const lorem = {
paragraph: function() {
return "random paragraph";
}
}
I have a Nodejs server that is being used to create about 1200 pdf forms that can be downloaded by a client later. They are being created using pdfmake and then output to a server folder. When I execute the code as written at about 350 documents, Nodejs runs out of memory. I know there must be a better way to save, but I cannot seem to figure it out.
The below method is being called by a map of an array of data from a Mongoose query. The relevant code for creating and saving the form is as follows:
const whichForm = certList => {
certList.map(cert => {
if (cert.Cert_Details !== null) {
switch (cert.GWMA) {
case 'OA':
case 'PC':
// Don't provide reports for Feedlots
if (cert.Cert_Details.cert_type !== null) {
if (cert.Cert_Details.cert_type === 'Irrigation') {
createOAReport(cert);
}
}
break;
case 'FA':
// Don't provide reports for Feedlots
if (cert.Cert_Details.cert_type === 'Irrigation') {
createFAReport(cert);
}
break;
}
}
}
}
Different File:
const PdfPrinter = require('pdfmake/src/printer');
const fs = require('fs');
const createOAReport = data => {
console.log('PC or OA Cert ', data.Cert_ID);
// console.log(data);
let all_meters_maint = [];
data.Flowmeters.map(flowmeter => {
// Each Flow meter
// console.log(`Inside Flowmeter ${flowmeter}`);
if (flowmeter.Active === true) {
let fm_maint = [];
fm_maint.push({
text: `Meter Serial Number: ${flowmeter.Meter_Details.Serial_num}`
});
fm_maint.push({
text: `Type of Meter: ${flowmeter.Meter_Details.Manufacturer}`
});
fm_maint.push({ text: `Units: ${flowmeter.Meter_Details.units}`});
fm_maint.push({ text: `Factor: ${flowmeter.Meter_Details.factor}`});
all_meters_maint.push(fm_maint);
}
docDefinition.content.push({
style: 'tableExample',
table: {
widths: [200, 200, '*', '*'],
body: all_meters_maint
},
layout: 'noBorders'
});
const fonts = {
Roboto: {
normal: path.join(__dirname, '../', '/fonts/Roboto-
Regular.ttf'),
bold: path.join(__dirname, '../', '/fonts/Roboto-Medium.ttf'),
italics: path.join(__dirname, '../', '/fonts/Roboto-Italic.ttf'),
bolditalics: path.join(__dirname, '../', '/fonts/Roboto-
MediumItalic.ttf')
}
};
const printer = new PdfPrinter(fonts);
const pdfDoc = printer.createPdfKitDocument(docDefinition);
// Build file path
const fullfilePath = path.join(
__dirname,
'../',
'/public/pdffiles/',
`${data.Cert_ID}.pdf`
);
pdfDoc.pipe(fs.createWriteStream(fullfilePath));
pdfDoc.end();
};
Is there a different way to save the files that don't force them to be in a stream and will not be kept in memory?
Before we get to the answer, I'm making one huge assumption based on the information in the question. The question states create about 1200 pdf forms. Which means I'm assuming in the function whichForm the parameter certList is an array of 1200 items. Or should I say 1200 items that will call the createOAReport method. You get the idea. I'm assuming the problem is that we are calling that method to create the PDFs 1200 times within that Array.map method. Which makes sense I believe given the question and context of the code.
On to the answer. The major problem is you aren't just trying to create 1200 pdfs. You are trying to create 1200 pdfs asynchronously, which of course puts a strain on the system trying to do all of that work all at once. Maybe even more so on a single thread system like Node.js.
The easy hacky solution is to just increase the memory of Node.js. By using the --max-old-space-size flag and setting the memory size in MB when running your node command. You can find more information about this at this tutorial. But the short version is a command like node --max-old-space-size=8192 main.js. That would increase the memory size of Node.js to 8192 MB or 8 GB.
Few problems with that method. Mainly it's not super scalable. What if someday you have 5000 pdfs you want to create? You'd have to increase that memory size again. And maybe increase the specs on the machine it's being run on.
The second solution, which you could actually probably do with the first solution, is to make this process not asynchronous. Depending on many factors and how optimized the current system is, chances are this will increase the amount of time it takes to create all of these PDFs.
This process is kinda a two step process to code it in. First is to setup your createOAReport function to return a promise to indicate when it's done. The second step is to change your whichForm function to limit how many items can be running asynchronously at any single point in time.
You will have to of course play around with the system to determine how many items you want to run at one time without overloading the system. Fine-tuning that number is not something I focused on, and of course you could probably increase that number by increasing the memory you give Node.js as well.
And of course, there are TONS of different ways to do this. I have a few ideas of methods that are better than the one I'm going to show here, but are a lot more complicated. The foundational idea of limiting how many items are running at once remains the same tho. You can optimize it to fit your needs.
I've developed systems like this before, but I don't think the way I've done it is the best or cleanest way to do it. But at the end of this question I've attached some sample code for your example trying to illustrate my point.
const _ = require('lodash');
const MAX_RUNNING_PROMISES = 10; // You will have to play with this number to get it right for your needs
const whichForm = async certList => {
// If certList is ["a", "b", "c", "d"]
// And we run the following function with MAX_RUNNING_PROMISES = 2
// array would equal [["a", "b"], ["c", "d"]]
certList = _.chunk(certList, MAX_RUNNING_PROMISES);
// Of course you can use something other than Lodash here, but I chose it because it's the first thing that came to mind
for (let i = 0; i < certList.length; i++) {
const certArray = certList[i];
// The following line will wait until all the promises have been resolved or completed before moving on
await Promise.all(certArray.map(cert => {
if (cert.Cert_Details !== null) {
switch (cert.GWMA) {
case 'OA':
case 'PC':
// Don't provide reports for Feedlots
if (cert.Cert_Details.cert_type !== null) {
if (cert.Cert_Details.cert_type === 'Irrigation') {
return createOAReport(cert);
}
}
break;
case 'FA':
// Don't provide reports for Feedlots
if (cert.Cert_Details.cert_type === 'Irrigation') {
return createFAReport(cert);
}
break;
}
}
}));
}
}
Then for your other file. We just have to convert it to return a promise.
const PdfPrinter = require('pdfmake/src/printer');
const fs = require('fs');
const createOAReport = data => {
return new Promise((resolve, reject) => {
console.log('PC or OA Cert ', data.Cert_ID);
// console.log(data);
let all_meters_maint = [];
const flowmeter = data.Flowmeters[0];
if (flowmeter.Active === true) {
let fm_maint = [];
fm_maint.push({
text: `Meter Serial Number: ${flowmeter.Meter_Details.Serial_num}`
});
fm_maint.push({
text: `Type of Meter: ${flowmeter.Meter_Details.Manufacturer}`
});
fm_maint.push({
text: `Units: ${flowmeter.Meter_Details.units}`
});
fm_maint.push({
text: `Factor: ${flowmeter.Meter_Details.factor}`
});
all_meters_maint.push(fm_maint);
}
docDefinition.content.push({
style: 'tableExample',
table: {
widths: [200, 200, '*', '*'],
body: all_meters_maint
},
layout: 'noBorders'
});
const fonts = {
Roboto: {
normal: path.join(__dirname, '../', '/fonts/Roboto-Regular.ttf'),
bold: path.join(__dirname, '../', '/fonts/Roboto-Medium.ttf'),
italics: path.join(__dirname, '../', '/fonts/Roboto-Italic.ttf'),
bolditalics: path.join(__dirname, '../', '/fonts/Roboto-MediumItalic.ttf')
}
};
const printer = new PdfPrinter(fonts);
const pdfDoc = printer.createPdfKitDocument(docDefinition);
// Build file path
const fullfilePath = path.join(
__dirname,
'../',
'/public/pdffiles/',
`${data.Cert_ID}.pdf`
);
pdfDoc.pipe(fs.createWriteStream(fullfilePath));
pdfDoc.on('finish', resolve); // This is where we tell it to resolve the promise when it's finished
pdfDoc.end();
});
};
I just realized after getting really far into this answer that my original assumption is incorrect. Since some of those pdfs might be created within the second function and the data.Flowmeters.map system. So although I'm not going to demonstrate it, you will have to apply the same ideas I have given throughout this answer to that system as well. For now, I have removed that section and am just using the first item in that array, since it's just an example.
You might want to restructure your code once you have an idea of this and just have one function that handles creating the PDF, and not have as many .map method calls all over the place. Abstract the .map methods out and keep it separate from the PDF creation process. That way it'd be easier to limit how many PDFs are being created at a single time.
It'd also be a good idea to add in some error handling around all of these processes.
NOTE I didn't actually test this code at all, so there might be some bugs with it. But the overall ideas and principals still apply.
Right now I have a simple.test.js file that generates calls to test based on simplified call/response files (so we don't need to write a .test.js for each of these simplified cases). For reference I'll include the file here:
'use strict';
const api = require('./api');
const SCRIPT_NAME_KEY = Symbol('script name key'),
fs = require('fs'),
path = require('path');
const generateTests = (dir) => {
const relPath = path.relative(__dirname, dir);
let query, resultScripts = [], resultSqls = [];
for (let entry of fs.readdirSync(dir)) {
if (entry[0] === '-')
continue;
let fqEntry = path.join(dir, entry);
if (fs.statSync(fqEntry).isDirectory()) {
generateTests(fqEntry);
continue;
}
if (entry === 'query.json')
query = fqEntry;
else if (entry.endsWith('.sql'))
resultSqls.push(fqEntry);
else if (entry.endsWith('.js') && !entry.endsWith('.test.js'))
resultScripts.push(fqEntry);
}
if (!query && resultScripts.length === 0 && resultSqls.length === 0)
return;
if (!query)
throw `${relPath} contains result script(s)/sql(s) but no query.json`;
if (resultScripts.length === 0 && resultSqls.length === 0)
throw `${relPath} contains a query.json file but no result script(s)/sql(s)`;
try {
query = require(query);
} catch (ex) {
throw `${relPath} query.json could not be parsed`;
}
for (let x = 0; x < resultScripts.length; x++) {
let scriptName = path.basename(resultScripts[x]);
console.log('scriptName', scriptName);
try {
resultScripts[x] = require(resultScripts[x]);
} catch (ex) {
throw `${relPath} result script ${scriptName} could not be parsed`;
}
resultScripts[x][SCRIPT_NAME_KEY] = scriptName;
}
test(`ST:${relPath}`, () => api.getSqls(query).then(resp => {
if (resultScripts.length === 0) {
expect(resp.err).toBeFalsy();
expect(resp.data).toBeAllValidSql();
} else {
for (const script of resultScripts)
expect({ n: script[SCRIPT_NAME_KEY], r: script(resp, script[SCRIPT_NAME_KEY]) }).toPass();
}
for (const sql of resultSqls)
expect(resp.data).toIncludeSql(fs.readFileSync(sql, 'utf8'));
}));
};
expect.extend({
toPass(actual) {
const pass = actual.r === void 0 || actual.r === null || !!actual.r.pass;
return {
pass: pass,
message: pass ? null : () => actual.r.message || `${actual.n} check failed!`
}
}
});
generateTests(path.join(__dirname, 'SimpleTests'));
This works really great! It runs immediately when the .test.js file is loaded by Jest and generates a test for each folder containing the valid files.
However, I now have a need to generate a test per record in a database. From what I can tell most of the available modules that provide DB functionality work on the premise of promises (and reasonably so!). So now I need to wait for a query to come back BEFORE I generate the tests.
This is what I'm trying:
'use strict';
const api = require('./api');
api.getAllReportsThroughSideChannel().then((reports) => {
for (const report of reports) {
test(`${report.Name} (${report.Id} - ${report.OwnerUsername})`, () => {
// ...
});
}
});
However when I do this I get:
FAIL ./reports.test.js
● Test suite failed to run
Your test suite must contain at least one test.
at ../node_modules/jest/node_modules/jest-cli/build/TestScheduler.js:256:22
As one might expect, the promise gets created but doesn't get a chance to actually trigger the generation of tests until after Jest has already expected to receive a list of tests from the file.
One thing I considered was to have a test that itself is a promise that checks out all the reports, but then it would fail on the first expect that results in a failure, and we want to get a list of all reports that fail tests. What we really want is a separate test for each.
I guess ultimately the question I want to know is if it is possible for the generation of tests to be done via a promise (rather then the tests themselves).
There is a TON of resources for Jest out there, after searching I didn't find anything that applies to my question, so apologies if I just missed it somehow.
Ok, after a few days of looking through docs, and code, it's looking more and more like this simply can not be done in Jest (or probably more correctly, it goes counter to Jest's testing philosophies).
As such, I have created a step prior to running the jest runtime proper, that simply downloads the results of the query to a file, then I use the file to synchronously generate the test cases.
I would LOVE it if someone can propose a better solution though.
Dear Javascript Guru's:
I have the following requirements:
Process a large array in batches of 1000 (or any arbitrary size).
When each batch is processed, update the UI to show our progress.
When all batches have been processed, continue with the next step.
For example:
function process_array(batch_size) {
var da_len = data_array.length;
var idx = 0;
function process_batch() {
var idx_end = Math.min(da_len, idx + batch_size);
while (idx < idx_end) {
// do the voodoo we need to do
}
}
// This loop kills the browser ...
while (idx < da_len) {
setTimeout(process_batch, 10);
// Show some progress (no luck) ...
show_progress(idx);
}
}
// Process array ...
process_array(1000);
// Continue with next task ...
// BUT NOT UNTIL WE HAVE FINISHED PROCESSING THE ARRAY!!!
Since I am new to javascript, I discovered that everything is done on a single thread and as such, one needs to get a little creative with regard to processing and updating the UI. I have found some examples using recursive setTimeout calls, (one key difference is I have to wait until the array has been fully processed before continuing), but I cannot seem to get things working as described above.
Also -- I am in need of a "pure" javascript solution -- no third party libraries or the use of web workers (that are not fully supported).
Any (and all) guidance would be appreciated.
Thanks in advance.
You can make a stream from array and use batch-stream to make batches so that you can stream in batches to UI.
stream-array
and
batch-stream
In JavaScript when executing scripts in a HTML page, the page becomes unresponsive until the script is finished. This is because JavaScript is single thread.
You could consider using a web worker in JavaScript that runs in the background, independently of other scripts, without affecting the performance of the page.
In this case User can continue to do whatever he wants in the UI.
You can send and receive messages from the web worker.
More info on Web Worker here.
So part of the magic of recursion is really thinking about the things that you need to pass in, to make it work.
And in JS (and other functional languages) that frequently involves functions.
function processBatch (remaining, processed, batchSize,
transform, onComplete, onProgress) {
if (!remaining.length) {
return onComplete(processed);
}
const batch = remaining.slice(0, batchSize);
const tail = remaining.slice(batchSize);
const totalProcessed = processed.concat(batch.map(transform));
return scheduleBatch(tail, totalProcessed, batchSize,
transform, onComplete, onProgress);
}
function scheduleBatch (remaining, processed, batchSize,
transform, onComplete, onProgress) {
onProgress(processed, remaining, batchSize);
setTimeout(() => processBatch(remaining, processed, batchSize,
transform, onComplete, onProgress));
}
const noop = () => {};
const identity = x => x;
function processArray (array, batchSize, transform, onComplete, onProgress) {
scheduleBatch(
array,
[],
batchSize,
transform || identity,
onComplete || noop,
onProgress || noop
);
}
This can be simplified extremely, and the reality is that I'm just having a little fun here, but if you follow the trail, you should see recursion in a closed system that works with an arbitrary transform, on arbitrary objects, of arbitrary array lengths, with arbitrary code-execution when complete, and when each batch is completed and scheduling the next run.
To be honest, you could even swap this implementation out for a custom scheduler, by changing 3 lines of code or so, and then you could log whatever you wanted...
const numbers = [1, 2, 3, 4, 5, 6];
const batchSize = 2;
const showWhenDone = numbers => console.log(`Done with: ${numbers}`);
const showProgress = (processed, remaining) =>
`${processed.length} done; ${remaining.length} to go`;
const quintuple = x => x * 5;
processArray(
numbers,
batchSize,
quintuple,
showWhenDone,
showProgress
);
// 0 done; 6 to go
// 2 done; 4 to go
// 4 done; 2 to go
// Done with: 5, 10, 15, 20, 25, 30
Overkill? Oh yes. But worth familiarizing yourself with the concepts, if you're going to spend some time in the language.
Thank-you all for your comments and suggestions.
Below is a code that I settled on. The code works for any task (in my case, processing an array) and gives the browser time to update the UI if need be.
The "do_task" function starts an anonymous function via setInterval that alternates between two steps -- processing the array in batches and showing the progress, this continues until all elements in the array have been processed.
function do_task() {
const k_task_process_array = 1;
const k_task_show_progress = 2;
var working = false;
var task_step = k_task_process_array;
var batch_size = 1000;
var idx = 0;
var idx_end = 0;
var da_len = data_array.length;
// Start the task ...
var task_id = setInterval(function () {
if (!working) {
working = true;
switch (task_step) {
case k_task_process_array:
idx_end = Math.min( idx + batch_size, da_len );
while (idx < idx_end) {
// do the voodoo we need to do ...
}
idx++;
}
task_step = k_task_show_progress;
working = false;
break;
default:
// Show progress here ...
// Continue processing array ...
task_step = k_task_process_array;
working = false;
}
// Check if done ...
if (idx >= da_len) {
clearInterval(task_id);
task_id = null;
}
working = false;
}
}, 1);
}