How I Optimized the Process of Reading Files in Node.js

In my application, I work with CSV files in Node.js that can be very large (sometimes over 200MB). I have two main APIs: one to fetch rows from a CSV file (including counting all rows) and another to list all files associated with a user. While the file listing API performs fast (around 870ms), the API that processes CSV files takes about 6 seconds. I’ll explain the code I wrote, how it works, and the additional steps I took to optimize the process. I’ll also share some extra practice functionality to help you experiment with these optimizations.

The Code Explained

Fetching and Processing CSV Files

The first API is responsible for fetching CSV file data as an array. It performs two main tasks:

Counting the total number of rows.
Fetching a subset of rows based on pagination parameters.

Main Function: getResultAsArray

async getResultAsArray(slotId: number, page = 1, size = 10) {
  const fileInfo = await this.getResultFile(slotId);

  if (!fileInfo)
    return {
      length: 0,
      data: [],
    };

  const skip = (page - 1 < 1 ? 0 : page - 1) * size;
  const length = await this.csvService.getFileRowsCount(fileInfo.filePath);

  if (skip >= length) throw new BadRequestException('Index Out of Range');

  const res = await this.csvService.getFileManyRows(fileInfo.filePath, {
    withHeaders: true,
    size,
    skip,
  });

  return {
    length: length,
    data: res,
  };
}

How It Works:

File Information: It retrieves the file info using getResultFile().
Row Counting: It calculates how many rows are in the CSV using getFileRowsCount().
Pagination: It calculates the number of rows to skip based on the current page and size.
Row Fetching: It fetches the required rows with getFileManyRows() and returns the total row count along with the data.

Counting Rows with Streams: getFileRowsCount

async getFileRowsCount(path: string) {
  const stream = fs.createReadStream(path);
  return this.getRowsCountByStream(stream);
}

private getRowsCountByStream(stream: Readable) {
  return new Promise<number>((resolve, reject) => {
    let count = 0;

    stream
      .pipe(csv())
      .on('data', () => {
        ++count;
      })
      .on('end', () => {
        resolve(count);
      })
      .on('error', (error) => reject(error));
  });
}

How It Works:

Stream Creation: A readable stream is created from the file.
CSV Parsing: The stream is piped through a CSV parser, and each data event increments the row counter.
Completion: When the stream ends, the total count is resolved.

Fetching Specific Rows: getFileManyRows

async getFileManyRows<T>(
  path: string,
  options: CsvOptions & {
    skip: number;
    size?: number;
  } = { skip: 0 },
) {
  return new Promise<T[]>((resolve, reject) => {
    if (options.size < 0 && options.skip < 0) {
      reject(new Error('Index is out of range'));
      return;
    }

    let i = 0;
    const result: T[] = [];

    const readStream = fs.createReadStream(path);

    readStream
      .pipe(
        csv({
          headers: options?.withHeaders ? undefined : false,
        }),
      )
      .on('data', (data) => {
        if (
          i >= options.skip &&
          (!options.size || result.length < options.size)
        )
          result.push(options?.withHeaders ? data : Object.values(data));

        if (options.size && result.length == options.size) {
          readStream.destroy();
          resolve(result);
        }

        ++i;
      })
      .on('end', () => {
        resolve(result);
      })
      .on('error', (error) => reject(error));
  });
}

How It Works:

Skipping and Limiting Rows: The code uses a counter i to skip the first options.skip rows. Then it pushes rows into the result until it reaches the specified size.
Early Termination: Once the desired number of rows is collected, it destroys the stream to prevent further unnecessary processing.
Promise Handling: The promise resolves when the stream ends or if an error occurs.

Fetching User Files

The second API fetches all CSV files associated with a user by reading directories and file stats:

async getResultsFromStorage(userId: number) {
  const slots = await this.userRepository
    .createQueryBuilder('user')
    .leftJoin('user.slots', 'slot')
    .select([
      'slot.id as "id"',
      'slot.name as "name"',
      'slot.s3Folder as "s3Folder"',
    ])
    .where('user.id = :userId', { userId })
    .orderBy('slot.id', 'DESC')
    .getRawMany();

  const filteredList: StorageFile[] = [];

  const userFolder = await this.getUserS3Folder(userId);

  for (const slot of slots) {
    try {
      const slotFolderPath = this.slotsService.getSlotFolderPath(
        userFolder,
        slot.s3Folder,
      );
      const files = await fsPromises.readdir(slotFolderPath);

      for (const file of files) {
        if (extname(file) === '.csv') {
          const filePath = join(slotFolderPath, file);
          const stats = await fsPromises.stat(filePath);

          if (stats.isFile())
            filteredList.push({
              name: file,
              path: `${userFolder}/${slot.s3Folder}/${file}`,
              size: stats.size,
              mtime: stats.mtime,
              extname: '.csv',
              slotId: slot.id,
              slotName: slot.name,
            });
        }
      }
    } catch (err) {}
  }

  return filteredList;
}

How It Works:

Querying Slots: It fetches user slots from the database.
Directory Reading: For each slot, it reads the directory where CSV files are stored.
File Filtering: It filters for .csv files, checks their stats, and adds them to the result if they are valid files.

Optimizing the File Reading Process

The CSV processing API can be slow when dealing with very large files because it reads through the entire file to count rows and then processes rows for pagination. Here are some strategies I used to optimize the process:

Caching Row Counts

If the CSV file rarely changes, cache the row count after the first calculation. This avoids reading the entire file on subsequent requests.

private rowCountCache: Map<string, number> = new Map();

async getFileRowsCount(path: string) {
  if (this.rowCountCache.has(path)) {
    return this.rowCountCache.get(path)!;
  }
  const stream = fs.createReadStream(path);
  const count = await this.getRowsCountByStream(stream);
  this.rowCountCache.set(path, count);
  return count;
}

Parallel Processing of Files

For the API that lists files, I can optimize directory and file stats retrieval by processing these tasks in parallel. Instead of using sequential loops, use Promise.all to handle multiple asynchronous file system operations concurrently.

async getResultsFromStorage(userId: number) {
  const slots = await this.userRepository
    .createQueryBuilder('user')
    .leftJoin('user.slots', 'slot')
    .select([
      'slot.id as "id"',
      'slot.name as "name"',
      'slot.s3Folder as "s3Folder"',
    ])
    .where('user.id = :userId', { userId })
    .orderBy('slot.id', 'DESC')
    .getRawMany();

  const userFolder = await this.getUserS3Folder(userId);

  const fileFetchPromises = slots.map(async (slot) => {
    try {
      const slotFolderPath = this.slotsService.getSlotFolderPath(
        userFolder,
        slot.s3Folder,
      );
      const files = await fsPromises.readdir(slotFolderPath);
      const csvFiles = files.filter((file) => extname(file) === '.csv');

      const statsPromises = csvFiles.map(async (file) => {
        const filePath = join(slotFolderPath, file);
        const stats = await fsPromises.stat(filePath);
        if (stats.isFile()) {
          return {
            name: file,
            path: `${userFolder}/${slot.s3Folder}/${file}`,
            size: stats.size,
            mtime: stats.mtime,
            extname: '.csv',
            slotId: slot.id,
            slotName: slot.name,
          };
        }
      });
      return Promise.all(statsPromises);
    } catch (err) {
      return [];
    }
  });

  const results = await Promise.all(fileFetchPromises);
  // Flatten the results array
  return results.flat().filter(Boolean);
}

Stream Backpressure and Memory Management

When processing huge files, be mindful of memory usage. The code already destroys the stream once the desired number of rows is collected. You can further optimize by:

Monitoring backpressure.
Adjusting the highWaterMark parameter in fs.createReadStream() if necessary.

Additional Practice Functionality

To further experiment with optimization, I added the following practice functionalities:

Retry Mechanism for File Reading

Sometimes reading large files can fail due to transient errors. Adding a simple retry mechanism can make the system more resilient:

async function readFileWithRetry<T>(readFn: () => Promise<T>, retries = 3): Promise<T> {
  for (let attempt = 0; attempt < retries; attempt++) {
    try {
      return await readFn();
    } catch (error) {
      if (attempt === retries - 1) {
        throw error;
      }
      // Optionally, add a delay here before retrying.
    }
  }
  throw new Error('Failed to read file after retries');
}

You can wrap calls like fsPromises.stat(filePath) or your CSV reading functions with readFileWithRetry to see how it performs under flaky conditions.

Performance Logging Middleware

For further debugging and performance tuning, I implemented middleware to log the time taken by each file read operation:

import { Request, Response, NextFunction } from 'express';

export function performanceLogger(req: Request, res: Response, next: NextFunction) {
  const start = process.hrtime.bigint();
  res.on('finish', () => {
    const end = process.hrtime.bigint();
    const durationMs = Number(end - start) / 1e6;
    console.log(`Request processed in ${durationMs.toFixed(2)}ms`);
  });
  next();
}

Apply this middleware to your CSV endpoints to monitor performance improvements as you optimize.

Final Thoughts

Optimizing file processing in Node.js especially for large CSV files is a multi-faceted challenge. I learned that:

Caching can dramatically reduce processing time for repeated requests.
Parallelizing asynchronous operations minimizes waiting times for IO-bound tasks.
Stream management and error handling are critical for robust performance.

By implementing these strategies and adding extra functionalities like retry mechanisms and performance logging, I’ve managed to improve the efficiency and resilience of my file processing APIs. I encourage you to experiment with these ideas in your projects and adjust them according to your application’s needs.

How I Optimized the Process of Reading Files in Node.js

The Code Explained

Fetching and Processing CSV Files

Main Function: getResultAsArray

How It Works:

Counting Rows with Streams: getFileRowsCount

How It Works:

Fetching Specific Rows: getFileManyRows

Fetching User Files

Optimizing the File Reading Process

Caching Row Counts

Parallel Processing of Files

Stream Backpressure and Memory Management

Additional Practice Functionality

Retry Mechanism for File Reading

Performance Logging Middleware

Final Thoughts

Related blog posts

How Do I Fix “Game is not Define” Error in My Console Game

How I Fix the Error While Sending Request From Web Server to Game Server

How to Fix the CANNOT_BIND Error While Integrating Amazon Game Circle on Kindle Fire

How I Fix an Array Index Error While Swiping in My Match-3 Game