介绍
高效处理大型 csv 文件是许多应用程序中的常见要求,从数据分析到 etl(提取、转换、加载)过程。在本文中,我想对四种流行编程语言(golang、带有 nestjs 的 nodejs、php 和 python)在 macbook pro m1 上处理大型 csv 文件的性能进行基准测试。我的目标是确定哪种语言可以为该任务提供最佳性能。
测试环境
硬件:macbook pro m1,256gb ssd,8gb ram
软件:
- macos 索诺玛 14.5
- php 8.3.6
- golang 1.22.4
- node.js 22.0.0 与 nestjs
- python 3.12.3
测试数据
我使用了一个名为 sales_data.csv 的合成 csv 文件,其中包含大约 100 万行,每行包含交易详细信息,例如 transaction_id、product_id、数量、价格和时间戳。
任务描述
对于每种语言,脚本执行以下任务:
立即学习“PHP免费学习笔记(深入)”;
- 读取 csv 文件。
- 计算总销售额。
- 识别销量最高的产品。
执行
以下是每种语言使用的脚本:
go 语言脚本:
销售.go
package main import ( "encoding/csv" "fmt" "os" "strconv" "time" ) func main() { start := time.now() file, err := os.open("../generate-csv/sales_data.csv") if err != nil { fmt.println("error:", err) return } defer file.close() reader := csv.newreader(file) _, _ = reader.read() // skip header totalsales := 0.0 productsales := make(map[string]float64) for { line, err := reader.read() if err != nil { break } productid := line[1] quantity, _ := strconv.atoi(line[2]) price, _ := strconv.parsefloat(line[3], 64) total := float64(quantity) * price totalsales += total productsales[productid] += total } var topproduct string var topsales float64 for product, sales := range productsales { if sales > topsales { topproduct = product topsales = sales } } elapsed := time.since(start) fmt.printf("golang execution time: %sn", elapsed) fmt.printf("total sales: $%.2fn", totalsales) fmt.printf("top product: %s with sales $%.2fn", topproduct, topsales) }
nestjs脚本:
csv.service.ts
import { injectable } from '@nestjs/common'; import * as fs from 'fs'; import * as fastcsv from 'fast-csv'; // path file csv const global_csv_path = '../generate-csv/sales_data.csv'; @injectable() @injectable() export class csvservice { async parsecsv(): promise { return new promise((resolve, reject) => { const starttime = process.hrtime(); let totalsales = 0; const productsales: { [key: string]: number } = {}; fs.createreadstream(global_csv_path) .pipe(fastcsv.parse({ headers: true, delimiter: ',' })) .on('data', (row) => { const productid = row.product_id; const quantity = parseint(row.quantity, 10); const price = parsefloat(row.price); const total = quantity * price; totalsales += total; if (!productsales[productid]) { productsales[productid] = 0; } productsales[productid] += total; }) .on('end', () => { const topproduct = object.keys(productsales).reduce((a, b) => productsales[a] > productsales[b] ? a : b, ); const topproductsales = productsales[topproduct] || 0; const endtime = process.hrtime(starttime); const nestexecutiontime = endtime[0] + endtime[1] / 1e9; console.log(`nestjs execution time: ${nestexecutiontime} seconds`); console.log(`total sales: $${totalsales}`); console.log( `top product: ${topproduct} with sales $${topproductsales}`, ); resolve({ nestexecutiontime, totalsales, topproductsales, }); }) .on('error', (error) => reject(error)); }); } }
csv.controller.ts
import { controller, get } from '@nestjs/common'; import { csvservice } from './csv.service'; @controller('csv') export class csvcontroller { constructor(private readonly csvservice: csvservice) {} @get('parse') async parsecsv(): promise { return this.csvservice.parsecsv(); } }
php脚本
销售.php
<?php $start_time = microtime(true); $file = fopen("../generate-csv/sales_data.csv", "r"); $total_sales = 0; $product_sales = []; fgetcsv($file); // skip header while (($line = fgetcsv($file)) !== false) { $product_id = $line[1]; $quantity = (int)$line[2]; $price = (float)$line[3]; $total = $quantity * $price; $total_sales += $total; if (!isset($product_sales[$product_id])) { $product_sales[$product_id] = 0; } $product_sales[$product_id] += $total; } fclose($file); arsort($product_sales); $top_product = array_key_first($product_sales); $end_time = microtime(true); $execution_time = ($end_time - $start_time); echo "php execution time: ".$execution_time." secondsn"; echo "total sales: $".$total_sales."n"; echo "top product: ".$top_product." with sales $".$product_sales[$top_product]."n";
python脚本
import csv import time # Input file name config input_file = '../generate-csv/sales_data.csv' def parse_csv(file_path): start_time = time.time() total_sales = 0 product_sales = {} with open(file_path, mode='r') as file: reader = csv.DictReader(file) for row in reader: product_id = row['product_id'] quantity = int(row['quantity']) price = float(row['price']) total = quantity * price total_sales += total if product_id not in product_sales: product_sales[product_id] = 0 product_sales[product_id] += total top_product = max(product_sales, key=product_sales.get) execution_time = time.time() - start_time return { 'total_sales': total_sales, 'top_product': top_product, 'top_product_sales': product_sales[top_product], 'execution_time': execution_time, } if __name__ == "__main__": result = parse_csv(input_file) print(f"Python Execution time: {result['execution_time']:.2f} seconds") print(f"Total Sales: ${result['total_sales']:.2f}") print(f"Top Product: {result['top_product']} with sales ${ result['top_product_sales']:.2f}")
结果
以下是我们基准测试的结果:
戈兰
- 执行时间:466.69975ms
- 总销售额:$274654985.36
- 顶级产品:产品 1126,销售额 $305922.81
nestjs
- 执行时间:6.730134208秒
- 总销售额:$274654985.36000216
- 顶级产品:1126,销售额 $305922.8099999997
php
- 执行时间:1.5142710208893秒
- 总销售额:$274654985.36
- 顶级产品:1126 销售额 $305922.81
python
- 执行时间:2.56秒
- 总销售额:$274654985.36
- 顶级产品:1126 销售额 $305922.81
分析
我的基准测试揭示了一些有趣的见解:
执行时间:golang 在执行时间方面表现最好,php8 紧随其后,而 nestjs 完成任务的时间最长。
内存使用:build nestjs 表现出高效的内存使用,而 python 表现出更高的内存消耗。
易于实现:golang 提供了最简单的实现,而 nestjs 需要更多的代码行和复杂性。
结论
根据我的发现,golang 提供了最佳的性能速度和内存效率,使其成为处理大型数据集的绝佳选择。
完整代码
您可以在我的 github 存储库上获取完整代码
csv-解析-战斗.