Friday, September 11, 2020

Python - Split one CSV file into multiple ones

 

import csv
import os


def split(filehandler, delimiter=',', row_limit=500000,
output_name_template='output_%s.csv', output_path='.', keep_headers=True):
reader = csv.reader(filehandler, delimiter=delimiter)
current_piece = 1
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w', newline=''), delimiter=delimiter)
current_limit = row_limit
if keep_headers:
headers = next(reader)
current_out_writer.writerow(headers)
for i, row in enumerate(reader):
if i + 1 > current_limit:
current_piece += 1
current_limit = row_limit * current_piece
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w', newline=''), delimiter=delimiter)
if keep_headers:
current_out_writer.writerow(headers)
current_out_writer.writerow(row)


file_path = os.path.join(os.path.abspath(os.path.dirname(__file__)), "your_file_name.csv")
split(open(file_path, 'r'))

Thursday, September 10, 2020

New era

 Recenly, I've been trying to learning PySpark and related. like AWS EMR. AWS Glue, AWS Quicksight, SageMaker, Databricks, Athena. It is a new way of processing data(well, compare to what I've been doing), kind of more cloud processing. Very excited, and continue learning.....will get more posts later :)