Load very large file in the MySQL database table using mysql in built bulk load utility and python and using a very local computer. I am using 8 GB of RAM and 4 core domestic processor.
Step 1:
Install required libraries:
Python library required: mysql-connector-python
pip3 install mysql-connector-python
OR
pip install mysql-connector-python
Step 2:
Setting the variables:
- Data File: You need to specify data file with path.
- Delimiter: Specify the delimiter. Example, ‘,’ or ‘\t’
- New Line (Line Termination) Character: Specify the character.
- Header: specify this as True always. This tutorial assumes you have column names as header in the file at the very first line.
- MySQL Host: Host name/IP of MySQL database.
- MySQL Port: Port of the same.
- MySQL DB: MySQL database.
- MySQL Table: Table Name where you want to import data.
- MySQL User: Username of MySQL server.
- MySQL Password: Password for Username.
# Variables/Configurations
data_file = r'/home/subha/Desktop/myworks/dataFiles/imdb/name.basics.tsv/data.tsv'
delimiter='\t'
newline = '\n'
header=True
mysql_host = "localhost"
mysql_port = "3306"
mysql_db = "myworks"
mysql_table = "imdb_name_basics_stg"
mysql_user = "your username"
mysql_password = "your password"
Note: I am using IMDB dataset (name.basics) which has around 12 million rows and 738MB size on disk. Here is the link to the file below:
https://datasets.imdbws.com/name.basics.tsv.gz
Step 3:
Import the python library
# import the mysql library
import mysql.connector as mysql
Step 4:
Read the first line for the header/columns.
# read first line of the input file to create header
inFile = open(data_file,"r")
header = inFile.readline()
inFile.close()
Step 5:
Extract column lists
# Extract column lists
column_list = header.split(delimiter)
Step 6:
Construct Create Table Statement
# Construct Create table statement
create_table_statement = "CREATE TABLE " + mysql_db + "." + mysql_table + "("
for a_field in column_list:
create_table_statement += a_field.replace("\n","") + " text,"
create_table_statement = create_table_statement[:-1] + ")"
print("Create Table Statement:")
print(create_table_statement)
Step 7:
Connecting to the MySQL DB
# Connect to mysql DB
db = mysql.connect(
host = mysql_host,
user = mysql_user,
passwd = mysql_password,
database = mysql_db,
allow_local_infile=True
)
Step 8:
Creating Cursor:
cursor = db.cursor()
Step 9:
Creating Table
# Creating table (DDL hence auto commit)
cursor.execute(create_table_statement)
Step 10:
Construct Load Statement
# Constructing load statement
load_statement = "LOAD DATA LOCAL INFILE '" + data_file + "'\nINTO TABLE " + mysql_db + "." + mysql_table + "\nFIELDS TERMINATED BY '" + delimiter + "'" + "\nLINES TERMINATED BY '" + newline + "'" + "\nIGNORE 1 LINES\n("
for a_field in column_list:
load_statement += a_field.replace("\n","") + ","
load_statement = load_statement[:-1] + ");"
print("Load Table Statement:")
print(load_statement)
Step 11:
Loading the file and finishing
cursor.execute(load_statement)
db.commit()
db.close()
print("Complete")
We are using below statement to load the data file into MySQL DB:
LOAD DATA LOCAL INFILE '/home/subha/Desktop/myworks/dataFiles/imdb/name.basics.tsv/data.tsv'
INTO TABLE myworks.imdb_name_basics_stg
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles);
The loading takes around 3/4 minutes to load all 12M records into MySQL table. IMDB has more huge files too. But it takes around 10/12 minute max to load the same using above method.
Here is the entire code together. Just change the Configuration section and run the script.
# Variables/Configurations
data_file = r'/home/subha/Desktop/myworks/dataFiles/imdb/name.basics.tsv/data.tsv'
delimiter='\t'
newline = '\n'
header=True
mysql_host = "localhost"
mysql_port = "3306"
mysql_db = "myworks"
mysql_table = "imdb_name_basics_stg"
mysql_user = "your username"
mysql_password = "user password"
# import the mysql library
import mysql.connector as mysql
# ----------------------------------
# read first line of the input file to create header
inFile = open(data_file,"r")
header = inFile.readline()
inFile.close()
# ---------------------------------------------------
# Extract column lists
column_list = header.split(delimiter)
# ---------------------------------------------------
# Construct Create table statement
create_table_statement = "CREATE TABLE " + mysql_db + "." + mysql_table + "("
for a_field in column_list:
create_table_statement += a_field.replace("\n","") + " text,"
create_table_statement = create_table_statement[:-1] + ")"
print("Create Table Statement:")
print(create_table_statement)
print("---------------------------------------------------------------------------")
# ---------------------------------------------------
# Connect to mysql DB
db = mysql.connect(
host = mysql_host,
user = mysql_user,
passwd = mysql_password,
database = mysql_db,
allow_local_infile=True
)
# ---------------------------------------------------
cursor = db.cursor()
# Creating table (DDL hence auto commit)
cursor.execute(create_table_statement)
# --------------------------------------
# Constructing load statement
load_statement = "LOAD DATA LOCAL INFILE '" + data_file + "'\nINTO TABLE " + mysql_db + "." + mysql_table + "\nFIELDS TERMINATED BY '" + delimiter + "'" + "\nLINES TERMINATED BY '" + newline + "'" + "\nIGNORE 1 LINES\n("
for a_field in column_list:
load_statement += a_field.replace("\n","") + ","
load_statement = load_statement[:-1] + ");"
print("Load Table Statement:")
print(load_statement)
print("---------------------------------------------------------------------------")
# -------------------------------------
cursor.execute(load_statement)
db.commit()
db.close()
print("Complete")
Happy Coding !! Need help? Post your need in this form:
NEED HELP