how to use sqoop in hadoop

2018-06-11 11:27:36
install
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#download Sqoop 1.4.7 version
[root@localhost ~]# wget https://mirrors.tuna.tsinghua.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
[root@localhost ~]# tar zxf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
[root@localhost ~]# wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.11.tar.gz
#after untar file, move jar package to Sqoop lib directory
[root@localhost ~]# cd /usr/local/sqoop/conf && cp sqoop-env-template.sh sqoop-env.sh
[root@localhost ~]# cat sqoop-env.sh
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/usr/local/hadoop
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/usr/local/hadoop/share/hadoop/mapreduce
#set the path to where bin/hbase is available
#export HBASE_HOME=
#Set the path to where bin/hive is available
#export HIVE_HOME=
#Set the path for where zookeper config dir is
#export ZOOCFGDIR=

#test
/usr/local/sqoop/bin/sqoop list-databases \
--connect jdbc:mysql://<dburi>/<dbname> \
--username <username> --password <password>
/usr/local/sqoop/bin/sqoop list-tables
--connect jdbc:mysql://<dburi>/<dbname> \
--username <username> --password <password>
application scenarios

tips: before you use command,make sure to su hadoop

  1. mysql -> hdfs
  2. hdfs -> mysql
  3. mysql -> hive
  4. hive -> mysql
  5. use sql as import condition
  • from mysql to hdfs
    –check-column (col): Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)
    –incremental (mode): Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
    –last-value (value): Specifies the maximum value of the check column from the previous import

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    sqoop import --connect jdbc:mysql://<dburi>/<dbname> \
    --username <username> --password <password> \
    --table <tablename> --check-column <col> --incremental <mode> --last-value <value> --target-dir <hdfs-dir>

    /usr/local/sqoop/bin/sqoop import \
    --connect jdbc:mysql://192.168.100.112/datbase?zeroDateTimeBehavior=CONVERT_TO_NULL \
    --username username --password password \
    --table pv_daxue \
    --target-dir /usr/sqoop/daxue

    #save as parquet(textfile,orcfile,parquet)
    /usr/local/sqoop/bin/sqoop import \
    --connect jdbc:mysql://192.168.100.112/datbase?zeroDateTimeBehavior=CONVERT_TO_NULL \
    --username username --password password \
    --table pv_daxue --target-dir /usr/sqoop/daxue \
    --as-parquetfile

    #save columns id,account
    /usr/local/sqoop/bin/sqoop import \
    --connect jdbc:mysql://192.168.100.112/datbase?zeroDateTimeBehavior=CONVERT_TO_NULL \
    --username username --password password \
    --table pv_daxue --target-dir /usr/sqoop/daxue \
    --columns id,account \
    --as-textfile

    tips: Parameters --as-sequencefile --as-avrodatafile and --as-parquetfile are not supported with --direct params in MySQL case.
    # after insert one record ,append import again
    # Append mode
    /usr/local/sqoop/bin/sqoop import \
    --connect jdbc:mysql://192.168.100.112/datbase?zeroDateTimeBehavior=CONVERT_TO_NULL \
    --username username --password password \
    --table pv_daxue --check-column id \
    --incremental append --target-dir /usr/sqoop/daxue \
    -last-value 5

    # Lastmodified mode
    #first import
    /usr/local/sqoop/bin/sqoop import \
    --connect jdbc:mysql://192.168.100.112/datbase?zeroDateTimeBehavior=CONVERT_TO_NULL \
    --username username --password password \
    --table pv_daxue --target-dir /usr/sqoop/daxue -m1

    bash-4.1$ hadoop fs -cat /usr/sqoop/daxue/part-m-00000
    1,hello,2018-06-12 23:48:32.0
    2,word,2018-06-12 23:48:32.0
    3,marry,2018-06-12 23:48:32.0
    4,tony,2018-06-12 23:48:32.0
    5,jack,2018-06-12 23:48:33.0
    6,james,2018-06-12 23:52:03.0
    #after insert one record , Lastmodified import again
    /usr/local/sqoop/bin/sqoop import \
    --connect jdbc:mysql://192.168.100.112/datbase?zeroDateTimeBehavior=CONVERT_TO_NULL \
    --username username --password password \
    --table pv_daxue --check-column last_mod --incremental lastmodified --last-value "2018-06-12 10:52:03" \
    --target-dir /usr/sqoop/daxue -m 1 --append

    18/06/12 10:59:48 INFO mapreduce.ImportJobBase: Transferred 60 bytes in 4.2309 seconds (14.1813 bytes/sec)
    18/06/12 10:59:48 INFO mapreduce.ImportJobBase: Retrieved 2 records
    bash-4.1$ hadoop fs -cat /usr/sqoop/daxue/part-m-00001
    6,james,2018-06-12 23:52:03.0
    7,hello,2018-06-12 23:58:08.0

    #merage by mode, after execute sql "update customertest set name = 'Hello' where id = 1;"
    /usr/local/sqoop/bin/sqoop import \
    --connect jdbc:mysql://192.168.100.112/datbase?zeroDateTimeBehavior=CONVERT_TO_NULL \
    --username username --password password \
    --table pv_daxue --check-column last_mod --incremental lastmodified --last-value "2018-06-12 23:52:03" \
    --target-dir /usr/sqoop/daxue -m 1 --merge-key id

    bash-4.1$ hadoop fs -cat /usr/sqoop/daxue/part-r-00000
    1,Hello,2018-06-13 00:07:41.0
    2,word,2018-06-12 23:48:32.0
    3,marry,2018-06-12 23:48:32.0
    4,tony,2018-06-12 23:48:32.0
    5,jack,2018-06-12 23:48:33.0
    6,james,2018-06-12 23:52:03.0
    7,me,2018-06-12 23:58:08.0
  • from hdfs to mysql
    According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn’t set

    1
    2
    3
    4
    5
    6
    7
    8
    sqoop export --connect jdbc:mysql://<dburi>/<dbname> \
    --username <username> --password <password> \
    --table <tablename> --export-dir <hdfs-dir>

    /usr/local/sqoop/bin/sqoop export \
    --connect "jdbc:mysql://192.168.100.112/datbase?useSSL=false&useUnicode=true&characterEncoding=utf-8" \
    --username username --password password \
    --table pv_daxue --export-dir /usr/sqoop/daxue
  • from mysql to hive

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    sqoop import --connect jdbc:mysql://<dburi>/<dbname> \
    --username <username> --password <password> \
    --table <tablename> --check-column <col> --incremental <mode> --last-value <value> \
    --fields-terminated-by "\t" --lines-terminated-by "\n" \
    --hive-import --target-dir <hdfs-dir> --hive-table <hive-tablename>

    /usr/local/sqoop/bin/sqoop import \
    --connect "jdbc:mysql://192.168.100.112/datbase?useSSL=false&useUnicode=true&characterEncoding=utf-8" \
    --username username --password password \
    –-table hive_table -–hive-import --hive-database database –-hive-table hive_test or -–create-hive-table hive_test \ --delete-target-dir --split-by id


    # --map-column-hive
    MySQL(bigint) --> Hive(bigint)
    MySQL(tinyint) --> Hive(tinyint)
    MySQL(int) --> Hive(int)
    MySQL(double) --> Hive(double)
    MySQL(bit) --> Hive(boolean)
    MySQL(varchar) --> Hive(string)
    MySQL(decimal) --> Hive(double)
    MySQL(date/timestamp) --> Hive(string)


    /usr/local/sqoop/bin/sqoop import \
    --connect "jdbc:mysql://192.168.100.112/datbase?useSSL=false&useUnicode=true&characterEncoding=utf-8" \
    --username username --password password \
    –-table hive_table -–hive-import \
    --map-column-hive cost="DECIMAL",date="DATE" \
    --hive-database database –-hive-table hive_test or -–create-hive-table hive_test \
    --delete-target-dir --split-by id
  • from hive to mysql

    1
    2
    3
    4
    5
    #Refer above from hdfs to mysql,only need specify the HDFS path corresponding to the Hive table
    /usr/local/sqoop/bin/sqoop export \
    --connect "jdbc:mysql://192.168.100.112/datbase?useSSL=false&useUnicode=true&characterEncoding=utf-8" \
    --username username --password password
    --table customer --export-dir /user/hive/warehouse/user.db/customer --fields-terminated-by '\001'
  • use sql as import condition

    1
    2
    3
    4
    5
    6
    7
    sqoop import --connect jdbc:mysql://<dburi>/<dbname> --username <username> --password <password> --query <query-sql> --split-by <sp-column> --hive-import --hive-table <hive-tablename> --target-dir <hdfs-dir>

    /usr/local/sqoop/bin/sqoop import \
    --connect jdbc:mysql://192.168.100.112/datbase?zeroDateTimeBehavior=CONVERT_TO_NULL \
    --username username --password password --table pv_daxue --target-dir /usr/sqoop/daxue --delete-target-dir \
    --query 'select id,account from version where account="ddd" and $CONDITIONS ' \
    --as-parquetfile
  • from mysql to hbase

    1
    2
    3
    4
    5
    6
    7
    /usr/local/sqoop/bin/sqoop import \
    --connect jdbc:mysql://192.168.100.112/datbase?zeroDateTimeBehavior=CONVERT_TO_NULL \
    --username username --password password
    --query 'select id,account from version where account="ddd" and $CONDITIONS ' \
    --hbase-table pv_daxue --hbase-create-table \
    --hbase-row-key id --split-by date -m 7 \
    --column-family tiger
遇到的问题
  • ERROR tool.ImportTool: Import failed: java.io.FileNotFoundException

    1
    2
    3
    4
    5

    18/06/11 15:42:43 ERROR tool.ImportTool: Import failed: java.io.FileNotFoundException: File does not exist: hdfs://172.16.56.143:8020/usr/local/sqoop-1.4.7.bin__hadoop-2.6.0/lib/parquet-jackson-1.6.0.jar

    [hadoop@node1 conf]$ /usr/local/hadoop/bin/hadoop fs -mkdir -p /usr/local/sqoop-1.4.7.bin__hadoop-2.6.0/lib
    [hadoop@node1 conf]$ /usr/local/hadoop/bin/hadoop fs -put /usr/local/sqoop-1.4.7.bin__hadoop-2.6.0/lib/* hdfs://172.16.56.143:8020/usr/local/sqoop-1.4.7.bin__hadoop-2.6.0/lib
  • Caused by: com.mysql.cj.exceptions.CJException: The connection property ‘zeroDateTimeBehavior’ acceptable values are: ‘CONVERT_TO_NULL’, ‘EXCEPTION’ or ‘ROUND’. The value ‘convertToNull’ is not acceptable.

    1
    2
    #using the following code below:
    jdbc:mysql://localhost:3306/database?zeroDateTimeBehavior=CONVERT_TO_NULL

    if config lzo ,you perhaps see ,use command “hadoop checknative” check

  • ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
    java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.

    1
    # install lzo support
  • No primary key could be found for tablescore

    1
    specify one with --split-by or perform a sequential import with'-m 1'
  • ERROR hive.HiveConfig: Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR is set correctly.
    ERROR tool.ImportTool: Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf

    1
    2
    3
    #add this one in .bash_profile:
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/lib/hive/lib/*
    source ~/.bash_profile
  • FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:For direct MetaStore DB connections, we don’t support retries at the client level.)

    1
    mysql> alter database hive character set latin1;
  • java.lang.RuntimeException: Can’t parse input data: ‘1Hello2018-06-13 00:07:41.0’

    1
    --fields-terminated-by '\001'

ref
Incremental Imports
sqoop的增量导入(increment import)
hadoop-lzo
lzo
Sqoop从MySQL导入数据


您的鼓励是我写作最大的动力

俗话说,投资效率是最好的投资。 如果您感觉我的文章质量不错,读后收获很大,预计能为您提高 10% 的工作效率,不妨小额捐助我一下,让我有动力继续写出更多好文章。