Hadoop Cmd
/root/hadoop-0.19.0/bin/hadoop jar $HADOOP_STREAMING_JAR -D mapred.reduce.tasks=0 -input /mnt/hgfs/code/streaming/feed.xml -output `pwd`/output -mapper 'echo.pl' -inputreader "StreamXmlRecordReader,begin=<product ,end=</product>"
echo.pl
#!/bin/env perl
my $count = 0;
while(<STDIN>) {
chomp($_);
$count++;
print "$count\t:".$_.":\n";
}
Input
000000 3c 72 6f 6f 74 3e 0a 3c 73 65 63 6f 6e 64 3e 32 ><root>.<second>2<
000010 3c 2f 73 65 63 6f 6e 64 3e 0a 3c 70 72 6f 64 75 ></second>.<produ<
000020 63 74 20 3e 09 0a 73 74 65 76 65 3c 2f 70 72 6f >ct >..steve</pro<
000030 64 75 63 74 3e 09 0a 3c 70 72 6f 64 75 63 74 20 >duct>..<product <
000040 3e 0a 6c 69 6e 65 20 74 77 6f 0a 6c 69 6e 65 20 >>.line two.line <
000050 74 68 72 65 65 0a 3c 2f 70 72 6f 64 75 63 74 3e >three.</product><
000060 0a 0a 3c 2f 72 6f 6f 74 3e 0a >..</root>.<
00006a
Output
000000 31 09 3a 3c 70 72 6f 64 75 63 74 20 3e 09 3a 0a >1.:<product >.:.<
000010 32 09 3a 73 74 65 76 65 3c 2f 70 72 6f 64 75 63 >2.:steve</produc<
000020 74 3e 09 3a 0a 33 09 3a 3c 70 72 6f 64 75 63 74 >t>.:.3.:<product<
000030 20 3e 3a 0a 34 09 3a 6c 69 6e 65 20 74 77 6f 3a > >:.4.:line two:<
000040 0a 35 09 3a 6c 69 6e 65 20 74 68 72 65 65 3a 0a >.5.:line three:.<
000050 36 09 3a 3c 2f 70 72 6f 64 75 63 74 3e 09 3a 0a >6.:</product>.:.<
000060
Tuesday, January 27, 2009
Hadoop Streaming using the StreamXmlRecordReader
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment