`
sillycat
  • 浏览: 2486513 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Perl Huge XML Solution(1)Split Files and Multiple Threads

 
阅读更多
Perl Huge XML Solution(1)Split Files and Multiple Threads

1. Upgrade the Perl
>sudo yum install cpan

>sudo cpan
cpan>install Bundle::CPAN
cpan>reload cpan

cpan>upgrade
Not working with Error Message
make NO isa perl

Solution:
> sudo yum install perl-Config*

Not working to upgrade the perl, but I can install the modules one by one
cpan> install Time::Piece
cpan> install Path::Class
cpan> install autodie
cpan> install Thread::Queue

2. Split The File
split_hero.pl
#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;
use Time::Piece;
use Path::Class;
use autodie; # die if problem reading or writing a file

my $OutputSize = 0;
my $OutputCount = 0;
my $MaxSize = 100_000_000;
my $HugeFileName = "data/728";

print localtime->strftime('%Y-%m-%d %X') . "\n";

my $out;
open(my $in, '<', $HugeFileName . '.xml') or die "input: $!\n";
while(<$in>) {
    if(!$out) {
        $OutputCount++;
        $OutputSize = 0;
        open($out, '>', $HugeFileName . "/output$OutputCount.xml") or die "output: $!\n";
        unless($OutputCount==1) {
            print $out qq{<?xml version='1.0' encoding='UTF-8'?>\n};
            print $out qq{<source>\n};
        }
    }
    print $out $_;
    $OutputSize += length($_);
    if(m|</job>|i) { #/
        if($OutputSize > $MaxSize) {
            print $out "</source>\n";
            close($out);
            $out = undef;
        }
    }
}
close($in);

my @files = glob($HugeFileName . "/*.xml");

my $dir = dir($HugeFileName);
my $list_file = $dir->file("file_list");
my $list_file_handle = $list_file->open('>>');

foreach my $file (@files) {
   $list_file_handle->print($file . "\n");
   print "$file\n";
}

print localtime->strftime('%Y-%m-%d %X') . "\n";

3. Multiple Threads on Perl
#!/usr/bin/perl

use strict;
use warnings;

use threads;
use Thread::Queue;

my $nthreads = 5;

my $process_q = Thread::Queue->new();
my $failed_q  = Thread::Queue->new();

#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.

sub worker {

    #NB - this will sit a loop indefinitely, until you close the queue.
    #using $process_q -> end
    #we do this once we've queued all the things we want to process
    #and the sub completes and exits neatly.
    #however if you _don't_ end it, this will sit waiting forever.
    while ( my $server = $process_q->dequeue() ) {
        chomp($server);
        print threads->self()->tid() . ": pinging $server\n";
        my $result = `/sbin/ping -c 1 $server`;
        if ($?) { $failed_q->enqueue($server) }
        print $result;
    }
}

#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
print("what is the task list = " . $input_fh . "\n");
$process_q->enqueue(<$input_fh>);
close($input_fh);

#we 'end' process_q  - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();

#start some threads
for ( 1 .. $nthreads ) {
    threads->create( \&worker );
}

#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
    $thr->join();
}

#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
    print "$server failed to ping\n";
}

I change that a little bit to call PHP
my $result = `php src/import.php 728 $server`;

4. Test Result
split Huge XML(4.5G)  on 2 cores CPU 4G memory Machine in 00:02:05
04:17:24
04:19:29

send to Redis/SQS on 2 cores CPU 4G memory Machine in 00:03:12
04:23:46
04:26:58


References:
http://sillycat.iteye.com/blog/1017590  file handler
http://sillycat.iteye.com/blog/2193773

Perl 1, 2, 3, 4, 6
http://sillycat.iteye.com/blog/1012882
http://sillycat.iteye.com/blog/1012923
http://sillycat.iteye.com/blog/1012940
http://sillycat.iteye.com/blog/1016428
http://sillycat.iteye.com/blog/1017632 string
http://sillycat.iteye.com/blog/1021197 web
http://sillycat.iteye.com/blog/1027282 queue client
http://sillycat.iteye.com/blog/1073593 browser info

Split XML File
http://stackoverflow.com/questions/11313852/split-one-file-into-multiple-files-based-on-delimiter
http://stackoverflow.com/questions/15503980/split-file-by-xml-tag
http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_24760607.html
https://metacpan.org/pod/XML::Twig#xml_split---cut-a-big-XML-file-into-smaller-chunks
http://code.izzid.com/2008/01/21/How-to-move-back-a-line-with-reading-a-perl-filehandle.html

Perl threads
http://stackoverflow.com/questions/26296206/perl-daemonize-with-child-daemons/26297240#26297240
http://stackoverflow.com/questions/6556976/how-to-use-perl-to-run-the-same-php-script-parallel

Perl Zip the File
http://perldoc.perl.org/IO/Compress/Zip.html
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics