Perl script to parse XML with lookup

Completado Publicado Jan 2, 2007 Pagado a la entrega
Completado Pagado a la entrega

I need a perl program to take an XML file and break it into several several (about 10) smaller XML files, each containing a subset of information. In addition, one piece of data is a persons name that needs to be switched with an ID number based upon a look-up from a delimited .TXT file.

## Deliverables

This is basicallytaking one XML file and breaking it into a series of smaller files in a slightly different format (with one lookup file for IDs). It’s for one user (me) on my Windows laptop with ActivePerl. Security is not a concern.

The structures of these files seem like a straightforward translation (plus an ID lookup).

Another big part of the parsing of games is by reading play-by-play data contained in "[login to view URL]" files, but these are often missing data. This can be solved by converting a file called "[login to view URL]" that is always correct into these "[login to view URL]" type files.

Each game has one such inning_#.XML file for each inning played (typically 9 but could be 6 in a rainout or 15 in extra-inning games).

What I would need is to convert every [login to view URL] and [login to view URL] to generate multiple [login to view URL] through [login to view URL] files (and extra innings that may exist after inning_9).

1) The inning_#.XML file is a series of <atbat> followed by pitch count (b for ball, s for strike, o for out), batter id, pitcher id, <des> (summary of at bat in plain English), and a <pitch des> which describes each pitch. The top of the inning is labeled <top> and the bottom <bottom>.

It is preceded by <inning num>, and <next>="Y" if there is a subsequent inning and "N" if this is the last inning of the game.

In the [login to view URL]:

- Focus only on the <atbat> and <play-by-play inning> tags. It shows the <pbp> which is translated into the <des> of the [login to view URL] files. The <result> is translated into the <pitch des> for [login to view URL] files.

- you can ignore the <pitching team>, <batting team>, <stats>, <batting>, <linescore>, <scoring-summary) tags.

- <pitch-by-pitch inning> has "t" for top of inning and "b" for bottom of inning. For example, <pitch-by-pitch inning="7t"> contents would go into [login to view URL] labeled <top> and <pitch-by-pitch inning="7b> contents would go into [login to view URL] <bottom>.

- <play-by-play inning> also has balls, strikes and outs that go into the inning_#.XML. For example, b="1" s="1" o="1". [login to view URL] it is listed after the description but in inning_#.XML it is before the description.

2) The directories of the "input" files are: ./month_??/day_??/gid_2006_??_??_???mlb_???mlb_1/[login to view URL] (such as ./month_04/day_03/gid_2006_04_03_slnmlb_phimlb_1/[login to view URL])

./[login to view URL]

The directories of the "output" files are:

./month_??/day_??/gid_2006_??_??_???mlb_???mlb_1/inning/inning_?.xml

(such as ./month_04/day_03/gid_2006_04_03_slnmlb_phimlb_1/inning/[login to view URL];[login to view URL];[login to view URL];...)

3) [login to view URL] has batter and pitcher id numbers only for starters in the <lineup team> tags. For substitution players must do a lookup in the [login to view URL] file

3) The [login to view URL] file has a list of teams, player ids and player names

4) Substiture <pitch des> in gameday_Syn to the following in [login to view URL] files:

"hit_into_play_score" to "In play, run-scoring play",

"hit_into_play" to "In play, out(s) recorded",

"hit_into_play_no_out" to "In play, no out recorded",

5) Note in MLB if a team is leading at a home game, there is no bottom of the 9th inning played, so don't worry if you don't see any data in those cases.

Q&A:

1) I can't tell who's pitching and batting in any given atbat or inning with 100% accuracy. The only place I can find the batter information is in the pbp fields, which, since it's "spoken word" text, makes it much more difficult to parse and do lookups. Just from a cursory look, it seems like the Batter is the first person mentioned it the test string unless there's a colon in the string, which would indicate a lineup change or something like that. Is that correct? For the pitcher, I can't see any way to determine who they are at all, without looking at all the "Pitcher Change" entries. Is that how I'm supposed to determine that?

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

The playerids go round-robin from the [login to view URL] file, tagged by <lineup team>.:

lineup team="cin">

<batter pid="408252" position="2B" batting_order="1" />

<batter pid="150479" position="SS" batting_order="2" />

<batter pid="150472" position="CF" batting_order="3" />

<batter pid="110383" position="1B" batting_order="4" />

<batter pid="276055" position="LF" batting_order="5" />

<batter pid="400290" position="RF" batting_order="6" />

<batter pid="429665" position="3B" batting_order="7" />

<batter pid="424325" position="C" batting_order="8" />

<batter pid="434298" position="P" batting_order="9" />

The team bats through the order (first person is batting_order=1, second is batting_order=2, and

so on. After #9 it cycles back to the first batter. So one approach is to loop through the <batter pid> numbers

until there is a "Substitution"

Perhaps you can do this approach:

For batters:

In Gameday_Syn, look at <pbp description>.

* IF <pbp description> starts with "With", such as "With Eric Bruntlett batting, Brad Ausmus caught stealing 2nd base, catcher David Ross to shortstop Felipe Lopez to first baseman Rich Aurilia"

then in the inning_#.xml add an <action> line such as <action b="2" s="0" o="0" des="With Eric Bruntlett batting, Brad Ausmus caught stealing 2nd base, catcher David Ross to shortstop Felipe Lopez to first baseman Rich Aurilia" />

NOTE: THERE ARE NO batter="####" AND NO pitcher="####", so no need to look up or translate, just put in the

description without modifying. Easy!

* IF <pbp description> starts contains "Substitution" anywhere in the text, add an <action> line as above,

no batter id's, no pitcher id's. Just the text such as: <action b="0" s="0" o="1" des="Offensive Substitution: Pinch hitter Edgar Renteria replaces Chad Paronto." />

* If <pbp description> starts with "Offensive Substitution" then the <batter pid> gets looked up in the ROSTER.

For example, "Offensive Substitution: Pinch hitter Andy Green replaces Jose Valverde" you would look up

"Andy Green" in the ROSTER file and moving forward have his <batter id> take that place in the batting_order.

* IF <pbp description> starts with "Pitcher Change" add an <action> line as above.

* OTHERWISE find the batterid by looking up in the ROSTER file first and last name. The first name is in the

<pbp description> up to the first space, the last name is from there until the 2nd space.

For pitchers:

* In Gameday_Syn, look at <lineup team><pitcher pid="#####" />. This will be the first pitcher for the team.

* The <home-team team>'s pitcher pitches during the TOP of each inning.

* The <away-team team>'s pitcher pitches during the BOTTOM of each inning.

* The pitcher pid will stay the same until there is a "Pitcher Change" encountered in the <pbp description>.

At that point look at the <pbp description> text which reads like "Pitcher Change: Russ Springer replaces ..."

The text after the ": " and before the next space is the first name ("Russ") and until the next space

is the last name ("Springer"). You can the look up in the ROSTER table the pitcher pid.

* The new pitcher pid stays the same until the game ends or there is another "Pitcher Change:" in the description

## Platform

Windows XP with ActivePerl

Ingeniería MySQL Perl PHP Arquitectura de software Verificación de software

Nº del proyecto: #3972318

Sobre el proyecto

10 propuestas Proyecto remoto Activo Jan 11, 2007

Adjudicado a:

mytopcode

See private message.

$212.5 USD en 21 días
(53 comentarios)
5.2

10 freelancers están ofertando un promedio de $184 por este trabajo

skywriter14

See private message.

$127.5 USD en 21 días
(22 comentarios)
4.9
gromilink

See private message.

$255 USD en 21 días
(17 comentarios)
4.5
neatcodersl

See private message.

$255 USD en 21 días
(32 comentarios)
4.9
biravw

See private message.

$153 USD en 21 días
(15 comentarios)
4.3
ajaysbritto

See private message.

$127.5 USD en 21 días
(7 comentarios)
2.6
smartcoder12

See private message.

$169.15 USD en 21 días
(2 comentarios)
2.3
dilipdesavali

See private message.

$127.5 USD en 21 días
(2 comentarios)
2.0
lsengel

See private message.

$170 USD en 21 días
(0 comentarios)
0.0
coutinhovw

See private message.

$238 USD en 21 días
(0 comentarios)
0.0