Search for ways in OpenStreetMap extract
up vote
4
down vote
favorite
What does it do ?
Looks in big XML files (1-3 GB) for given parameters that I need in my project, appends them to lists and finally exports both of them to the CSV files.
My XML scheme
I need to get child named 'v', which is value of specific child 'k' nested in tag named 'tag' which is nested within parent 'way'.
What I tried
IBM's take on big xml files
Plus various scripts from StackExchange.
Performance
My function needs 50% less time to parse XML than above method, any other I tried or I was able to try, because with some I was unable to figure things out and given up on them.
My goal
To get some tips on my code, hopefully find way to speed it up.
I need to parse about 20 - 30 GB of XML files (as mentioned before single file is 1-3 GB), whole procedure is really time consuming - It takes about 12 hours up to 2 days of non stop parsing.
Variables description:
xml - path to xml file
list_agis - list of IDs, list which length is number between 5k - 1 kk
parent - way
child - tag
child_atribitute - k
child_value_1 - highway
child_value_2 - track
name_id, name_atribute, name_file, path_csv, part_name - variables needed to create csv file name
My code
def save_to_csv(
list_1, list_2, name_1,
name_2, csv_name, catalogue,
part_name):
"""
Saves to CSV, based on 2 lists.
"""
raw_data = {name_1: list_1,
name_2: list_2}
df = pd.DataFrame(raw_data, columns=[name_1, name_2])
df.to_csv(
'{0}{1}_{2}.csv'.format(catalogue, part_name, csv_name),
index=False, header=True, encoding = 'CP1250')
def xml_parser(
xml, lista_agis, parent,
atribiute_parent, child, child_atribiute,
child_value_1, child_value_2, name_file,
sciezka_csv, name_id, name_atribiute,
part_name):
"""
Function to pick from xml files tag values.
Firstly it creates tree of xml file and then
goes each level town and when final condtion is fullfiled
id and value from xml file is appended to list in the end of
xml file list is saved to CSV.
"""
rootElement = ET.parse(xml).getroot()
list_id =
list_value =
for subelement in rootElement:
if subelement.tag == parent:
if subelement.get(atribiute_parent) in lista_agis:
for sselement in subelement:
if sselement.tag == child:
if sselement.attrib[child_atribiute] == child_value_1:
list_id.append(
subelement.get(atribiute_parent))
list_value.append(
sselement.get(child_value_2))
save_to_csv(
list_id, list_value, name_id,
name_atribiute, name_file,
sciezka_csv, part_name)
python parsing xml
|
show 1 more comment
up vote
4
down vote
favorite
What does it do ?
Looks in big XML files (1-3 GB) for given parameters that I need in my project, appends them to lists and finally exports both of them to the CSV files.
My XML scheme
I need to get child named 'v', which is value of specific child 'k' nested in tag named 'tag' which is nested within parent 'way'.
What I tried
IBM's take on big xml files
Plus various scripts from StackExchange.
Performance
My function needs 50% less time to parse XML than above method, any other I tried or I was able to try, because with some I was unable to figure things out and given up on them.
My goal
To get some tips on my code, hopefully find way to speed it up.
I need to parse about 20 - 30 GB of XML files (as mentioned before single file is 1-3 GB), whole procedure is really time consuming - It takes about 12 hours up to 2 days of non stop parsing.
Variables description:
xml - path to xml file
list_agis - list of IDs, list which length is number between 5k - 1 kk
parent - way
child - tag
child_atribitute - k
child_value_1 - highway
child_value_2 - track
name_id, name_atribute, name_file, path_csv, part_name - variables needed to create csv file name
My code
def save_to_csv(
list_1, list_2, name_1,
name_2, csv_name, catalogue,
part_name):
"""
Saves to CSV, based on 2 lists.
"""
raw_data = {name_1: list_1,
name_2: list_2}
df = pd.DataFrame(raw_data, columns=[name_1, name_2])
df.to_csv(
'{0}{1}_{2}.csv'.format(catalogue, part_name, csv_name),
index=False, header=True, encoding = 'CP1250')
def xml_parser(
xml, lista_agis, parent,
atribiute_parent, child, child_atribiute,
child_value_1, child_value_2, name_file,
sciezka_csv, name_id, name_atribiute,
part_name):
"""
Function to pick from xml files tag values.
Firstly it creates tree of xml file and then
goes each level town and when final condtion is fullfiled
id and value from xml file is appended to list in the end of
xml file list is saved to CSV.
"""
rootElement = ET.parse(xml).getroot()
list_id =
list_value =
for subelement in rootElement:
if subelement.tag == parent:
if subelement.get(atribiute_parent) in lista_agis:
for sselement in subelement:
if sselement.tag == child:
if sselement.attrib[child_atribiute] == child_value_1:
list_id.append(
subelement.get(atribiute_parent))
list_value.append(
sselement.get(child_value_2))
save_to_csv(
list_id, list_value, name_id,
name_atribiute, name_file,
sciezka_csv, part_name)
python parsing xml
1
Welcome on code review. When you ask to be reviewed from international people, try to give english code. You have more chance to get useful response if code have meaning for reviewers.
– Calak
Nov 21 at 15:06
Is that OSM XML? Can you avoid the overhead of XML by using a PBF version of your OSM dump instead? (C++ certainly has good libraries for reading OSM PBF; I'd be very surprised if Python doesn't).
– Toby Speight
Nov 21 at 15:28
2
Yes that is OSM XML. Im gonna dive into that pbf files then. Thanks! edit: I think this is my solution. imposm.org/docs/imposm.parser/latest
– JuniorPythonNewbie
Nov 21 at 15:31
2
You probably want to use a stream parser, instead of loading the entire document into memory and then parsing it. See this answer on Stack Overflow.
– AJNeufeld
Nov 21 at 15:50
1
k
andv
are "attributes", not "children".
– Reinderien
Nov 21 at 16:55
|
show 1 more comment
up vote
4
down vote
favorite
up vote
4
down vote
favorite
What does it do ?
Looks in big XML files (1-3 GB) for given parameters that I need in my project, appends them to lists and finally exports both of them to the CSV files.
My XML scheme
I need to get child named 'v', which is value of specific child 'k' nested in tag named 'tag' which is nested within parent 'way'.
What I tried
IBM's take on big xml files
Plus various scripts from StackExchange.
Performance
My function needs 50% less time to parse XML than above method, any other I tried or I was able to try, because with some I was unable to figure things out and given up on them.
My goal
To get some tips on my code, hopefully find way to speed it up.
I need to parse about 20 - 30 GB of XML files (as mentioned before single file is 1-3 GB), whole procedure is really time consuming - It takes about 12 hours up to 2 days of non stop parsing.
Variables description:
xml - path to xml file
list_agis - list of IDs, list which length is number between 5k - 1 kk
parent - way
child - tag
child_atribitute - k
child_value_1 - highway
child_value_2 - track
name_id, name_atribute, name_file, path_csv, part_name - variables needed to create csv file name
My code
def save_to_csv(
list_1, list_2, name_1,
name_2, csv_name, catalogue,
part_name):
"""
Saves to CSV, based on 2 lists.
"""
raw_data = {name_1: list_1,
name_2: list_2}
df = pd.DataFrame(raw_data, columns=[name_1, name_2])
df.to_csv(
'{0}{1}_{2}.csv'.format(catalogue, part_name, csv_name),
index=False, header=True, encoding = 'CP1250')
def xml_parser(
xml, lista_agis, parent,
atribiute_parent, child, child_atribiute,
child_value_1, child_value_2, name_file,
sciezka_csv, name_id, name_atribiute,
part_name):
"""
Function to pick from xml files tag values.
Firstly it creates tree of xml file and then
goes each level town and when final condtion is fullfiled
id and value from xml file is appended to list in the end of
xml file list is saved to CSV.
"""
rootElement = ET.parse(xml).getroot()
list_id =
list_value =
for subelement in rootElement:
if subelement.tag == parent:
if subelement.get(atribiute_parent) in lista_agis:
for sselement in subelement:
if sselement.tag == child:
if sselement.attrib[child_atribiute] == child_value_1:
list_id.append(
subelement.get(atribiute_parent))
list_value.append(
sselement.get(child_value_2))
save_to_csv(
list_id, list_value, name_id,
name_atribiute, name_file,
sciezka_csv, part_name)
python parsing xml
What does it do ?
Looks in big XML files (1-3 GB) for given parameters that I need in my project, appends them to lists and finally exports both of them to the CSV files.
My XML scheme
I need to get child named 'v', which is value of specific child 'k' nested in tag named 'tag' which is nested within parent 'way'.
What I tried
IBM's take on big xml files
Plus various scripts from StackExchange.
Performance
My function needs 50% less time to parse XML than above method, any other I tried or I was able to try, because with some I was unable to figure things out and given up on them.
My goal
To get some tips on my code, hopefully find way to speed it up.
I need to parse about 20 - 30 GB of XML files (as mentioned before single file is 1-3 GB), whole procedure is really time consuming - It takes about 12 hours up to 2 days of non stop parsing.
Variables description:
xml - path to xml file
list_agis - list of IDs, list which length is number between 5k - 1 kk
parent - way
child - tag
child_atribitute - k
child_value_1 - highway
child_value_2 - track
name_id, name_atribute, name_file, path_csv, part_name - variables needed to create csv file name
My code
def save_to_csv(
list_1, list_2, name_1,
name_2, csv_name, catalogue,
part_name):
"""
Saves to CSV, based on 2 lists.
"""
raw_data = {name_1: list_1,
name_2: list_2}
df = pd.DataFrame(raw_data, columns=[name_1, name_2])
df.to_csv(
'{0}{1}_{2}.csv'.format(catalogue, part_name, csv_name),
index=False, header=True, encoding = 'CP1250')
def xml_parser(
xml, lista_agis, parent,
atribiute_parent, child, child_atribiute,
child_value_1, child_value_2, name_file,
sciezka_csv, name_id, name_atribiute,
part_name):
"""
Function to pick from xml files tag values.
Firstly it creates tree of xml file and then
goes each level town and when final condtion is fullfiled
id and value from xml file is appended to list in the end of
xml file list is saved to CSV.
"""
rootElement = ET.parse(xml).getroot()
list_id =
list_value =
for subelement in rootElement:
if subelement.tag == parent:
if subelement.get(atribiute_parent) in lista_agis:
for sselement in subelement:
if sselement.tag == child:
if sselement.attrib[child_atribiute] == child_value_1:
list_id.append(
subelement.get(atribiute_parent))
list_value.append(
sselement.get(child_value_2))
save_to_csv(
list_id, list_value, name_id,
name_atribiute, name_file,
sciezka_csv, part_name)
python parsing xml
python parsing xml
edited Nov 21 at 15:34
Toby Speight
22.6k537109
22.6k537109
asked Nov 21 at 14:59
JuniorPythonNewbie
504
504
1
Welcome on code review. When you ask to be reviewed from international people, try to give english code. You have more chance to get useful response if code have meaning for reviewers.
– Calak
Nov 21 at 15:06
Is that OSM XML? Can you avoid the overhead of XML by using a PBF version of your OSM dump instead? (C++ certainly has good libraries for reading OSM PBF; I'd be very surprised if Python doesn't).
– Toby Speight
Nov 21 at 15:28
2
Yes that is OSM XML. Im gonna dive into that pbf files then. Thanks! edit: I think this is my solution. imposm.org/docs/imposm.parser/latest
– JuniorPythonNewbie
Nov 21 at 15:31
2
You probably want to use a stream parser, instead of loading the entire document into memory and then parsing it. See this answer on Stack Overflow.
– AJNeufeld
Nov 21 at 15:50
1
k
andv
are "attributes", not "children".
– Reinderien
Nov 21 at 16:55
|
show 1 more comment
1
Welcome on code review. When you ask to be reviewed from international people, try to give english code. You have more chance to get useful response if code have meaning for reviewers.
– Calak
Nov 21 at 15:06
Is that OSM XML? Can you avoid the overhead of XML by using a PBF version of your OSM dump instead? (C++ certainly has good libraries for reading OSM PBF; I'd be very surprised if Python doesn't).
– Toby Speight
Nov 21 at 15:28
2
Yes that is OSM XML. Im gonna dive into that pbf files then. Thanks! edit: I think this is my solution. imposm.org/docs/imposm.parser/latest
– JuniorPythonNewbie
Nov 21 at 15:31
2
You probably want to use a stream parser, instead of loading the entire document into memory and then parsing it. See this answer on Stack Overflow.
– AJNeufeld
Nov 21 at 15:50
1
k
andv
are "attributes", not "children".
– Reinderien
Nov 21 at 16:55
1
1
Welcome on code review. When you ask to be reviewed from international people, try to give english code. You have more chance to get useful response if code have meaning for reviewers.
– Calak
Nov 21 at 15:06
Welcome on code review. When you ask to be reviewed from international people, try to give english code. You have more chance to get useful response if code have meaning for reviewers.
– Calak
Nov 21 at 15:06
Is that OSM XML? Can you avoid the overhead of XML by using a PBF version of your OSM dump instead? (C++ certainly has good libraries for reading OSM PBF; I'd be very surprised if Python doesn't).
– Toby Speight
Nov 21 at 15:28
Is that OSM XML? Can you avoid the overhead of XML by using a PBF version of your OSM dump instead? (C++ certainly has good libraries for reading OSM PBF; I'd be very surprised if Python doesn't).
– Toby Speight
Nov 21 at 15:28
2
2
Yes that is OSM XML. Im gonna dive into that pbf files then. Thanks! edit: I think this is my solution. imposm.org/docs/imposm.parser/latest
– JuniorPythonNewbie
Nov 21 at 15:31
Yes that is OSM XML. Im gonna dive into that pbf files then. Thanks! edit: I think this is my solution. imposm.org/docs/imposm.parser/latest
– JuniorPythonNewbie
Nov 21 at 15:31
2
2
You probably want to use a stream parser, instead of loading the entire document into memory and then parsing it. See this answer on Stack Overflow.
– AJNeufeld
Nov 21 at 15:50
You probably want to use a stream parser, instead of loading the entire document into memory and then parsing it. See this answer on Stack Overflow.
– AJNeufeld
Nov 21 at 15:50
1
1
k
and v
are "attributes", not "children".– Reinderien
Nov 21 at 16:55
k
and v
are "attributes", not "children".– Reinderien
Nov 21 at 16:55
|
show 1 more comment
1 Answer
1
active
oldest
votes
up vote
0
down vote
Introduction
Unfortunately imposm is just for Python 2, my project is in Python 3. I think lxml library looks promising. I wrote simple code to test it, right now it is based on just 2 nodes.
Picture of nodes
I attach picture from XML file so one can see what I am dealing with.
How it works
It is iterating over l_id (which is list of ids) using etree.parse and findall.
First inner loop gathers dictionary where given id is.
Second inner loop gathers dictionary where chosen value is.
Loop for dict_ids_all appends to new list only ids from dictionary.
Loop for dict_ids_all appends to new list only value from dictionary.
My code
tree = lxml.etree.parse(r'path to xml')
l_dict_ids_all =
l_dict_values_all=
l_only_id =
l_only_values =
l_id = ['"35121262"', '"35121263"']
name = '"name"'
for id in l_id:
for tag in tree.findall('//node[@id={0}]'.format(id)):
l_dict_ids_all.append(tag.attrib)
for tag in tree.findall('//node[@id={0}]//tag[@k={1}]'.format(id,name)):
l_dict_values_all.append(tag.attrib)
#printing is only for review purpose
print('Full id dict')
print(l_dict_ids_all)
print('Full Value dict')
print(l_dict_values_all)
print('Only ID list')
for element in l_dict_ids_all:
l_only_id.append(element['id'])
print(l_only_id)
print('Only Value list')
for element in l_dict_values_all:
l_only_values.append(element['k'])
print(l_only_values)
Output
Full id dict
[{'id': '35121262', 'visible': 'true', 'version': '17',
'changeset': '41419943', 'timestamp': '2016-08-12T22:24:23Z', 'user':
'kocio', 'uid': '52087', 'lat': '52.1560439', 'lon': '21.0346808'},
{'id': '35121263', 'visible': 'true', 'version': '16', 'changeset':
'41419943', 'timestamp': '2016-08-12T22:24:20Z', 'user': 'kocio',
'uid': '52087', 'lat': '52.1492285', 'lon': '21.0461042'}]
Full Value dict [{'k': 'name', 'v': 'Stokłosy'}, {'k': 'name', 'v': 'Imielin'}]
Only ID list ['35121262', '35121263']
Only Value list ['name', 'name']
What I tried
I am aware that creating list and using it to append items to new list is wrong, but whenever I tried something like this:
l_dict_ids_all.append(tag.attrib[0]['id'])
Received an error :
TypeError Traceback (most recent call)
ipython-input-91-8b0a49bc5f35 in ()
7 for id in l_id:
8 for tag in tree.findall('//node[@id={0}]'.format(id)):
----> 9 l_dict_ids_all.append(tag.attrib[0]['id'])
10
src/lxml/etree.pyx in lxml.etree._Attrib.getitem()
src/lxml/apihelpers.pxi in lxml.etree._getAttributeValue()
src/lxml/apihelpers.pxi in lxml.etree._getNodeAttributeValue()
src/lxml/apihelpers.pxi in lxml.etree._getNsTag()
src/lxml/apihelpers.pxi in lxml.etree.__getNsTag()
src/lxml/apihelpers.pxi in lxml.etree._utf8()
TypeError: Argument must be bytes or unicode, got 'int'
My goal/problem
Code is working, but I want to make it better.
I need to get rid of 2 out 4 lists which I create at the begging.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Introduction
Unfortunately imposm is just for Python 2, my project is in Python 3. I think lxml library looks promising. I wrote simple code to test it, right now it is based on just 2 nodes.
Picture of nodes
I attach picture from XML file so one can see what I am dealing with.
How it works
It is iterating over l_id (which is list of ids) using etree.parse and findall.
First inner loop gathers dictionary where given id is.
Second inner loop gathers dictionary where chosen value is.
Loop for dict_ids_all appends to new list only ids from dictionary.
Loop for dict_ids_all appends to new list only value from dictionary.
My code
tree = lxml.etree.parse(r'path to xml')
l_dict_ids_all =
l_dict_values_all=
l_only_id =
l_only_values =
l_id = ['"35121262"', '"35121263"']
name = '"name"'
for id in l_id:
for tag in tree.findall('//node[@id={0}]'.format(id)):
l_dict_ids_all.append(tag.attrib)
for tag in tree.findall('//node[@id={0}]//tag[@k={1}]'.format(id,name)):
l_dict_values_all.append(tag.attrib)
#printing is only for review purpose
print('Full id dict')
print(l_dict_ids_all)
print('Full Value dict')
print(l_dict_values_all)
print('Only ID list')
for element in l_dict_ids_all:
l_only_id.append(element['id'])
print(l_only_id)
print('Only Value list')
for element in l_dict_values_all:
l_only_values.append(element['k'])
print(l_only_values)
Output
Full id dict
[{'id': '35121262', 'visible': 'true', 'version': '17',
'changeset': '41419943', 'timestamp': '2016-08-12T22:24:23Z', 'user':
'kocio', 'uid': '52087', 'lat': '52.1560439', 'lon': '21.0346808'},
{'id': '35121263', 'visible': 'true', 'version': '16', 'changeset':
'41419943', 'timestamp': '2016-08-12T22:24:20Z', 'user': 'kocio',
'uid': '52087', 'lat': '52.1492285', 'lon': '21.0461042'}]
Full Value dict [{'k': 'name', 'v': 'Stokłosy'}, {'k': 'name', 'v': 'Imielin'}]
Only ID list ['35121262', '35121263']
Only Value list ['name', 'name']
What I tried
I am aware that creating list and using it to append items to new list is wrong, but whenever I tried something like this:
l_dict_ids_all.append(tag.attrib[0]['id'])
Received an error :
TypeError Traceback (most recent call)
ipython-input-91-8b0a49bc5f35 in ()
7 for id in l_id:
8 for tag in tree.findall('//node[@id={0}]'.format(id)):
----> 9 l_dict_ids_all.append(tag.attrib[0]['id'])
10
src/lxml/etree.pyx in lxml.etree._Attrib.getitem()
src/lxml/apihelpers.pxi in lxml.etree._getAttributeValue()
src/lxml/apihelpers.pxi in lxml.etree._getNodeAttributeValue()
src/lxml/apihelpers.pxi in lxml.etree._getNsTag()
src/lxml/apihelpers.pxi in lxml.etree.__getNsTag()
src/lxml/apihelpers.pxi in lxml.etree._utf8()
TypeError: Argument must be bytes or unicode, got 'int'
My goal/problem
Code is working, but I want to make it better.
I need to get rid of 2 out 4 lists which I create at the begging.
add a comment |
up vote
0
down vote
Introduction
Unfortunately imposm is just for Python 2, my project is in Python 3. I think lxml library looks promising. I wrote simple code to test it, right now it is based on just 2 nodes.
Picture of nodes
I attach picture from XML file so one can see what I am dealing with.
How it works
It is iterating over l_id (which is list of ids) using etree.parse and findall.
First inner loop gathers dictionary where given id is.
Second inner loop gathers dictionary where chosen value is.
Loop for dict_ids_all appends to new list only ids from dictionary.
Loop for dict_ids_all appends to new list only value from dictionary.
My code
tree = lxml.etree.parse(r'path to xml')
l_dict_ids_all =
l_dict_values_all=
l_only_id =
l_only_values =
l_id = ['"35121262"', '"35121263"']
name = '"name"'
for id in l_id:
for tag in tree.findall('//node[@id={0}]'.format(id)):
l_dict_ids_all.append(tag.attrib)
for tag in tree.findall('//node[@id={0}]//tag[@k={1}]'.format(id,name)):
l_dict_values_all.append(tag.attrib)
#printing is only for review purpose
print('Full id dict')
print(l_dict_ids_all)
print('Full Value dict')
print(l_dict_values_all)
print('Only ID list')
for element in l_dict_ids_all:
l_only_id.append(element['id'])
print(l_only_id)
print('Only Value list')
for element in l_dict_values_all:
l_only_values.append(element['k'])
print(l_only_values)
Output
Full id dict
[{'id': '35121262', 'visible': 'true', 'version': '17',
'changeset': '41419943', 'timestamp': '2016-08-12T22:24:23Z', 'user':
'kocio', 'uid': '52087', 'lat': '52.1560439', 'lon': '21.0346808'},
{'id': '35121263', 'visible': 'true', 'version': '16', 'changeset':
'41419943', 'timestamp': '2016-08-12T22:24:20Z', 'user': 'kocio',
'uid': '52087', 'lat': '52.1492285', 'lon': '21.0461042'}]
Full Value dict [{'k': 'name', 'v': 'Stokłosy'}, {'k': 'name', 'v': 'Imielin'}]
Only ID list ['35121262', '35121263']
Only Value list ['name', 'name']
What I tried
I am aware that creating list and using it to append items to new list is wrong, but whenever I tried something like this:
l_dict_ids_all.append(tag.attrib[0]['id'])
Received an error :
TypeError Traceback (most recent call)
ipython-input-91-8b0a49bc5f35 in ()
7 for id in l_id:
8 for tag in tree.findall('//node[@id={0}]'.format(id)):
----> 9 l_dict_ids_all.append(tag.attrib[0]['id'])
10
src/lxml/etree.pyx in lxml.etree._Attrib.getitem()
src/lxml/apihelpers.pxi in lxml.etree._getAttributeValue()
src/lxml/apihelpers.pxi in lxml.etree._getNodeAttributeValue()
src/lxml/apihelpers.pxi in lxml.etree._getNsTag()
src/lxml/apihelpers.pxi in lxml.etree.__getNsTag()
src/lxml/apihelpers.pxi in lxml.etree._utf8()
TypeError: Argument must be bytes or unicode, got 'int'
My goal/problem
Code is working, but I want to make it better.
I need to get rid of 2 out 4 lists which I create at the begging.
add a comment |
up vote
0
down vote
up vote
0
down vote
Introduction
Unfortunately imposm is just for Python 2, my project is in Python 3. I think lxml library looks promising. I wrote simple code to test it, right now it is based on just 2 nodes.
Picture of nodes
I attach picture from XML file so one can see what I am dealing with.
How it works
It is iterating over l_id (which is list of ids) using etree.parse and findall.
First inner loop gathers dictionary where given id is.
Second inner loop gathers dictionary where chosen value is.
Loop for dict_ids_all appends to new list only ids from dictionary.
Loop for dict_ids_all appends to new list only value from dictionary.
My code
tree = lxml.etree.parse(r'path to xml')
l_dict_ids_all =
l_dict_values_all=
l_only_id =
l_only_values =
l_id = ['"35121262"', '"35121263"']
name = '"name"'
for id in l_id:
for tag in tree.findall('//node[@id={0}]'.format(id)):
l_dict_ids_all.append(tag.attrib)
for tag in tree.findall('//node[@id={0}]//tag[@k={1}]'.format(id,name)):
l_dict_values_all.append(tag.attrib)
#printing is only for review purpose
print('Full id dict')
print(l_dict_ids_all)
print('Full Value dict')
print(l_dict_values_all)
print('Only ID list')
for element in l_dict_ids_all:
l_only_id.append(element['id'])
print(l_only_id)
print('Only Value list')
for element in l_dict_values_all:
l_only_values.append(element['k'])
print(l_only_values)
Output
Full id dict
[{'id': '35121262', 'visible': 'true', 'version': '17',
'changeset': '41419943', 'timestamp': '2016-08-12T22:24:23Z', 'user':
'kocio', 'uid': '52087', 'lat': '52.1560439', 'lon': '21.0346808'},
{'id': '35121263', 'visible': 'true', 'version': '16', 'changeset':
'41419943', 'timestamp': '2016-08-12T22:24:20Z', 'user': 'kocio',
'uid': '52087', 'lat': '52.1492285', 'lon': '21.0461042'}]
Full Value dict [{'k': 'name', 'v': 'Stokłosy'}, {'k': 'name', 'v': 'Imielin'}]
Only ID list ['35121262', '35121263']
Only Value list ['name', 'name']
What I tried
I am aware that creating list and using it to append items to new list is wrong, but whenever I tried something like this:
l_dict_ids_all.append(tag.attrib[0]['id'])
Received an error :
TypeError Traceback (most recent call)
ipython-input-91-8b0a49bc5f35 in ()
7 for id in l_id:
8 for tag in tree.findall('//node[@id={0}]'.format(id)):
----> 9 l_dict_ids_all.append(tag.attrib[0]['id'])
10
src/lxml/etree.pyx in lxml.etree._Attrib.getitem()
src/lxml/apihelpers.pxi in lxml.etree._getAttributeValue()
src/lxml/apihelpers.pxi in lxml.etree._getNodeAttributeValue()
src/lxml/apihelpers.pxi in lxml.etree._getNsTag()
src/lxml/apihelpers.pxi in lxml.etree.__getNsTag()
src/lxml/apihelpers.pxi in lxml.etree._utf8()
TypeError: Argument must be bytes or unicode, got 'int'
My goal/problem
Code is working, but I want to make it better.
I need to get rid of 2 out 4 lists which I create at the begging.
Introduction
Unfortunately imposm is just for Python 2, my project is in Python 3. I think lxml library looks promising. I wrote simple code to test it, right now it is based on just 2 nodes.
Picture of nodes
I attach picture from XML file so one can see what I am dealing with.
How it works
It is iterating over l_id (which is list of ids) using etree.parse and findall.
First inner loop gathers dictionary where given id is.
Second inner loop gathers dictionary where chosen value is.
Loop for dict_ids_all appends to new list only ids from dictionary.
Loop for dict_ids_all appends to new list only value from dictionary.
My code
tree = lxml.etree.parse(r'path to xml')
l_dict_ids_all =
l_dict_values_all=
l_only_id =
l_only_values =
l_id = ['"35121262"', '"35121263"']
name = '"name"'
for id in l_id:
for tag in tree.findall('//node[@id={0}]'.format(id)):
l_dict_ids_all.append(tag.attrib)
for tag in tree.findall('//node[@id={0}]//tag[@k={1}]'.format(id,name)):
l_dict_values_all.append(tag.attrib)
#printing is only for review purpose
print('Full id dict')
print(l_dict_ids_all)
print('Full Value dict')
print(l_dict_values_all)
print('Only ID list')
for element in l_dict_ids_all:
l_only_id.append(element['id'])
print(l_only_id)
print('Only Value list')
for element in l_dict_values_all:
l_only_values.append(element['k'])
print(l_only_values)
Output
Full id dict
[{'id': '35121262', 'visible': 'true', 'version': '17',
'changeset': '41419943', 'timestamp': '2016-08-12T22:24:23Z', 'user':
'kocio', 'uid': '52087', 'lat': '52.1560439', 'lon': '21.0346808'},
{'id': '35121263', 'visible': 'true', 'version': '16', 'changeset':
'41419943', 'timestamp': '2016-08-12T22:24:20Z', 'user': 'kocio',
'uid': '52087', 'lat': '52.1492285', 'lon': '21.0461042'}]
Full Value dict [{'k': 'name', 'v': 'Stokłosy'}, {'k': 'name', 'v': 'Imielin'}]
Only ID list ['35121262', '35121263']
Only Value list ['name', 'name']
What I tried
I am aware that creating list and using it to append items to new list is wrong, but whenever I tried something like this:
l_dict_ids_all.append(tag.attrib[0]['id'])
Received an error :
TypeError Traceback (most recent call)
ipython-input-91-8b0a49bc5f35 in ()
7 for id in l_id:
8 for tag in tree.findall('//node[@id={0}]'.format(id)):
----> 9 l_dict_ids_all.append(tag.attrib[0]['id'])
10
src/lxml/etree.pyx in lxml.etree._Attrib.getitem()
src/lxml/apihelpers.pxi in lxml.etree._getAttributeValue()
src/lxml/apihelpers.pxi in lxml.etree._getNodeAttributeValue()
src/lxml/apihelpers.pxi in lxml.etree._getNsTag()
src/lxml/apihelpers.pxi in lxml.etree.__getNsTag()
src/lxml/apihelpers.pxi in lxml.etree._utf8()
TypeError: Argument must be bytes or unicode, got 'int'
My goal/problem
Code is working, but I want to make it better.
I need to get rid of 2 out 4 lists which I create at the begging.
answered Nov 23 at 11:56
JuniorPythonNewbie
504
504
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208158%2fsearch-for-ways-in-openstreetmap-extract%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Welcome on code review. When you ask to be reviewed from international people, try to give english code. You have more chance to get useful response if code have meaning for reviewers.
– Calak
Nov 21 at 15:06
Is that OSM XML? Can you avoid the overhead of XML by using a PBF version of your OSM dump instead? (C++ certainly has good libraries for reading OSM PBF; I'd be very surprised if Python doesn't).
– Toby Speight
Nov 21 at 15:28
2
Yes that is OSM XML. Im gonna dive into that pbf files then. Thanks! edit: I think this is my solution. imposm.org/docs/imposm.parser/latest
– JuniorPythonNewbie
Nov 21 at 15:31
2
You probably want to use a stream parser, instead of loading the entire document into memory and then parsing it. See this answer on Stack Overflow.
– AJNeufeld
Nov 21 at 15:50
1
k
andv
are "attributes", not "children".– Reinderien
Nov 21 at 16:55