Counting SQL GUIDs from a server log and printing the stats, improved
This is a continuation from my original question. After the improvements suggested by @Reinderien and some of my own.
This approach I've taken is kinda obvious. And I'm not using any parallel processing. I think there's scope for improvement because I know of a crate, Rayon in Rust which could have run the steps I'm currently running, parallelly. I'll explain why I think this is possible below.
"""
Find the number of 'exceptions' and 'added' event's in the exception log
with respect to the device ID.
author: clmno
date: 2018-12-23
updated: 2018-12-27
"""
from time import time
import re
def timer(fn):
""" Used to time a function's execution"""
def f(*args, **kwargs):
before = time()
rv = fn(*args, **kwargs)
after = time()
print("elapsed", after - before)
return rv
return f
#compile the regex globally
re_prefix = '.*?'
re_guid='([A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12})'
rg = re.compile(re_prefix+re_guid, re.IGNORECASE|re.DOTALL)
def find_sql_guid(txt):
""" From the passed in txt, find the SQL guid using re"""
m = rg.search(txt)
if m:
guid1 = m.group(1)
else:
print("ERROR: No SQL guid in line. Check the code")
exit(-1)
return guid1
@timer
def find_device_IDs(file_obj, element):
""" Find the element (type: str) within the file (file path is
provide as arg). Then find the SQL guid from the line at hand.
(Each line has a SQL guid)
Return a dict of {element: [<list of SQL guids>]}
"""
lines = set()
for line in file_obj:
if element in line:
#find the sql-guid from the line-str & append
lines.add(find_sql_guid(line))
file_obj.seek(0)
return lines
@timer
def find_num_occurences(file_obj, key, search_val, unique_values):
""" Find and append SQL guids that are in a line that contains a string
that's in search_val into 'exception' and 'added'
Return a dict of {'exception':set(<set of SQL guids>),
'added': set(<set of SQL guids>)}
"""
lines = {'exception':set(), 'added': set()}
for line in file_obj:
for value in unique_values:
if value in line:
if search_val[0] in line:
lines['exception'].add(value)
elif search_val[1] in line:
lines['added'].add(value)
file_obj.seek(0)
return lines
def print_stats(num_exceptions_dict):
for key in num_exceptions_dict.keys():
print("{} added ".format(key) +
str(len(list(num_exceptions_dict[key]["added"]))))
print("{} exceptions ".format(key) +
str(len(list(num_exceptions_dict[key]["exception"]))))
if __name__ == "__main__":
path = 'log/server.log'
search_list = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
with open(path) as file_obj:
#find every occurance of device ID and find their corresponding SQL
# guids (unique ID)
unique_ids_dict = {
element: find_device_IDs(file_obj, element)
for element in search_list
}
#Now for each unique ID find if string ["Exception occurred",
# "Packet record has been added"] is found in it's SQL guid list.
search_with_in_deviceID = ("Exception occurred",
"Packet record has been added")
#reset the file pointer
file_obj.seek(0)
num_exceptions_dict = {
elem: find_num_occurences(file_obj, elem, search_with_in_deviceID,
unique_ids_dict[elem])
for elem in search_list
}
print_stats(num_exceptions_dict)
and here's a small server log for you to experiment on
Improvements
- More pythonic with some help from Reinderien.
- Opening the file only once. Can't see a significant change in execution speed.
- Using better data structure model. Was using
dict
s everywhere,set
s made sense.
My current approach is to
- Find the device ID (eg 3BAA5C42) and their corresponding SQL GUIDs.
- For each SQL GUID find if it resulted in an
exception
oradded
event. Store them in dict.
- Print the stats
Parallelize
Step one and two are just going through the file searching for a particular string and performing a set of instructions once the sting is found. And so each process (both within steps one and two, and step one and two as a whole) is independent of each other / mutually exclusive
. And so running them in parallel makes more sense.
How should I get going to improve this code?
python
New contributor
|
show 1 more comment
This is a continuation from my original question. After the improvements suggested by @Reinderien and some of my own.
This approach I've taken is kinda obvious. And I'm not using any parallel processing. I think there's scope for improvement because I know of a crate, Rayon in Rust which could have run the steps I'm currently running, parallelly. I'll explain why I think this is possible below.
"""
Find the number of 'exceptions' and 'added' event's in the exception log
with respect to the device ID.
author: clmno
date: 2018-12-23
updated: 2018-12-27
"""
from time import time
import re
def timer(fn):
""" Used to time a function's execution"""
def f(*args, **kwargs):
before = time()
rv = fn(*args, **kwargs)
after = time()
print("elapsed", after - before)
return rv
return f
#compile the regex globally
re_prefix = '.*?'
re_guid='([A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12})'
rg = re.compile(re_prefix+re_guid, re.IGNORECASE|re.DOTALL)
def find_sql_guid(txt):
""" From the passed in txt, find the SQL guid using re"""
m = rg.search(txt)
if m:
guid1 = m.group(1)
else:
print("ERROR: No SQL guid in line. Check the code")
exit(-1)
return guid1
@timer
def find_device_IDs(file_obj, element):
""" Find the element (type: str) within the file (file path is
provide as arg). Then find the SQL guid from the line at hand.
(Each line has a SQL guid)
Return a dict of {element: [<list of SQL guids>]}
"""
lines = set()
for line in file_obj:
if element in line:
#find the sql-guid from the line-str & append
lines.add(find_sql_guid(line))
file_obj.seek(0)
return lines
@timer
def find_num_occurences(file_obj, key, search_val, unique_values):
""" Find and append SQL guids that are in a line that contains a string
that's in search_val into 'exception' and 'added'
Return a dict of {'exception':set(<set of SQL guids>),
'added': set(<set of SQL guids>)}
"""
lines = {'exception':set(), 'added': set()}
for line in file_obj:
for value in unique_values:
if value in line:
if search_val[0] in line:
lines['exception'].add(value)
elif search_val[1] in line:
lines['added'].add(value)
file_obj.seek(0)
return lines
def print_stats(num_exceptions_dict):
for key in num_exceptions_dict.keys():
print("{} added ".format(key) +
str(len(list(num_exceptions_dict[key]["added"]))))
print("{} exceptions ".format(key) +
str(len(list(num_exceptions_dict[key]["exception"]))))
if __name__ == "__main__":
path = 'log/server.log'
search_list = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
with open(path) as file_obj:
#find every occurance of device ID and find their corresponding SQL
# guids (unique ID)
unique_ids_dict = {
element: find_device_IDs(file_obj, element)
for element in search_list
}
#Now for each unique ID find if string ["Exception occurred",
# "Packet record has been added"] is found in it's SQL guid list.
search_with_in_deviceID = ("Exception occurred",
"Packet record has been added")
#reset the file pointer
file_obj.seek(0)
num_exceptions_dict = {
elem: find_num_occurences(file_obj, elem, search_with_in_deviceID,
unique_ids_dict[elem])
for elem in search_list
}
print_stats(num_exceptions_dict)
and here's a small server log for you to experiment on
Improvements
- More pythonic with some help from Reinderien.
- Opening the file only once. Can't see a significant change in execution speed.
- Using better data structure model. Was using
dict
s everywhere,set
s made sense.
My current approach is to
- Find the device ID (eg 3BAA5C42) and their corresponding SQL GUIDs.
- For each SQL GUID find if it resulted in an
exception
oradded
event. Store them in dict.
- Print the stats
Parallelize
Step one and two are just going through the file searching for a particular string and performing a set of instructions once the sting is found. And so each process (both within steps one and two, and step one and two as a whole) is independent of each other / mutually exclusive
. And so running them in parallel makes more sense.
How should I get going to improve this code?
python
New contributor
And I've read of Donald Knuth's famour string search algo Knuth-Morris-Pratt. What I'm doing is a string search. Just wondering
– clmno
Dec 27 at 6:36
Your dictionary comprehension to find the unique IDs does not work. Afterfind_device_IDs
has run for the first time, you have reached the end of the file, so all except one ID are empty sets. Afile_obj.seek(0)
at the end of the function would help for now.
– Graipher
Dec 27 at 10:03
The signature of that function should also bedef find_device_IDs(file_obj, element):
, otherwise it just accesses the global variablefile_obj
.
– Graipher
Dec 27 at 10:08
@Graipher Fixed that in the code and tried running again. Sadly, there isn't a significant improvement from my previous version. (its 45s now, used to be 47s)
– clmno
Dec 27 at 10:22
1
Will write an answer to see if you can get away with a single pass over the file.
– Graipher
Dec 27 at 10:25
|
show 1 more comment
This is a continuation from my original question. After the improvements suggested by @Reinderien and some of my own.
This approach I've taken is kinda obvious. And I'm not using any parallel processing. I think there's scope for improvement because I know of a crate, Rayon in Rust which could have run the steps I'm currently running, parallelly. I'll explain why I think this is possible below.
"""
Find the number of 'exceptions' and 'added' event's in the exception log
with respect to the device ID.
author: clmno
date: 2018-12-23
updated: 2018-12-27
"""
from time import time
import re
def timer(fn):
""" Used to time a function's execution"""
def f(*args, **kwargs):
before = time()
rv = fn(*args, **kwargs)
after = time()
print("elapsed", after - before)
return rv
return f
#compile the regex globally
re_prefix = '.*?'
re_guid='([A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12})'
rg = re.compile(re_prefix+re_guid, re.IGNORECASE|re.DOTALL)
def find_sql_guid(txt):
""" From the passed in txt, find the SQL guid using re"""
m = rg.search(txt)
if m:
guid1 = m.group(1)
else:
print("ERROR: No SQL guid in line. Check the code")
exit(-1)
return guid1
@timer
def find_device_IDs(file_obj, element):
""" Find the element (type: str) within the file (file path is
provide as arg). Then find the SQL guid from the line at hand.
(Each line has a SQL guid)
Return a dict of {element: [<list of SQL guids>]}
"""
lines = set()
for line in file_obj:
if element in line:
#find the sql-guid from the line-str & append
lines.add(find_sql_guid(line))
file_obj.seek(0)
return lines
@timer
def find_num_occurences(file_obj, key, search_val, unique_values):
""" Find and append SQL guids that are in a line that contains a string
that's in search_val into 'exception' and 'added'
Return a dict of {'exception':set(<set of SQL guids>),
'added': set(<set of SQL guids>)}
"""
lines = {'exception':set(), 'added': set()}
for line in file_obj:
for value in unique_values:
if value in line:
if search_val[0] in line:
lines['exception'].add(value)
elif search_val[1] in line:
lines['added'].add(value)
file_obj.seek(0)
return lines
def print_stats(num_exceptions_dict):
for key in num_exceptions_dict.keys():
print("{} added ".format(key) +
str(len(list(num_exceptions_dict[key]["added"]))))
print("{} exceptions ".format(key) +
str(len(list(num_exceptions_dict[key]["exception"]))))
if __name__ == "__main__":
path = 'log/server.log'
search_list = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
with open(path) as file_obj:
#find every occurance of device ID and find their corresponding SQL
# guids (unique ID)
unique_ids_dict = {
element: find_device_IDs(file_obj, element)
for element in search_list
}
#Now for each unique ID find if string ["Exception occurred",
# "Packet record has been added"] is found in it's SQL guid list.
search_with_in_deviceID = ("Exception occurred",
"Packet record has been added")
#reset the file pointer
file_obj.seek(0)
num_exceptions_dict = {
elem: find_num_occurences(file_obj, elem, search_with_in_deviceID,
unique_ids_dict[elem])
for elem in search_list
}
print_stats(num_exceptions_dict)
and here's a small server log for you to experiment on
Improvements
- More pythonic with some help from Reinderien.
- Opening the file only once. Can't see a significant change in execution speed.
- Using better data structure model. Was using
dict
s everywhere,set
s made sense.
My current approach is to
- Find the device ID (eg 3BAA5C42) and their corresponding SQL GUIDs.
- For each SQL GUID find if it resulted in an
exception
oradded
event. Store them in dict.
- Print the stats
Parallelize
Step one and two are just going through the file searching for a particular string and performing a set of instructions once the sting is found. And so each process (both within steps one and two, and step one and two as a whole) is independent of each other / mutually exclusive
. And so running them in parallel makes more sense.
How should I get going to improve this code?
python
New contributor
This is a continuation from my original question. After the improvements suggested by @Reinderien and some of my own.
This approach I've taken is kinda obvious. And I'm not using any parallel processing. I think there's scope for improvement because I know of a crate, Rayon in Rust which could have run the steps I'm currently running, parallelly. I'll explain why I think this is possible below.
"""
Find the number of 'exceptions' and 'added' event's in the exception log
with respect to the device ID.
author: clmno
date: 2018-12-23
updated: 2018-12-27
"""
from time import time
import re
def timer(fn):
""" Used to time a function's execution"""
def f(*args, **kwargs):
before = time()
rv = fn(*args, **kwargs)
after = time()
print("elapsed", after - before)
return rv
return f
#compile the regex globally
re_prefix = '.*?'
re_guid='([A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12})'
rg = re.compile(re_prefix+re_guid, re.IGNORECASE|re.DOTALL)
def find_sql_guid(txt):
""" From the passed in txt, find the SQL guid using re"""
m = rg.search(txt)
if m:
guid1 = m.group(1)
else:
print("ERROR: No SQL guid in line. Check the code")
exit(-1)
return guid1
@timer
def find_device_IDs(file_obj, element):
""" Find the element (type: str) within the file (file path is
provide as arg). Then find the SQL guid from the line at hand.
(Each line has a SQL guid)
Return a dict of {element: [<list of SQL guids>]}
"""
lines = set()
for line in file_obj:
if element in line:
#find the sql-guid from the line-str & append
lines.add(find_sql_guid(line))
file_obj.seek(0)
return lines
@timer
def find_num_occurences(file_obj, key, search_val, unique_values):
""" Find and append SQL guids that are in a line that contains a string
that's in search_val into 'exception' and 'added'
Return a dict of {'exception':set(<set of SQL guids>),
'added': set(<set of SQL guids>)}
"""
lines = {'exception':set(), 'added': set()}
for line in file_obj:
for value in unique_values:
if value in line:
if search_val[0] in line:
lines['exception'].add(value)
elif search_val[1] in line:
lines['added'].add(value)
file_obj.seek(0)
return lines
def print_stats(num_exceptions_dict):
for key in num_exceptions_dict.keys():
print("{} added ".format(key) +
str(len(list(num_exceptions_dict[key]["added"]))))
print("{} exceptions ".format(key) +
str(len(list(num_exceptions_dict[key]["exception"]))))
if __name__ == "__main__":
path = 'log/server.log'
search_list = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
with open(path) as file_obj:
#find every occurance of device ID and find their corresponding SQL
# guids (unique ID)
unique_ids_dict = {
element: find_device_IDs(file_obj, element)
for element in search_list
}
#Now for each unique ID find if string ["Exception occurred",
# "Packet record has been added"] is found in it's SQL guid list.
search_with_in_deviceID = ("Exception occurred",
"Packet record has been added")
#reset the file pointer
file_obj.seek(0)
num_exceptions_dict = {
elem: find_num_occurences(file_obj, elem, search_with_in_deviceID,
unique_ids_dict[elem])
for elem in search_list
}
print_stats(num_exceptions_dict)
and here's a small server log for you to experiment on
Improvements
- More pythonic with some help from Reinderien.
- Opening the file only once. Can't see a significant change in execution speed.
- Using better data structure model. Was using
dict
s everywhere,set
s made sense.
My current approach is to
- Find the device ID (eg 3BAA5C42) and their corresponding SQL GUIDs.
- For each SQL GUID find if it resulted in an
exception
oradded
event. Store them in dict.
- Print the stats
Parallelize
Step one and two are just going through the file searching for a particular string and performing a set of instructions once the sting is found. And so each process (both within steps one and two, and step one and two as a whole) is independent of each other / mutually exclusive
. And so running them in parallel makes more sense.
How should I get going to improve this code?
python
python
New contributor
New contributor
edited Dec 27 at 10:18
New contributor
asked Dec 27 at 6:17
clmno
1304
1304
New contributor
New contributor
And I've read of Donald Knuth's famour string search algo Knuth-Morris-Pratt. What I'm doing is a string search. Just wondering
– clmno
Dec 27 at 6:36
Your dictionary comprehension to find the unique IDs does not work. Afterfind_device_IDs
has run for the first time, you have reached the end of the file, so all except one ID are empty sets. Afile_obj.seek(0)
at the end of the function would help for now.
– Graipher
Dec 27 at 10:03
The signature of that function should also bedef find_device_IDs(file_obj, element):
, otherwise it just accesses the global variablefile_obj
.
– Graipher
Dec 27 at 10:08
@Graipher Fixed that in the code and tried running again. Sadly, there isn't a significant improvement from my previous version. (its 45s now, used to be 47s)
– clmno
Dec 27 at 10:22
1
Will write an answer to see if you can get away with a single pass over the file.
– Graipher
Dec 27 at 10:25
|
show 1 more comment
And I've read of Donald Knuth's famour string search algo Knuth-Morris-Pratt. What I'm doing is a string search. Just wondering
– clmno
Dec 27 at 6:36
Your dictionary comprehension to find the unique IDs does not work. Afterfind_device_IDs
has run for the first time, you have reached the end of the file, so all except one ID are empty sets. Afile_obj.seek(0)
at the end of the function would help for now.
– Graipher
Dec 27 at 10:03
The signature of that function should also bedef find_device_IDs(file_obj, element):
, otherwise it just accesses the global variablefile_obj
.
– Graipher
Dec 27 at 10:08
@Graipher Fixed that in the code and tried running again. Sadly, there isn't a significant improvement from my previous version. (its 45s now, used to be 47s)
– clmno
Dec 27 at 10:22
1
Will write an answer to see if you can get away with a single pass over the file.
– Graipher
Dec 27 at 10:25
And I've read of Donald Knuth's famour string search algo Knuth-Morris-Pratt. What I'm doing is a string search. Just wondering
– clmno
Dec 27 at 6:36
And I've read of Donald Knuth's famour string search algo Knuth-Morris-Pratt. What I'm doing is a string search. Just wondering
– clmno
Dec 27 at 6:36
Your dictionary comprehension to find the unique IDs does not work. After
find_device_IDs
has run for the first time, you have reached the end of the file, so all except one ID are empty sets. A file_obj.seek(0)
at the end of the function would help for now.– Graipher
Dec 27 at 10:03
Your dictionary comprehension to find the unique IDs does not work. After
find_device_IDs
has run for the first time, you have reached the end of the file, so all except one ID are empty sets. A file_obj.seek(0)
at the end of the function would help for now.– Graipher
Dec 27 at 10:03
The signature of that function should also be
def find_device_IDs(file_obj, element):
, otherwise it just accesses the global variable file_obj
.– Graipher
Dec 27 at 10:08
The signature of that function should also be
def find_device_IDs(file_obj, element):
, otherwise it just accesses the global variable file_obj
.– Graipher
Dec 27 at 10:08
@Graipher Fixed that in the code and tried running again. Sadly, there isn't a significant improvement from my previous version. (its 45s now, used to be 47s)
– clmno
Dec 27 at 10:22
@Graipher Fixed that in the code and tried running again. Sadly, there isn't a significant improvement from my previous version. (its 45s now, used to be 47s)
– clmno
Dec 27 at 10:22
1
1
Will write an answer to see if you can get away with a single pass over the file.
– Graipher
Dec 27 at 10:25
Will write an answer to see if you can get away with a single pass over the file.
– Graipher
Dec 27 at 10:25
|
show 1 more comment
1 Answer
1
active
oldest
votes
As evidenced by your timings a major bottleneck of your code is having to read the file multiple times. At least it is now only opened once, but the actual content is still read eight times (once for each unique ID, so four times in your example, and then once again each for the exceptions).
First, let's reduce this to two passes, once for the IDs and once for the exceptions/added events:
from collections import defaultdict
@timer
def find_device_IDs(file_obj, search_list):
""" Find the element (type: str) within the file (file path is
provide as arg). Then find the SQL guid from the line at hand.
(Each line has a SQL guid)
Return a dict of {element: [<list of SQL guids>]}
"""
sql_guids = defaultdict(set)
for line in file_obj:
for element in search_list:
if element in line:
#find the sql-guid from the line-str & append
sql_guids[element].add(find_sql_guid(line))
return sql_guids
The exception/added finding function is a bit more complicated. Here we first need to invert the dictionary:
device_ids = {sql_guid: device_id for device_id, values in unique_ids_dict.items() for sql_guid in values}
# {'0af229d1-283e-4575-a818-901617a762a7': '3BAA5C57',
# '2f4a7f93-d7ed-4514-bef0-9bb0f025ecd3': '3BAA5C42',
# '4e720c6e-1866-4c9b-b967-dfab049266fb': '3BAA5B67',
# '85708e5d-768d-4a90-ab71-60a737de96e3': '3BAA5B67',
# 'e268b224-bfb7-40c7-8ae5-500eaecb292b': '3BAA5B84',
# 'e4ced298-530c-41cc-98a7-42a2e4fe5987': '3BAA5B67'}
Then we can use that:
@timer
def find_num_occurences(file_obj, sql_guids, search_vals):
device_ids = {sql_guid: device_id for device_id, values in sql_guids.items() for sql_guid in values}
data = defaultdict(lambda: defaultdict(set))
for line in file_obj:
for sql_guid, device_id in device_ids.items():
if sql_guid in line:
for key, search_val in search_vals.items():
if search_val in line:
data[device_id][key].add(sql_guid)
return data
The usage is almost the same as your code:
with open(path) as file_obj:
device_ids = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
sql_guids = find_device_IDs(file_obj, device_ids)
file_obj.seek(0)
search_with_in_deviceID = {"exception": "Exception occurred",
"added": "Packet record has been added"}
print(find_num_occurences(file_obj, sql_guids, search_with_in_deviceID))
# defaultdict(<function __main__.find_num_occurences.<locals>.<lambda>>,
# {'3BAA5B67': defaultdict(set,
# {'added': {'4e720c6e-1866-4c9b-b967-dfab049266fb'},
# 'exception': {'85708e5d-768d-4a90-ab71-60a737de96e3',
# 'e4ced298-530c-41cc-98a7-42a2e4fe5987'}}),
# '3BAA5B84': defaultdict(set,
# {'added': {'e268b224-bfb7-40c7-8ae5-500eaecb292b'}}),
# '3BAA5C42': defaultdict(set,
# {'added': {'2f4a7f93-d7ed-4514-bef0-9bb0f025ecd3'}}),
# '3BAA5C57': defaultdict(set,
# {'added': {'0af229d1-283e-4575-a818-901617a762a7'}})})
You can actually get this down to a single pass, by collecting all IDs where an exception occurred and only at the end joining that with the elements you are actually searching for:
def get_data(file_obj, device_ids, search_vals):
sql_guid_to_device_id = {}
data = defaultdict(set)
for line in file_obj:
# search for an sql_guid
m = rg.search(line)
if m:
sql_guid = m.group(1)
# Add to mapping
for device_id in device_ids:
if device_id in line:
sql_guid_to_device_id[sql_guid] = device_id
# Add to exceptions/added
for key, search_val in search_vals.items():
if search_val in line:
data[sql_guid].add(key)
return sql_guid_to_device_id, data
def merge(sql_guid_to_device_id, data):
data2 = defaultdict(lambda: defaultdict(set))
for sql_guid, values in data.items():
if sql_guid in sql_guid_to_device_id:
for key in values:
data2[sql_guid_to_device_id[sql_guid]][key].add(sql_guid)
return data2
With the following usage:
with open(path) as file_obj:
device_ids = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
search_with_in_deviceID = {"exception": "Exception occurred",
"added": "Packet record has been added"}
sql_guid_to_device_id, data = get_data(file_obj, device_ids, search_with_in_deviceID)
data2 = merge(sql_guid_to_device_id, data)
for device_id, values in data2.items():
for key, sql_guids in values.items():
print(f"{device_id} {key} {len(sql_guids)}")
# 3BAA5B67 exception 2
# 3BAA5B67 added 1
# 3BAA5C42 added 1
# 3BAA5B84 added 1
# 3BAA5C57 added 1
get_data
, data
and data2
still need better names...
Other than that this should be faster because it reads the file only once. It does consume more memory, though, because it also saves exceptions or added events for SQL guids which you later don't need. If this trade-off is not worth it, go back to the first half of this answer.
2
Theone pass
solution was pretty awesome! It has brought it down to 6 seconds! TIL Lesser IO operations, faster your code. Coming from hardware/fw we always use IOs and rarely have enough RAM. ty
– clmno
Dec 27 at 12:18
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
clmno is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210405%2fcounting-sql-guids-from-a-server-log-and-printing-the-stats-improved%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
As evidenced by your timings a major bottleneck of your code is having to read the file multiple times. At least it is now only opened once, but the actual content is still read eight times (once for each unique ID, so four times in your example, and then once again each for the exceptions).
First, let's reduce this to two passes, once for the IDs and once for the exceptions/added events:
from collections import defaultdict
@timer
def find_device_IDs(file_obj, search_list):
""" Find the element (type: str) within the file (file path is
provide as arg). Then find the SQL guid from the line at hand.
(Each line has a SQL guid)
Return a dict of {element: [<list of SQL guids>]}
"""
sql_guids = defaultdict(set)
for line in file_obj:
for element in search_list:
if element in line:
#find the sql-guid from the line-str & append
sql_guids[element].add(find_sql_guid(line))
return sql_guids
The exception/added finding function is a bit more complicated. Here we first need to invert the dictionary:
device_ids = {sql_guid: device_id for device_id, values in unique_ids_dict.items() for sql_guid in values}
# {'0af229d1-283e-4575-a818-901617a762a7': '3BAA5C57',
# '2f4a7f93-d7ed-4514-bef0-9bb0f025ecd3': '3BAA5C42',
# '4e720c6e-1866-4c9b-b967-dfab049266fb': '3BAA5B67',
# '85708e5d-768d-4a90-ab71-60a737de96e3': '3BAA5B67',
# 'e268b224-bfb7-40c7-8ae5-500eaecb292b': '3BAA5B84',
# 'e4ced298-530c-41cc-98a7-42a2e4fe5987': '3BAA5B67'}
Then we can use that:
@timer
def find_num_occurences(file_obj, sql_guids, search_vals):
device_ids = {sql_guid: device_id for device_id, values in sql_guids.items() for sql_guid in values}
data = defaultdict(lambda: defaultdict(set))
for line in file_obj:
for sql_guid, device_id in device_ids.items():
if sql_guid in line:
for key, search_val in search_vals.items():
if search_val in line:
data[device_id][key].add(sql_guid)
return data
The usage is almost the same as your code:
with open(path) as file_obj:
device_ids = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
sql_guids = find_device_IDs(file_obj, device_ids)
file_obj.seek(0)
search_with_in_deviceID = {"exception": "Exception occurred",
"added": "Packet record has been added"}
print(find_num_occurences(file_obj, sql_guids, search_with_in_deviceID))
# defaultdict(<function __main__.find_num_occurences.<locals>.<lambda>>,
# {'3BAA5B67': defaultdict(set,
# {'added': {'4e720c6e-1866-4c9b-b967-dfab049266fb'},
# 'exception': {'85708e5d-768d-4a90-ab71-60a737de96e3',
# 'e4ced298-530c-41cc-98a7-42a2e4fe5987'}}),
# '3BAA5B84': defaultdict(set,
# {'added': {'e268b224-bfb7-40c7-8ae5-500eaecb292b'}}),
# '3BAA5C42': defaultdict(set,
# {'added': {'2f4a7f93-d7ed-4514-bef0-9bb0f025ecd3'}}),
# '3BAA5C57': defaultdict(set,
# {'added': {'0af229d1-283e-4575-a818-901617a762a7'}})})
You can actually get this down to a single pass, by collecting all IDs where an exception occurred and only at the end joining that with the elements you are actually searching for:
def get_data(file_obj, device_ids, search_vals):
sql_guid_to_device_id = {}
data = defaultdict(set)
for line in file_obj:
# search for an sql_guid
m = rg.search(line)
if m:
sql_guid = m.group(1)
# Add to mapping
for device_id in device_ids:
if device_id in line:
sql_guid_to_device_id[sql_guid] = device_id
# Add to exceptions/added
for key, search_val in search_vals.items():
if search_val in line:
data[sql_guid].add(key)
return sql_guid_to_device_id, data
def merge(sql_guid_to_device_id, data):
data2 = defaultdict(lambda: defaultdict(set))
for sql_guid, values in data.items():
if sql_guid in sql_guid_to_device_id:
for key in values:
data2[sql_guid_to_device_id[sql_guid]][key].add(sql_guid)
return data2
With the following usage:
with open(path) as file_obj:
device_ids = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
search_with_in_deviceID = {"exception": "Exception occurred",
"added": "Packet record has been added"}
sql_guid_to_device_id, data = get_data(file_obj, device_ids, search_with_in_deviceID)
data2 = merge(sql_guid_to_device_id, data)
for device_id, values in data2.items():
for key, sql_guids in values.items():
print(f"{device_id} {key} {len(sql_guids)}")
# 3BAA5B67 exception 2
# 3BAA5B67 added 1
# 3BAA5C42 added 1
# 3BAA5B84 added 1
# 3BAA5C57 added 1
get_data
, data
and data2
still need better names...
Other than that this should be faster because it reads the file only once. It does consume more memory, though, because it also saves exceptions or added events for SQL guids which you later don't need. If this trade-off is not worth it, go back to the first half of this answer.
2
Theone pass
solution was pretty awesome! It has brought it down to 6 seconds! TIL Lesser IO operations, faster your code. Coming from hardware/fw we always use IOs and rarely have enough RAM. ty
– clmno
Dec 27 at 12:18
add a comment |
As evidenced by your timings a major bottleneck of your code is having to read the file multiple times. At least it is now only opened once, but the actual content is still read eight times (once for each unique ID, so four times in your example, and then once again each for the exceptions).
First, let's reduce this to two passes, once for the IDs and once for the exceptions/added events:
from collections import defaultdict
@timer
def find_device_IDs(file_obj, search_list):
""" Find the element (type: str) within the file (file path is
provide as arg). Then find the SQL guid from the line at hand.
(Each line has a SQL guid)
Return a dict of {element: [<list of SQL guids>]}
"""
sql_guids = defaultdict(set)
for line in file_obj:
for element in search_list:
if element in line:
#find the sql-guid from the line-str & append
sql_guids[element].add(find_sql_guid(line))
return sql_guids
The exception/added finding function is a bit more complicated. Here we first need to invert the dictionary:
device_ids = {sql_guid: device_id for device_id, values in unique_ids_dict.items() for sql_guid in values}
# {'0af229d1-283e-4575-a818-901617a762a7': '3BAA5C57',
# '2f4a7f93-d7ed-4514-bef0-9bb0f025ecd3': '3BAA5C42',
# '4e720c6e-1866-4c9b-b967-dfab049266fb': '3BAA5B67',
# '85708e5d-768d-4a90-ab71-60a737de96e3': '3BAA5B67',
# 'e268b224-bfb7-40c7-8ae5-500eaecb292b': '3BAA5B84',
# 'e4ced298-530c-41cc-98a7-42a2e4fe5987': '3BAA5B67'}
Then we can use that:
@timer
def find_num_occurences(file_obj, sql_guids, search_vals):
device_ids = {sql_guid: device_id for device_id, values in sql_guids.items() for sql_guid in values}
data = defaultdict(lambda: defaultdict(set))
for line in file_obj:
for sql_guid, device_id in device_ids.items():
if sql_guid in line:
for key, search_val in search_vals.items():
if search_val in line:
data[device_id][key].add(sql_guid)
return data
The usage is almost the same as your code:
with open(path) as file_obj:
device_ids = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
sql_guids = find_device_IDs(file_obj, device_ids)
file_obj.seek(0)
search_with_in_deviceID = {"exception": "Exception occurred",
"added": "Packet record has been added"}
print(find_num_occurences(file_obj, sql_guids, search_with_in_deviceID))
# defaultdict(<function __main__.find_num_occurences.<locals>.<lambda>>,
# {'3BAA5B67': defaultdict(set,
# {'added': {'4e720c6e-1866-4c9b-b967-dfab049266fb'},
# 'exception': {'85708e5d-768d-4a90-ab71-60a737de96e3',
# 'e4ced298-530c-41cc-98a7-42a2e4fe5987'}}),
# '3BAA5B84': defaultdict(set,
# {'added': {'e268b224-bfb7-40c7-8ae5-500eaecb292b'}}),
# '3BAA5C42': defaultdict(set,
# {'added': {'2f4a7f93-d7ed-4514-bef0-9bb0f025ecd3'}}),
# '3BAA5C57': defaultdict(set,
# {'added': {'0af229d1-283e-4575-a818-901617a762a7'}})})
You can actually get this down to a single pass, by collecting all IDs where an exception occurred and only at the end joining that with the elements you are actually searching for:
def get_data(file_obj, device_ids, search_vals):
sql_guid_to_device_id = {}
data = defaultdict(set)
for line in file_obj:
# search for an sql_guid
m = rg.search(line)
if m:
sql_guid = m.group(1)
# Add to mapping
for device_id in device_ids:
if device_id in line:
sql_guid_to_device_id[sql_guid] = device_id
# Add to exceptions/added
for key, search_val in search_vals.items():
if search_val in line:
data[sql_guid].add(key)
return sql_guid_to_device_id, data
def merge(sql_guid_to_device_id, data):
data2 = defaultdict(lambda: defaultdict(set))
for sql_guid, values in data.items():
if sql_guid in sql_guid_to_device_id:
for key in values:
data2[sql_guid_to_device_id[sql_guid]][key].add(sql_guid)
return data2
With the following usage:
with open(path) as file_obj:
device_ids = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
search_with_in_deviceID = {"exception": "Exception occurred",
"added": "Packet record has been added"}
sql_guid_to_device_id, data = get_data(file_obj, device_ids, search_with_in_deviceID)
data2 = merge(sql_guid_to_device_id, data)
for device_id, values in data2.items():
for key, sql_guids in values.items():
print(f"{device_id} {key} {len(sql_guids)}")
# 3BAA5B67 exception 2
# 3BAA5B67 added 1
# 3BAA5C42 added 1
# 3BAA5B84 added 1
# 3BAA5C57 added 1
get_data
, data
and data2
still need better names...
Other than that this should be faster because it reads the file only once. It does consume more memory, though, because it also saves exceptions or added events for SQL guids which you later don't need. If this trade-off is not worth it, go back to the first half of this answer.
2
Theone pass
solution was pretty awesome! It has brought it down to 6 seconds! TIL Lesser IO operations, faster your code. Coming from hardware/fw we always use IOs and rarely have enough RAM. ty
– clmno
Dec 27 at 12:18
add a comment |
As evidenced by your timings a major bottleneck of your code is having to read the file multiple times. At least it is now only opened once, but the actual content is still read eight times (once for each unique ID, so four times in your example, and then once again each for the exceptions).
First, let's reduce this to two passes, once for the IDs and once for the exceptions/added events:
from collections import defaultdict
@timer
def find_device_IDs(file_obj, search_list):
""" Find the element (type: str) within the file (file path is
provide as arg). Then find the SQL guid from the line at hand.
(Each line has a SQL guid)
Return a dict of {element: [<list of SQL guids>]}
"""
sql_guids = defaultdict(set)
for line in file_obj:
for element in search_list:
if element in line:
#find the sql-guid from the line-str & append
sql_guids[element].add(find_sql_guid(line))
return sql_guids
The exception/added finding function is a bit more complicated. Here we first need to invert the dictionary:
device_ids = {sql_guid: device_id for device_id, values in unique_ids_dict.items() for sql_guid in values}
# {'0af229d1-283e-4575-a818-901617a762a7': '3BAA5C57',
# '2f4a7f93-d7ed-4514-bef0-9bb0f025ecd3': '3BAA5C42',
# '4e720c6e-1866-4c9b-b967-dfab049266fb': '3BAA5B67',
# '85708e5d-768d-4a90-ab71-60a737de96e3': '3BAA5B67',
# 'e268b224-bfb7-40c7-8ae5-500eaecb292b': '3BAA5B84',
# 'e4ced298-530c-41cc-98a7-42a2e4fe5987': '3BAA5B67'}
Then we can use that:
@timer
def find_num_occurences(file_obj, sql_guids, search_vals):
device_ids = {sql_guid: device_id for device_id, values in sql_guids.items() for sql_guid in values}
data = defaultdict(lambda: defaultdict(set))
for line in file_obj:
for sql_guid, device_id in device_ids.items():
if sql_guid in line:
for key, search_val in search_vals.items():
if search_val in line:
data[device_id][key].add(sql_guid)
return data
The usage is almost the same as your code:
with open(path) as file_obj:
device_ids = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
sql_guids = find_device_IDs(file_obj, device_ids)
file_obj.seek(0)
search_with_in_deviceID = {"exception": "Exception occurred",
"added": "Packet record has been added"}
print(find_num_occurences(file_obj, sql_guids, search_with_in_deviceID))
# defaultdict(<function __main__.find_num_occurences.<locals>.<lambda>>,
# {'3BAA5B67': defaultdict(set,
# {'added': {'4e720c6e-1866-4c9b-b967-dfab049266fb'},
# 'exception': {'85708e5d-768d-4a90-ab71-60a737de96e3',
# 'e4ced298-530c-41cc-98a7-42a2e4fe5987'}}),
# '3BAA5B84': defaultdict(set,
# {'added': {'e268b224-bfb7-40c7-8ae5-500eaecb292b'}}),
# '3BAA5C42': defaultdict(set,
# {'added': {'2f4a7f93-d7ed-4514-bef0-9bb0f025ecd3'}}),
# '3BAA5C57': defaultdict(set,
# {'added': {'0af229d1-283e-4575-a818-901617a762a7'}})})
You can actually get this down to a single pass, by collecting all IDs where an exception occurred and only at the end joining that with the elements you are actually searching for:
def get_data(file_obj, device_ids, search_vals):
sql_guid_to_device_id = {}
data = defaultdict(set)
for line in file_obj:
# search for an sql_guid
m = rg.search(line)
if m:
sql_guid = m.group(1)
# Add to mapping
for device_id in device_ids:
if device_id in line:
sql_guid_to_device_id[sql_guid] = device_id
# Add to exceptions/added
for key, search_val in search_vals.items():
if search_val in line:
data[sql_guid].add(key)
return sql_guid_to_device_id, data
def merge(sql_guid_to_device_id, data):
data2 = defaultdict(lambda: defaultdict(set))
for sql_guid, values in data.items():
if sql_guid in sql_guid_to_device_id:
for key in values:
data2[sql_guid_to_device_id[sql_guid]][key].add(sql_guid)
return data2
With the following usage:
with open(path) as file_obj:
device_ids = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
search_with_in_deviceID = {"exception": "Exception occurred",
"added": "Packet record has been added"}
sql_guid_to_device_id, data = get_data(file_obj, device_ids, search_with_in_deviceID)
data2 = merge(sql_guid_to_device_id, data)
for device_id, values in data2.items():
for key, sql_guids in values.items():
print(f"{device_id} {key} {len(sql_guids)}")
# 3BAA5B67 exception 2
# 3BAA5B67 added 1
# 3BAA5C42 added 1
# 3BAA5B84 added 1
# 3BAA5C57 added 1
get_data
, data
and data2
still need better names...
Other than that this should be faster because it reads the file only once. It does consume more memory, though, because it also saves exceptions or added events for SQL guids which you later don't need. If this trade-off is not worth it, go back to the first half of this answer.
As evidenced by your timings a major bottleneck of your code is having to read the file multiple times. At least it is now only opened once, but the actual content is still read eight times (once for each unique ID, so four times in your example, and then once again each for the exceptions).
First, let's reduce this to two passes, once for the IDs and once for the exceptions/added events:
from collections import defaultdict
@timer
def find_device_IDs(file_obj, search_list):
""" Find the element (type: str) within the file (file path is
provide as arg). Then find the SQL guid from the line at hand.
(Each line has a SQL guid)
Return a dict of {element: [<list of SQL guids>]}
"""
sql_guids = defaultdict(set)
for line in file_obj:
for element in search_list:
if element in line:
#find the sql-guid from the line-str & append
sql_guids[element].add(find_sql_guid(line))
return sql_guids
The exception/added finding function is a bit more complicated. Here we first need to invert the dictionary:
device_ids = {sql_guid: device_id for device_id, values in unique_ids_dict.items() for sql_guid in values}
# {'0af229d1-283e-4575-a818-901617a762a7': '3BAA5C57',
# '2f4a7f93-d7ed-4514-bef0-9bb0f025ecd3': '3BAA5C42',
# '4e720c6e-1866-4c9b-b967-dfab049266fb': '3BAA5B67',
# '85708e5d-768d-4a90-ab71-60a737de96e3': '3BAA5B67',
# 'e268b224-bfb7-40c7-8ae5-500eaecb292b': '3BAA5B84',
# 'e4ced298-530c-41cc-98a7-42a2e4fe5987': '3BAA5B67'}
Then we can use that:
@timer
def find_num_occurences(file_obj, sql_guids, search_vals):
device_ids = {sql_guid: device_id for device_id, values in sql_guids.items() for sql_guid in values}
data = defaultdict(lambda: defaultdict(set))
for line in file_obj:
for sql_guid, device_id in device_ids.items():
if sql_guid in line:
for key, search_val in search_vals.items():
if search_val in line:
data[device_id][key].add(sql_guid)
return data
The usage is almost the same as your code:
with open(path) as file_obj:
device_ids = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
sql_guids = find_device_IDs(file_obj, device_ids)
file_obj.seek(0)
search_with_in_deviceID = {"exception": "Exception occurred",
"added": "Packet record has been added"}
print(find_num_occurences(file_obj, sql_guids, search_with_in_deviceID))
# defaultdict(<function __main__.find_num_occurences.<locals>.<lambda>>,
# {'3BAA5B67': defaultdict(set,
# {'added': {'4e720c6e-1866-4c9b-b967-dfab049266fb'},
# 'exception': {'85708e5d-768d-4a90-ab71-60a737de96e3',
# 'e4ced298-530c-41cc-98a7-42a2e4fe5987'}}),
# '3BAA5B84': defaultdict(set,
# {'added': {'e268b224-bfb7-40c7-8ae5-500eaecb292b'}}),
# '3BAA5C42': defaultdict(set,
# {'added': {'2f4a7f93-d7ed-4514-bef0-9bb0f025ecd3'}}),
# '3BAA5C57': defaultdict(set,
# {'added': {'0af229d1-283e-4575-a818-901617a762a7'}})})
You can actually get this down to a single pass, by collecting all IDs where an exception occurred and only at the end joining that with the elements you are actually searching for:
def get_data(file_obj, device_ids, search_vals):
sql_guid_to_device_id = {}
data = defaultdict(set)
for line in file_obj:
# search for an sql_guid
m = rg.search(line)
if m:
sql_guid = m.group(1)
# Add to mapping
for device_id in device_ids:
if device_id in line:
sql_guid_to_device_id[sql_guid] = device_id
# Add to exceptions/added
for key, search_val in search_vals.items():
if search_val in line:
data[sql_guid].add(key)
return sql_guid_to_device_id, data
def merge(sql_guid_to_device_id, data):
data2 = defaultdict(lambda: defaultdict(set))
for sql_guid, values in data.items():
if sql_guid in sql_guid_to_device_id:
for key in values:
data2[sql_guid_to_device_id[sql_guid]][key].add(sql_guid)
return data2
With the following usage:
with open(path) as file_obj:
device_ids = ('3BAA5C42', '3BAA5B84', '3BAA5C57', '3BAA5B67')
search_with_in_deviceID = {"exception": "Exception occurred",
"added": "Packet record has been added"}
sql_guid_to_device_id, data = get_data(file_obj, device_ids, search_with_in_deviceID)
data2 = merge(sql_guid_to_device_id, data)
for device_id, values in data2.items():
for key, sql_guids in values.items():
print(f"{device_id} {key} {len(sql_guids)}")
# 3BAA5B67 exception 2
# 3BAA5B67 added 1
# 3BAA5C42 added 1
# 3BAA5B84 added 1
# 3BAA5C57 added 1
get_data
, data
and data2
still need better names...
Other than that this should be faster because it reads the file only once. It does consume more memory, though, because it also saves exceptions or added events for SQL guids which you later don't need. If this trade-off is not worth it, go back to the first half of this answer.
edited Dec 27 at 11:58
answered Dec 27 at 11:40
Graipher
23.5k53585
23.5k53585
2
Theone pass
solution was pretty awesome! It has brought it down to 6 seconds! TIL Lesser IO operations, faster your code. Coming from hardware/fw we always use IOs and rarely have enough RAM. ty
– clmno
Dec 27 at 12:18
add a comment |
2
Theone pass
solution was pretty awesome! It has brought it down to 6 seconds! TIL Lesser IO operations, faster your code. Coming from hardware/fw we always use IOs and rarely have enough RAM. ty
– clmno
Dec 27 at 12:18
2
2
The
one pass
solution was pretty awesome! It has brought it down to 6 seconds! TIL Lesser IO operations, faster your code. Coming from hardware/fw we always use IOs and rarely have enough RAM. ty– clmno
Dec 27 at 12:18
The
one pass
solution was pretty awesome! It has brought it down to 6 seconds! TIL Lesser IO operations, faster your code. Coming from hardware/fw we always use IOs and rarely have enough RAM. ty– clmno
Dec 27 at 12:18
add a comment |
clmno is a new contributor. Be nice, and check out our Code of Conduct.
clmno is a new contributor. Be nice, and check out our Code of Conduct.
clmno is a new contributor. Be nice, and check out our Code of Conduct.
clmno is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210405%2fcounting-sql-guids-from-a-server-log-and-printing-the-stats-improved%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
And I've read of Donald Knuth's famour string search algo Knuth-Morris-Pratt. What I'm doing is a string search. Just wondering
– clmno
Dec 27 at 6:36
Your dictionary comprehension to find the unique IDs does not work. After
find_device_IDs
has run for the first time, you have reached the end of the file, so all except one ID are empty sets. Afile_obj.seek(0)
at the end of the function would help for now.– Graipher
Dec 27 at 10:03
The signature of that function should also be
def find_device_IDs(file_obj, element):
, otherwise it just accesses the global variablefile_obj
.– Graipher
Dec 27 at 10:08
@Graipher Fixed that in the code and tried running again. Sadly, there isn't a significant improvement from my previous version. (its 45s now, used to be 47s)
– clmno
Dec 27 at 10:22
1
Will write an answer to see if you can get away with a single pass over the file.
– Graipher
Dec 27 at 10:25