Aggregate Pandas Columns on Geospacial Distance
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
$begingroup$
I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.
I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven't seen an error yet, so it appears to be working okay.
The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
#Calculate the great circle distance between two points
#on the earth (specified in decimal degrees)
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.
def hav_checker(row, lon, lat):
hav = haversine(row['longitude'], row['latitude'], lon, lat)
return hav
My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).
For reference, I am using the California housing dataset to build this out.
def value_grabber(row, frame, threshold, target_col):
frame = frame.copy()
frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1)
mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean()
return mean_tar
I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.
df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1)
df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1)
df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1)
I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.
python numpy geospatial
New contributor
$endgroup$
add a comment |
$begingroup$
I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.
I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven't seen an error yet, so it appears to be working okay.
The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
#Calculate the great circle distance between two points
#on the earth (specified in decimal degrees)
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.
def hav_checker(row, lon, lat):
hav = haversine(row['longitude'], row['latitude'], lon, lat)
return hav
My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).
For reference, I am using the California housing dataset to build this out.
def value_grabber(row, frame, threshold, target_col):
frame = frame.copy()
frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1)
mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean()
return mean_tar
I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.
df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1)
df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1)
df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1)
I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.
python numpy geospatial
New contributor
$endgroup$
add a comment |
$begingroup$
I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.
I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven't seen an error yet, so it appears to be working okay.
The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
#Calculate the great circle distance between two points
#on the earth (specified in decimal degrees)
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.
def hav_checker(row, lon, lat):
hav = haversine(row['longitude'], row['latitude'], lon, lat)
return hav
My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).
For reference, I am using the California housing dataset to build this out.
def value_grabber(row, frame, threshold, target_col):
frame = frame.copy()
frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1)
mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean()
return mean_tar
I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.
df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1)
df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1)
df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1)
I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.
python numpy geospatial
New contributor
$endgroup$
I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.
I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven't seen an error yet, so it appears to be working okay.
The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
#Calculate the great circle distance between two points
#on the earth (specified in decimal degrees)
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.
def hav_checker(row, lon, lat):
hav = haversine(row['longitude'], row['latitude'], lon, lat)
return hav
My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).
For reference, I am using the California housing dataset to build this out.
def value_grabber(row, frame, threshold, target_col):
frame = frame.copy()
frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1)
mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean()
return mean_tar
I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.
df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1)
df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1)
df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1)
I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.
python numpy geospatial
python numpy geospatial
New contributor
New contributor
New contributor
asked 1 hour ago
krewsayderkrewsayder
61
61
New contributor
New contributor
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
krewsayder is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217557%2faggregate-pandas-columns-on-geospacial-distance%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
krewsayder is a new contributor. Be nice, and check out our Code of Conduct.
krewsayder is a new contributor. Be nice, and check out our Code of Conduct.
krewsayder is a new contributor. Be nice, and check out our Code of Conduct.
krewsayder is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217557%2faggregate-pandas-columns-on-geospacial-distance%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown