Aggregate Pandas Columns on Geospacial Distance





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







1












$begingroup$


I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.



I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven't seen an error yet, so it appears to be working okay.



The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.



from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):

#Calculate the great circle distance between two points
#on the earth (specified in decimal degrees)

# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r


My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.



def hav_checker(row, lon, lat):

hav = haversine(row['longitude'], row['latitude'], lon, lat)

return hav


My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).



For reference, I am using the California housing dataset to build this out.



def value_grabber(row, frame, threshold, target_col):

frame = frame.copy()

frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1)

mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean()

return mean_tar


I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.



df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1)

df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1)

df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1)


I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.










share|improve this question







New contributor




krewsayder is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$



















    1












    $begingroup$


    I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.



    I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven't seen an error yet, so it appears to be working okay.



    The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.



    from math import radians, cos, sin, asin, sqrt

    def haversine(lon1, lat1, lon2, lat2):

    #Calculate the great circle distance between two points
    #on the earth (specified in decimal degrees)

    # convert decimal degrees to radians
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r


    My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.



    def hav_checker(row, lon, lat):

    hav = haversine(row['longitude'], row['latitude'], lon, lat)

    return hav


    My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).



    For reference, I am using the California housing dataset to build this out.



    def value_grabber(row, frame, threshold, target_col):

    frame = frame.copy()

    frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1)

    mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean()

    return mean_tar


    I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.



    df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1)

    df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1)

    df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1)


    I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.










    share|improve this question







    New contributor




    krewsayder is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      1












      1








      1





      $begingroup$


      I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.



      I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven't seen an error yet, so it appears to be working okay.



      The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.



      from math import radians, cos, sin, asin, sqrt

      def haversine(lon1, lat1, lon2, lat2):

      #Calculate the great circle distance between two points
      #on the earth (specified in decimal degrees)

      # convert decimal degrees to radians
      lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

      # haversine formula
      dlon = lon2 - lon1
      dlat = lat2 - lat1
      a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
      c = 2 * asin(sqrt(a))
      r = 6371 # Radius of earth in kilometers. Use 3956 for miles
      return c * r


      My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.



      def hav_checker(row, lon, lat):

      hav = haversine(row['longitude'], row['latitude'], lon, lat)

      return hav


      My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).



      For reference, I am using the California housing dataset to build this out.



      def value_grabber(row, frame, threshold, target_col):

      frame = frame.copy()

      frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1)

      mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean()

      return mean_tar


      I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.



      df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1)

      df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1)

      df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1)


      I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.










      share|improve this question







      New contributor




      krewsayder is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.



      I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven't seen an error yet, so it appears to be working okay.



      The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.



      from math import radians, cos, sin, asin, sqrt

      def haversine(lon1, lat1, lon2, lat2):

      #Calculate the great circle distance between two points
      #on the earth (specified in decimal degrees)

      # convert decimal degrees to radians
      lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

      # haversine formula
      dlon = lon2 - lon1
      dlat = lat2 - lat1
      a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
      c = 2 * asin(sqrt(a))
      r = 6371 # Radius of earth in kilometers. Use 3956 for miles
      return c * r


      My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.



      def hav_checker(row, lon, lat):

      hav = haversine(row['longitude'], row['latitude'], lon, lat)

      return hav


      My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).



      For reference, I am using the California housing dataset to build this out.



      def value_grabber(row, frame, threshold, target_col):

      frame = frame.copy()

      frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1)

      mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean()

      return mean_tar


      I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.



      df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1)

      df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1)

      df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1)


      I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.







      python numpy geospatial






      share|improve this question







      New contributor




      krewsayder is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      krewsayder is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      krewsayder is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 1 hour ago









      krewsayderkrewsayder

      61




      61




      New contributor




      krewsayder is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      krewsayder is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      krewsayder is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          0






          active

          oldest

          votes












          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "196"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          krewsayder is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217557%2faggregate-pandas-columns-on-geospacial-distance%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          krewsayder is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          krewsayder is a new contributor. Be nice, and check out our Code of Conduct.













          krewsayder is a new contributor. Be nice, and check out our Code of Conduct.












          krewsayder is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217557%2faggregate-pandas-columns-on-geospacial-distance%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Сан-Квентин

          Алькесар

          Josef Freinademetz