Linear Regression on Pandas












0












$begingroup$


I'm working on a simple statistics problem with Pandas and sklearn. I'm aware that my code is ugly, but how can I improve it?



import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

df = pd.read_csv("sphist.csv")
df["Date"] = pd.to_datetime(df["Date"])
df.sort_values(["Date"], inplace=True)
df["day_5"] = np.nan
df["day_30"] = np.nan
df["std_5"] = np.nan


for i in range(30, len(df)):
last_5 = df.iloc[i-5:i, 4]
last_30 = df.iloc[i-30:i, 4]
df.iloc[i, -3] = last_5.mean()
df.iloc[i, -2] = last_30.mean()
df.iloc[i, -1] = last_5.std()

df = df.iloc[30:]
df.dropna(axis=0, inplace=True)

train = df[df["Date"] < datetime(2013, 1, 1)]
test = df[df["Date"] >= datetime(2013, 1, 1)]
# print(train.head(), test.head())

X_cols = ["day_5", "day_30", "std_5"]
y_col = "Close"

lr = LinearRegression()
lr.fit(train[X_cols], train[y_col])
yhat = lr.predict(test[X_cols])
mse = mean_squared_error(yhat, test[y_col])
rmse = mse/len(yhat)
score = lr.score(test[X_cols], test[y_col])

print(rmse, score)

plt.scatter(yhat, test[y_col], c="k", s=1)
plt.plot([.95*yhat.min(), 1.05*yhat.max()], [.95*yhat.min(), 1.05*yhat.max()], c="r")
plt.show()



  1. It relies on hard-code iloc indices, which is hard to read or maintain. How can I change it to column names/row names?

  2. The codes look messy. Any advice to improve it?









share







New contributor




BurgerBurglar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    0












    $begingroup$


    I'm working on a simple statistics problem with Pandas and sklearn. I'm aware that my code is ugly, but how can I improve it?



    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from datetime import datetime
    from sklearn.metrics import mean_squared_error
    from sklearn.linear_model import LinearRegression

    df = pd.read_csv("sphist.csv")
    df["Date"] = pd.to_datetime(df["Date"])
    df.sort_values(["Date"], inplace=True)
    df["day_5"] = np.nan
    df["day_30"] = np.nan
    df["std_5"] = np.nan


    for i in range(30, len(df)):
    last_5 = df.iloc[i-5:i, 4]
    last_30 = df.iloc[i-30:i, 4]
    df.iloc[i, -3] = last_5.mean()
    df.iloc[i, -2] = last_30.mean()
    df.iloc[i, -1] = last_5.std()

    df = df.iloc[30:]
    df.dropna(axis=0, inplace=True)

    train = df[df["Date"] < datetime(2013, 1, 1)]
    test = df[df["Date"] >= datetime(2013, 1, 1)]
    # print(train.head(), test.head())

    X_cols = ["day_5", "day_30", "std_5"]
    y_col = "Close"

    lr = LinearRegression()
    lr.fit(train[X_cols], train[y_col])
    yhat = lr.predict(test[X_cols])
    mse = mean_squared_error(yhat, test[y_col])
    rmse = mse/len(yhat)
    score = lr.score(test[X_cols], test[y_col])

    print(rmse, score)

    plt.scatter(yhat, test[y_col], c="k", s=1)
    plt.plot([.95*yhat.min(), 1.05*yhat.max()], [.95*yhat.min(), 1.05*yhat.max()], c="r")
    plt.show()



    1. It relies on hard-code iloc indices, which is hard to read or maintain. How can I change it to column names/row names?

    2. The codes look messy. Any advice to improve it?









    share







    New contributor




    BurgerBurglar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      0












      0








      0





      $begingroup$


      I'm working on a simple statistics problem with Pandas and sklearn. I'm aware that my code is ugly, but how can I improve it?



      import numpy as np
      import pandas as pd
      import matplotlib.pyplot as plt
      from datetime import datetime
      from sklearn.metrics import mean_squared_error
      from sklearn.linear_model import LinearRegression

      df = pd.read_csv("sphist.csv")
      df["Date"] = pd.to_datetime(df["Date"])
      df.sort_values(["Date"], inplace=True)
      df["day_5"] = np.nan
      df["day_30"] = np.nan
      df["std_5"] = np.nan


      for i in range(30, len(df)):
      last_5 = df.iloc[i-5:i, 4]
      last_30 = df.iloc[i-30:i, 4]
      df.iloc[i, -3] = last_5.mean()
      df.iloc[i, -2] = last_30.mean()
      df.iloc[i, -1] = last_5.std()

      df = df.iloc[30:]
      df.dropna(axis=0, inplace=True)

      train = df[df["Date"] < datetime(2013, 1, 1)]
      test = df[df["Date"] >= datetime(2013, 1, 1)]
      # print(train.head(), test.head())

      X_cols = ["day_5", "day_30", "std_5"]
      y_col = "Close"

      lr = LinearRegression()
      lr.fit(train[X_cols], train[y_col])
      yhat = lr.predict(test[X_cols])
      mse = mean_squared_error(yhat, test[y_col])
      rmse = mse/len(yhat)
      score = lr.score(test[X_cols], test[y_col])

      print(rmse, score)

      plt.scatter(yhat, test[y_col], c="k", s=1)
      plt.plot([.95*yhat.min(), 1.05*yhat.max()], [.95*yhat.min(), 1.05*yhat.max()], c="r")
      plt.show()



      1. It relies on hard-code iloc indices, which is hard to read or maintain. How can I change it to column names/row names?

      2. The codes look messy. Any advice to improve it?









      share







      New contributor




      BurgerBurglar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I'm working on a simple statistics problem with Pandas and sklearn. I'm aware that my code is ugly, but how can I improve it?



      import numpy as np
      import pandas as pd
      import matplotlib.pyplot as plt
      from datetime import datetime
      from sklearn.metrics import mean_squared_error
      from sklearn.linear_model import LinearRegression

      df = pd.read_csv("sphist.csv")
      df["Date"] = pd.to_datetime(df["Date"])
      df.sort_values(["Date"], inplace=True)
      df["day_5"] = np.nan
      df["day_30"] = np.nan
      df["std_5"] = np.nan


      for i in range(30, len(df)):
      last_5 = df.iloc[i-5:i, 4]
      last_30 = df.iloc[i-30:i, 4]
      df.iloc[i, -3] = last_5.mean()
      df.iloc[i, -2] = last_30.mean()
      df.iloc[i, -1] = last_5.std()

      df = df.iloc[30:]
      df.dropna(axis=0, inplace=True)

      train = df[df["Date"] < datetime(2013, 1, 1)]
      test = df[df["Date"] >= datetime(2013, 1, 1)]
      # print(train.head(), test.head())

      X_cols = ["day_5", "day_30", "std_5"]
      y_col = "Close"

      lr = LinearRegression()
      lr.fit(train[X_cols], train[y_col])
      yhat = lr.predict(test[X_cols])
      mse = mean_squared_error(yhat, test[y_col])
      rmse = mse/len(yhat)
      score = lr.score(test[X_cols], test[y_col])

      print(rmse, score)

      plt.scatter(yhat, test[y_col], c="k", s=1)
      plt.plot([.95*yhat.min(), 1.05*yhat.max()], [.95*yhat.min(), 1.05*yhat.max()], c="r")
      plt.show()



      1. It relies on hard-code iloc indices, which is hard to read or maintain. How can I change it to column names/row names?

      2. The codes look messy. Any advice to improve it?







      python pandas





      share







      New contributor




      BurgerBurglar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.










      share







      New contributor




      BurgerBurglar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      share



      share






      New contributor




      BurgerBurglar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 8 mins ago









      BurgerBurglarBurgerBurglar

      1




      1




      New contributor




      BurgerBurglar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      BurgerBurglar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      BurgerBurglar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          0






          active

          oldest

          votes











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "196"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          BurgerBurglar is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f212043%2flinear-regression-on-pandas%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          BurgerBurglar is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          BurgerBurglar is a new contributor. Be nice, and check out our Code of Conduct.













          BurgerBurglar is a new contributor. Be nice, and check out our Code of Conduct.












          BurgerBurglar is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f212043%2flinear-regression-on-pandas%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Сан-Квентин

          Алькесар

          Josef Freinademetz