Groupby class and count missing values in features












14














I have a problem and I cannot find any solution in the web or documentation, even if I think that it is very trivial.



What do I want to do?



I have a dataframe like this



CLASS FEATURE1 FEATURE2 FEATURE3
X A NaN NaN
X NaN A NaN
B A A A


I want to group by the label(CLASS) and display the number of NaN-Values that are counted in every feature so that it looks like this. The purpose of this is to get a general idea how missing values are distributed over the different classes.



CLASS FEATURE1 FEATURE2 FEATURE3
X 1 1 2
B 0 0 0


I know how to recieve the amount of nonnull-Values - df.groupby['CLASS'].count()



Is there something similar for the NaN-Values?



I tried to subtract the count() from the size() but it returned an unformatted output filled with the value NaN










share|improve this question









New contributor




FelTry2 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    14














    I have a problem and I cannot find any solution in the web or documentation, even if I think that it is very trivial.



    What do I want to do?



    I have a dataframe like this



    CLASS FEATURE1 FEATURE2 FEATURE3
    X A NaN NaN
    X NaN A NaN
    B A A A


    I want to group by the label(CLASS) and display the number of NaN-Values that are counted in every feature so that it looks like this. The purpose of this is to get a general idea how missing values are distributed over the different classes.



    CLASS FEATURE1 FEATURE2 FEATURE3
    X 1 1 2
    B 0 0 0


    I know how to recieve the amount of nonnull-Values - df.groupby['CLASS'].count()



    Is there something similar for the NaN-Values?



    I tried to subtract the count() from the size() but it returned an unformatted output filled with the value NaN










    share|improve this question









    New contributor




    FelTry2 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.























      14












      14








      14







      I have a problem and I cannot find any solution in the web or documentation, even if I think that it is very trivial.



      What do I want to do?



      I have a dataframe like this



      CLASS FEATURE1 FEATURE2 FEATURE3
      X A NaN NaN
      X NaN A NaN
      B A A A


      I want to group by the label(CLASS) and display the number of NaN-Values that are counted in every feature so that it looks like this. The purpose of this is to get a general idea how missing values are distributed over the different classes.



      CLASS FEATURE1 FEATURE2 FEATURE3
      X 1 1 2
      B 0 0 0


      I know how to recieve the amount of nonnull-Values - df.groupby['CLASS'].count()



      Is there something similar for the NaN-Values?



      I tried to subtract the count() from the size() but it returned an unformatted output filled with the value NaN










      share|improve this question









      New contributor




      FelTry2 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I have a problem and I cannot find any solution in the web or documentation, even if I think that it is very trivial.



      What do I want to do?



      I have a dataframe like this



      CLASS FEATURE1 FEATURE2 FEATURE3
      X A NaN NaN
      X NaN A NaN
      B A A A


      I want to group by the label(CLASS) and display the number of NaN-Values that are counted in every feature so that it looks like this. The purpose of this is to get a general idea how missing values are distributed over the different classes.



      CLASS FEATURE1 FEATURE2 FEATURE3
      X 1 1 2
      B 0 0 0


      I know how to recieve the amount of nonnull-Values - df.groupby['CLASS'].count()



      Is there something similar for the NaN-Values?



      I tried to subtract the count() from the size() but it returned an unformatted output filled with the value NaN







      python pandas dataframe group-by nan






      share|improve this question









      New contributor




      FelTry2 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      FelTry2 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited Dec 27 '18 at 16:11









      coldspeed

      120k19119194




      120k19119194






      New contributor




      FelTry2 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked Dec 27 '18 at 15:15









      FelTry2

      955




      955




      New contributor




      FelTry2 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      FelTry2 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      FelTry2 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.
























          3 Answers
          3






          active

          oldest

          votes


















          14














          Compute a mask with isna, then group and find the sum:



          df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

          CLASS FEATURE1 FEATURE2 FEATURE3
          0 X 1.0 1.0 2.0
          1 B 0.0 0.0 0.0




          Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:



          df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)


          Or,



          g = df.groupby('CLASS')
          g.count().rsub(g.size(), axis=0)




                 FEATURE1  FEATURE2  FEATURE3
          CLASS
          B 0 0 0
          X 1 1 2




          There are quite a few good answers, so here are some timeits for your perusal:



          df_ = df
          df = pd.concat([df_] * 10000)

          %timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
          %timeit df.set_index('CLASS').isna().sum(level=0)
          %%timeit
          g = df.groupby('CLASS')
          g.count().rsub(g.size(), axis=0)

          11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


          Actual performance depends on your data and setup, so your mileage may vary.






          share|improve this answer



















          • 2




            Amazing - thank you very, very much!
            – FelTry2
            Dec 27 '18 at 15:37



















          12














          You can use set_index and sum:



          df.set_index('CLASS').isna().sum(level=0)


          Output:



                 FEATURE1  FEATURE2  FEATURE3
          CLASS
          X 1.0 1.0 2.0
          B 0.0 0.0 0.0





          share|improve this answer





























            7














            Using the diff between count and size



            g=df.groupby('CLASS')

            -g.count().sub(g.size(),0)

            FEATURE1 FEATURE2 FEATURE3
            CLASS
            B 0 0 0
            X 1 1 2


            And we can transform this question to the more generic question how to count how many NaN in dataframe with for loop



            pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
            Out[468]:
            FEATURE1 FEATURE2 FEATURE3
            B 0 0 0
            X 1 1 2





            share|improve this answer



















            • 2




              That's a good one. +1
              – Scott Boston
              Dec 27 '18 at 15:26











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });






            FelTry2 is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947196%2fgroupby-class-and-count-missing-values-in-features%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            14














            Compute a mask with isna, then group and find the sum:



            df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

            CLASS FEATURE1 FEATURE2 FEATURE3
            0 X 1.0 1.0 2.0
            1 B 0.0 0.0 0.0




            Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:



            df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)


            Or,



            g = df.groupby('CLASS')
            g.count().rsub(g.size(), axis=0)




                   FEATURE1  FEATURE2  FEATURE3
            CLASS
            B 0 0 0
            X 1 1 2




            There are quite a few good answers, so here are some timeits for your perusal:



            df_ = df
            df = pd.concat([df_] * 10000)

            %timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
            %timeit df.set_index('CLASS').isna().sum(level=0)
            %%timeit
            g = df.groupby('CLASS')
            g.count().rsub(g.size(), axis=0)

            11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


            Actual performance depends on your data and setup, so your mileage may vary.






            share|improve this answer



















            • 2




              Amazing - thank you very, very much!
              – FelTry2
              Dec 27 '18 at 15:37
















            14














            Compute a mask with isna, then group and find the sum:



            df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

            CLASS FEATURE1 FEATURE2 FEATURE3
            0 X 1.0 1.0 2.0
            1 B 0.0 0.0 0.0




            Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:



            df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)


            Or,



            g = df.groupby('CLASS')
            g.count().rsub(g.size(), axis=0)




                   FEATURE1  FEATURE2  FEATURE3
            CLASS
            B 0 0 0
            X 1 1 2




            There are quite a few good answers, so here are some timeits for your perusal:



            df_ = df
            df = pd.concat([df_] * 10000)

            %timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
            %timeit df.set_index('CLASS').isna().sum(level=0)
            %%timeit
            g = df.groupby('CLASS')
            g.count().rsub(g.size(), axis=0)

            11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


            Actual performance depends on your data and setup, so your mileage may vary.






            share|improve this answer



















            • 2




              Amazing - thank you very, very much!
              – FelTry2
              Dec 27 '18 at 15:37














            14












            14








            14






            Compute a mask with isna, then group and find the sum:



            df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

            CLASS FEATURE1 FEATURE2 FEATURE3
            0 X 1.0 1.0 2.0
            1 B 0.0 0.0 0.0




            Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:



            df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)


            Or,



            g = df.groupby('CLASS')
            g.count().rsub(g.size(), axis=0)




                   FEATURE1  FEATURE2  FEATURE3
            CLASS
            B 0 0 0
            X 1 1 2




            There are quite a few good answers, so here are some timeits for your perusal:



            df_ = df
            df = pd.concat([df_] * 10000)

            %timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
            %timeit df.set_index('CLASS').isna().sum(level=0)
            %%timeit
            g = df.groupby('CLASS')
            g.count().rsub(g.size(), axis=0)

            11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


            Actual performance depends on your data and setup, so your mileage may vary.






            share|improve this answer














            Compute a mask with isna, then group and find the sum:



            df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

            CLASS FEATURE1 FEATURE2 FEATURE3
            0 X 1.0 1.0 2.0
            1 B 0.0 0.0 0.0




            Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:



            df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)


            Or,



            g = df.groupby('CLASS')
            g.count().rsub(g.size(), axis=0)




                   FEATURE1  FEATURE2  FEATURE3
            CLASS
            B 0 0 0
            X 1 1 2




            There are quite a few good answers, so here are some timeits for your perusal:



            df_ = df
            df = pd.concat([df_] * 10000)

            %timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
            %timeit df.set_index('CLASS').isna().sum(level=0)
            %%timeit
            g = df.groupby('CLASS')
            g.count().rsub(g.size(), axis=0)

            11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


            Actual performance depends on your data and setup, so your mileage may vary.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Dec 27 '18 at 15:29

























            answered Dec 27 '18 at 15:16









            coldspeed

            120k19119194




            120k19119194








            • 2




              Amazing - thank you very, very much!
              – FelTry2
              Dec 27 '18 at 15:37














            • 2




              Amazing - thank you very, very much!
              – FelTry2
              Dec 27 '18 at 15:37








            2




            2




            Amazing - thank you very, very much!
            – FelTry2
            Dec 27 '18 at 15:37




            Amazing - thank you very, very much!
            – FelTry2
            Dec 27 '18 at 15:37













            12














            You can use set_index and sum:



            df.set_index('CLASS').isna().sum(level=0)


            Output:



                   FEATURE1  FEATURE2  FEATURE3
            CLASS
            X 1.0 1.0 2.0
            B 0.0 0.0 0.0





            share|improve this answer


























              12














              You can use set_index and sum:



              df.set_index('CLASS').isna().sum(level=0)


              Output:



                     FEATURE1  FEATURE2  FEATURE3
              CLASS
              X 1.0 1.0 2.0
              B 0.0 0.0 0.0





              share|improve this answer
























                12












                12








                12






                You can use set_index and sum:



                df.set_index('CLASS').isna().sum(level=0)


                Output:



                       FEATURE1  FEATURE2  FEATURE3
                CLASS
                X 1.0 1.0 2.0
                B 0.0 0.0 0.0





                share|improve this answer












                You can use set_index and sum:



                df.set_index('CLASS').isna().sum(level=0)


                Output:



                       FEATURE1  FEATURE2  FEATURE3
                CLASS
                X 1.0 1.0 2.0
                B 0.0 0.0 0.0






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Dec 27 '18 at 15:18









                Scott Boston

                51.3k72955




                51.3k72955























                    7














                    Using the diff between count and size



                    g=df.groupby('CLASS')

                    -g.count().sub(g.size(),0)

                    FEATURE1 FEATURE2 FEATURE3
                    CLASS
                    B 0 0 0
                    X 1 1 2


                    And we can transform this question to the more generic question how to count how many NaN in dataframe with for loop



                    pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
                    Out[468]:
                    FEATURE1 FEATURE2 FEATURE3
                    B 0 0 0
                    X 1 1 2





                    share|improve this answer



















                    • 2




                      That's a good one. +1
                      – Scott Boston
                      Dec 27 '18 at 15:26
















                    7














                    Using the diff between count and size



                    g=df.groupby('CLASS')

                    -g.count().sub(g.size(),0)

                    FEATURE1 FEATURE2 FEATURE3
                    CLASS
                    B 0 0 0
                    X 1 1 2


                    And we can transform this question to the more generic question how to count how many NaN in dataframe with for loop



                    pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
                    Out[468]:
                    FEATURE1 FEATURE2 FEATURE3
                    B 0 0 0
                    X 1 1 2





                    share|improve this answer



















                    • 2




                      That's a good one. +1
                      – Scott Boston
                      Dec 27 '18 at 15:26














                    7












                    7








                    7






                    Using the diff between count and size



                    g=df.groupby('CLASS')

                    -g.count().sub(g.size(),0)

                    FEATURE1 FEATURE2 FEATURE3
                    CLASS
                    B 0 0 0
                    X 1 1 2


                    And we can transform this question to the more generic question how to count how many NaN in dataframe with for loop



                    pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
                    Out[468]:
                    FEATURE1 FEATURE2 FEATURE3
                    B 0 0 0
                    X 1 1 2





                    share|improve this answer














                    Using the diff between count and size



                    g=df.groupby('CLASS')

                    -g.count().sub(g.size(),0)

                    FEATURE1 FEATURE2 FEATURE3
                    CLASS
                    B 0 0 0
                    X 1 1 2


                    And we can transform this question to the more generic question how to count how many NaN in dataframe with for loop



                    pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
                    Out[468]:
                    FEATURE1 FEATURE2 FEATURE3
                    B 0 0 0
                    X 1 1 2






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Dec 27 '18 at 16:25

























                    answered Dec 27 '18 at 15:19









                    W-B

                    101k73163




                    101k73163








                    • 2




                      That's a good one. +1
                      – Scott Boston
                      Dec 27 '18 at 15:26














                    • 2




                      That's a good one. +1
                      – Scott Boston
                      Dec 27 '18 at 15:26








                    2




                    2




                    That's a good one. +1
                    – Scott Boston
                    Dec 27 '18 at 15:26




                    That's a good one. +1
                    – Scott Boston
                    Dec 27 '18 at 15:26










                    FelTry2 is a new contributor. Be nice, and check out our Code of Conduct.










                    draft saved

                    draft discarded


















                    FelTry2 is a new contributor. Be nice, and check out our Code of Conduct.













                    FelTry2 is a new contributor. Be nice, and check out our Code of Conduct.












                    FelTry2 is a new contributor. Be nice, and check out our Code of Conduct.
















                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947196%2fgroupby-class-and-count-missing-values-in-features%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Сан-Квентин

                    8-я гвардейская общевойсковая армия

                    Алькесар