Data Structures for Counting Duplicates and using std::vector::erase
up vote
4
down vote
favorite
Problem
Dupe detection for a vector of ints. I simply want a count of the unique input characters that occurred at least twice. The goal is to count a dupe only once and ignore that input character if another dupe of it is seen in the future. A test input could look something like this vector<int> test = { 4,5,9,6,9,9,6,3,4 };
Looking for Feedback on
Looking for basic feedback on the data structures I'm using and the possibility of using the vector erase method to iterate and take advantage of the space allocated to my numbers vector instead of using a map to not count dups more than once. Any C++ 11 or 17 features I can take advantage of here too?
int countDuplicates(vector<int> numbers) {
int dups = 0;
set<int> s;
map<int, int> m;
for (int n : numbers) {
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
}
return dups;
}
c++ vectors hash-map set
add a comment |
up vote
4
down vote
favorite
Problem
Dupe detection for a vector of ints. I simply want a count of the unique input characters that occurred at least twice. The goal is to count a dupe only once and ignore that input character if another dupe of it is seen in the future. A test input could look something like this vector<int> test = { 4,5,9,6,9,9,6,3,4 };
Looking for Feedback on
Looking for basic feedback on the data structures I'm using and the possibility of using the vector erase method to iterate and take advantage of the space allocated to my numbers vector instead of using a map to not count dups more than once. Any C++ 11 or 17 features I can take advantage of here too?
int countDuplicates(vector<int> numbers) {
int dups = 0;
set<int> s;
map<int, int> m;
for (int n : numbers) {
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
}
return dups;
}
c++ vectors hash-map set
Not really enough for a full answer, butm.insert(pair<int, int>(n, 0))
can be replaced with simplym.emplace(n, 0)
saving you from writing out the pair constructor.
– Kyle
Nov 29 at 8:04
add a comment |
up vote
4
down vote
favorite
up vote
4
down vote
favorite
Problem
Dupe detection for a vector of ints. I simply want a count of the unique input characters that occurred at least twice. The goal is to count a dupe only once and ignore that input character if another dupe of it is seen in the future. A test input could look something like this vector<int> test = { 4,5,9,6,9,9,6,3,4 };
Looking for Feedback on
Looking for basic feedback on the data structures I'm using and the possibility of using the vector erase method to iterate and take advantage of the space allocated to my numbers vector instead of using a map to not count dups more than once. Any C++ 11 or 17 features I can take advantage of here too?
int countDuplicates(vector<int> numbers) {
int dups = 0;
set<int> s;
map<int, int> m;
for (int n : numbers) {
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
}
return dups;
}
c++ vectors hash-map set
Problem
Dupe detection for a vector of ints. I simply want a count of the unique input characters that occurred at least twice. The goal is to count a dupe only once and ignore that input character if another dupe of it is seen in the future. A test input could look something like this vector<int> test = { 4,5,9,6,9,9,6,3,4 };
Looking for Feedback on
Looking for basic feedback on the data structures I'm using and the possibility of using the vector erase method to iterate and take advantage of the space allocated to my numbers vector instead of using a map to not count dups more than once. Any C++ 11 or 17 features I can take advantage of here too?
int countDuplicates(vector<int> numbers) {
int dups = 0;
set<int> s;
map<int, int> m;
for (int n : numbers) {
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
}
return dups;
}
c++ vectors hash-map set
c++ vectors hash-map set
edited Nov 29 at 0:05
asked Nov 28 at 21:05
greg
29017
29017
Not really enough for a full answer, butm.insert(pair<int, int>(n, 0))
can be replaced with simplym.emplace(n, 0)
saving you from writing out the pair constructor.
– Kyle
Nov 29 at 8:04
add a comment |
Not really enough for a full answer, butm.insert(pair<int, int>(n, 0))
can be replaced with simplym.emplace(n, 0)
saving you from writing out the pair constructor.
– Kyle
Nov 29 at 8:04
Not really enough for a full answer, but
m.insert(pair<int, int>(n, 0))
can be replaced with simply m.emplace(n, 0)
saving you from writing out the pair constructor.– Kyle
Nov 29 at 8:04
Not really enough for a full answer, but
m.insert(pair<int, int>(n, 0))
can be replaced with simply m.emplace(n, 0)
saving you from writing out the pair constructor.– Kyle
Nov 29 at 8:04
add a comment |
2 Answers
2
active
oldest
votes
up vote
7
down vote
accepted
Basic Algorithm
At least if I understand the intent correctly, you simply want a count of the unique input characters that occurred at least twice.
In that case, I think I'd do something like this:
int count_dupes(std::vector<int> const &inputs) {
std::map<int, int> counts;
for (auto i : inputs)
++counts[i];
return std::count_if(counts.begin(), counts.end(),
(auto const &p) { return p.second >= 2; });
}
I'd also consider using an array instead of a map, as outlined in an answer to an earlier question: https://codereview.stackexchange.com/a/208502/489 --but this can depend on the range of values you're dealing with. With a 16-bit int, it's no problem at all on most machines. With a 32-bit int
(and no other constraints on values) it's still possible on many machines, but probably impractical. For arbitrary 64-bit int, an array won't be practical.
Parameter Passing
Right now, you're passing the input by value. This means when you call the function with some vector, a copy of the original vector will normally be made and passed to the function. As a general rule, something like a vector that's potentially large and slow to copy should be passed by reference to const, as shown in the code above.
Logical Comparisons
Comparing a Boolean value to true
or false
is generally a poor idea. if (x==true)
is equivalent to if (x)
and if (x == false)
is equivalent to if (!x)
. Normally, if it's Boolean in nature, a variable should be given a name that reflects that nature, and should be used directly rather than being compared to true
or false
. For example, s.insert(n).second == false
wold be better written as: if (!s.insert(n).second)
.
Some people (understandably, I guess) prefer to use the written form: if if (not s.insert(n).second)
. I've written C and C++ long enough that I have no difficulty with reading !
as meaning "not", but especially if it may be read by people less accustomed to programming, it may make more sense to use the words instead of symbols.
Formatting/Indentation
At least to me, this indentation looks a bit odd:
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
If you use indentation like that consistently, I guess it's not necessarily terrible, but I think more people are accustomed to something more like this:
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
...where each closing brace is vertically aligned with the beginning of the block it closes. As a side-note, there are almost endless debates about the efficacy of various bracing styles. I'm not going to advocate for or against any of the well known styles, but I think there's a fair amount to be gained from using a style that's well known, and then using it consistently. I don't see much to gain from style that's different from what almost anybody else uses.
If my understanding is correct,count_if
iterates over counts and incrementsp
when it finds a unique input that occurred at least twice.++counts[i]
is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort withfalse
and!
, I've usedfalse
more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
– greg
Nov 29 at 0:04
1
Yes,++counts[i]
will insert a new record fori
if it hasn't been inserted yet. That newly inserted record will have its count at 0. The++
will then increment the current count.count_if
just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returnedtrue
.
– Jerry Coffin
Nov 29 at 1:39
2
It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, applystd::unique
and getstd::distance
between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
– Incomputable
Nov 29 at 11:37
You might want to attach a caveat to the suggestion to use an array, sinceint
has a much wider range of values thanchar
(so we're not in quite the same context as that other question).
– Toby Speight
Nov 29 at 11:49
1
@Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadlystd::unique
doesn't do the job if what we need to count is the number of elements appearing at least twice
– papagaga
Nov 29 at 11:57
add a comment |
up vote
5
down vote
I don't agree with @JerryCoffin on two accounts: algorithm and paramater passing, the latter being a consequence of the former. That's why I submit this extra review, even if @JerryCoffin's has already been accepted, and even if I agree with the other points he made.
When you design an algorithm, especially in C++, you want it to be as efficient as possible, in as many situations as possible. It's a good idea to take a look at existing algorithms in the standard library to see how it can be achieved, all the more when there is an algorithm there that is closely related to the one you're designing: std::unique
, that removes all but the first of consecutive equivalent elements. What's interesting is 1) that it operates on a sorted range and 2) that it modifies the input sequence: thus it makes it optimal when the input sequence is already sorted, and also when it's disposable. Can we benefit from std::unique
s interface in our largely similar problem? I would say so:
#include <algorithm>
template <typename Iterator>
int count_duplicates(Iterator first, Iterator last) {
// requires a sorted range
int count = 0;
while (true) {
first = std::adjacent_find(first, last);
if (first == last) return count;
first = std::adjacent_find(++first, last, std::not_equal_to<>());
++count;
}
}
Let's now compare with @JerryCoffin's proposed solution, which allocates memory for a std::map
and then has in all cases a complexity of O(n*log(n))
for populating it + O(n)
for counting elements with a frequency higher than 1:
if the input range is already sorted, this algorithm has
O(n)
complexity, which is betterif the input range is disposable but not sorted, this algorithm has the same complexity (
O(n*log(n))
for prior sorting andO(n)
for counting), but doesn't allocate memory and has better cache localityif the input is neither sorted nor disposable, we have the same complexity and memory requirements (we need to copy the input range) but we keep the better cache locality
On the other hand it lacks the possibility of relying on a more efficient structure to count the occurrences of each element, such as an array or a hash table. We could then theoretically go from O(n*log(n))
to O(n)
when looking for duplicates. But I'm still unconvinced because those data structures would be oversized if the input range has a small alphabet.
EDIT: I think I've read the submitted code and the question a bit too fast. If what we need is not only to count elements appearing at least twice, but erasing other elements of the vector, then the solution is different, even if most building blocks remain:
#include <vector>
#include <algorithm>
#include <iostream>
template <typename Iterator>
Iterator one_of_duplicates(Iterator first, Iterator last) {
// requires a sorted input
auto current = first;
while (true) {
// find a duplicated element, move it behind 'first'
// and find the next different element
current = std::adjacent_find(current, last);
if (current == last) return first;
*first++ = std::move(*current);
std::cerr << *current << std::endl;
current = std::adjacent_find(current, last, std::not_equal_to<>());
}
}
int main() {
std::vector<int> data = { 0, 1, 2, 3, 4, 5, 1, 2, 2, 3, 5, 5, 5 };
std::sort(data.begin(), data.end());
data.erase(one_of_duplicates(data.begin(), data.end()), data.end());
for (auto i : data) std::cout << i << ',';
}
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208648%2fdata-structures-for-counting-duplicates-and-using-stdvectorerase%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
7
down vote
accepted
Basic Algorithm
At least if I understand the intent correctly, you simply want a count of the unique input characters that occurred at least twice.
In that case, I think I'd do something like this:
int count_dupes(std::vector<int> const &inputs) {
std::map<int, int> counts;
for (auto i : inputs)
++counts[i];
return std::count_if(counts.begin(), counts.end(),
(auto const &p) { return p.second >= 2; });
}
I'd also consider using an array instead of a map, as outlined in an answer to an earlier question: https://codereview.stackexchange.com/a/208502/489 --but this can depend on the range of values you're dealing with. With a 16-bit int, it's no problem at all on most machines. With a 32-bit int
(and no other constraints on values) it's still possible on many machines, but probably impractical. For arbitrary 64-bit int, an array won't be practical.
Parameter Passing
Right now, you're passing the input by value. This means when you call the function with some vector, a copy of the original vector will normally be made and passed to the function. As a general rule, something like a vector that's potentially large and slow to copy should be passed by reference to const, as shown in the code above.
Logical Comparisons
Comparing a Boolean value to true
or false
is generally a poor idea. if (x==true)
is equivalent to if (x)
and if (x == false)
is equivalent to if (!x)
. Normally, if it's Boolean in nature, a variable should be given a name that reflects that nature, and should be used directly rather than being compared to true
or false
. For example, s.insert(n).second == false
wold be better written as: if (!s.insert(n).second)
.
Some people (understandably, I guess) prefer to use the written form: if if (not s.insert(n).second)
. I've written C and C++ long enough that I have no difficulty with reading !
as meaning "not", but especially if it may be read by people less accustomed to programming, it may make more sense to use the words instead of symbols.
Formatting/Indentation
At least to me, this indentation looks a bit odd:
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
If you use indentation like that consistently, I guess it's not necessarily terrible, but I think more people are accustomed to something more like this:
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
...where each closing brace is vertically aligned with the beginning of the block it closes. As a side-note, there are almost endless debates about the efficacy of various bracing styles. I'm not going to advocate for or against any of the well known styles, but I think there's a fair amount to be gained from using a style that's well known, and then using it consistently. I don't see much to gain from style that's different from what almost anybody else uses.
If my understanding is correct,count_if
iterates over counts and incrementsp
when it finds a unique input that occurred at least twice.++counts[i]
is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort withfalse
and!
, I've usedfalse
more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
– greg
Nov 29 at 0:04
1
Yes,++counts[i]
will insert a new record fori
if it hasn't been inserted yet. That newly inserted record will have its count at 0. The++
will then increment the current count.count_if
just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returnedtrue
.
– Jerry Coffin
Nov 29 at 1:39
2
It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, applystd::unique
and getstd::distance
between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
– Incomputable
Nov 29 at 11:37
You might want to attach a caveat to the suggestion to use an array, sinceint
has a much wider range of values thanchar
(so we're not in quite the same context as that other question).
– Toby Speight
Nov 29 at 11:49
1
@Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadlystd::unique
doesn't do the job if what we need to count is the number of elements appearing at least twice
– papagaga
Nov 29 at 11:57
add a comment |
up vote
7
down vote
accepted
Basic Algorithm
At least if I understand the intent correctly, you simply want a count of the unique input characters that occurred at least twice.
In that case, I think I'd do something like this:
int count_dupes(std::vector<int> const &inputs) {
std::map<int, int> counts;
for (auto i : inputs)
++counts[i];
return std::count_if(counts.begin(), counts.end(),
(auto const &p) { return p.second >= 2; });
}
I'd also consider using an array instead of a map, as outlined in an answer to an earlier question: https://codereview.stackexchange.com/a/208502/489 --but this can depend on the range of values you're dealing with. With a 16-bit int, it's no problem at all on most machines. With a 32-bit int
(and no other constraints on values) it's still possible on many machines, but probably impractical. For arbitrary 64-bit int, an array won't be practical.
Parameter Passing
Right now, you're passing the input by value. This means when you call the function with some vector, a copy of the original vector will normally be made and passed to the function. As a general rule, something like a vector that's potentially large and slow to copy should be passed by reference to const, as shown in the code above.
Logical Comparisons
Comparing a Boolean value to true
or false
is generally a poor idea. if (x==true)
is equivalent to if (x)
and if (x == false)
is equivalent to if (!x)
. Normally, if it's Boolean in nature, a variable should be given a name that reflects that nature, and should be used directly rather than being compared to true
or false
. For example, s.insert(n).second == false
wold be better written as: if (!s.insert(n).second)
.
Some people (understandably, I guess) prefer to use the written form: if if (not s.insert(n).second)
. I've written C and C++ long enough that I have no difficulty with reading !
as meaning "not", but especially if it may be read by people less accustomed to programming, it may make more sense to use the words instead of symbols.
Formatting/Indentation
At least to me, this indentation looks a bit odd:
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
If you use indentation like that consistently, I guess it's not necessarily terrible, but I think more people are accustomed to something more like this:
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
...where each closing brace is vertically aligned with the beginning of the block it closes. As a side-note, there are almost endless debates about the efficacy of various bracing styles. I'm not going to advocate for or against any of the well known styles, but I think there's a fair amount to be gained from using a style that's well known, and then using it consistently. I don't see much to gain from style that's different from what almost anybody else uses.
If my understanding is correct,count_if
iterates over counts and incrementsp
when it finds a unique input that occurred at least twice.++counts[i]
is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort withfalse
and!
, I've usedfalse
more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
– greg
Nov 29 at 0:04
1
Yes,++counts[i]
will insert a new record fori
if it hasn't been inserted yet. That newly inserted record will have its count at 0. The++
will then increment the current count.count_if
just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returnedtrue
.
– Jerry Coffin
Nov 29 at 1:39
2
It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, applystd::unique
and getstd::distance
between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
– Incomputable
Nov 29 at 11:37
You might want to attach a caveat to the suggestion to use an array, sinceint
has a much wider range of values thanchar
(so we're not in quite the same context as that other question).
– Toby Speight
Nov 29 at 11:49
1
@Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadlystd::unique
doesn't do the job if what we need to count is the number of elements appearing at least twice
– papagaga
Nov 29 at 11:57
add a comment |
up vote
7
down vote
accepted
up vote
7
down vote
accepted
Basic Algorithm
At least if I understand the intent correctly, you simply want a count of the unique input characters that occurred at least twice.
In that case, I think I'd do something like this:
int count_dupes(std::vector<int> const &inputs) {
std::map<int, int> counts;
for (auto i : inputs)
++counts[i];
return std::count_if(counts.begin(), counts.end(),
(auto const &p) { return p.second >= 2; });
}
I'd also consider using an array instead of a map, as outlined in an answer to an earlier question: https://codereview.stackexchange.com/a/208502/489 --but this can depend on the range of values you're dealing with. With a 16-bit int, it's no problem at all on most machines. With a 32-bit int
(and no other constraints on values) it's still possible on many machines, but probably impractical. For arbitrary 64-bit int, an array won't be practical.
Parameter Passing
Right now, you're passing the input by value. This means when you call the function with some vector, a copy of the original vector will normally be made and passed to the function. As a general rule, something like a vector that's potentially large and slow to copy should be passed by reference to const, as shown in the code above.
Logical Comparisons
Comparing a Boolean value to true
or false
is generally a poor idea. if (x==true)
is equivalent to if (x)
and if (x == false)
is equivalent to if (!x)
. Normally, if it's Boolean in nature, a variable should be given a name that reflects that nature, and should be used directly rather than being compared to true
or false
. For example, s.insert(n).second == false
wold be better written as: if (!s.insert(n).second)
.
Some people (understandably, I guess) prefer to use the written form: if if (not s.insert(n).second)
. I've written C and C++ long enough that I have no difficulty with reading !
as meaning "not", but especially if it may be read by people less accustomed to programming, it may make more sense to use the words instead of symbols.
Formatting/Indentation
At least to me, this indentation looks a bit odd:
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
If you use indentation like that consistently, I guess it's not necessarily terrible, but I think more people are accustomed to something more like this:
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
...where each closing brace is vertically aligned with the beginning of the block it closes. As a side-note, there are almost endless debates about the efficacy of various bracing styles. I'm not going to advocate for or against any of the well known styles, but I think there's a fair amount to be gained from using a style that's well known, and then using it consistently. I don't see much to gain from style that's different from what almost anybody else uses.
Basic Algorithm
At least if I understand the intent correctly, you simply want a count of the unique input characters that occurred at least twice.
In that case, I think I'd do something like this:
int count_dupes(std::vector<int> const &inputs) {
std::map<int, int> counts;
for (auto i : inputs)
++counts[i];
return std::count_if(counts.begin(), counts.end(),
(auto const &p) { return p.second >= 2; });
}
I'd also consider using an array instead of a map, as outlined in an answer to an earlier question: https://codereview.stackexchange.com/a/208502/489 --but this can depend on the range of values you're dealing with. With a 16-bit int, it's no problem at all on most machines. With a 32-bit int
(and no other constraints on values) it's still possible on many machines, but probably impractical. For arbitrary 64-bit int, an array won't be practical.
Parameter Passing
Right now, you're passing the input by value. This means when you call the function with some vector, a copy of the original vector will normally be made and passed to the function. As a general rule, something like a vector that's potentially large and slow to copy should be passed by reference to const, as shown in the code above.
Logical Comparisons
Comparing a Boolean value to true
or false
is generally a poor idea. if (x==true)
is equivalent to if (x)
and if (x == false)
is equivalent to if (!x)
. Normally, if it's Boolean in nature, a variable should be given a name that reflects that nature, and should be used directly rather than being compared to true
or false
. For example, s.insert(n).second == false
wold be better written as: if (!s.insert(n).second)
.
Some people (understandably, I guess) prefer to use the written form: if if (not s.insert(n).second)
. I've written C and C++ long enough that I have no difficulty with reading !
as meaning "not", but especially if it may be read by people less accustomed to programming, it may make more sense to use the words instead of symbols.
Formatting/Indentation
At least to me, this indentation looks a bit odd:
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
If you use indentation like that consistently, I guess it's not necessarily terrible, but I think more people are accustomed to something more like this:
if (s.insert(n).second == false && m.find(n) == m.end()) {
dups++;
m.insert(pair<int, int>(n,0));
// better to remove from vector than increase space with the map?
// numbers.erase(remove(numbers.begin(), numbers.end(), n), numbers.end());
} else {
s.insert(n);
}
...where each closing brace is vertically aligned with the beginning of the block it closes. As a side-note, there are almost endless debates about the efficacy of various bracing styles. I'm not going to advocate for or against any of the well known styles, but I think there's a fair amount to be gained from using a style that's well known, and then using it consistently. I don't see much to gain from style that's different from what almost anybody else uses.
edited Nov 29 at 15:16
answered Nov 28 at 23:24
Jerry Coffin
27.8k460125
27.8k460125
If my understanding is correct,count_if
iterates over counts and incrementsp
when it finds a unique input that occurred at least twice.++counts[i]
is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort withfalse
and!
, I've usedfalse
more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
– greg
Nov 29 at 0:04
1
Yes,++counts[i]
will insert a new record fori
if it hasn't been inserted yet. That newly inserted record will have its count at 0. The++
will then increment the current count.count_if
just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returnedtrue
.
– Jerry Coffin
Nov 29 at 1:39
2
It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, applystd::unique
and getstd::distance
between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
– Incomputable
Nov 29 at 11:37
You might want to attach a caveat to the suggestion to use an array, sinceint
has a much wider range of values thanchar
(so we're not in quite the same context as that other question).
– Toby Speight
Nov 29 at 11:49
1
@Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadlystd::unique
doesn't do the job if what we need to count is the number of elements appearing at least twice
– papagaga
Nov 29 at 11:57
add a comment |
If my understanding is correct,count_if
iterates over counts and incrementsp
when it finds a unique input that occurred at least twice.++counts[i]
is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort withfalse
and!
, I've usedfalse
more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.
– greg
Nov 29 at 0:04
1
Yes,++counts[i]
will insert a new record fori
if it hasn't been inserted yet. That newly inserted record will have its count at 0. The++
will then increment the current count.count_if
just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returnedtrue
.
– Jerry Coffin
Nov 29 at 1:39
2
It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, applystd::unique
and getstd::distance
between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.
– Incomputable
Nov 29 at 11:37
You might want to attach a caveat to the suggestion to use an array, sinceint
has a much wider range of values thanchar
(so we're not in quite the same context as that other question).
– Toby Speight
Nov 29 at 11:49
1
@Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadlystd::unique
doesn't do the job if what we need to count is the number of elements appearing at least twice
– papagaga
Nov 29 at 11:57
If my understanding is correct,
count_if
iterates over counts and increments p
when it finds a unique input that occurred at least twice. ++counts[i]
is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort with false
and !
, I've used false
more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.– greg
Nov 29 at 0:04
If my understanding is correct,
count_if
iterates over counts and increments p
when it finds a unique input that occurred at least twice. ++counts[i]
is a very clean map update, I've not seen this before and took me a moment to understand, but both the key and value are updated. Agree on logical comparison feedback. I've gone back and fort with false
and !
, I've used false
more often for readability, but I'm at the point in my life we're brevity and speed are becoming more important. Brace indent was a paste error, but good catch and raised my awareness of this detail.– greg
Nov 29 at 0:04
1
1
Yes,
++counts[i]
will insert a new record for i
if it hasn't been inserted yet. That newly inserted record will have its count at 0. The ++
will then increment the current count. count_if
just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returned true
.– Jerry Coffin
Nov 29 at 1:39
Yes,
++counts[i]
will insert a new record for i
if it hasn't been inserted yet. That newly inserted record will have its count at 0. The ++
will then increment the current count. count_if
just counts the number of elements in a collection that meet the specified criteria, so it basically just counts and returns the number of items for which your predicate returned true
.– Jerry Coffin
Nov 29 at 1:39
2
2
It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, apply
std::unique
and get std::distance
between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.– Incomputable
Nov 29 at 11:37
It would also be quite easy to do with the original version, as it makes a copy. One could then sort copied vector, apply
std::unique
and get std::distance
between begin and returned iterator. It is not really certain which version is better though. I've written this comment with relation to the one above, but it seems to be deleted now.– Incomputable
Nov 29 at 11:37
You might want to attach a caveat to the suggestion to use an array, since
int
has a much wider range of values than char
(so we're not in quite the same context as that other question).– Toby Speight
Nov 29 at 11:49
You might want to attach a caveat to the suggestion to use an array, since
int
has a much wider range of values than char
(so we're not in quite the same context as that other question).– Toby Speight
Nov 29 at 11:49
1
1
@Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadly
std::unique
doesn't do the job if what we need to count is the number of elements appearing at least twice– papagaga
Nov 29 at 11:57
@Incomputable: yes, it was my comment, I had some doubts about its validity after reading the original question again. It might have been justified. / sadly
std::unique
doesn't do the job if what we need to count is the number of elements appearing at least twice– papagaga
Nov 29 at 11:57
add a comment |
up vote
5
down vote
I don't agree with @JerryCoffin on two accounts: algorithm and paramater passing, the latter being a consequence of the former. That's why I submit this extra review, even if @JerryCoffin's has already been accepted, and even if I agree with the other points he made.
When you design an algorithm, especially in C++, you want it to be as efficient as possible, in as many situations as possible. It's a good idea to take a look at existing algorithms in the standard library to see how it can be achieved, all the more when there is an algorithm there that is closely related to the one you're designing: std::unique
, that removes all but the first of consecutive equivalent elements. What's interesting is 1) that it operates on a sorted range and 2) that it modifies the input sequence: thus it makes it optimal when the input sequence is already sorted, and also when it's disposable. Can we benefit from std::unique
s interface in our largely similar problem? I would say so:
#include <algorithm>
template <typename Iterator>
int count_duplicates(Iterator first, Iterator last) {
// requires a sorted range
int count = 0;
while (true) {
first = std::adjacent_find(first, last);
if (first == last) return count;
first = std::adjacent_find(++first, last, std::not_equal_to<>());
++count;
}
}
Let's now compare with @JerryCoffin's proposed solution, which allocates memory for a std::map
and then has in all cases a complexity of O(n*log(n))
for populating it + O(n)
for counting elements with a frequency higher than 1:
if the input range is already sorted, this algorithm has
O(n)
complexity, which is betterif the input range is disposable but not sorted, this algorithm has the same complexity (
O(n*log(n))
for prior sorting andO(n)
for counting), but doesn't allocate memory and has better cache localityif the input is neither sorted nor disposable, we have the same complexity and memory requirements (we need to copy the input range) but we keep the better cache locality
On the other hand it lacks the possibility of relying on a more efficient structure to count the occurrences of each element, such as an array or a hash table. We could then theoretically go from O(n*log(n))
to O(n)
when looking for duplicates. But I'm still unconvinced because those data structures would be oversized if the input range has a small alphabet.
EDIT: I think I've read the submitted code and the question a bit too fast. If what we need is not only to count elements appearing at least twice, but erasing other elements of the vector, then the solution is different, even if most building blocks remain:
#include <vector>
#include <algorithm>
#include <iostream>
template <typename Iterator>
Iterator one_of_duplicates(Iterator first, Iterator last) {
// requires a sorted input
auto current = first;
while (true) {
// find a duplicated element, move it behind 'first'
// and find the next different element
current = std::adjacent_find(current, last);
if (current == last) return first;
*first++ = std::move(*current);
std::cerr << *current << std::endl;
current = std::adjacent_find(current, last, std::not_equal_to<>());
}
}
int main() {
std::vector<int> data = { 0, 1, 2, 3, 4, 5, 1, 2, 2, 3, 5, 5, 5 };
std::sort(data.begin(), data.end());
data.erase(one_of_duplicates(data.begin(), data.end()), data.end());
for (auto i : data) std::cout << i << ',';
}
add a comment |
up vote
5
down vote
I don't agree with @JerryCoffin on two accounts: algorithm and paramater passing, the latter being a consequence of the former. That's why I submit this extra review, even if @JerryCoffin's has already been accepted, and even if I agree with the other points he made.
When you design an algorithm, especially in C++, you want it to be as efficient as possible, in as many situations as possible. It's a good idea to take a look at existing algorithms in the standard library to see how it can be achieved, all the more when there is an algorithm there that is closely related to the one you're designing: std::unique
, that removes all but the first of consecutive equivalent elements. What's interesting is 1) that it operates on a sorted range and 2) that it modifies the input sequence: thus it makes it optimal when the input sequence is already sorted, and also when it's disposable. Can we benefit from std::unique
s interface in our largely similar problem? I would say so:
#include <algorithm>
template <typename Iterator>
int count_duplicates(Iterator first, Iterator last) {
// requires a sorted range
int count = 0;
while (true) {
first = std::adjacent_find(first, last);
if (first == last) return count;
first = std::adjacent_find(++first, last, std::not_equal_to<>());
++count;
}
}
Let's now compare with @JerryCoffin's proposed solution, which allocates memory for a std::map
and then has in all cases a complexity of O(n*log(n))
for populating it + O(n)
for counting elements with a frequency higher than 1:
if the input range is already sorted, this algorithm has
O(n)
complexity, which is betterif the input range is disposable but not sorted, this algorithm has the same complexity (
O(n*log(n))
for prior sorting andO(n)
for counting), but doesn't allocate memory and has better cache localityif the input is neither sorted nor disposable, we have the same complexity and memory requirements (we need to copy the input range) but we keep the better cache locality
On the other hand it lacks the possibility of relying on a more efficient structure to count the occurrences of each element, such as an array or a hash table. We could then theoretically go from O(n*log(n))
to O(n)
when looking for duplicates. But I'm still unconvinced because those data structures would be oversized if the input range has a small alphabet.
EDIT: I think I've read the submitted code and the question a bit too fast. If what we need is not only to count elements appearing at least twice, but erasing other elements of the vector, then the solution is different, even if most building blocks remain:
#include <vector>
#include <algorithm>
#include <iostream>
template <typename Iterator>
Iterator one_of_duplicates(Iterator first, Iterator last) {
// requires a sorted input
auto current = first;
while (true) {
// find a duplicated element, move it behind 'first'
// and find the next different element
current = std::adjacent_find(current, last);
if (current == last) return first;
*first++ = std::move(*current);
std::cerr << *current << std::endl;
current = std::adjacent_find(current, last, std::not_equal_to<>());
}
}
int main() {
std::vector<int> data = { 0, 1, 2, 3, 4, 5, 1, 2, 2, 3, 5, 5, 5 };
std::sort(data.begin(), data.end());
data.erase(one_of_duplicates(data.begin(), data.end()), data.end());
for (auto i : data) std::cout << i << ',';
}
add a comment |
up vote
5
down vote
up vote
5
down vote
I don't agree with @JerryCoffin on two accounts: algorithm and paramater passing, the latter being a consequence of the former. That's why I submit this extra review, even if @JerryCoffin's has already been accepted, and even if I agree with the other points he made.
When you design an algorithm, especially in C++, you want it to be as efficient as possible, in as many situations as possible. It's a good idea to take a look at existing algorithms in the standard library to see how it can be achieved, all the more when there is an algorithm there that is closely related to the one you're designing: std::unique
, that removes all but the first of consecutive equivalent elements. What's interesting is 1) that it operates on a sorted range and 2) that it modifies the input sequence: thus it makes it optimal when the input sequence is already sorted, and also when it's disposable. Can we benefit from std::unique
s interface in our largely similar problem? I would say so:
#include <algorithm>
template <typename Iterator>
int count_duplicates(Iterator first, Iterator last) {
// requires a sorted range
int count = 0;
while (true) {
first = std::adjacent_find(first, last);
if (first == last) return count;
first = std::adjacent_find(++first, last, std::not_equal_to<>());
++count;
}
}
Let's now compare with @JerryCoffin's proposed solution, which allocates memory for a std::map
and then has in all cases a complexity of O(n*log(n))
for populating it + O(n)
for counting elements with a frequency higher than 1:
if the input range is already sorted, this algorithm has
O(n)
complexity, which is betterif the input range is disposable but not sorted, this algorithm has the same complexity (
O(n*log(n))
for prior sorting andO(n)
for counting), but doesn't allocate memory and has better cache localityif the input is neither sorted nor disposable, we have the same complexity and memory requirements (we need to copy the input range) but we keep the better cache locality
On the other hand it lacks the possibility of relying on a more efficient structure to count the occurrences of each element, such as an array or a hash table. We could then theoretically go from O(n*log(n))
to O(n)
when looking for duplicates. But I'm still unconvinced because those data structures would be oversized if the input range has a small alphabet.
EDIT: I think I've read the submitted code and the question a bit too fast. If what we need is not only to count elements appearing at least twice, but erasing other elements of the vector, then the solution is different, even if most building blocks remain:
#include <vector>
#include <algorithm>
#include <iostream>
template <typename Iterator>
Iterator one_of_duplicates(Iterator first, Iterator last) {
// requires a sorted input
auto current = first;
while (true) {
// find a duplicated element, move it behind 'first'
// and find the next different element
current = std::adjacent_find(current, last);
if (current == last) return first;
*first++ = std::move(*current);
std::cerr << *current << std::endl;
current = std::adjacent_find(current, last, std::not_equal_to<>());
}
}
int main() {
std::vector<int> data = { 0, 1, 2, 3, 4, 5, 1, 2, 2, 3, 5, 5, 5 };
std::sort(data.begin(), data.end());
data.erase(one_of_duplicates(data.begin(), data.end()), data.end());
for (auto i : data) std::cout << i << ',';
}
I don't agree with @JerryCoffin on two accounts: algorithm and paramater passing, the latter being a consequence of the former. That's why I submit this extra review, even if @JerryCoffin's has already been accepted, and even if I agree with the other points he made.
When you design an algorithm, especially in C++, you want it to be as efficient as possible, in as many situations as possible. It's a good idea to take a look at existing algorithms in the standard library to see how it can be achieved, all the more when there is an algorithm there that is closely related to the one you're designing: std::unique
, that removes all but the first of consecutive equivalent elements. What's interesting is 1) that it operates on a sorted range and 2) that it modifies the input sequence: thus it makes it optimal when the input sequence is already sorted, and also when it's disposable. Can we benefit from std::unique
s interface in our largely similar problem? I would say so:
#include <algorithm>
template <typename Iterator>
int count_duplicates(Iterator first, Iterator last) {
// requires a sorted range
int count = 0;
while (true) {
first = std::adjacent_find(first, last);
if (first == last) return count;
first = std::adjacent_find(++first, last, std::not_equal_to<>());
++count;
}
}
Let's now compare with @JerryCoffin's proposed solution, which allocates memory for a std::map
and then has in all cases a complexity of O(n*log(n))
for populating it + O(n)
for counting elements with a frequency higher than 1:
if the input range is already sorted, this algorithm has
O(n)
complexity, which is betterif the input range is disposable but not sorted, this algorithm has the same complexity (
O(n*log(n))
for prior sorting andO(n)
for counting), but doesn't allocate memory and has better cache localityif the input is neither sorted nor disposable, we have the same complexity and memory requirements (we need to copy the input range) but we keep the better cache locality
On the other hand it lacks the possibility of relying on a more efficient structure to count the occurrences of each element, such as an array or a hash table. We could then theoretically go from O(n*log(n))
to O(n)
when looking for duplicates. But I'm still unconvinced because those data structures would be oversized if the input range has a small alphabet.
EDIT: I think I've read the submitted code and the question a bit too fast. If what we need is not only to count elements appearing at least twice, but erasing other elements of the vector, then the solution is different, even if most building blocks remain:
#include <vector>
#include <algorithm>
#include <iostream>
template <typename Iterator>
Iterator one_of_duplicates(Iterator first, Iterator last) {
// requires a sorted input
auto current = first;
while (true) {
// find a duplicated element, move it behind 'first'
// and find the next different element
current = std::adjacent_find(current, last);
if (current == last) return first;
*first++ = std::move(*current);
std::cerr << *current << std::endl;
current = std::adjacent_find(current, last, std::not_equal_to<>());
}
}
int main() {
std::vector<int> data = { 0, 1, 2, 3, 4, 5, 1, 2, 2, 3, 5, 5, 5 };
std::sort(data.begin(), data.end());
data.erase(one_of_duplicates(data.begin(), data.end()), data.end());
for (auto i : data) std::cout << i << ',';
}
edited Nov 29 at 15:40
answered Nov 29 at 11:26
papagaga
4,089221
4,089221
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208648%2fdata-structures-for-counting-duplicates-and-using-stdvectorerase%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Not really enough for a full answer, but
m.insert(pair<int, int>(n, 0))
can be replaced with simplym.emplace(n, 0)
saving you from writing out the pair constructor.– Kyle
Nov 29 at 8:04