Disk I/O & size performance
I have a process that needs to do a lot of disk writes but no read. I can either write a lot of small files (~1,000,000,000 files, what I'm currently doing) or write a few big files.
Small files are ~2Kb on average but as I have a 4096b block size, I'm loosing about one half on my disk capacity.
However, as a lot of threads needs to write in the as the same time, wouldn't big files be a bottleneck since each thread will need to open the file, write, then close it?
To summarize, what is the best for I/O and space optimization between:
- A lot of small files
- A few big files
linux hard-drive
add a comment |
I have a process that needs to do a lot of disk writes but no read. I can either write a lot of small files (~1,000,000,000 files, what I'm currently doing) or write a few big files.
Small files are ~2Kb on average but as I have a 4096b block size, I'm loosing about one half on my disk capacity.
However, as a lot of threads needs to write in the as the same time, wouldn't big files be a bottleneck since each thread will need to open the file, write, then close it?
To summarize, what is the best for I/O and space optimization between:
- A lot of small files
- A few big files
linux hard-drive
1
Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.
– davidgo
Jan 27 at 18:57
All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).
– xenoid
Jan 27 at 19:48
@davidgo Which databases are suitable for that?
– o2640110
Jan 27 at 20:50
That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL
– davidgo
Jan 27 at 21:06
Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.
– davidgo
Jan 27 at 21:08
add a comment |
I have a process that needs to do a lot of disk writes but no read. I can either write a lot of small files (~1,000,000,000 files, what I'm currently doing) or write a few big files.
Small files are ~2Kb on average but as I have a 4096b block size, I'm loosing about one half on my disk capacity.
However, as a lot of threads needs to write in the as the same time, wouldn't big files be a bottleneck since each thread will need to open the file, write, then close it?
To summarize, what is the best for I/O and space optimization between:
- A lot of small files
- A few big files
linux hard-drive
I have a process that needs to do a lot of disk writes but no read. I can either write a lot of small files (~1,000,000,000 files, what I'm currently doing) or write a few big files.
Small files are ~2Kb on average but as I have a 4096b block size, I'm loosing about one half on my disk capacity.
However, as a lot of threads needs to write in the as the same time, wouldn't big files be a bottleneck since each thread will need to open the file, write, then close it?
To summarize, what is the best for I/O and space optimization between:
- A lot of small files
- A few big files
linux hard-drive
linux hard-drive
edited Jan 28 at 0:38
o2640110
asked Jan 27 at 17:27
o2640110o2640110
64
64
1
Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.
– davidgo
Jan 27 at 18:57
All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).
– xenoid
Jan 27 at 19:48
@davidgo Which databases are suitable for that?
– o2640110
Jan 27 at 20:50
That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL
– davidgo
Jan 27 at 21:06
Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.
– davidgo
Jan 27 at 21:08
add a comment |
1
Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.
– davidgo
Jan 27 at 18:57
All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).
– xenoid
Jan 27 at 19:48
@davidgo Which databases are suitable for that?
– o2640110
Jan 27 at 20:50
That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL
– davidgo
Jan 27 at 21:06
Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.
– davidgo
Jan 27 at 21:08
1
1
Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.
– davidgo
Jan 27 at 18:57
Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.
– davidgo
Jan 27 at 18:57
All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).
– xenoid
Jan 27 at 19:48
All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).
– xenoid
Jan 27 at 19:48
@davidgo Which databases are suitable for that?
– o2640110
Jan 27 at 20:50
@davidgo Which databases are suitable for that?
– o2640110
Jan 27 at 20:50
That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL
– davidgo
Jan 27 at 21:06
That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL
– davidgo
Jan 27 at 21:06
Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.
– davidgo
Jan 27 at 21:08
Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.
– davidgo
Jan 27 at 21:08
add a comment |
1 Answer
1
active
oldest
votes
The easiest might be to let write-caching determine how frequently actual HDD (or SSD) writes are made. You can turn write-caching on or off at the OS level, or experiment with various hdparam settings. This enables tuning without altering your application. See Unix StackExchange on tuning.
Another possibility is to write to a RAM-disk, and periodically move data to the HDD.
Caveat: Increasing write latency increases the possibility of data loss, though if you're using a laptop or PC with UPS, that might not be an issue.
I didn't know about write caching, quite interesting
– o2640110
Jan 27 at 23:39
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1398984%2fdisk-i-o-size-performance%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The easiest might be to let write-caching determine how frequently actual HDD (or SSD) writes are made. You can turn write-caching on or off at the OS level, or experiment with various hdparam settings. This enables tuning without altering your application. See Unix StackExchange on tuning.
Another possibility is to write to a RAM-disk, and periodically move data to the HDD.
Caveat: Increasing write latency increases the possibility of data loss, though if you're using a laptop or PC with UPS, that might not be an issue.
I didn't know about write caching, quite interesting
– o2640110
Jan 27 at 23:39
add a comment |
The easiest might be to let write-caching determine how frequently actual HDD (or SSD) writes are made. You can turn write-caching on or off at the OS level, or experiment with various hdparam settings. This enables tuning without altering your application. See Unix StackExchange on tuning.
Another possibility is to write to a RAM-disk, and periodically move data to the HDD.
Caveat: Increasing write latency increases the possibility of data loss, though if you're using a laptop or PC with UPS, that might not be an issue.
I didn't know about write caching, quite interesting
– o2640110
Jan 27 at 23:39
add a comment |
The easiest might be to let write-caching determine how frequently actual HDD (or SSD) writes are made. You can turn write-caching on or off at the OS level, or experiment with various hdparam settings. This enables tuning without altering your application. See Unix StackExchange on tuning.
Another possibility is to write to a RAM-disk, and periodically move data to the HDD.
Caveat: Increasing write latency increases the possibility of data loss, though if you're using a laptop or PC with UPS, that might not be an issue.
The easiest might be to let write-caching determine how frequently actual HDD (or SSD) writes are made. You can turn write-caching on or off at the OS level, or experiment with various hdparam settings. This enables tuning without altering your application. See Unix StackExchange on tuning.
Another possibility is to write to a RAM-disk, and periodically move data to the HDD.
Caveat: Increasing write latency increases the possibility of data loss, though if you're using a laptop or PC with UPS, that might not be an issue.
answered Jan 27 at 18:03
DrMoishe PippikDrMoishe Pippik
10.4k21432
10.4k21432
I didn't know about write caching, quite interesting
– o2640110
Jan 27 at 23:39
add a comment |
I didn't know about write caching, quite interesting
– o2640110
Jan 27 at 23:39
I didn't know about write caching, quite interesting
– o2640110
Jan 27 at 23:39
I didn't know about write caching, quite interesting
– o2640110
Jan 27 at 23:39
add a comment |
Thanks for contributing an answer to Super User!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1398984%2fdisk-i-o-size-performance%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.
– davidgo
Jan 27 at 18:57
All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).
– xenoid
Jan 27 at 19:48
@davidgo Which databases are suitable for that?
– o2640110
Jan 27 at 20:50
That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL
– davidgo
Jan 27 at 21:06
Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.
– davidgo
Jan 27 at 21:08