Message boards :
Number crunching :
How to Fix the current Issues - One man's opinion
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19477 Credit: 40,757,560 RAC: 67 ![]() ![]() |
The deadline for AP is 25 days and I haven't seen any problems with that and as they take longer to crunch than MB, I would think a max for MB at 21 days would be reasonable. |
![]() ![]() Send message Joined: 23 May 99 Posts: 7379 Credit: 44,181,323 RAC: 238 ![]() ![]() |
Greetings, What you guys may or may not know or understand is that there is already a 2nd project here at SETI. It's call Beta. That project resides on the same servers as SETI Prime does. What you are suggesting is for a 3rd project to be installed. I don't see the point. Just one man's opinion on this topic. ;) Have a great day! :) Siran CAPT Siran d'Vel'nahr - L L & P _\\// Winders 11 OS? "What a piece of junk!" - L. Skywalker "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
![]() Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 ![]() ![]() |
Greetings,Perhaps the Beta should be created to be able to handle a few task size doubling now, and several more in the future. |
![]() Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 ![]() ![]() |
That trend will follow the uptime/downtime ratio of this project (plus many other aspects).This amount is exponentially decaying as we go back in time, but the volunteers of this project can provide the computing power to convert (even re-calculate) that amount of data (as the computing power is exponentially growing), but I'm not sure if it should be converted at all. The architecture of the science database can be changed without changing the meaning the data in it, so this project can use a different architecture in the future.Not as exponential as you might think, as the number of active users has diminished over that period. For several reasons, BOINC and credit screw to name but two. The goal should be to reduce downtime (ideally to 0), as the frequent and extended downtime periods resulted in counterproductive user action. |
![]() ![]() Send message Joined: 23 May 99 Posts: 7379 Credit: 44,181,323 RAC: 238 ![]() ![]() |
Greetings,Perhaps the Beta should be created to be able to handle a few task size doubling now, and several more in the future. Hi Retvari, Beta IS a project in and of itself and does not need to be created. It already exists on the same severs as SETI Prime does. It is there to test new apps and server software, hence the name Beta. I don't see messing with Beta when Prime needs more fixing. I am currently without any work on my main host and have been for quite some time now. My Pis and Linux PC have just over a days work on each and my laptop, just over 2 days Beta is shut down right now, I assume, so that the SETI team can concentrate on fixing Prime. Have a great day! :) Siran [edit] My main just got some WUs. Woohoo! :) [/edit] CAPT Siran d'Vel'nahr - L L & P _\\// Winders 11 OS? "What a piece of junk!" - L. Skywalker "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22647 Credit: 416,307,556 RAC: 380 ![]() ![]() |
What do you actually mean by "doubling task size"? Do you mean just adding more data points to increase the file size from 700k to 1400k? Do you mean doubling the resolution, so doubling the file size? Do you mean putting two data sets into one file, so doubling the file size? Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22647 Credit: 416,307,556 RAC: 380 ![]() ![]() |
Where do you get that Beta is shut down just now? There are tasks ready to send, the splitters are not disabled, the board is alive and kicking. Remember Beta is not about processing "real" data, it is for testing something prior to release on main, and if there is nothing in the Beta test schedule just now then it just sits there idle until such time as there is something to test. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
![]() ![]() Send message Joined: 23 May 99 Posts: 7379 Credit: 44,181,323 RAC: 238 ![]() ![]() |
Where do you get that Beta is shut down just now? Hi Rob, I was going by something I read here in the forum several days ago. When I got the link to the server status page at Beta, I was just looking at the server names and nothing else. I suppose if I'd looked at the other stats I would not have made that statement. My bad. Sorry. :( Have a great day! :) Siran CAPT Siran d'Vel'nahr - L L & P _\\// Winders 11 OS? "What a piece of junk!" - L. Skywalker "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14686 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Where do you get that Beta is shut down just now?The next thing to test will be the 715 server code fix, so BOINC can proceed with a full 'Server Stable v1.2.1' release for the benefit of other projects. I still got an 'internal server error' with anonymous platform when I tested this morning. Unfortunately, we only have one Eric. |
![]() Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 ![]() ![]() |
What do you actually mean by "doubling task size"?I would go for the 1st option. A task which covers a longer period in time would also mean that less overlap (=less network traffic, less disk space) is necessary for data processing / transfer. The ideal solution would be to send as much data to a host that the actual device (CPU/GPU) could process it in 1~2 hours. For example a very fast host would receive up to 256 times longer chunks of data to process. I can easily spot tasks, which were processed by my wingman over 400 times slower. In other words my host puts 400 times higher load on the servers than the other host does. This is not necessary. The ability to reduce the workload on the servers should be adopted in the way the data is split between hosts, as future GPUs will be even faster. I'm aware that the storage limits of the given workunit for the found spikes / pulses / triplets / Gaussians should be increased as well. The 2nd option is also viable, but the 3rd wouldn't change things much. The point is to reduce the number of tasks out in the field, and the number of server-client transactions to make it easier for the servers to handle their job. |
![]() Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 ![]() ![]() |
Beta IS a project in and of itself and does not need to be created. It already exists on the same severs as SETI Prime does. It is there to test new apps and server software, hence the name Beta. I don't see messing with Beta when Prime needs more fixing.We're discussing ideas this project needs to adopt to get fixed for good. That's what beta could be used for. Tinkering with the old stuff couldn't achieve that in the long term. |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22647 Credit: 416,307,556 RAC: 380 ![]() ![]() |
There are a couple of things to consider with just simply doubling the number of data points delivered as a task. Currently each successive pair of work units have an overlap of the data, this does have an impact on the analysis, which is not easy to predict. Also, as has already been identified is that the maximum number of signals per task is set to 30, and this is set in the servers; to change this would mean a revision to the database structure. As to the concept of determining task size by predicting the performance of the host performing so predicting the calculation time. This is very fraught with difficulties, as one would end up with multiple task sizes. Which would mean the splitters would have to split each task (or group of tasks) for each "type" of host, thus one would loose the diversity in processor that is inherent in the fixed size, randomly assigned method of working that is currently employed. It would work if there were no overlap between work units - I think some other projects that do not have overlapping data do use this sort of approach. Sadly, I suspect we are stuck with the 700k task size unless the project wants to re-structure the database and do some re-writing of all the applications to cope with the increase in permitted number of signals. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
J. Mileski ![]() Send message Joined: 9 Jun 02 Posts: 632 Credit: 172,116,532 RAC: 572 ![]() ![]() |
|
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
No it can't unless they develop a AP splitter for GBT work. Not sure whether a new application is needed though. Someone here knows the answer definitively. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
No. Because this result set would still be just one result file, so only one database row is needed to reference it.The clients could process the bigger workunits in several parts producing multiple independent sets of results each covering similar time window as before. The assimilators would have more work to do per workunit but not any more per source tape than with the small workunits.Wouldn't that just produce the same number of results as the present system and we would still have 11 million "Results returned and awaiting validation". |
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
I'm aware that the storage limits of the given workunit for the found spikes / pulses / triplets / Gaussians should be increased as well.If you have really long work units, then increasing the limits for returned signals is not enough. A rfi spike will fill any reasonable limit and this will then mask all the good parts of the data. Bigger time windows mean more observation time is lost due to these events. This is why I suggested in another post that the clients would process the long workunits in multiple parts that would match the size of the current workunits and produce result data separately for each part. So you could have result overflow for one part but good results for the rest. |
alanb1951 ![]() ![]() ![]() Send message Joined: 25 May 99 Posts: 10 Credit: 6,904,127 RAC: 34 ![]() ![]() |
I'm aware that the storage limits of the given workunit for the found spikes / pulses / triplets / Gaussians should be increased as well.If you have really long work units, then increasing the limits for returned signals is not enough. A rfi spike will fill any reasonable limit and this will then mask all the good parts of the data. Bigger time windows mean more observation time is lost due to these events. MIlkyWay@home already does this batching up of sub-tasks. That works provided the different parts all validate! However, if one sub-task fails to validate the whole batch of tasks is sent out to another client to resolve the mismatch, and they keep trying until two clients send in a matched set or the limits are hit; fortunately, MilkyWay tasks are reasonably short... I suspect that almost any attempt to deal with validation errors in a more sub-task related way would introduce unwanted levels of complexity; perhaps not an issue if there was enough developer time available, but I fear that is not the case. However, something needs to be done; I'm just not sure what! Cheers - Al. |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
Please forgive me but i have another opinion that could help to solve the current issues since change the WU size or server configuration (separate the type of hosts as suggest) at this moment will only add more gasoline to the already burning problem. In the past when we talks about the spoofed client, somebody post (not remember who), the problem with the DB was not the size was the number of queries/second. I disagree at that time since i always believe, size matters. At the end we all could see the decision of keep the release of the spoofed client in a closed loop was right. The impact of change the user side limits hit hard the servers. Now i could see, the real answer is BOTH. And there is why the answer pass for solve both an the same time. Fixing one without fix the other could be make us circle around and never solve the entire problem. In desperate times we need to take desperate measures. Few controversial actions must be taken: - back to the old and well tested validation of 2 system. - stop sending new WU to faulty GPU/driver host. - reduce drastically the death line of the new WU. - reduce even more the limits. - reduce the number of days of the client WU cache size. - send the expired/not validated WUs with a small death line (maybe a week) only to the top fastest hosts with high APR. - stop to make changes on the servers until the system is back to working fine. This all will take time to do their jobs and reduce both the number of queries/seconds and the DB size, be aware that time will be weeks, there are so many WU with death line of march and probably more even longer, sting around waiting for the wingman who will probably never appears. To reduce the size of the DB we need to try to clear the WUs as fast as possible. Yes i know all are controversial but after about a month we all see something extreme must be done, for the good of the project and the sanity of the DB. Keep the project producing a lot of new WU/day without solving the core of the problem will make us fall in a non return black hole. Then after the system back to stability and with all working fine, some measures could be slowly changing. One at a time. One point we need to agree the decision to rise the limits and change the server version without any major test was wrong and leave us to all that mess. The problem with the cross validation caused by the faulty GPU/Driver just added more gasoline to the fire. my 0.02 ![]() |
![]() Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 ![]() ![]() |
If you have really long work units, then increasing the limits for returned signals is not enough. A rfi spike will fill any reasonable limit and this will then mask all the good parts of the data. Bigger time windows mean more observation time is lost due to these events.RFI spikes can be easily detected and omitted by the app from the result, so no observation time would be lost. This is why I suggested in another post that the clients would process the long workunits in multiple parts that would match the size of the current workunits and produce result data separately for each part. So you could have result overflow for one part but good results for the rest.This would leave the load on the servers unchanged. Further tweaking and optimizing client behavior would make the servers' job harder, this isn't the right way. There's no easy way to fix the problems we face. |
![]() Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 ![]() ![]() |
In my opinion this project needs a new splitting / validation process which is able to handle the ultra high performance of the present and future GPUs as well as the oldest CPUs. It could be achieved by sending larger chunks of data to fast hosts (expanding in the power of 2, limited by the actual processing speed of the slowest device (GPU/CPU) in the given system). It needs a new client app also, as it should omit the parts of the data poised by RFI. I think the need for the transition to that adaptive splitting algorithm is now. Please share your ideas! (Besides that it can't be done.) |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.