[mnet-devel] Re: [mnet-bugs] [ mnet-Bugs-908261 ] memory usage explosion
Arno Waschk
hamamatsu at gmx.de
Tue Apr 20 20:51:26 BST 2004
hm, from what i have seen lately i think the non-blockwrangler mem usage
is somewhat tolerable. At least much better than what i saw earlier. so
i'll vote for putting it into 0.6.3. Clever patch, BTW! Arno
On Tue, 20 Apr 2004 08:32:42 -0700, SourceForge.net
<noreply at sourceforge.net> wrote:
> Bugs item #908261, was opened at 2004-03-02 12:36
> Message generated for change (Comment added) made by zooko
> You can respond by visiting:
> https://sourceforge.net/tracker/?func=detail&atid=436453&aid=908261&group_id=43482
>
> Category: Network / EGTP
> Group: v0.6.2
> Status: Open
> Resolution: None
> Priority: 5
> Submitted By: Zooko O'Whielacronx (zooko)
>> Assigned to: Arno Waschk (arnowa)
> Summary: memory usage explosion
>
> Initial Comment:
> Arno reports that after a temporary network disconnect,
> Mnet started using > 200 MB (?? I don't remember what
> he said exactly).
>
> This could be a bug in peerman, but on the other hand
> it could also be a lot of other things. It is easy to
> start gulping down memory in Python, and hard to find
> out where you are doing it after the fact.
>
> ----------------------------------------------------------------------
>
>> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-04-20 15:32
>
> Message:
> Logged In: YES
> user_id=52562
>
> Assigning to arno so he can decide if it is okay to bump it
> to v0.6.3. Not that I have any idea of what *else* to do
> with it. By the way, I applied that patch the reduces mem
> usage by a few MB.
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-04-20 15:03
>
> Message:
> Logged In: YES
> user_id=52562
>
> Well, I've experimented and experimented. I have a patch
> that reduces memory usage by about 6 to 12 MB, but I'm not
> sure it is worth putting that patch into v0.6.2.
>
> I've never seen memory usage exceed 200 MB.
>
> I've only very rarely seen it get up to 175 MB.
>
> I'm about to give up and list it as a "known bug" in v0.6.2.
>
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-04-17 22:51
>
> Message:
> Logged In: YES
> user_id=52562
>
> I'm testing a patch which prints out diags that will
> identify if msgs held in the closure of
> post-timeout-callbacks is the problem.
>
> If the diags show that this is a problem, then I'll fix it.
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-04-17 10:12
>
> Message:
> Logged In: YES
> user_id=52562
>
> Uh-oh.
>
> Last night I ran a new node on my Mac laptop. It is behind
> NAT, and its relay server was <nw31m>. This morning <nw31m>
> is using 175 MB RAM!
>
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-04-15 16:47
>
> Message:
> Logged In: YES
> user_id=52562
>
> Assigning to Arno so that he can be responsible for closing it.
>
> Arno is a good person to be the "issue closer" because he is good at
> finding bugs and bad behavior.
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-04-13 15:24
>
> Message:
> Logged In: YES
> user_id=52562
>
> Fixed?
>
> I've seen my relay server process lots of messages, and
> store messages in RAM for slow delivery, and so forth, and I
> haven't seen any memory usage explosion. It's been up for
> two days in the most recent trial.
>
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-18 12:19
>
> Message:
> Logged In: YES
> user_id=52562
>
> I committed a patch which completely changes the way relay
> servers store messages for slow delivery. The new way
> actually calculates the aggregate size of all stored
> messages, and drops messages (randomly) if it goes over the
> limit.
>
> Unfortunately, since I started a node that runs this patch
> yesterday, there haven't been any stored messages on my
> relay server. This is a problem with testing things on the
> real live public Mnet.
>
> I guess I'll just wait patiently.
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-16 14:53
>
> Message:
> Logged In: YES
> user_id=52562
>
> Another issue is that it looks a lot like messages stored by
> the relay server for slow delivery don't get garbage
> collected after they are delivered. See this:
> 2004-03-16_14:11:07 (XXXDEBUG) get_mem_usage_virt_and_res():
> (49090560, 8204)
> 2004-03-16_14:11:43 (RelayServer) <RelayServerHandlers
> 1>._doq_loop_debug(); recips:
> 0, senders: 0, msgs: 0, sum of sizes of msgs: 0,
> len(self.mtm._cm._map_inmsgid_to_i
> 2004-03-16_14:15:43 (RelayServer) <RelayServerHandlers
> 1>._doq_loop_debug(); recips:
> 1, senders: 2, msgs: 37, sum of sizes of msgs: 2949124,
> len(self.mtm._cm._map_inmsg
> 2004-03-16_14:16:07 (XXXDEBUG) get_mem_usage_virt_and_res():
> (58159104, 10399)
> 2004-03-16_14:16:43 (RelayServer) <RelayServerHandlers
> 1>._doq_loop_debug(); recips:
> 0, senders: 0, msgs: 0, sum of sizes of msgs: 0, le
> 2004-03-16_14:21:07 (XXXDEBUG) get_mem_usage_virt_and_res():
> (58155008, 10399)
> n(self.mtm._cm._map_inmsgid_to_i
>
> It looks a lot like storing those messages cost 10 MB (why
> ??) and then the 10 MBs is never recovered for the rest of
> the life of the program.
>
> Now a big question I have is: suppose 37 new messages come
> in, weighing 2949124 bytes in total, then will the RAM usage
> got up *another* 10 MB, or will the memory already allocated
> be reused??
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-14 12:14
>
> Message:
> Logged In: YES
> user_id=52562
>
> Okay, between the first and second samples there were a
> large number of relay server and "lookup contact info"
> messages. Between the second and third samples there were
> an even larger number of the same. Each sample interval is
> about 5 minutes, and during a normal sample interval we
> exchanged maybe 50 or so messages total. During the first
> memory-increasing sample interval, we exchanged 1500 or so.
> During the second memory-increasing sample interval, we
> exchanged 2100 or so (again, almost every one was a relay
> server message or a "lookup contact info").
>
> Note that it cannot be a memory leak that occurs with every
> such message, because over the other dozens of hours of
> operation, thousands of messages go through with zero
> increase in memory usage.
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-14 12:00
>
> Message:
> Logged In: YES
> user_id=52562
>
> Another interesting fact: all of the messages storing had
> finished *before* the second increase in RAM happened:
>
> 2004-03-13_22:15:54 (RelayServer) <RelayServerHandlers
> 1>._doq_loop_debug(); recips: 1, senders: 3, msgs: 104, sum
> of sizes of msgs: 5376044, len(self.mtm._cm._map_inmsg
> 2004-03-13_22:16:21 (XXXDEBUG) get_mem_usage_virt_and_res():
> (72863744, 13933)
> 2004-03-13_22:16:54 (RelayServer) <RelayServerHandlers
> 1>._doq_loop_debug(); recips: 1, senders: 3, msgs: 104, sum
> of sizes of msgs: 10899508, len(self.mtm._cm._map_inms
> 2004-03-13_22:20:55 (RelayServer) <RelayServerHandlers
> 1>._doq_loop_debug(); recips: 1, senders: 3, msgs: 104, sum
> of sizes of msgs: 10899508, len(self.mtm._cm._map_inms
> 2004-03-13_22:21:21 (XXXDEBUG) get_mem_usage_virt_and_res():
> (150106112, 32814)
> 2004-03-13_22:21:55 (RelayServer) <RelayServerHandlers
> 1>._doq_loop_debug(); recips: 1, senders: 3, msgs: 104, sum
> of sizes of msgs: 10899508, len(self.mtm._cm._map_inms
>
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-14 11:57
>
> Message:
> Logged In: YES
> user_id=52562
>
> This is a strong correlation of events, but it is a mystery
> why the amount of RAM used would go up superlinearly with
> the amount of bytes of message stored.
> From the first sample to the second, the RAM went up
> 27,004,928 virt / 6,609 res. From the second sample to the
> third, the RAM went up 77,242,368 virt, / 18,881 res. From
> the first sample to the second, the bytes of stored message
> went up 5,374,132. From the second sample to the third, the
> bytes of stored message went up another 5,523,464. Why
> would storing 5 MB of messages take 27 MB virt, and then
> storing another 5 MB of messages takes another 77 MB virt ?
>
>
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-14 11:50
>
> Message:
> Logged In: YES
> user_id=52562
>
> Okay, one possible clue is that <nw31m>, acting as relay
> server, stored some messages for slow delivery during that time:
>
>
> 2004-03-13_22:11:21 (XXXDEBUG) get_mem_usage_virt_and_res():
> (45858816, 7324)
> 2004-03-13_22:10:54 (RelayServer) <RelayServerHandlers
> 1>._doq_loop_debug(); recips: 1, senders: 2, msgs: 2, sum of
> sizes of msgs: 1912, len(self.mtm._cm._map_inmsgid_to
> 2004-03-13_22:15:54 (RelayServer) <RelayServerHandlers
> 1>._doq_loop_debug(); recips: 1, senders: 3, msgs: 104, sum
> of sizes of msgs: 5376044, len(self.mtm._cm._map_inmsg
> 2004-03-13_22:16:21 (XXXDEBUG) get_mem_usage_virt_and_res():
> (72863744, 13933)
> 2004-03-13_22:16:54 (RelayServer) <RelayServerHandlers
> 1>._doq_loop_debug(); recips: 1, senders: 3, msgs: 104, sum
> of sizes of msgs: 10899508, len(self.mtm._cm._map_inms
> 2004-03-13_22:21:21 (XXXDEBUG) get_mem_usage_virt_and_res():
> (150106112, 32814)
> 2004-03-13_22:21:55 (RelayServer) <RelayServerHandlers
> 1>._doq_loop_debug(); recips: 1, senders: 3, msgs: 104, sum
> of sizes of msgs: 10899508, len(self.mtm._cm._map_inms
>
>
> And it has continued to store them (and to occupy that much
> RAM) for the next 14 hours or so.
>
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-14 11:35
>
> Message:
> Logged In: YES
> user_id=52562
>
> I observed this again. After being up for two days, my node
> (running HEAD == v0.6.3) went from 45 MB virt/7 MB res to
> 149 MB virt/32 MB res and stayed there. This happened over
> 10 minutes. Here are the relevant log entries:
>
> 2004-03-12_16:10:28 () protocol version: 0.9991
> 2004-03-12_16:10:28 () pyutil library version: 0.0.4.10-STABLE
> 2004-03-12_16:10:28 () broker version: 0.6.3, nano version
> 373, flags UNSTABLE
> 2004-03-12_16:10:28 () python version: 2.3.3 (#1, Mar 3
> 2004, 07:41:22)
> : [GCC 2.95.4 20011002 (Debian prerelease)]
> 2004-03-12_16:10:28 () Crypto++ version: 5.0
> 2004-03-12_16:10:28 () Berkeley DB version: (4, 1, 25)
> 2004-03-12_16:10:28 () PyBSDDB version: 4.1.6
> 2004-03-12_16:10:46 (Broker) This Broker's ID is: <nw31m>
> (full name: nw31me8ggzzah1td6uc1wrhau7hhsuqe) or in
> mojosixbit encoding: FTMloOY1745KI_TZKhOYn3nLTcg
> 2004-03-13_22:06:21 (XXXDEBUG) get_mem_usage_virt_and_res():
> (45846528, 7322)
> 2004-03-13_22:11:21 (XXXDEBUG) get_mem_usage_virt_and_res():
> (45858816, 7324)
> 2004-03-13_22:16:21 (XXXDEBUG) get_mem_usage_virt_and_res():
> (72863744, 13933)
> 2004-03-13_22:21:21 (XXXDEBUG) get_mem_usage_virt_and_res():
> (150106112, 32814)
> 2004-03-13_22:26:21 (XXXDEBUG) get_mem_usage_virt_and_res():
> (149262336, 32608)
>
> increasing priority of this issue and investigating in my
> log file.
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-12 21:38
>
> Message:
> Logged In: YES
> user_id=52562
>
>
> Arno is now testing several new patches, the combination of
> which may help with this problem.
>
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-11 12:24
>
> Message:
> Logged In: YES
> user_id=52562
>
> Downgrading priority because it doesn't happen all that often.
>
> There is a patch in v0.6.3 [1] that probably helps with this
> quite a lot (as well as improving network usage in general).
>
> The improvements in bootpage loading and hello'ing that we
> are currently planning might also reduce the frequency and
> intensity of these memory usage explosions.
>
> Hopefully Arno will close this or bump it to "v0.6.3" group
> after he tries out these patches.
>
>
> [1] peerman.py, rev. 1.38:
> http://cvs.sourceforge.net/viewcvs.py/mnet/mnet/common/peerman.py?r1=1.38
>
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-08 14:37
>
> Message:
> Logged In: YES
> user_id=52562
>
> Personally I'm inclined to leave this as a Known Bug in
> v0.6.2. Arno and I have investigated extensively, and I
> definitely understand why this happens sometimes. I don't
> know if my explanation explains all occurrences. Anyway, I
> don't see a way to *really* make sure this can never happen.
> I do have a patch that demultiplexes multiple overlapping
> "lookup contact info"'s to the same MT for the same peerId.
> I will CVS commit that patch to the HEAD branch and then
> maybe Arno will consider whether it is worth porting that
> patch to the branch_v062 branch. I think that patch
> probably reduces or eliminates the memory usage explosion in
> the case of a network outage, but I haven't tested that yet.
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-07 13:39
>
> Message:
> Logged In: YES
> user_id=52562
>
> I have not observed this event except when all MTs were
> unreachable. Arno says he has observed it even when MTs
> were reachable. Or, actually, maybe the bug he was talking
> about wasn't memory-usage-explosion, but a different one
> involving the node not being able to talk to any peers.
> Maybe Arno should post his own words to the issue tracker
> himself. ;-)
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-06 21:36
>
> Message:
> Logged In: YES
> user_id=52562
>
> Okay, this is almost certainly the same as the "message
> storm" bug.
> I have run a good test and I see what is very likely the
> problem. I have a fix in mind, and it is a fix that I've
> implemented before. It is also a fix that Arno and I have
> previously discussed: making multiple successive calls to
> peerman.lookup_contact_info() with the *same* MT and the
> *same* peerId get demultiplexed so that they result in only
> a single outgoing "lookup" msg.
> That may not be the only problem, but it is worth doing.
> Now, when will I implement this? MAYBE I will wake up
> early, before Irby does, tomorrow morning and do it then...
>
> ----------------------------------------------------------------------
>
> Comment By: Zooko O'Whielacronx (zooko)
> Date: 2004-03-06 12:07
>
> Message:
> Logged In: YES
> user_id=52562
>
> Increasing priority, because my server is currently using
> 343 MB virt, 230 MB resident, even though it has been doing
> nothing but serving for a couple of days.
>
> ----------------------------------------------------------------------
>
> Comment By: Arno Waschk (arnowa)
> Date: 2004-03-03 13:10
>
> Message:
> Logged In: YES
> user_id=642335
>
> it looks like the broker does not nicely reconnect to the
> netwórk, and peerman does not handle that... will look into
> that as soon as i find some time
>
> ----------------------------------------------------------------------
>
> You can respond by visiting:
> https://sourceforge.net/tracker/?func=detail&atid=436453&aid=908261&group_id=43482
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> mnet-bugs mailing list
> mnet-bugs at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/mnet-bugs
>
--
http://www.arnowaschk.de
-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
mnet-devel mailing list
mnet-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mnet-devel
More information about the Mnet-devel
mailing list