back to notes

Debugging Ironic Python Agent

bug IPA when deploy ramdisk is built using DIB.....earlier i used to put pdb in the IPA source code on the server and which used to break there when we used to run /usr/bin/ironic-python-agent
<Nisha> but now since the code is in venv, it doesnt break
<Nisha> do you know how to actually trace the IPA now when i have the iso built using DIB
* baoli has quit (Remote host closed the connection)
<sinval> lucasagomes, sambetts that's right, so, during powe==none the driver will raise exception, but for other cases when power==on or power==off, the currently implementation does not raise anything, because validate is not called...
<lucasagomes> Nisha, hi there. I'm thinking here :-/ do you need to dynamically debug it with break points? Or would adding a bunch of debug messages do the job?
<sinval> do you guys have any suggestion?
<sinval> is there a heartbeat for the node?
<sambetts> sinval: not once deployed
<lucasagomes> sinval, not off the top of my head
<lucasagomes> sinval, there's a heartbeat only while the ramdisk is running for cleaning or deploy
<lucasagomes> sinval, maybe this is something that a driver periodic task for oneview can do ?
<lucasagomes> I know we discussed periodic task to replace the driver, but maybe we can use it now just to validate()
<lucasagomes> replace the daemon*
<sambetts> sinval: you have two options, put ownership checking logic in the actual get_power_state function which will be called perodically, or add a driver periodic task to check the ownership for all nodes
<sinval> well, but, it will be the same case of a deamon, the task can be slower than the error
<sinval> or not?
<dtantsur> see you tomorrow
<thiagop> see ya dtantsur
<openstackgerrit> Milan Kováčik proposed openstack/ironic-inspector: Add discover nodes exercise https://review.openstack.org/276107
<sinval> bye dtantsur
<sambetts> o/ dtantsur
<thiagop> sambetts lucasagomes good options, we'll consider that
<sinval> sambetts: makes sense
<lucasagomes> dtantsur, night
* bharath (~bkumar@idp01webcache1-z.apj.hpecore.net) has joined
* mmnelemane has quit (Ping timeout: 252 seconds)
<sambetts> sinval: any periodic task is going to have some delay in detecting the error compared to listening on a bus, but your periodic task could listen on the bus for events, and listen for a server profile change event or something and then trigger an error state event
* baoli (~baoli@173.38.117.71) has joined
<thiagop> thiagop: probably we should file a bug about that since it already affects the merged driver
<sinval> sambetts, the bus will solve everything :)
<sinval> sambetts, I mean, I'm just trying to see if we could came up with a quick and effective solution, before going for a bus
<sambetts> yup, my concern is that, that is definatly an error state, but an external deamon can't put a node into an error state
<Nisha> lucasagomes, dynamically debug
<JayF> jroll: something to look into: If Ironic is ever going to provision network switches, you should look into ONIE
* bharathk (~bkumar@idp01webcache1-z.apj.hpecore.net) has joined
<JayF> jroll: basically Ironic could tie into ONIE and support all opencompute switches
* bharath has quit (Read error: Connection reset by peer)
<JayF> ( http://onie.org/ ) -- basically an open source install environment for switches
* harlowja_at_home (~harlowja_@2602:306:bc29:a9a0:f102:fe1b:9612:28ee) has joined
<jroll> vdrok: go ahead man, thank you for the help
<sinval> sambetts, you're right
<jroll> JayF: ++ for ONIE
<jroll> I was looking at that a bit earlier
<sinval> sambetts, just 1 minute, I'll be back soon
* lucasagomes looked at ONIE some time ago
<sambetts> sinval: Sure :)
<lucasagomes> that's cool
<openstackgerrit> Merged openstack/ironic: Add documentation for the IPMITool driver https://review.openstack.org/290718
<lucasagomes> Nisha, hmm... the only way I think about that is kinda hacky. But, if you modify the ironic-agent element to not start the ironic-python-agent service at boot up time
<lucasagomes> Nisha, and you enable a way to log into the ramdisk once it's booted (dynamic-login element)
<lucasagomes> you could then log in, and start the service manually
<lucasagomes> and debug it there
<Nisha> yes i have login enabled
<lucasagomes> not super straight forward but I don't think we have plenty of options here
<lucasagomes> Nisha, right... you can also set the deploy_timeout_callback config option to 0
<Nisha> lucasagomes, i am facing issues there only in debugging
<sinval> sambetts: cool, so, should we register a bug for that issue
<lucasagomes> that will disable the call-back timeout and give you time to debug the service
<sinval> and we could add the validate call during get_power_state()
<lucasagomes> Nisha, which issue?
<sambetts> sinval: if it affects your current driver, if not I would add something in your spec to cover if
* ohamada has quit (Read error: Connection reset by peer)
<sambetts> it*
* ohamada (~ohamada@62.84.155.101) has joined
* lucasagomes thanks NobodyCam for approving the ipmitool doc patch (-:
<sinval> sambetts: yeah, it affects our current driver, so by registering the bug we could track this
<sinval> sambetts: because, the same case can happen for pre-allocation (current version) and dynamic allocation (next versions)
<sinval> lucasagomes: do you have an opinions about that? ^
* e0ne_ has quit (Quit: My MacBook Pro has gone to sleep. ZZZzzz…)
* lucasagomes reads
* pcaruana has quit (Quit: Leaving)
<lucasagomes> sinval, well not really... the event based approach (bus) will always beat a pooling based approach (check every X time)
<Nisha> lucasagomes, i want to debug _get_partition in IPA when it installs bootloader into the server during localboot....
<lucasagomes> sinval, beat in detection time I mean
<lucasagomes> Nisha, right, so if you modify the code to add a break point there
<lucasagomes> Nisha, then you modify the deploy-agent element to not start the ipa service at boot time
<lucasagomes> Nisha, then enable a way to log in in the ramdisk
<lucasagomes> you can do that no?
<Nisha> i modified the code code, added the pdb also
<lucasagomes> cause you: 1. tell ironic to deploy the node the noe will boot up
<lucasagomes> 2. access the node and start IPA manually /usr/bin/ironic-python-agent
<lucasagomes> 3. debug it
<lucasagomes> pdb /usr/bin ...
<lucasagomes> Nisha, right
<Nisha> oh u mean i shud add pdb in front of binary ...
<Nisha> thanks i will try this way
* ohamada_ (~ohamada@62.84.155.101) has joined
* ohamada has quit (Read error: Connection reset by peer)
<lucasagomes> Nisha, maybe it was my mistake... I don't think you may need it
* ohamada_ has quit (Read error: Connection reset by peer)
<lucasagomes> if you just run the ironic-python-agent manually on ur current user section I believe it will stop at the break point
* ohamada_ (~ohamada@62.84.155.101) has joined
<lucasagomes> if not yeah you can run it as a script
<sinval> lucasagomes, that's right, but the first idea is fixing the miserable failure by putting the "validate_if_node_is_mine" inside the get_power_state, this should work until we came up with the periodic task + bus
<lucasagomes> Nisha, python -m pdb /usr/bin...
<sinval> lucasagomes: periodic + bus is the target
* lucasagomes reads the scrollback
<sinval> hahaha
<lucasagomes> sinval, right and the second suggestion is to add a driver periodic task that will call validate() (so we don't mess up with the sync_power_state one)
<lucasagomes> but both approachs are not perfect they leave a window of time where things can go bad
<lucasagomes> which can be solved by the event-approach in the future
<lucasagomes> sinval, maybe I'm missing what is the question here
* lucasagomes long day, I'm sorry, brain is a bit slow at the moment
<sinval> lucasagomes: hahaha, you answered my question, we should go for periodic + bus from now on
<sinval> lucasagomes, sambetts: I'm registering the bug about that issue, thank you very much
<lucasagomes> sinval, ack :-D
<lucasagomes> sinval, well thank YOU
<sinval> I'll have to step back a little bit, see you guys tomorrow o/
<lucasagomes> see ya
<lucasagomes> I will also call it a day, gotta get the train back home yet
<lucasagomes> have a great evening all, talk soon
* lucasagomes is now known as lucas-afk
<sinval> night lucas-afk
* dims has quit (Quit: My Mac has gone to sleep. ZZZzzz…)
* ndipanov has quit (Ping timeout: 246 seconds)
* sinval is now known as sinval-afk
* derekh has quit (Quit: Leaving)
* dims (~dims@c-71-192-249-191.hsd1.ma.comcast.net) has joined
<mgould> lucas-afk, good night!
* suro-patz (~suro-patz@2001:4998:effd:600:b019:4153:c3dc:6240) has joined
* praneshp (~praneshp@98.248.93.197) has joined


last updated march 2016