-
Notifications
You must be signed in to change notification settings - Fork 41
Sendfil Service #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sendfil Service #512
Conversation
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
textileben
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Docker sections look good to me.
You might want to have someone else review the Filecoin portion as I'm not familiar with how the network messaging works.
Signed-off-by: Aaron Sutula <hi@asutula.com>
| "listenAddr": { | ||
| Key: "listen_addr", | ||
| DefValue: "127.0.0.1:5000", | ||
| }, | ||
| "lotusAddr": { | ||
| Key: "lotus_addr", | ||
| DefValue: "127.0.0.1:7777", | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice these are now just plain host:port, no multiaddr. This depends on an open Powergate PR.
| service SendFilService { | ||
| rpc SendFil(SendFilRequest) returns (SendFilResponse) {} | ||
| rpc Txn(TxnRequest) returns (TxnResponse) {} | ||
| rpc ListTxns(ListTxnsRequest) returns (ListTxnsResponse) {} | ||
| rpc Summary(SummaryRequest) returns (SummaryResponse) {} | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the API. Pretty straight forward, lets you send fil, query for a specific transaction, list transactions, and get a summary. The summary is mostly intended for monitoring so we can keep an eye on things.
| google.protobuf.Timestamp after = 1; | ||
| google.protobuf.Timestamp before = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can generate a summary for a range of time or if you don't provide any constraints, all time.
| int64 count_txns = 1; | ||
| int64 count_pending = 2; | ||
| int64 count_active = 3; | ||
| int64 count_failed = 4; | ||
| int64 count_waiting = 5; | ||
| int64 count_from_addrs = 6; | ||
| int64 count_to_addrs = 7; | ||
| int64 total_nano_fil_sent = 8; | ||
| double avg_nano_fil_sent = 9; | ||
| int64 max_nano_fil_sent = 10; | ||
| int64 min_nano_fil_sent = 11; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss how these should all be interpreted, but the idea is to have an idea of what "normal" looks like and keep an eye out for any deviations from that. A monitoring system can generate a summary, say every 30 min for the last hour (or more frequently), and we'll get some rolling data which could be nice.
|
|
||
| message TxnRequest { | ||
| string message_cid = 1; | ||
| bool wait = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both SendFil and Txn take a wait parameter. If true, the response won't come until the transaction is active on chain or times out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so confirmed some comment above. Just a thought about the risky-ness of the API timing out or having some network error. I don't have the big picture yet, but maybe that's recoverable by querying later and discovering the sendfil pending/succeeding even if the client never got the initial API return (due to the timeout/network error)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea exactly. We use a reasonable timeout while waiting for the message in the lotus api. If that timeout expires, the error returned here will say that is what happened. The client is free to make another request. The service will also retry any previously timed out waits.
|
|
||
| enum MessageState { | ||
| MESSAGE_STATE_UNSPECIFIED = 0; | ||
| MESSAGE_STATE_PENDING = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message hasn't been detected as active on chain.
| enum MessageState { | ||
| MESSAGE_STATE_UNSPECIFIED = 0; | ||
| MESSAGE_STATE_PENDING = 1; | ||
| MESSAGE_STATE_ACTIVE = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self explanatory.
| MESSAGE_STATE_UNSPECIFIED = 0; | ||
| MESSAGE_STATE_PENDING = 1; | ||
| MESSAGE_STATE_ACTIVE = 2; | ||
| MESSAGE_STATE_FAILED = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a final state that can't be recovered from. The message is permanently marked as failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Txn.failure_msg will explain why.
| message SendFilRequest { | ||
| string from = 1; | ||
| string to = 2; | ||
| int64 amount_nano_fil = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the use of nano fil everywhere in this api.
| if err != nil { | ||
| ch <- waitResult{err: fmt.Errorf("creating lotus client: %v", err)} | ||
| return | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anywhere we come across a temporary or recoverable error, we send that error on the result chan and return. In that case, the txn will be waited again in the future.
| lastestMsgCid, err := tx.latestMsgCid() | ||
| if err != nil { | ||
| tx.MessageState = pb.MessageState_MESSAGE_STATE_FAILED | ||
| tx.FailureMsg = fmt.Sprintf("getting latest message cid from txn: %v", err) | ||
| tx.UpdatedAt = time.Now() | ||
| log.Warnf("failing txn with err: %s", tx.FailureMsg) | ||
| if err := s.updateTxn(ctx, tx); err != nil { | ||
| ch <- waitResult{err: err} | ||
| return | ||
| } | ||
| ch <- waitResult{txn: tx} | ||
| return | ||
| } | ||
| c, err := cid.Decode(lastestMsgCid.Cid) | ||
| if err != nil { | ||
| tx.MessageState = pb.MessageState_MESSAGE_STATE_FAILED | ||
| tx.FailureMsg = fmt.Sprintf("decoding latest message cid for txn: %v", err) | ||
| tx.UpdatedAt = time.Now() | ||
| log.Warnf("failing txn with err: %s", tx.FailureMsg) | ||
| if err := s.updateTxn(ctx, tx); err != nil { | ||
| ch <- waitResult{err: err} | ||
| return | ||
| } | ||
| ch <- waitResult{txn: tx} | ||
| return | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If for some strange reason we can't make sense of the cid associated with the txn, that is a fatal/final error condition.
| "google.golang.org/grpc/status" | ||
| "google.golang.org/grpc/test/bufconn" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to be very thorough for these tests. There are an unreasonable number of permutations of all the ListTxns options, so just tried to test the obvious ones there.
| os.Exit(exitVal) | ||
| } | ||
|
|
||
| func TestRestartWaiting(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very important one. It tests that the waiting state is properly initialized upon service restart.
| more := true | ||
| var moreToken int64 = 0 | ||
| for more { | ||
| req := &pb.ListTxnsRequest{CreatedAfter: txFirst.CreatedAt, CreatedBefore: txLast.CreatedAt, Limit: 3, Ascending: ascending} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paging test makes sure that paging works properly when CreatedAfter and CreatedBefore constraints are also provided, both ascending and descending. This was tricky because paging also internally uses a created_at filter via the MoreToken... so it works!
Signed-off-by: Aaron Sutula <hi@asutula.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I left some comments in the review for some consideration. But maybe the plan is merging and continue in other PR maybe?
Some extra question: Are there some plans to include some sanity checks regarding big send fil amounts or similar? As to avoid errors ins getting clients sending too much FIL maybe by error or something out of control?
BTW, I left comments in order as I was thinking things. So if any other does the review, can follow some line of thought, even if some extra comment is discovered later.
api/sendfild/Dockerfile
Outdated
| COPY --from=0 /usr/lib/*-linux-gnu*/libcrypto.so* /usr/lib/ | ||
|
|
||
| # listenAddrs | ||
| EXPOSE 7006 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be 5000 considering new standards?
api/sendfild/Dockerfile.dev
Outdated
| EXPOSE 10006 | ||
| EXPOSE 8010 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same about 1006?
Not sure about 8010 should be changed.
|
|
||
| func SendFilWait() SendFilOption { | ||
| return func(req *pb.SendFilRequest) { | ||
| req.Wait = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early in review so I'm not entirely sure, but I can imagine that Wait blocks until confirmation.
Open question: should the "Wait" thing be encouraged if there's risk that the API timeouts and somehow the "send fil" continues after the API call errored? Maybe safer to discourage this use? (assuming my guess now of what "wait" means)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. I think it is normal and fine to want to wait here, and there is a chance that the underlying lotus call to StateMsgWait will timeout. This doesn't meant the transaction failed or timed out, just that our monitoring of it timed out. We need to make that clear to the caller. I think providing a proper error message in this case is the key.
api/sendfild/service/service.go
Outdated
| return ch | ||
| } | ||
|
|
||
| func (s *Service) Close() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that Close() doesn't wait for any go-routine created by wait(..) finishing it's job. I think that can't lead to a bad thing since partial writes doesn't leave things in a bad state.
The only thing I can imagine is that s.server.Stop(..) might timeout since there're current API calls waiting for that go-routine signaling that the txn got on chain. So most probably it can be common that the <t.C below will fire and kill the API processings (which I guess if fine for other APIs?).
Other option is to signal here in Close() any running go-routine from wait to stop early, so we avoid that timeout.
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
|
Ok @jsign I fixed the basic things based on your feedback. I'm going to merge this into
|
This is going to be merged into #480 and integrated into the
filrewardsdservice code, but splitting this out into a new PR just for clarity.Ok code all commented.
This PR includes:
StateWaitMsgand a ticker to make sure all transaction records are updated with the latest data available from lotus.waitparam when sending fil and getting a txn record so the caller can be notified when the message is active on chain.Lots of detail left in the code comments.