Skip to content

Conversation

@asutula
Copy link
Member

@asutula asutula commented Mar 3, 2021

This is going to be merged into #480 and integrated into the filrewardsd service code, but splitting this out into a new PR just for clarity.

Ok code all commented.

This PR includes:

  • Service for sending fil, listing transactions, as well as a summary api for reporting.
  • Persists all txn data in mongo
  • Utilizes lotus StateWaitMsg and a ticker to make sure all transaction records are updated with the latest data available from lotus.
  • Supports a wait param when sending fil and getting a txn record so the caller can be notified when the message is active on chain.

Lots of detail left in the code comments.

Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
@asutula asutula requested a review from textileben March 3, 2021 00:46
@asutula asutula self-assigned this Mar 3, 2021
asutula added 2 commits March 3, 2021 09:52
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Copy link
Contributor

@textileben textileben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Docker sections look good to me.

You might want to have someone else review the Filecoin portion as I'm not familiar with how the network messaging works.

Signed-off-by: Aaron Sutula <hi@asutula.com>
Comment on lines +37 to +44
"listenAddr": {
Key: "listen_addr",
DefValue: "127.0.0.1:5000",
},
"lotusAddr": {
Key: "lotus_addr",
DefValue: "127.0.0.1:7777",
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice these are now just plain host:port, no multiaddr. This depends on an open Powergate PR.

Comment on lines 98 to 103
service SendFilService {
rpc SendFil(SendFilRequest) returns (SendFilResponse) {}
rpc Txn(TxnRequest) returns (TxnResponse) {}
rpc ListTxns(ListTxnsRequest) returns (ListTxnsResponse) {}
rpc Summary(SummaryRequest) returns (SummaryResponse) {}
} No newline at end of file
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the API. Pretty straight forward, lets you send fil, query for a specific transaction, list transactions, and get a summary. The summary is mostly intended for monitoring so we can keep an eye on things.

Comment on lines +80 to +81
google.protobuf.Timestamp after = 1;
google.protobuf.Timestamp before = 2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can generate a summary for a range of time or if you don't provide any constraints, all time.

Comment on lines +85 to +95
int64 count_txns = 1;
int64 count_pending = 2;
int64 count_active = 3;
int64 count_failed = 4;
int64 count_waiting = 5;
int64 count_from_addrs = 6;
int64 count_to_addrs = 7;
int64 total_nano_fil_sent = 8;
double avg_nano_fil_sent = 9;
int64 max_nano_fil_sent = 10;
int64 min_nano_fil_sent = 11;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss how these should all be interpreted, but the idea is to have an idea of what "normal" looks like and keep an eye out for any deviations from that. A monitoring system can generate a summary, say every 30 min for the last hour (or more frequently), and we'll get some rolling data which could be nice.


message TxnRequest {
string message_cid = 1;
bool wait = 2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both SendFil and Txn take a wait parameter. If true, the response won't come until the transaction is active on chain or times out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so confirmed some comment above. Just a thought about the risky-ness of the API timing out or having some network error. I don't have the big picture yet, but maybe that's recoverable by querying later and discovering the sendfil pending/succeeding even if the client never got the initial API return (due to the timeout/network error)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea exactly. We use a reasonable timeout while waiting for the message in the lotus api. If that timeout expires, the error returned here will say that is what happened. The client is free to make another request. The service will also retry any previously timed out waits.


enum MessageState {
MESSAGE_STATE_UNSPECIFIED = 0;
MESSAGE_STATE_PENDING = 1;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message hasn't been detected as active on chain.

enum MessageState {
MESSAGE_STATE_UNSPECIFIED = 0;
MESSAGE_STATE_PENDING = 1;
MESSAGE_STATE_ACTIVE = 2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self explanatory.

MESSAGE_STATE_UNSPECIFIED = 0;
MESSAGE_STATE_PENDING = 1;
MESSAGE_STATE_ACTIVE = 2;
MESSAGE_STATE_FAILED = 3;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a final state that can't be recovered from. The message is permanently marked as failed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Txn.failure_msg will explain why.

message SendFilRequest {
string from = 1;
string to = 2;
int64 amount_nano_fil = 3;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the use of nano fil everywhere in this api.

Comment on lines 548 to 551
if err != nil {
ch <- waitResult{err: fmt.Errorf("creating lotus client: %v", err)}
return
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anywhere we come across a temporary or recoverable error, we send that error on the result chan and return. In that case, the txn will be waited again in the future.

Comment on lines 553 to 578
lastestMsgCid, err := tx.latestMsgCid()
if err != nil {
tx.MessageState = pb.MessageState_MESSAGE_STATE_FAILED
tx.FailureMsg = fmt.Sprintf("getting latest message cid from txn: %v", err)
tx.UpdatedAt = time.Now()
log.Warnf("failing txn with err: %s", tx.FailureMsg)
if err := s.updateTxn(ctx, tx); err != nil {
ch <- waitResult{err: err}
return
}
ch <- waitResult{txn: tx}
return
}
c, err := cid.Decode(lastestMsgCid.Cid)
if err != nil {
tx.MessageState = pb.MessageState_MESSAGE_STATE_FAILED
tx.FailureMsg = fmt.Sprintf("decoding latest message cid for txn: %v", err)
tx.UpdatedAt = time.Now()
log.Warnf("failing txn with err: %s", tx.FailureMsg)
if err := s.updateTxn(ctx, tx); err != nil {
ch <- waitResult{err: err}
return
}
ch <- waitResult{txn: tx}
return
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If for some strange reason we can't make sense of the cid associated with the txn, that is a fatal/final error condition.

"google.golang.org/grpc/status"
"google.golang.org/grpc/test/bufconn"
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to be very thorough for these tests. There are an unreasonable number of permutations of all the ListTxns options, so just tried to test the obvious ones there.

os.Exit(exitVal)
}

func TestRestartWaiting(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very important one. It tests that the waiting state is properly initialized upon service restart.

more := true
var moreToken int64 = 0
for more {
req := &pb.ListTxnsRequest{CreatedAfter: txFirst.CreatedAt, CreatedBefore: txLast.CreatedAt, Limit: 3, Ascending: ascending}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paging test makes sure that paging works properly when CreatedAfter and CreatedBefore constraints are also provided, both ascending and descending. This was tricky because paging also internally uses a created_at filter via the MoreToken... so it works!

@asutula asutula marked this pull request as ready for review March 3, 2021 22:07
@asutula asutula requested a review from jsign March 3, 2021 22:08
Signed-off-by: Aaron Sutula <hi@asutula.com>
Copy link
Contributor

@jsign jsign left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I left some comments in the review for some consideration. But maybe the plan is merging and continue in other PR maybe?

Some extra question: Are there some plans to include some sanity checks regarding big send fil amounts or similar? As to avoid errors ins getting clients sending too much FIL maybe by error or something out of control?

BTW, I left comments in order as I was thinking things. So if any other does the review, can follow some line of thought, even if some extra comment is discovered later.

COPY --from=0 /usr/lib/*-linux-gnu*/libcrypto.so* /usr/lib/

# listenAddrs
EXPOSE 7006
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 5000 considering new standards?

Comment on lines 28 to 29
EXPOSE 10006
EXPOSE 8010
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same about 1006?
Not sure about 8010 should be changed.


func SendFilWait() SendFilOption {
return func(req *pb.SendFilRequest) {
req.Wait = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early in review so I'm not entirely sure, but I can imagine that Wait blocks until confirmation.

Open question: should the "Wait" thing be encouraged if there's risk that the API timeouts and somehow the "send fil" continues after the API call errored? Maybe safer to discourage this use? (assuming my guess now of what "wait" means)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. I think it is normal and fine to want to wait here, and there is a chance that the underlying lotus call to StateMsgWait will timeout. This doesn't meant the transaction failed or timed out, just that our monitoring of it timed out. We need to make that clear to the caller. I think providing a proper error message in this case is the key.

return ch
}

func (s *Service) Close() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that Close() doesn't wait for any go-routine created by wait(..) finishing it's job. I think that can't lead to a bad thing since partial writes doesn't leave things in a bad state.

The only thing I can imagine is that s.server.Stop(..) might timeout since there're current API calls waiting for that go-routine signaling that the txn got on chain. So most probably it can be common that the <t.C below will fire and kill the API processings (which I guess if fine for other APIs?).

Other option is to signal here in Close() any running go-routine from wait to stop early, so we avoid that timeout.

asutula added 7 commits March 4, 2021 20:51
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
Signed-off-by: Aaron Sutula <hi@asutula.com>
@asutula
Copy link
Member Author

asutula commented Mar 8, 2021

Ok @jsign I fixed the basic things based on your feedback. I'm going to merge this into asutula/fil-rewards-bookkeeping and follow up with another PR that addresses the more complex issues of:

  • removing more from list results
  • dealing with no nanosecond precision in mongo for moreToken
  • correctly using channels to notify all interested callers of updates to messages

@asutula asutula merged commit aaf4d6e into asutula/fil-rewards-bookkeeping Mar 8, 2021
@asutula asutula deleted the asutula/sendfil branch March 8, 2021 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants