Benchmarks¶
Image Captioning on MSCOCO (Cross-Entropy Loss)¶
Model |
BLEU@1 |
BLEU@2 |
BLEU@3 |
BLEU@4 |
METEOR |
ROUGE-L |
CIDEr-D |
SPICE |
|---|---|---|---|---|---|---|---|---|
75.3 |
59.0 |
45.4 |
35.0 |
26.7 |
55.6 |
107.7 |
19.7 |
|
76.4 |
60.6 |
46.9 |
36.1 |
27.6 |
56.6 |
113.0 |
20.4 |
|
76.3 |
60.3 |
46.6 |
36.0 |
27.6 |
56.6 |
113.1 |
20.7 |
|
76.8 |
61.1 |
47.6 |
36.9 |
28.2 |
57.2 |
116.3 |
21.2 |
|
76.4 |
60.3 |
46.5 |
35.8 |
28.2 |
56.7 |
116.6 |
21.3 |
|
76.3 |
60.2 |
46.4 |
35.6 |
28.1 |
56.5 |
116.0 |
21.2 |
|
77.5 |
61.9 |
48.3 |
37.5 |
28.6 |
57.6 |
120.7 |
21.9 |
|
75.5 |
59.4 |
45.7 |
34.9 |
28.7 |
56.7 |
116.3 |
22.0 |
Image Captioning on MSCOCO (CIDEr Score Optimization)¶
Model |
BLEU@1 |
BLEU@2 |
BLEU@3 |
BLEU@4 |
METEOR |
ROUGE-L |
CIDEr-D |
SPICE |
|---|---|---|---|---|---|---|---|---|
77.9 |
61.5 |
46.7 |
35.0 |
27.1 |
56.3 |
117.0 |
20.5 |
|
79.4 |
63.5 |
48.9 |
37.1 |
27.9 |
57.6 |
123.1 |
21.3 |
|
80.1 |
64.3 |
49.7 |
37.7 |
28.0 |
58.0 |
124.7 |
21.5 |
|
80.2 |
64.7 |
50.3 |
38.5 |
28.5 |
58.4 |
127.2 |
22.1 |
|
80.5 |
65.4 |
51.1 |
39.2 |
29.1 |
58.7 |
130.0 |
23.0 |
|
80.7 |
65.5 |
51.4 |
39.6 |
29.2 |
58.9 |
131.1 |
22.9 |
|
80.4 |
65.2 |
51.0 |
39.2 |
29.4 |
59.0 |
131.0 |
23.2 |
|
81.3 |
66.3 |
52.0 |
40.1 |
29.6 |
59.8 |
132.6 |
23.4 |
Video Captioning on MSVD¶
Model |
BLEU@1 |
BLEU@2 |
BLEU@3 |
BLEU@4 |
METEOR |
ROUGE-L |
CIDEr-D |
SPICE |
|---|---|---|---|---|---|---|---|---|
77.0 |
65.6 |
56.9 |
48.1 |
32.4 |
68.1 |
73.1 |
4.8 |
|
80.4 |
68.9 |
60.1 |
51.0 |
33.5 |
70.0 |
77.2 |
4.9 |
|
79.0 |
67.6 |
58.5 |
49.4 |
33.3 |
68.7 |
80.3 |
4.9 |
|
81.6 |
70.4 |
61.3 |
51.7 |
34.1 |
70.4 |
77.8 |
5.0 |
Video Captioning on MSR-VTT¶
Model |
BLEU@1 |
BLEU@2 |
BLEU@3 |
BLEU@4 |
METEOR |
ROUGE-L |
CIDEr-D |
SPICE |
|---|---|---|---|---|---|---|---|---|
73.6 |
60.8 |
49.0 |
38.6 |
26.0 |
58.3 |
41.1 |
5.6 |
|
74.3 |
61.8 |
50.3 |
39.9 |
26.4 |
59.4 |
42.9 |
5.8 |
|
75.4 |
62.3 |
50.0 |
39.2 |
26.5 |
58.7 |
44.0 |
5.9 |
|
76.4 |
62.3 |
49.9 |
38.9 |
26.3 |
59.0 |
40.7 |
5.7 |
Visual Question Answering¶
Model |
Overall |
Yes/No |
Number |
Other |
|---|---|---|---|---|
70.1 |
86.8 |
53.7 |
59.6 |
|
71.9 |
88.3 |
54.3 |
62.0 |
Caption-based image retrieval on Flickr30k¶
Model |
R1 |
R5 |
R10 |
|---|---|---|---|
61.6 |
87.7 |
92.8 |
|
62.0 |
86.6 |
92.4 |
Visual commonsense reasoning¶
Model |
Q -> A |
QA -> R |
Q -> AR |
|---|---|---|---|
73.0 |
75.3 |
55.4 |
|
75.0 |
76.5 |
57.7 |